Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of fatal errors #46846

Merged
merged 1 commit into from
Feb 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions src/Common/ThreadStatus.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ void ThreadStatus::setFatalErrorCallback(std::function<void()> callback)

void ThreadStatus::onFatalError()
{
std::lock_guard lock(thread_group->mutex);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows it to be reset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (fatal_error_callback)
fatal_error_callback();
}
Expand Down
17 changes: 16 additions & 1 deletion src/Daemon/BaseDaemon.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ static void terminateRequestedSignalHandler(int sig, siginfo_t *, void *)
}


static std::atomic<bool> fatal_error_printed{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite correct to use std::atomic in a signal handler, it would be better to use std::atomic_flag. However, I doubt someone actually uses ClickHouse on architectures where it matters.


/** Handler for "fault" or diagnostic signals. Send data about fault to separate thread to write into log.
*/
static void signalHandler(int sig, siginfo_t * info, void * context)
Expand All @@ -159,7 +161,16 @@ static void signalHandler(int sig, siginfo_t * info, void * context)
if (sig != SIGTSTP) /// This signal is used for debugging.
{
/// The time that is usually enough for separate thread to print info into log.
sleepForSeconds(20); /// FIXME: use some feedback from threads that process stacktrace
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement TODO.
The logs will be sent to the client in more cases.

/// Under MSan full stack unwinding with DWARF info about inline functions takes 101 seconds in one case.
for (size_t i = 0; i < 300; ++i)
{
/// We will synchronize with the thread printing the messages with an atomic variable to finish earlier.
if (fatal_error_printed)
break;

/// This coarse method of synchronization is perfectly ok for fatal signals.
sleepForSeconds(1);
}
call_default_signal_handler(sig);
}

Expand Down Expand Up @@ -309,7 +320,9 @@ class SignalListener : public Poco::Runnable
}

if (auto logs_queue = thread_ptr->getInternalTextLogsQueue())
{
DB::CurrentThread::attachInternalTextLogsQueue(logs_queue, DB::LogsLevel::trace);
}
}

std::string signal_description = "Unknown signal";
Expand Down Expand Up @@ -407,6 +420,8 @@ class SignalListener : public Poco::Runnable
/// When everything is done, we will try to send these error messages to client.
if (thread_ptr)
thread_ptr->onFatalError();

fatal_error_printed = true;
}
};

Expand Down
2 changes: 2 additions & 0 deletions src/Server/TCPHandler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -611,6 +611,8 @@ void TCPHandler::runImpl()
/// It is important to destroy query context here. We do not want it to live arbitrarily longer than the query.
query_context.reset();

CurrentThread::setFatalErrorCallback({});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reset the callback before the destruction of TCPHandler.


if (is_interserver_mode)
{
/// We don't really have session in interserver mode, new one is created for each query. It's better to reset it now.
Expand Down