-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix possible permanent "Cannot schedule a task" error #9154
Fix possible permanent "Cannot schedule a task" error #9154
Conversation
Fails on upstream/master too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but I don't understand what is the fix.
No logic in ThreadPool was changed. It should work with unhandled exceptions regardless to the change. |
In case of exception it will shutdown it, and in some cases the global pool can be terminated, which will lead to "cannot schedule task" permanently for every new connection/query d83f249 has some details, if this is not enough, ping me, I can try to get back with reproducer |
We don't use GlobalThreadPool directly, only via ThreadFromGlobalPool. |
The proper fix will be:
|
Ok. Does |
So the exception need to be re-thrown somehow, this can be done from the
I'm suggesting to not shutdown the loop on exceptions, for now this can done only if this is the global pool, but I would prefer not to add this condition, since this will make it less readable and the exception will be thrown from the |
580b5b4
to
ff82acf
Compare
Decided to make it for global pool only (at least for now) |
@azat Is it possible to create a test case for that? What is happening with 00974_query_profiler ? @alexey-milovidov any chances to finish the review? |
It is easy to cover this with unit test, but this is not that interesting since this is the problem of the callers. |
Otherwise GlobalThreadPool can be terminated (for example due to an exception from the ParallelInputsHandler::onFinish/onFinishThread, from ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread, since writeToTemporaryFile() can definitelly throw) and the server will not accept new connections (or/and execute queries) anymore. Here is possible stacktrace (it is a bit inaccurate, due to optimizations I guess, and it had been obtained with the DB::tryLogCurrentException() in the catch block of the ThreadPoolImpl::worker()): 2020.02.16 22:30:40.415246 [ 45909 ] {} <Error> ThreadPool: Unhandled exception in the ThreadPool(10000,1000,10000) the loop will be shutted down: Code: 241, e.displayText() = DB::Exception: Memory limit (total) exceeded: would use 279.40 GiB (attempt to allocate chunk of 4205536 bytes), maximum: 279.40 GiB, Stack trace (when copying this message, always include the lines below): 1. Common/Exception.cpp:35: DB::Exception::Exception(...) ... 6. Common/Allocator.h:102: void DB::PODArrayBase<8ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::reserve<>(unsigned long) (.part.0) 7. Interpreters/Aggregator.cpp:1040: void DB::Aggregator::writeToTemporaryFileImpl<...>(...) 8. Interpreters/Aggregator.cpp:719: DB::Aggregator::writeToTemporaryFile(...) 9. include/memory:4206: DB::Aggregator::writeToTemporaryFile(...) 10. DataStreams/ParallelInputsProcessor.h:223: DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::thread(...) Refs: ClickHouse#6833 (comment) (Reference to particular comment, since I'm not sure about the initial issue)
ff82acf
to
a15b2da
Compare
So I tried to do this just for fun, but because of compression (for temporary files in aggregator) this approach cannot be used.
I have no idea, but after rebase it does not fails anymore (maybe some issue that has been fixed in master already? anyway logs wasn't verbose enough to investigate this) |
@azat query profiler test was temporary broken in master a few weeks ago, then fixed. |
The original issue
should not affect versions 20.3+, because |
It was found on 20.2.1.2337-23 (even though AFAIK there wasn't official release), and AFAICS the processors has been enabled:
|
@azat But how it's related to |
Hm, query that triggers it had been initiated by the Distributed engine, and I guess it was possible in pre #8929 (v20.2.1.2425-testing-181-g1e8389eceb) (But I guess it is not important for stable releases) |
Sorry, this PR introduced race condition: (I will fix it) |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix possible permanent "Cannot schedule a task" error (due to unhandled exception in
ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread
)Refs: #6833