Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible permanent "Cannot schedule a task" error #9154

Conversation

azat
Copy link
Collaborator

@azat azat commented Feb 17, 2020

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix possible permanent "Cannot schedule a task" error (due to unhandled exception in ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread)

Refs: #6833

@azat
Copy link
Collaborator Author

azat commented Feb 17, 2020

00974_query_profiler | FAIL

Fails on upstream/master too

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but I don't understand what is the fix.

@alexey-milovidov
Copy link
Member

No logic in ThreadPool was changed. It should work with unhandled exceptions regardless to the change.

@azat
Copy link
Collaborator Author

azat commented Feb 17, 2020

No logic in ThreadPool was changed. It should work with unhandled exceptions regardless to the change

In case of exception it will shutdown it, and in some cases the global pool can be terminated, which will lead to "cannot schedule task" permanently for every new connection/query

d83f249 has some details, if this is not enough, ping me, I can try to get back with reproducer

@alexey-milovidov
Copy link
Member

In case of exception it will shutdown it, and in some cases the global pool can be terminated, which will lead to "cannot schedule task" permanently for every new connection/query

We don't use GlobalThreadPool directly, only via ThreadFromGlobalPool.

@alexey-milovidov
Copy link
Member

The proper fix will be:

@azat
Copy link
Collaborator Author

azat commented Feb 17, 2020

don't throw from the job function of ThreadFromGlobalPool.

Ok.

Does onFinish* should be wrapped with try/catch/onException anyway? (looks like it still worth adding)

@azat
Copy link
Collaborator Author

azat commented Feb 17, 2020

don't throw from the job function of ThreadFromGlobalPool.

So the exception need to be re-thrown somehow, this can be done from the join(), but in this case the following should be done:

  • modify each join() that called in destructor -- not a problem
  • ThreadPoolImpl will not known that exception occurred until destructor (finalize)

I'm suggesting to not shutdown the loop on exceptions, for now this can done only if this is the global pool, but I would prefer not to add this condition, since this will make it less readable and the exception will be thrown from the wait anyway, but I can miss something, @alexey-milovidov ?

@azat azat force-pushed the ParallelInputsProcessor-GlobalThreadPool-shutdown-fix branch from 580b5b4 to ff82acf Compare February 17, 2020 20:40
@azat
Copy link
Collaborator Author

azat commented Feb 18, 2020

I'm suggesting to not shutdown the loop on exceptions, for now this can done only if this is the global pool, but I would prefer not to add this condition

Decided to make it for global pool only (at least for now)

@filimonov
Copy link
Contributor

filimonov commented Mar 14, 2020

@azat Is it possible to create a test case for that?

What is happening with 00974_query_profiler ?

@alexey-milovidov any chances to finish the review?

@azat
Copy link
Collaborator Author

azat commented Mar 15, 2020

Is it possible to create a test case for that?

It is easy to cover this with unit test, but this is not that interesting since this is the problem of the callers.
It is also possible to reproduce this via some query (i.e. function? test), but it is not that easy, since the exception should be thrown from the merge stage only (but this can be reproduced, by using the partition with exact size)

Otherwise GlobalThreadPool can be terminated (for example due to an
exception from the ParallelInputsHandler::onFinish/onFinishThread, from
ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread,
since writeToTemporaryFile() can definitelly throw) and the server will
not accept new connections (or/and execute queries) anymore.

Here is possible stacktrace (it is a bit inaccurate, due to
optimizations I guess, and it had been obtained with the
DB::tryLogCurrentException() in the catch block of the
ThreadPoolImpl::worker()):

    2020.02.16 22:30:40.415246 [ 45909 ] {} <Error> ThreadPool: Unhandled exception in the ThreadPool(10000,1000,10000) the loop will be shutted down: Code: 241, e.displayText() = DB::Exception: Memory limit (total) exceeded: would use 279.40 GiB (attempt to allocate chunk of 4205536 bytes), maximum: 279.40 GiB, Stack trace (when copying this message, always include the lines below):

    1.  Common/Exception.cpp:35: DB::Exception::Exception(...)
    ...
    6.  Common/Allocator.h:102: void DB::PODArrayBase<8ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::reserve<>(unsigned long) (.part.0)
    7.  Interpreters/Aggregator.cpp:1040: void DB::Aggregator::writeToTemporaryFileImpl<...>(...)
    8.  Interpreters/Aggregator.cpp:719: DB::Aggregator::writeToTemporaryFile(...)
    9.  include/memory:4206: DB::Aggregator::writeToTemporaryFile(...)
    10. DataStreams/ParallelInputsProcessor.h:223: DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::thread(...)

Refs: ClickHouse#6833 (comment)
(Reference to particular comment, since I'm not sure about the initial issue)
@azat azat force-pushed the ParallelInputsProcessor-GlobalThreadPool-shutdown-fix branch from ff82acf to a15b2da Compare March 15, 2020 10:13
@azat
Copy link
Collaborator Author

azat commented Mar 15, 2020

It is also possible to reproduce this via some query (i.e. function? test), but it is not that easy, since the exception should be thrown from the merge stage only (but this can be reproduced, by using the partition with exact size)

So I tried to do this just for fun, but because of compression (for temporary files in aggregator) this approach cannot be used.

What is happening with 00974_query_profiler ?

I have no idea, but after rebase it does not fails anymore (maybe some issue that has been fixed in master already? anyway logs wasn't verbose enough to investigate this)

@alexey-milovidov alexey-milovidov merged commit 8d9aba4 into ClickHouse:master Mar 19, 2020
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Mar 19, 2020

@azat query profiler test was temporary broken in master a few weeks ago, then fixed.
See #9472 (comment)

@alexey-milovidov
Copy link
Member

The original issue

ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread

should not affect versions 20.3+, because experimental_use_processors is enabled.

@alexey-milovidov alexey-milovidov added the pr-bugfix Pull request with bugfix, not backported by default label Mar 19, 2020
@azat
Copy link
Collaborator Author

azat commented Mar 19, 2020

should not affect versions 20.3+, because experimental_use_processors is enabled.

It was found on 20.2.1.2337-23 (even though AFAIK there wasn't official release), and AFAICS the processors has been enabled:

$ git show v20.2.1.2337-testing:dbms/src/Core/Settings.h | fgrep processors
    M(SettingBool, experimental_use_processors, true, "Use processors pipeline.", 0) \

@alexey-milovidov
Copy link
Member

@azat But how it's related to ParallelInputsProcessor, that (despite it's name) has no relationship to processors?

@azat
Copy link
Collaborator Author

azat commented Mar 19, 2020

@azat But how it's related to ParallelInputsProcessor, that (despite it's name) has no relationship to processors?

Hm, query that triggers it had been initiated by the Distributed engine, and I guess it was possible in pre #8929 (v20.2.1.2425-testing-181-g1e8389eceb)

(But I guess it is not important for stable releases)

@azat azat deleted the ParallelInputsProcessor-GlobalThreadPool-shutdown-fix branch March 19, 2020 21:43
@alexey-milovidov
Copy link
Member

@alexey-milovidov
Copy link
Member

@azat we suspect that this issue is not fixed: #33712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-docs-needed pr-bugfix Pull request with bugfix, not backported by default
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants