Fix possible permanent "Cannot schedule a task" error #9154

azat · 2020-02-17T07:18:22Z

Changelog category (leave one):

Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix possible permanent "Cannot schedule a task" error (due to unhandled exception in ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread)

Refs: #6833

azat · 2020-02-17T09:47:03Z

00974_query_profiler | FAIL

Fails on upstream/master too

alexey-milovidov

Ok but I don't understand what is the fix.

alexey-milovidov · 2020-02-17T11:33:22Z

No logic in ThreadPool was changed. It should work with unhandled exceptions regardless to the change.

azat · 2020-02-17T11:40:47Z

No logic in ThreadPool was changed. It should work with unhandled exceptions regardless to the change

In case of exception it will shutdown it, and in some cases the global pool can be terminated, which will lead to "cannot schedule task" permanently for every new connection/query

d83f249 has some details, if this is not enough, ping me, I can try to get back with reproducer

alexey-milovidov · 2020-02-17T12:48:50Z

In case of exception it will shutdown it, and in some cases the global pool can be terminated, which will lead to "cannot schedule task" permanently for every new connection/query

We don't use GlobalThreadPool directly, only via ThreadFromGlobalPool.

alexey-milovidov · 2020-02-17T12:50:57Z

The proper fix will be:

don't throw from the job function of ThreadFromGlobalPool.
https://clickhouse-test-reports.s3.yandex.net/codebrowser/html_report/ClickHouse/dbms/src/Common/ThreadPool.h.html#164

azat · 2020-02-17T13:04:36Z

don't throw from the job function of ThreadFromGlobalPool.

Ok.

Does onFinish* should be wrapped with try/catch/onException anyway? (looks like it still worth adding)

dbms/src/DataStreams/ParallelInputsProcessor.h

azat · 2020-02-17T19:36:24Z

don't throw from the job function of ThreadFromGlobalPool.

So the exception need to be re-thrown somehow, this can be done from the join(), but in this case the following should be done:

modify each join() that called in destructor -- not a problem
ThreadPoolImpl will not known that exception occurred until destructor (finalize)

I'm suggesting to not shutdown the loop on exceptions, for now this can done only if this is the global pool, but I would prefer not to add this condition, since this will make it less readable and the exception will be thrown from the wait anyway, but I can miss something, @alexey-milovidov ?

azat · 2020-02-18T04:16:03Z

I'm suggesting to not shutdown the loop on exceptions, for now this can done only if this is the global pool, but I would prefer not to add this condition

Decided to make it for global pool only (at least for now)

filimonov · 2020-03-14T21:10:22Z

@azat Is it possible to create a test case for that?

What is happening with 00974_query_profiler ?

@alexey-milovidov any chances to finish the review?

azat · 2020-03-15T06:36:47Z

Is it possible to create a test case for that?

It is easy to cover this with unit test, but this is not that interesting since this is the problem of the callers.
It is also possible to reproduce this via some query (i.e. function? test), but it is not that easy, since the exception should be thrown from the merge stage only (but this can be reproduced, by using the partition with exact size)

Otherwise GlobalThreadPool can be terminated (for example due to an exception from the ParallelInputsHandler::onFinish/onFinishThread, from ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread, since writeToTemporaryFile() can definitelly throw) and the server will not accept new connections (or/and execute queries) anymore. Here is possible stacktrace (it is a bit inaccurate, due to optimizations I guess, and it had been obtained with the DB::tryLogCurrentException() in the catch block of the ThreadPoolImpl::worker()): 2020.02.16 22:30:40.415246 [ 45909 ] {} <Error> ThreadPool: Unhandled exception in the ThreadPool(10000,1000,10000) the loop will be shutted down: Code: 241, e.displayText() = DB::Exception: Memory limit (total) exceeded: would use 279.40 GiB (attempt to allocate chunk of 4205536 bytes), maximum: 279.40 GiB, Stack trace (when copying this message, always include the lines below): 1. Common/Exception.cpp:35: DB::Exception::Exception(...) ... 6. Common/Allocator.h:102: void DB::PODArrayBase<8ul, 4096ul, Allocator<false, false>, 15ul, 16ul>::reserve<>(unsigned long) (.part.0) 7. Interpreters/Aggregator.cpp:1040: void DB::Aggregator::writeToTemporaryFileImpl<...>(...) 8. Interpreters/Aggregator.cpp:719: DB::Aggregator::writeToTemporaryFile(...) 9. include/memory:4206: DB::Aggregator::writeToTemporaryFile(...) 10. DataStreams/ParallelInputsProcessor.h:223: DB::ParallelInputsProcessor<DB::ParallelAggregatingBlockInputStream::Handler>::thread(...) Refs: ClickHouse#6833 (comment) (Reference to particular comment, since I'm not sure about the initial issue)

azat · 2020-03-15T19:35:51Z

It is also possible to reproduce this via some query (i.e. function? test), but it is not that easy, since the exception should be thrown from the merge stage only (but this can be reproduced, by using the partition with exact size)

So I tried to do this just for fun, but because of compression (for temporary files in aggregator) this approach cannot be used.

What is happening with 00974_query_profiler ?

I have no idea, but after rebase it does not fails anymore (maybe some issue that has been fixed in master already? anyway logs wasn't verbose enough to investigate this)

alexey-milovidov · 2020-03-19T18:35:42Z

@azat query profiler test was temporary broken in master a few weeks ago, then fixed.
See #9472 (comment)

alexey-milovidov · 2020-03-19T18:37:58Z

The original issue

ParallelAggregatingBlockInputStream::Handler::onFinish/onFinishThread

should not affect versions 20.3+, because experimental_use_processors is enabled.

azat · 2020-03-19T18:59:09Z

should not affect versions 20.3+, because experimental_use_processors is enabled.

It was found on 20.2.1.2337-23 (even though AFAIK there wasn't official release), and AFAICS the processors has been enabled:

$ git show v20.2.1.2337-testing:dbms/src/Core/Settings.h | fgrep processors
    M(SettingBool, experimental_use_processors, true, "Use processors pipeline.", 0) \

alexey-milovidov · 2020-03-19T19:11:01Z

@azat But how it's related to ParallelInputsProcessor, that (despite it's name) has no relationship to processors?

azat · 2020-03-19T21:38:02Z

@azat But how it's related to ParallelInputsProcessor, that (despite it's name) has no relationship to processors?

Hm, query that triggers it had been initiated by the Distributed engine, and I guess it was possible in pre #8929 (v20.2.1.2425-testing-181-g1e8389eceb)

(But I guess it is not important for stable releases)

alexey-milovidov · 2020-05-30T18:55:20Z

Sorry, this PR introduced race condition:
https://clickhouse-test-reports.s3.yandex.net/11301/8a4c2380ddd7a1e316f4d3d1eb5d178bccabec1f/stress_test_(thread)/stderr.log

(I will fix it)

alexey-milovidov · 2022-10-11T00:01:32Z

@azat we suspect that this issue is not fixed: #33712

qoega added the no-docs-needed label Feb 17, 2020

alexey-milovidov approved these changes Feb 17, 2020

View reviewed changes

azat commented Feb 17, 2020

View reviewed changes

dbms/src/DataStreams/ParallelInputsProcessor.h Outdated Show resolved Hide resolved

azat force-pushed the ParallelInputsProcessor-GlobalThreadPool-shutdown-fix branch from 580b5b4 to ff82acf Compare February 17, 2020 20:40

alexey-milovidov removed the no-docs-needed label Feb 27, 2020

blinkov added the no-docs-needed label Feb 27, 2020

azat requested a review from alexey-milovidov March 12, 2020 13:29

azat added 2 commits March 15, 2020 13:13

Call onException if ParallelInputsHandler::onFinish* throws

6969191

azat force-pushed the ParallelInputsProcessor-GlobalThreadPool-shutdown-fix branch from ff82acf to a15b2da Compare March 15, 2020 10:13

alexey-milovidov merged commit 8d9aba4 into ClickHouse:master Mar 19, 2020

alexey-milovidov added the pr-bugfix Pull request with bugfix, not backported by default label Mar 19, 2020

azat mentioned this pull request Mar 19, 2020

DB::Exception: Cannot schedule a task #6833

Closed

alexey-milovidov added a commit that referenced this pull request Mar 19, 2020

Remove unused (obsolete) code from ThreadPool #9154

64a45e3

alexey-milovidov added v19.14 labels Mar 19, 2020

alexey-milovidov added v19.18 labels Mar 19, 2020

azat mentioned this pull request Mar 19, 2020

Remove unused (obsolete) code from ThreadPool #9761

Merged

azat deleted the ParallelInputsProcessor-GlobalThreadPool-shutdown-fix branch March 19, 2020 21:43

azat mentioned this pull request Apr 27, 2020

DB::Exception: Cannot schedule a task #10504

Closed

alexey-milovidov mentioned this pull request May 30, 2020

Fix very rare race condition in ThreadPool #11314

Merged

akuzm added v20.3-backported and removed v20.3 labels Jun 5, 2020

huoarter mentioned this pull request Nov 4, 2021

cannot build v19.11 For patch on ubuntu #31079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix possible permanent "Cannot schedule a task" error #9154

Fix possible permanent "Cannot schedule a task" error #9154

azat commented Feb 17, 2020

azat commented Feb 17, 2020

alexey-milovidov left a comment

alexey-milovidov commented Feb 17, 2020

azat commented Feb 17, 2020 •

edited

alexey-milovidov commented Feb 17, 2020

alexey-milovidov commented Feb 17, 2020

azat commented Feb 17, 2020

azat commented Feb 17, 2020

azat commented Feb 18, 2020

filimonov commented Mar 14, 2020 •

edited

azat commented Mar 15, 2020

azat commented Mar 15, 2020

alexey-milovidov commented Mar 19, 2020 •

edited

alexey-milovidov commented Mar 19, 2020

azat commented Mar 19, 2020

alexey-milovidov commented Mar 19, 2020

azat commented Mar 19, 2020 •

edited

alexey-milovidov commented May 30, 2020

alexey-milovidov commented Oct 11, 2022

Fix possible permanent "Cannot schedule a task" error #9154

Fix possible permanent "Cannot schedule a task" error #9154

Conversation

azat commented Feb 17, 2020

azat commented Feb 17, 2020

alexey-milovidov left a comment

Choose a reason for hiding this comment

alexey-milovidov commented Feb 17, 2020

azat commented Feb 17, 2020 • edited

alexey-milovidov commented Feb 17, 2020

alexey-milovidov commented Feb 17, 2020

azat commented Feb 17, 2020

azat commented Feb 17, 2020

azat commented Feb 18, 2020

filimonov commented Mar 14, 2020 • edited

azat commented Mar 15, 2020

azat commented Mar 15, 2020

alexey-milovidov commented Mar 19, 2020 • edited

alexey-milovidov commented Mar 19, 2020

azat commented Mar 19, 2020

alexey-milovidov commented Mar 19, 2020

azat commented Mar 19, 2020 • edited

alexey-milovidov commented May 30, 2020

alexey-milovidov commented Oct 11, 2022

azat commented Feb 17, 2020 •

edited

filimonov commented Mar 14, 2020 •

edited

alexey-milovidov commented Mar 19, 2020 •

edited

azat commented Mar 19, 2020 •

edited