-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize query processing right after reading FROM ... #48727
Conversation
- mostly simple or external storages
+ avoid parallelization after SourceFromSingleChunk, SystemOne
see test_dictionaries_redis/test_long.py::test_redis_dict_long
https://s3.amazonaws.com/clickhouse-test-reports/48727/bb60f10035f0ce7538b98bab02c219dff474fdae/integration_tests__asan__[1/6].html |
test_storage_mysql/test.py:test_settings_connection_wait_timeout
Affected :
Note: P.S. Can miss something. Probably, need to add some tests/performance tests in addition |
+ disable parallelization for storage Null
Ok. Many performance tests use zeros/zeros_mt to check single/multithread performance. |
Rather then changing tests it much simpler to avoid parallelization after |
It'd be nice to understand the reason for failure with dictionaries. If output is parallelized for dictionaries, query returns correct result with
Incrementing SELECT count(), uniqExact(date), uniqExact(id) FROM redis_dict") settings max_threads=2 1001 | 2 | 1000 SELECT count(), uniqExact(date), uniqExact(id) FROM redis_dict") settings max_threads=3 1002 | 2 | 1000 ... The reason is that a row is generated somewhere for each stream w/o data. But I didn't figure out yet how/where |
@@ -133,6 +133,13 @@ void IStorage::read( | |||
size_t num_streams) | |||
{ | |||
auto pipe = read(column_names, storage_snapshot, query_info, context, processed_stage, max_block_size, num_streams); | |||
|
|||
/// parallelize processing if not yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we do it here, or is it better to do it inside
InterpreterSelectQuery
? - Are there any potential troubles with mutations and StorageFromMergeTreeDataPart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we do it here, or is it better to do it inside
InterpreterSelectQuery
?
I think it's ok to do it here with the following considerations ...
num_streams
is provided by InterpreterSelectQuery
as a recommendation, i.e. how many threads are available for data processing. The reading step has the following choices:
-
(a) it knows the amount of data it can read, and it's not much data, so it creates only the necessary number of data streams based on parameters passed to
IStorage::read()
i.e.max_block_size
/storage_limits
. In this case, we don't want to adjust the number of streams andparallelizeOutputAfterReading()
can returnfalse
-
(b) it's either an unknown amount of data or known amount of data (but enough to utilize all available threads) -> in both cases output is parallelized by
num_streams
- Are there any potential troubles with mutations and StorageFromMergeTreeDataPart?
The generic thing about this change affects only storage which will use default plan step to read from storage – ReadFromStorageStep
. Sophisticated engines use specialized steps to read from its storage, like ReadFromMergeTree
in MergeTree
case.
StorageFromMergeTreeDataPart
is not affected since it uses the ReadFromMergeTree
step, which overrides read()
method where resize()
is added.
Just curious: how does it not fail lots of tests in CI? When I tried |
Probably they were fixed in #48525 ? |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Query processing is parallelized right after reading from a data source. Affected data sources are mostly simple or external storages like table functions
url
,file
.