Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize query processing right after reading FROM ... #48727

Merged
merged 22 commits into from
Apr 23, 2023

Conversation

devcrafter
Copy link
Member

@devcrafter devcrafter commented Apr 12, 2023

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Query processing is parallelized right after reading from a data source. Affected data sources are mostly simple or external storages like table functions url, file.

@devcrafter devcrafter marked this pull request as draft April 12, 2023 18:49
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-performance Pull request with some performance improvements label Apr 12, 2023
@alexey-milovidov alexey-milovidov self-assigned this Apr 12, 2023
@devcrafter
Copy link
Member Author

devcrafter commented Apr 13, 2023

https://s3.amazonaws.com/clickhouse-test-reports/48727/bb60f10035f0ce7538b98bab02c219dff474fdae/integration_tests__asan__[1/6].html
count() returns incorrect result with Redis dictionary. Assume that there is some bug around dictionaries (don't know yet whether it's Redis specific or not). So, for now, avoid parallelization after reading from 'Dictionary' storage (see 7c84dc4)

@devcrafter devcrafter marked this pull request as ready for review April 14, 2023 23:57
@devcrafter
Copy link
Member Author

devcrafter commented Apr 15, 2023

Affected :

  • table engines: mongodb, mysql, postgresql, sqlite, rocksdb, Hive, LogFamily engines, Buffer, Memory, KeeperMap, URL
  • table functions: remote, functions which uses engines from list above

Note:

  • Dictionary table engine is not affected yet (see comment)
  • numbers not affected (yet?), see comment

P.S. Can miss something. Probably, need to add some tests/performance tests in addition

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Apr 15, 2023

Ok.

Many performance tests use zeros/zeros_mt to check single/multithread performance.
We can edit them by adding SETTINGS max_threads = 1 where it is zeros or numbers.
Single-threaded performance tests make sense for stability of the results.

@devcrafter
Copy link
Member Author

Many performance tests use zeros/zeros_mt to check single/multithread performance.
We can edit them by adding SETTINGS max_threads = 1 where it is zeros or numbers.
Single-threaded performance tests make sense for stability of the results.

Rather then changing tests it much simpler to avoid parallelization after zeroes. It looks like we better stick to it for zeroes and numbers since they have mutlithreaded counterparts.

@devcrafter
Copy link
Member Author

devcrafter commented Apr 15, 2023

It'd be nice to understand the reason for failure with dictionaries. If output is parallelized for dictionaries, query returns correct result with max_threads=1:

SELECT count(), uniqExact(date), uniqExact(id) FROM redis_dict") settings max_threads=1
1000 | 1 | 1000

Incrementing max_threads will increment count() result by 1, i.e.

SELECT count(), uniqExact(date), uniqExact(id) FROM redis_dict") settings max_threads=2
1001 | 2 | 1000

SELECT count(), uniqExact(date), uniqExact(id) FROM redis_dict") settings max_threads=3
1002 | 2 | 1000
...

The reason is that a row is generated somewhere for each stream w/o data. But I didn't figure out yet how/where

@@ -133,6 +133,13 @@ void IStorage::read(
size_t num_streams)
{
auto pipe = read(column_names, storage_snapshot, query_info, context, processed_stage, max_block_size, num_streams);

/// parallelize processing if not yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we do it here, or is it better to do it inside InterpreterSelectQuery?
  2. Are there any potential troubles with mutations and StorageFromMergeTreeDataPart?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we do it here, or is it better to do it inside InterpreterSelectQuery?

I think it's ok to do it here with the following considerations ...

num_streams is provided by InterpreterSelectQuery as a recommendation, i.e. how many threads are available for data processing. The reading step has the following choices:

  • (a) it knows the amount of data it can read, and it's not much data, so it creates only the necessary number of data streams based on parameters passed to IStorage::read() i.e. max_block_size/storage_limits. In this case, we don't want to adjust the number of streams and parallelizeOutputAfterReading() can return false

  • (b) it's either an unknown amount of data or known amount of data (but enough to utilize all available threads) -> in both cases output is parallelized by num_streams

  1. Are there any potential troubles with mutations and StorageFromMergeTreeDataPart?

The generic thing about this change affects only storage which will use default plan step to read from storage – ReadFromStorageStep. Sophisticated engines use specialized steps to read from its storage, like ReadFromMergeTree in MergeTree case.

StorageFromMergeTreeDataPart is not affected since it uses the ReadFromMergeTree step, which overrides read() method where resize() is added.

@alexey-milovidov alexey-milovidov merged commit 67de39c into master Apr 23, 2023
144 checks passed
@alexey-milovidov alexey-milovidov deleted the parallel-processing-from-storages branch April 23, 2023 20:10
@al13n321
Copy link
Member

Just curious: how does it not fail lots of tests in CI? When I tried input_format_parquet_preserve_order = true by default, ~10 tests failed because of reordering (all straightforward to fix).

@devcrafter
Copy link
Member Author

Just curious: how does it not fail lots of tests in CI? When I tried input_format_parquet_preserve_order = true by default, ~10 tests failed because of reordering (all straightforward to fix).

Probably they were fixed in #48525 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-performance Pull request with some performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants