More parallel execution for queries with `FINAL` #36396

nickitat · 2022-04-18T18:30:11Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Now we split data parts into layers and distribute them among threads instead of whole parts to make the execution of queries with FINAL more data-parallel.

an example of query pipeline

(Expression)
ExpressionTransform × 8
  (SettingQuotaAndLimits)
    (ReadFromMergeTree)
    ExpressionTransform × 8
      ReplacingSorted 4 → 1
        ExpressionTransform × 4
          FilterTransform × 4
          Description: filter values in [(4382720), +inf)
            MergeTreeInOrder × 4 0 → 1
              ReplacingSorted 4 → 1
                ExpressionTransform × 4
                  FilterTransform × 4
                  Description: filter values in [(3751936), (4382720))
                    MergeTreeInOrder × 4 0 → 1
                      ReplacingSorted 4 → 1
                        ExpressionTransform × 4
                          FilterTransform × 4
                          Description: filter values in [(3129344), (3751936))
                            MergeTreeInOrder × 4 0 → 1
                              ReplacingSorted 4 → 1
                                ExpressionTransform × 4
                                  FilterTransform × 4
                                  Description: filter values in [(2498560), (3129344))
                                    MergeTreeInOrder × 4 0 → 1
                                      ReplacingSorted 4 → 1
                                        ExpressionTransform × 4
                                          FilterTransform × 4
                                          Description: filter values in [(1875968), (2498560))
                                            MergeTreeInOrder × 4 0 → 1
                                              ReplacingSorted 4 → 1
                                                ExpressionTransform × 4
                                                  FilterTransform × 4
                                                  Description: filter values in [(1245184), (1875968))
                                                    MergeTreeInOrder × 4 0 → 1
                                                      ReplacingSorted 4 → 1
                                                        ExpressionTransform × 4
                                                          FilterTransform × 4
                                                          Description: filter values in [(622592), (1245184))
                                                            MergeTreeInOrder × 4 0 → 1
                                                              ReplacingSorted 4 → 1
                                                                ExpressionTransform × 4
                                                                  FilterTransform × 4
                                                                  Description: filter values in [-inf, (622592))
                                                                    MergeTreeInOrder × 4 0 → 1

perf tests

nickitat · 2022-04-23T17:53:52Z

@Mergifyio update

mergify · 2022-04-23T17:54:36Z

update

✅ Branch has been successfully updated

KochetovNicolai · 2022-04-28T12:23:08Z

src/Processors/QueryPlan/PartsSplitter.h

+    extern const int LOGICAL_ERROR;
+}
+
+struct IIndexAccess


Generally, it's good to write a comment to

every class (both interface and implementation)

all virtual methods

and, actually, every method which purpose is not obvious from the first glance

Looks like this iface has single implementation.
Do we actually need it? I think that, generally, you should not create an interfece if you don't plan to have > 1 implementations...

KochetovNicolai · 2022-04-28T12:24:40Z

src/Processors/QueryPlan/PartsSplitter.h

+
+struct IIndexAccess
+{
+    struct Value : std::vector<Field>


Inheritance from std containers is suspicious. Maybe, composition?

KochetovNicolai · 2022-04-28T12:28:20Z

src/Processors/QueryPlan/PartsSplitter.h

+        return std::accumulate(
+            parts.begin(), parts.end(), static_cast<size_t>(0), [](size_t sum, const auto & part) { return sum + part.getRowsCount(); });


ok, but I would prefer for loop :)

UnamedRus · 2022-04-28T12:31:43Z

Hm, can we use it in order to lower amount of memory usage during GROUP BY column_from_order_by.

If each thread will have it's own range of column_from_order_by values, it does mean, that threads will not have all possible values in their own hash tables.

For example optimize_aggregation_in_order doesn't do such parallelization as it could be.

KochetovNicolai · 2022-04-28T12:34:56Z

src/Processors/QueryPlan/PartsSplitter.h

+            // NULL_LAST
+            if (value[i].isNull())
+                value[i] = POSITIVE_INFINITY;


Not clear - why this is needed?

KochetovNicolai · 2022-04-28T19:27:56Z

src/Processors/QueryPlan/PartsSplitter.h

+            enum class Type
+            {
+                Border,
+                RangeBeginning,


Maybe we can also add RangeEnd event to remove some ranges from event_queue

jorisgio · 2022-06-03T07:58:57Z

We have been running this code for 2 weeks, and from my perspective, it is very promising and delivering huge performance gain 🎉 . Here are some numbers (unfortunately i did not get any number before patch to compare, but i can tell you it was an order of magnitude slower).

311.49 billion rows, 13.44 TB (33.43 billion rows/s., 1.44 TB/s.)

And it seems to scale really well.

Mostly the two pain point i have are not directly related to this pr :

number of ranges directly depends on max_final_thread, it is like max_threads for normal queries. But for distributed queries, would love to have a way to tweak threads per shard based on data size, some setting like max_range_size so that it ajusts to data to process
Minor and not directly related it seems it cannot stack final with optimze_aggregation_in_order, and keep using the ranges in upper layer (but this can somehow be worked around with smarter queries)

nickitat · 2022-06-03T11:35:42Z

We have been running this code for 2 weeks, and from my perspective, it is very promising and delivering huge performance gain tada . Here are some numbers (unfortunately i did not get any number before patch to compare, but i can tell you it was an order of magnitude slower).

311.49 billion rows, 13.44 TB (33.43 billion rows/s., 1.44 TB/s.)

And it seems to scale really well.

Mostly the two pain point i have are not directly related to this pr :

number of ranges directly depends on max_final_thread, it is like max_threads for normal queries. But for distributed queries, would love to have a way to tweak threads per shard based on data size, some setting like max_range_size so that it ajusts to data to process

Minor and not directly related it seems it cannot stack final with optimze_aggregation_in_order, and keep using the ranges in upper layer (but this can somehow be worked around with smarter queries)

thank you for the feedback!
regarding your points:

each node will do the splitting on its own (in the best way the current implementation could provide), so in this sense, each node will adjust to the data it is looking into. what do you want to achieve by asking a node to split the data into more ranges than it has CPU cores?
we definitely want to integrate this splitting functionality with aggregation in order. without cross-thread merging, AIO should perform significantly better

UnamedRus · 2022-06-03T12:32:57Z

And it can make sense to integrate with parallel replicas processing: #26748

jorisgio · 2022-06-10T15:42:50Z

each node will do the splitting on its own (in the best way the current implementation could provide), so in this sense, each node will adjust to the data it is looking into. what do you want to achieve by asking a node to split the data into more ranges than it has CPU cores?

if you have a server with high core count, but many load. It makes sense to run with max_final_threads=16 to process many requests in parallel. But if there is some imbalance in data distribution, one server might have twice more data, and this could benefit of having max_final_threads=32 only there. So having a setting saying 'max_ranges_size' instead would achieve that, with max_final_threads as an upper bound ?

we definitely want to integrate this splitting functionality with aggregation in order. without cross-thread merging, AIO should perform significantly better

That would be great 💯 my question specifically though is do you plan to make it stack ? For cases like :

SELECT * FROM table FINAL WHERE somefilters LIMIT 1 BY some_prefix_of_pk

a good usecase to get first/last entry in time log data
Or stuff like

SELECT count(events) FROM table FINAL WHERE somefilters GROUP BY some_prefix_of_pk

In those case it makes a lot of sense to have GROUP BY operating on the same ranges of final ? though that is not totally trivial actually, because the ranges are computed for full primary key, not the prefix so some aggregation are not local, without using the prefix only to compute ranges.

nickitat · 2022-06-13T13:47:18Z

if you have a server with high core count, but many load. It makes sense to run with max_final_threads=16 to process many requests in parallel. But if there is some imbalance in data distribution, one server might have twice more data, and this could benefit of having max_final_threads=32 only there. So having a setting saying 'max_ranges_size' instead would achieve that, with max_final_threads as an upper bound ?

final won't run faster if you set max_final_threads to a value higher than amount of cores, because each final thread utilizes it's core (assuming data distributed evenly) and in practice it works as if it won't let any other thread to execute smth until it finishes. memory consumption also won't decrease, if you increase the number of final threads.

In those case it makes a lot of sense to have GROUP BY operating on the same ranges of final

yep, AIO should be able to use splitting regardless of presence of final in query

src/QueryPipeline/printPipeline.h

src/Processors/Transforms/FilterSortedStreamByRange.h

src/Processors/QueryPlan/PartsSplitter.cpp

nickitat · 2022-06-15T10:44:07Z

@rschu1ze Robert, thank you for the review. I will make changes in a separate pr, to let this one to lend into the release.

This reverts commit c8afeaf.

tavplubix · 2022-06-15T14:26:40Z

@nickitat, you forgot to take a look at failed tests before merging

…ckHouse#36396)"" This reverts commit 5bfb152.

* Revert "Revert "More parallel execution for queries with `FINAL` (#36396)"" This reverts commit 5bfb152. * fix tests * fix review suggestions Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

nickitat added the can be tested Allows running workflows for external contributors label Apr 18, 2022

robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Apr 18, 2022

nickitat force-pushed the glorious_final branch from d299ff3 to ba4e7b0 Compare April 21, 2022 13:50

nickitat added 3 commits April 21, 2022 15:53

impl

fcf2f82

fix test

06ada79

print step description in explain pipeline

3b3e3f5

nickitat force-pushed the glorious_final branch from ba4e7b0 to 3b3e3f5 Compare April 21, 2022 13:54

fix style

3333aa0

nickitat added pr-performance Pull request with some performance improvements and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Apr 21, 2022

nickitat changed the title ~~[WIP] More data parallel final~~ [WIP] More data parallel execution for queries with FINAL Apr 21, 2022

nickitat changed the title ~~[WIP] More data parallel execution for queries with FINAL~~ More data parallel execution for queries with FINAL Apr 21, 2022

nickitat changed the title ~~More data parallel execution for queries with FINAL~~ More parallel execution for queries with FINAL Apr 21, 2022

nickitat added 2 commits April 22, 2022 11:56

clean up

e5ef701

fix

b3ea5ff

nickitat force-pushed the glorious_final branch from 6b91b79 to b3ea5ff Compare April 22, 2022 09:56

nickitat marked this pull request as ready for review April 22, 2022 11:31

nickitat requested a review from KochetovNicolai April 22, 2022 11:31

sort ranges by part id

2927746

Merge branch 'master' into glorious_final

62604f0

fix SortingStep::updateOutputStream()

473a4f4

KochetovNicolai self-assigned this Apr 28, 2022

KochetovNicolai reviewed Apr 28, 2022

View reviewed changes

nickitat added 2 commits May 11, 2022 18:29

a bit more clean up

dbd36d6

upd perf test

125264b

nickitat marked this pull request as ready for review May 11, 2022 18:33

KochetovNicolai assigned nickitat and rschu1ze and unassigned KochetovNicolai and nickitat May 17, 2022

nickitat added 2 commits June 10, 2022 20:55

Merge branch 'master' into glorious_final

85334cd

fix build

44205ac

rschu1ze reviewed Jun 13, 2022

View reviewed changes

nickitat merged commit c8afeaf into ClickHouse:master Jun 15, 2022

tavplubix added a commit that referenced this pull request Jun 15, 2022

Revert "More parallel execution for queries with FINAL (#36396)"

5bfb152

This reverts commit c8afeaf.

tavplubix mentioned this pull request Jun 15, 2022

Revert "More parallel execution for queries with FINAL" #38094

Merged

nickitat added a commit to nickitat/ClickHouse that referenced this pull request Jun 15, 2022

Revert "Revert "More parallel execution for queries with FINAL (Cli…

e7d3378

…ckHouse#36396)"" This reverts commit 5bfb152.

tavplubix mentioned this pull request Jun 16, 2022

Bring back #36396 #38110

Merged

amosbird pushed a commit to amosbird/ClickHouse that referenced this pull request Jun 22, 2022

Revert "Revert "More parallel execution for queries with FINAL (Cli…

2097abb

…ckHouse#36396)"" This reverts commit 5bfb152.

filimonov mentioned this pull request Jun 28, 2022

Query with final on replacingMergeTree run on version 22.7.1.1 is slower than on version 22.6.1.1 #38509

Closed

nickitat linked an issue Jul 2, 2022 that may be closed by this pull request

Speeding up Final by 500% by splitting query into UNION ALL of non overlapping PK ranges #30387

Closed

UnamedRus mentioned this pull request Dec 17, 2022

Ideas for performance optimizations of aggregation in order #42696

Open

devcrafter mentioned this pull request Jan 5, 2023

Improvements for DISTINCT in order #39313

Open

KochetovNicolai mentioned this pull request Mar 29, 2023

Full sorting merge join - parallel execution #48165

Open

UnamedRus mentioned this pull request Aug 15, 2023

Experiment with EXCHANGE operator for JOIN and GROUP BY #53451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More parallel execution for queries with `FINAL` #36396

More parallel execution for queries with `FINAL` #36396

nickitat commented Apr 18, 2022 •

edited

nickitat commented Apr 23, 2022

mergify bot commented Apr 23, 2022

KochetovNicolai Apr 28, 2022

KochetovNicolai Apr 28, 2022

KochetovNicolai Apr 28, 2022

KochetovNicolai Apr 28, 2022

UnamedRus commented Apr 28, 2022 •

edited

KochetovNicolai Apr 28, 2022

KochetovNicolai Apr 28, 2022

jorisgio commented Jun 3, 2022

nickitat commented Jun 3, 2022

UnamedRus commented Jun 3, 2022

jorisgio commented Jun 10, 2022 •

edited

nickitat commented Jun 13, 2022

nickitat commented Jun 15, 2022

tavplubix commented Jun 15, 2022

		return std::accumulate(
		parts.begin(), parts.end(), static_cast<size_t>(0), [](size_t sum, const auto & part) { return sum + part.getRowsCount(); });

More parallel execution for queries with FINAL #36396

More parallel execution for queries with FINAL #36396

Conversation

nickitat commented Apr 18, 2022 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

nickitat commented Apr 23, 2022

mergify bot commented Apr 23, 2022

✅ Branch has been successfully updated

KochetovNicolai Apr 28, 2022

Choose a reason for hiding this comment

KochetovNicolai Apr 28, 2022

Choose a reason for hiding this comment

KochetovNicolai Apr 28, 2022

Choose a reason for hiding this comment

KochetovNicolai Apr 28, 2022

Choose a reason for hiding this comment

UnamedRus commented Apr 28, 2022 • edited

KochetovNicolai Apr 28, 2022

Choose a reason for hiding this comment

KochetovNicolai Apr 28, 2022

Choose a reason for hiding this comment

jorisgio commented Jun 3, 2022

nickitat commented Jun 3, 2022

UnamedRus commented Jun 3, 2022

jorisgio commented Jun 10, 2022 • edited

nickitat commented Jun 13, 2022

nickitat commented Jun 15, 2022

tavplubix commented Jun 15, 2022

More parallel execution for queries with `FINAL` #36396

More parallel execution for queries with `FINAL` #36396

nickitat commented Apr 18, 2022 •

edited

UnamedRus commented Apr 28, 2022 •

edited

jorisgio commented Jun 10, 2022 •

edited