ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups #9234

Dandandan · 2021-01-17T16:24:50Z

Currently, we loop to the hashmap for every key.

However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call update_batch for each of the key already in the hashmap.

In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.

On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.

This PR:

q1 took 340 ms
q2 took 1768 ms
q3 took 10975 ms
q4 took 337 ms
q5 took 13529 ms

Master:

q1 took 330 ms
q2 took 1648 ms
q3 took 16408 ms
q4 took 335 ms
q5 took 21074 ms

github-actions · 2021-01-17T16:25:13Z

https://issues.apache.org/jira/browse/ARROW-11290

codecov-io · 2021-01-17T16:39:39Z

Codecov Report

Merging #9234 (eaf918e) into master (e73f205) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #9234   +/-   ##
=======================================
  Coverage   81.61%   81.61%           
=======================================
  Files         215      215           
  Lines       51891    51897    +6     
=======================================
+ Hits        42353    42358    +5     
- Misses       9538     9539    +1

Impacted Files	Coverage Δ
...ust/datafusion/src/physical_plan/hash_aggregate.rs	`85.02% <100.00%> (+0.23%)`	⬆️
rust/parquet/src/encodings/encoding.rs	`94.86% <0.00%> (-0.20%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e73f205...eaf918e. Read the comment docs.

jorgecarleitao

Cool idea. LGTM. Thanks a lot, @Dandandan

andygrove · 2021-01-17T19:17:05Z

rust/datafusion/src/physical_plan/hash_aggregate.rs

@@ -288,6 +288,9 @@ fn group_aggregate_batch(
    // Make sure we can create the accumulators or otherwise return an error
    create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;

+    // Keys received in this batch
+    let mut batch_keys = vec![];


Should this be a set rather than a vec since it is intended to track the unique set of keys in the batch?

That's what I thought first, but this is checked already when pushing the keys to the vec.

It checks for either an empty indices vec (which means no rows yet with this key) or being the first row with this key in or_insert_with.

Dandandan · 2021-01-18T12:04:49Z

Added a ticket for remaining work including some profiling information here: https://issues.apache.org/jira/browse/ARROW-11300

alamb · 2021-01-20T12:24:55Z

I merged this branch locally to master and re-ran all the tests. Things looked good so merging it in.

…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes #9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…ash aggregation with small groups Based on #9234, this PR improves the situation described in https://issues.apache.org/jira/browse/ARROW-11300. The current situation is that we call `take` on arrays, which is fine, but causes a lot of small `Arrays` to be created / allocated. when we have only a small number of rows in each group. This improves the results on the group by queries on db-benchmark: PR: ``` q1 took 32 ms q2 took 422 ms q3 took 3468 ms q4 took 44 ms q5 took 3166 ms q7 took 3081 ms ``` #9234 (different results from that PR description as this has now partitioning enabled and a custom allocator) ``` q1 took 34 ms q2 took 389 ms q3 took 4590 ms q4 took 47 ms q5 took 5152 ms q7 took 3941 ms ``` The PR changes the algorithm to: * Create indices / offsets of all keys / indices new in the batch. * `take` the arrays based on indices in one go (so it only requires one bigger allocation for each array) * Use `slice` based on the offsets to take values from the arrays and pass it to the accumulators. Closes #9271 from Dandandan/hash_agg_few_rows Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes apache#9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…ash aggregation with small groups Based on apache#9234, this PR improves the situation described in https://issues.apache.org/jira/browse/ARROW-11300. The current situation is that we call `take` on arrays, which is fine, but causes a lot of small `Arrays` to be created / allocated. when we have only a small number of rows in each group. This improves the results on the group by queries on db-benchmark: PR: ``` q1 took 32 ms q2 took 422 ms q3 took 3468 ms q4 took 44 ms q5 took 3166 ms q7 took 3081 ms ``` apache#9234 (different results from that PR description as this has now partitioning enabled and a custom allocator) ``` q1 took 34 ms q2 took 389 ms q3 took 4590 ms q4 took 47 ms q5 took 5152 ms q7 took 3941 ms ``` The PR changes the algorithm to: * Create indices / offsets of all keys / indices new in the batch. * `take` the arrays based on indices in one go (so it only requires one bigger allocation for each array) * Use `slice` based on the offsets to take values from the arrays and pass it to the accumulators. Closes apache#9271 from Dandandan/hash_agg_few_rows Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

…sue with high number of groups Currently, we loop to the hashmap for every key. However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap. In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators. On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit. This PR: ``` q1 took 340 ms q2 took 1768 ms q3 took 10975 ms q4 took 337 ms q5 took 13529 ms ``` Master: ``` q1 took 330 ms q2 took 1648 ms q3 took 16408 ms q4 took 335 ms q5 took 21074 ms ``` Closes apache#9234 from Dandandan/hash_agg_speed2 Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…ash aggregation with small groups Based on apache#9234, this PR improves the situation described in https://issues.apache.org/jira/browse/ARROW-11300. The current situation is that we call `take` on arrays, which is fine, but causes a lot of small `Arrays` to be created / allocated. when we have only a small number of rows in each group. This improves the results on the group by queries on db-benchmark: PR: ``` q1 took 32 ms q2 took 422 ms q3 took 3468 ms q4 took 44 ms q5 took 3166 ms q7 took 3081 ms ``` apache#9234 (different results from that PR description as this has now partitioning enabled and a custom allocator) ``` q1 took 34 ms q2 took 389 ms q3 took 4590 ms q4 took 47 ms q5 took 5152 ms q7 took 3941 ms ``` The PR changes the algorithm to: * Create indices / offsets of all keys / indices new in the batch. * `take` the arrays based on indices in one go (so it only requires one bigger allocation for each array) * Use `slice` based on the offsets to take values from the arrays and pass it to the accumulators. Closes apache#9271 from Dandandan/hash_agg_few_rows Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

Fix performance issue with hash aggregation

45b23e8

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 17, 2021

Dandandan changed the title ~~ARROW-11290: Address hash aggregate performance issue with low cardinality keys~~ ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with low cardinality keys Jan 17, 2021

Dandandan changed the title ~~ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with low cardinality keys~~ ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups Jan 17, 2021

Fmt, doc

eaf918e

jorgecarleitao approved these changes Jan 17, 2021

View reviewed changes

andygrove reviewed Jan 17, 2021

View reviewed changes

andygrove approved these changes Jan 17, 2021

View reviewed changes

Dandandan mentioned this pull request Jan 19, 2021

ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups #9271

Closed

alamb closed this in 6912869 Jan 20, 2021

This was referenced Jan 20, 2021

[Rust][DataFusion] Address hash aggregate performance with high number of groups #27190

Closed

[Rust][DataFusion] Improve hash aggregate performance with large number of groups in #27200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups #9234

ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups #9234

Dandandan commented Jan 17, 2021 •

edited

Loading

github-actions bot commented Jan 17, 2021

codecov-io commented Jan 17, 2021 •

edited

Loading

jorgecarleitao left a comment

andygrove Jan 17, 2021

Dandandan Jan 17, 2021

Dandandan Jan 17, 2021 •

edited

Loading

Dandandan commented Jan 18, 2021

alamb commented Jan 20, 2021

ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups #9234

ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups #9234

Conversation

Dandandan commented Jan 17, 2021 • edited Loading

github-actions bot commented Jan 17, 2021

codecov-io commented Jan 17, 2021 • edited Loading

Codecov Report

jorgecarleitao left a comment

Choose a reason for hiding this comment

andygrove Jan 17, 2021

Choose a reason for hiding this comment

Dandandan Jan 17, 2021

Choose a reason for hiding this comment

Dandandan Jan 17, 2021 • edited Loading

Choose a reason for hiding this comment

Dandandan commented Jan 18, 2021

alamb commented Jan 20, 2021

Dandandan commented Jan 17, 2021 •

edited

Loading

codecov-io commented Jan 17, 2021 •

edited

Loading

Dandandan Jan 17, 2021 •

edited

Loading