[Rust][DataFusion] Improve hash aggregate performance with large number of groups in #27200

asfimport · 2021-01-18T12:03:11Z

Currently, hash aggregates are performing well when having a small number of output groups, but the results on db-benchmark h2oai/db-benchmark#182 test on data with a high number of output groups.
#9234 improved the situation a bit, but DataFusion is still much slower than even the slowest result when comparing to the published results.

This seems mostly having to do with the way we use individual key/groups.
For each new key, we take the indices of the group, resulting in lots of small allocations and cache unfriendliness and other overhead if we have many keys with only a small (just 1-2) number of rows per group in a batch. Also the indices are converted from a Vec to an Array, making the situation worse (accounts for ~22% of the instructions on the master branch!), other profiling results seem to be from related allocations too.

To make it efficient for tiny groups, we should probably change the hash aggregate algorithm to take based on all indices from the batch in one go, and "slice" into the resulting array for the individual accumulators.

Here is some profiling info of the db-benchmark questions 1-5 against master:

Reporter: Daniël Heres / @Dandandan
Assignee: Daniël Heres / @Dandandan

Original Issue Attachments:

image-2021-01-18-13-00-36-685.png

PRs and other links:

GitHub Pull Request #9271

_{Note: This issue was originally created as ARROW-11300. Please see the migration documentation for further details.}

asfimport · 2021-01-28T03:04:53Z

Jorge Leitão / @jorgecarleitao:
Issue resolved by pull request 9271
#9271

asfimport closed this as completed Jan 28, 2021

asfimport assigned Dandandan Jan 10, 2023

asfimport added this to the 4.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rust][DataFusion] Improve hash aggregate performance with large number of groups in #27200

[Rust][DataFusion] Improve hash aggregate performance with large number of groups in #27200

asfimport commented Jan 18, 2021

asfimport commented Jan 28, 2021

[Rust][DataFusion] Improve hash aggregate performance with large number of groups in #27200

[Rust][DataFusion] Improve hash aggregate performance with large number of groups in #27200

Comments

asfimport commented Jan 18, 2021

Original Issue Attachments:

PRs and other links:

asfimport commented Jan 28, 2021