perf: combine_hashes for primitive multi-column rehash#22350
Conversation
Aligns hash_array_primitive's rehash path with the existing pattern used by hash_array (generic variable-width) and hash_dictionary_array: hash the current value once with the query RandomState and combine it with the previous row hash via combine_hashes, instead of constructing a fresh foldhash::SeedableRandomState per row to fold the seed in. Measured on ClickBench q32 (`GROUP BY WatchID, ClientIP ORDER BY COUNT(*) DESC LIMIT 10`, partitioned dataset, dfbench, current main baseline 3988.27 ms over 3 iterations): a 10-iteration same-machine run with this change settled around 1.7 s with a warm tail of 1.4-1.5 s. Validation: - cargo test -p datafusion-common hash_utils - cargo clippy -p datafusion-common --all-targets --all-features -- -D warnings
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/primitive-rehash-multi-column (4712451) to c8b784a (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/primitive-rehash-multi-column (4712451) to c8b784a (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/primitive-rehash-multi-column (4712451) to c8b784a (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
When multiple group keys participate in a grouped aggregate,
create_hasheswalks the columns left-to-right and rehashes each row by combining the current value's hash with the previously accumulated row hash. The generic variable-width path (hash_array) and the dictionary path (hash_dictionary_array) already do this withcombine_hashes(value.hash_one(random_state), *hash). The primitive path (hash_array_primitive) instead constructs a freshfoldhash::fast::SeedableRandomStateper row, seeded from the previous hash, and runs the value through that. That approach is heavier on the hot loop and isn't necessary for correctness — the same hash distribution is achievable withcombine_hashes.The motivating workload is ClickBench q32 on the partitioned dataset:
The two group keys are both primitive (
Int64/Int32), and the partial aggregate spends most of its time in group-id calculation. On a current-main baseline of 3988.27 ms (dfbench clickbench --query 32 --iterations 3), a 10-iteration same-machine run with this change settled around 1.7 s, with a warm tail of 1.4-1.5 s per iteration.Guardrail single-query runs of other primitive multi-key aggregates (q30, q31, q35) and the q18 path that benefits from primitive rehashing as a second column did not regress.
What changes are included in this PR?
Replaces the two
seeded_state(...).build_hasher() / value.hash_write(...) / hasher.finish()blocks inhash_array_primitive's rehash branch (no-null and nullable variants) with the samecombine_hashes(value.hash_one(random_state), *hash)pattern already used by the surrounding generic and dictionary paths.seeded_stateitself is unchanged and still used by the byte-view hash paths.Doc comment on
hash_array_primitiveupdated to match the new behavior.Are these changes tested?
Covered by existing tests in
datafusion-common::hash_utils, including the multi-column hash tests:cargo test -p datafusion-common hash_utilscargo clippy -p datafusion-common --all-targets --all-features -- -D warningsAre there any user-facing changes?
No. This is an internal performance change with no API or behavioral impact — hash distribution properties are unchanged because
combine_hashesis already the standard pattern in this file.