Skip to content

Benchmark single-column GROUP BY: hash vs direct-indexed flat array#22479

Closed
nathanb9 wants to merge 5 commits into
apache:mainfrom
nathanb9:benchmark-single-group-by-flat-primitive
Closed

Benchmark single-column GROUP BY: hash vs direct-indexed flat array#22479
nathanb9 wants to merge 5 commits into
apache:mainfrom
nathanb9:benchmark-single-group-by-flat-primitive

Conversation

@nathanb9
Copy link
Copy Markdown
Contributor

Summary

  • Adds GroupValuesFlatPrimitive, a direct-indexing GroupValues implementation for integer GROUP BY columns with bounded value ranges (no hashing, O(1) array lookup via value - min)
  • Adds a benchmark (single_group_by_primitive) comparing hash-based GroupValuesPrimitive vs GroupValuesFlatPrimitive
  • Makes single_group_by and row modules public for benchmark access

Benchmark design

Uses the same iter_batched_ref methodology as the multi-column GROUP BY benchmark in #22322:

  • Construction is in setup (not timed)
  • Only intern() calls are measured
  • black_box prevents dead-code elimination

Three experiments:

  1. Group count sweep (10–100K groups, 1M rows) — measures scaling with cardinality
  2. Density sweep (10K groups, 10%–100% density) — measures flat array sparsity impact
  3. Row count scaling (10K groups, 1M–10M rows) — measures per-row cost compounding

Local results (Apple Silicon, release mode)

Groups Hash Flat Speedup
10 1.41ms 0.96ms 1.47x
100 1.38ms 0.97ms 1.43x
1,000 1.52ms 0.98ms 1.55x
10,000 2.31ms 0.97ms 2.37x
100,000 4.53ms 1.52ms 2.99x

Related

Test plan

  • cargo test -p datafusion-physical-plan --lib flat_primitive — 5 unit tests pass
  • cargo clippy -p datafusion-physical-plan --benches -- -D warnings — clean
  • cargo bench -p datafusion-physical-plan --bench single_group_by_primitive — runs successfully

🤖 Generated with Claude Code

nathanb9 and others added 5 commits May 17, 2026 15:54
return_type() is not implemented by all UDAFs (e.g. first_value,
last_value). Use the universally-supported return_field() API to
derive the return type for default_value computation.
approx_distinct is semantically COUNT(DISTINCT), so it returns 0 (not
NULL) on empty input. Update the window_using_aggregates snapshot to
reflect this and add is_nullable() -> false for schema consistency.
…ingle-column GROUP BY

Adds a direct-indexing GroupValues implementation for integer-typed
GROUP BY columns with bounded value ranges, inspired by the ArrayMap
used for perfect hash joins. Instead of hashing and probing a hash
table, it computes `value - min` to index directly into a flat array.

Includes a benchmark comparing hash-based vs flat approaches across
varying group counts (10-100K) and densities (10%-100%), with both
cold (full lifecycle) and warm (pre-populated, pure lookup) variants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels May 23, 2026
@nathanb9 nathanb9 closed this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation optimizer Optimizer rules physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant