Call take arrays once per repartitioned input batch#22159
Call take arrays once per repartitioned input batch#22159gene-bordegaray wants to merge 1 commit into
Conversation
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing gene.bordegaray/2026/05/repartition-grouped-hash-take (a0a727c) to 937dfda (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing gene.bordegaray/2026/05/repartition-grouped-hash-take (a0a727c) to 937dfda (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing gene.bordegaray/2026/05/repartition-grouped-hash-take (a0a727c) to 937dfda (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
This is inended for fanout on larger scale factor. The benchmarks in my description are run with Can this be run with |
|
cc: @gabotechs |
Which issue does this PR close?
Rationale for this change
Hash repartition currently builds one output batch per non-empty target partition by calling
take_arraysseparately for each partition. At high fanout this means an input batch can issue many take kernels, which shows in repartition-heavy queries.This changes hash repartition to concatenate the per-partition row indices, call
take_arraysonce for the input batch, and then slice the reordered batch back into per-partition output batches.This is complementary to #22010: that PR reduces channel/gate traffic from many small batches, while this PR reduces the Arrow take-kernel work required to create the repartitioned batches.
What changes are included in this PR?
take_arrayscalls with one groupedtake_arrayscall per input batch.RecordBatch::sliceoutputs for each non-empty partition.How the grouped take works:
Are these changes tested?
cargo test -p datafusion-physical-plan repartition --libBenchmarks:
Default TPCH SF10 summary, with no
--batch-sizeoverride:TPCH SF10 default batch size, 8 partitions, all queries
TPCH SF10 default batch size, 16 partitions, all queries
TPCH SF10 default batch size, 32 partitions, all queries
TPCH SF10 default batch size, 64 partitions, all queries
TPCH SF10 default batch size, 300 partitions, all queries
Stress cases:
--batch-size 1024to stress the repartition path. They are included to show the mechanism under smaller input batches and higher output fanout, not as the primary end-to-end performance claim.TPCH SF10, 8 partitions, all queries
TPCH SF10, 16 partitions, all queries
TPCH SF10, 32 partitions, all queries
TPCH SF10, 64 partitions, all queries
TPCH SF10, 300 partitions, targeted high-fanout queries
Yes this is a real use case for fanout in distributed-datafusion
TPCH SF10, 300 partitions, peak RSS stress
Measured with
/usr/bin/time -l, one iteration,--batch-size 1024,--partitions 300, and no DataFusion memory limit. RSS is process peak resident set size from the OS.Memory concern and follow-up work
This PR changes output batches from materializing per-partition batches to slices of one reordered batch. This means sibling slices can share the same buffers.
Potential concern:
A slow output partition can keep the shared reordered batch buffers alive until its slice is dropped. Also,
RecordBatch::get_array_memory_size()may count shared slice buffers repeatedly when repartition reserves memory per output batch.The peak RSS stress above did not show a process-memory regression in the measured queries. Follow-up work should add buffer-aware accounting.
Are there any user-facing changes?
No.