feat: add standalone shuffle benchmark binary for profiling by andygrove · Pull Request #3752 · apache/datafusion-comet

andygrove · 2026-03-21T13:43:49Z

Which issue does this PR close?

N/A - new tooling

Rationale for this change

The existing shuffle benchmarks use small synthetic data (8192 rows x 10 batches) with Criterion, which makes it difficult to:

Benchmark with realistic data distributions from TPC-H/TPC-DS at scale
Profile with tools like cargo flamegraph, perf, or instruments (Criterion's harness interferes)
Test both write and read paths (current benchmarks are write-only)
Explore different scenarios like spilling, high partition counts, or codec comparisons

What changes are included in this PR?

Adds a shuffle_bench binary (native/core/src/bin/shuffle_bench.rs) that benchmarks Comet shuffle write and read performance independently from Spark.

Features:

Parquet input: Point at TPC-H/TPC-DS Parquet files for realistic data distributions
Synthetic data generation: Configurable schema with int, string, decimal, and date columns
Write + read benchmarking: --read-back decodes all partitions and reports throughput
Configurable scenarios: partitioning (hash/single/round-robin), partition count, compression (none/lz4/zstd/snappy), memory limit for spilling
Profiler-friendly: Single long-running process with warmup and iteration support

Example usage:

# Benchmark with TPC-H data
cargo run --release --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 --codec zstd --read-back

# Generate synthetic data
cargo run --release --bin shuffle_bench -- \
  --generate --gen-rows 10000000 \
  --partitions 200 --codec lz4 --read-back --iterations 3 --warmup 1

# Profile with flamegraph
cargo flamegraph --release --bin shuffle_bench -- \
  --input /data/lineitem/ --partitions 200 --codec zstd

How are these changes tested?

Manually tested with both generated data and various configurations:

Different codecs (none, lz4, zstd, snappy)
Read-back verification (all rows decoded correctly)
Multiple iterations with warmup
Clippy clean, cargo fmt applied

Add a `shuffle_bench` binary that benchmarks shuffle write and read performance independently from Spark, making it easy to profile with tools like `cargo flamegraph`, `perf`, or `instruments`. Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating synthetic data with configurable schema. Covers different scenarios including compression codecs, partition counts, partitioning schemes, and memory-constrained spilling.

andygrove added 3 commits March 21, 2026 07:43

feat: add --limit option to shuffle benchmark (default 1M rows)

9b5b305

perf: apply limit during parquet read to avoid scanning all files

e1ab490

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add standalone shuffle benchmark binary for profiling#3752

feat: add standalone shuffle benchmark binary for profiling#3752
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:shuffle-bench-binary

andygrove commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 21, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant