Skip to content

feat: add standalone shuffle benchmark binary for profiling#3752

Draft
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:shuffle-bench-binary
Draft

feat: add standalone shuffle benchmark binary for profiling#3752
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:shuffle-bench-binary

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

N/A - new tooling

Rationale for this change

The existing shuffle benchmarks use small synthetic data (8192 rows x 10 batches) with Criterion, which makes it difficult to:

  • Benchmark with realistic data distributions from TPC-H/TPC-DS at scale
  • Profile with tools like cargo flamegraph, perf, or instruments (Criterion's harness interferes)
  • Test both write and read paths (current benchmarks are write-only)
  • Explore different scenarios like spilling, high partition counts, or codec comparisons

What changes are included in this PR?

Adds a shuffle_bench binary (native/core/src/bin/shuffle_bench.rs) that benchmarks Comet shuffle write and read performance independently from Spark.

Features:

  • Parquet input: Point at TPC-H/TPC-DS Parquet files for realistic data distributions
  • Synthetic data generation: Configurable schema with int, string, decimal, and date columns
  • Write + read benchmarking: --read-back decodes all partitions and reports throughput
  • Configurable scenarios: partitioning (hash/single/round-robin), partition count, compression (none/lz4/zstd/snappy), memory limit for spilling
  • Profiler-friendly: Single long-running process with warmup and iteration support

Example usage:

# Benchmark with TPC-H data
cargo run --release --bin shuffle_bench -- \
  --input /data/tpch-sf100/lineitem/ \
  --partitions 200 --codec zstd --read-back

# Generate synthetic data
cargo run --release --bin shuffle_bench -- \
  --generate --gen-rows 10000000 \
  --partitions 200 --codec lz4 --read-back --iterations 3 --warmup 1

# Profile with flamegraph
cargo flamegraph --release --bin shuffle_bench -- \
  --input /data/lineitem/ --partitions 200 --codec zstd

How are these changes tested?

Manually tested with both generated data and various configurations:

  • Different codecs (none, lz4, zstd, snappy)
  • Read-back verification (all rows decoded correctly)
  • Multiple iterations with warmup
  • Clippy clean, cargo fmt applied

Add a `shuffle_bench` binary that benchmarks shuffle write and read
performance independently from Spark, making it easy to profile with
tools like `cargo flamegraph`, `perf`, or `instruments`.

Supports reading Parquet files (e.g. TPC-H/TPC-DS) or generating
synthetic data with configurable schema. Covers different scenarios
including compression codecs, partition counts, partitioning schemes,
and memory-constrained spilling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant