Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 17, 2026

Summary

  • Make the --query parameter optional - runs all 22 TPC-H queries when not specified
  • Only print SQL queries when --debug flag is enabled
  • Write a single JSON output file for the entire benchmark run (instead of one per query)
  • Fix parquet file path resolution for datafusion benchmarks
  • Simplify output when --iterations 1 (no iteration number, no average)

Example usage

Run a single query:

cargo run --release --bin tpch -- benchmark datafusion --path ./data --format parquet --query 1

Run all 22 queries:

cargo run --release --bin tpch -- benchmark datafusion --path ./data --format parquet

Test plan

  • Code compiles with cargo check -p ballista-benchmarks
  • cargo fmt and cargo clippy pass
  • Manual testing with TPC-H data (both parquet and tbl formats)

🤖 Generated with Claude Code

@andygrove andygrove force-pushed the optional-query-param branch 5 times, most recently from a5e4d34 to f408775 Compare January 17, 2026 21:28
@andygrove andygrove changed the title feat: make query parameter optional in tpch benchmark feat: improve tpch benchmark CLI Jan 17, 2026
@andygrove
Copy link
Member Author

andygrove commented Jan 17, 2026

Ballista run:

$ cargo run --release --bin tpch benchmark ballista --host localhost --port 50050 --path $(pwd)/data --format parquet --iterations 1 --output .
    Finished `release` profile [optimized] target(s) in 0.16s
     Running `/home/andy/git/apache/datafusion-ballista/target/release/tpch benchmark ballista --host localhost --port 50050 --path /home/andy/git/apache/datafusion-ballista/benchmarks/data --format parquet --iterations 1 --output .`
Running benchmarks with the following options: BallistaBenchmarkOpt { query: None, debug: false, expected_results: None, iterations: 1, batch_size: 8192, path: "/home/andy/git/apache/datafusion-ballista/benchmarks/data", file_format: "parquet", partitions: 2, host: Some("localhost"), port: Some(50050), output_path: Some(".") }
Query 1 took 816.9 ms and returned 4 rows
Query 2 took 1425.0 ms and returned 100 rows
Query 3 took 1018.5 ms and returned 10 rows
Query 4 took 815.6 ms and returned 5 rows
Query 5 took 1625.8 ms and returned 5 rows
Query 6 took 408.3 ms and returned 1 rows
Query 7 took 1830.3 ms and returned 4 rows
Query 8 took 2236.7 ms and returned 2 rows
Query 9 took 1829.2 ms and returned 175 rows
Query 10 took 1220.9 ms and returned 20 rows
Query 11 took 1018.1 ms and returned 1048 rows
Query 12 took 812.4 ms and returned 2 rows
Query 13 took 916.4 ms and returned 42 rows
Query 14 took 611.3 ms and returned 1 rows
Query 15 took 1019.0 ms and returned 0 rows
Query 16 took 1220.8 ms and returned 18314 rows
Query 17 took 715.1 ms and returned 1 rows
Query 18 took 1325.1 ms and returned 57 rows
Query 19 took 716.4 ms and returned 1 rows
Query 20 took 1019.5 ms and returned 186 rows
Query 21 took 1628.9 ms and returned 100 rows
Query 22 took 816.5 ms and returned 7 rows
Writing summary file to ./tpch-1768685522.json

DataFusion run:

$ cargo run --release --bin tpch benchmark datafusion --path $(pwd)/data --format parquet --iterations 1 --output .
   Compiling ballista-benchmarks v51.0.0 (/home/andy/git/apache/datafusion-ballista/benchmarks)
    Finished `release` profile [optimized] target(s) in 9.28s
     Running `/home/andy/git/apache/datafusion-ballista/target/release/tpch benchmark datafusion --path /home/andy/git/apache/datafusion-ballista/benchmarks/data --format parquet --iterations 1 --output .`
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: None, debug: false, iterations: 1, partitions: 2, batch_size: 8192, path: "/home/andy/git/apache/datafusion-ballista/benchmarks/data", file_format: "parquet", mem_table: false, output_path: Some(".") }
Query 1 took 222.8 ms and returned 4 rows
Query 2 took 45.5 ms and returned 100 rows
Query 3 took 76.7 ms and returned 10 rows
Query 4 took 101.8 ms and returned 5 rows
Query 5 took 111.8 ms and returned 5 rows
Query 6 took 56.4 ms and returned 1 rows
Query 7 took 162.9 ms and returned 4 rows
Query 8 took 98.8 ms and returned 2 rows
Query 9 took 151.0 ms and returned 175 rows
Query 10 took 116.8 ms and returned 20 rows
Query 11 took 24.9 ms and returned 1048 rows
Query 12 took 90.8 ms and returned 2 rows
Query 13 took 148.1 ms and returned 42 rows
Query 14 took 58.3 ms and returned 1 rows
Query 15 took 86.6 ms and returned 0 rows
Query 16 took 34.6 ms and returned 18314 rows
Query 17 took 160.8 ms and returned 1 rows
Query 18 took 286.1 ms and returned 57 rows
Query 19 took 117.3 ms and returned 1 rows
Query 20 took 78.6 ms and returned 186 rows
Query 21 took 163.9 ms and returned 100 rows
Query 22 took 39.3 ms and returned 7 rows
Writing summary file to ./tpch-1768685732.json

When running the tpch benchmark, the --query parameter is now optional.
If not specified, all 22 TPC-H queries will be run sequentially.

Changes:
- Make --query optional for both datafusion and ballista benchmarks
- Run all 22 queries when --query is not specified
- Only print SQL queries when --debug flag is enabled
- Write a single JSON output file for the entire benchmark run
- Fix parquet file path resolution for datafusion benchmarks
- Simplify output when iterations=1 (no iteration number, no average)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove force-pushed the optional-query-param branch from f408775 to c3b9c01 Compare January 17, 2026 21:35
@andygrove andygrove requested a review from milenkovicm January 17, 2026 21:36
@milenkovicm milenkovicm merged commit 34f7513 into apache:main Jan 17, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants