Skip to content

Conversation

@comphead
Copy link
Contributor

@comphead comphead commented Nov 28, 2025

Which issue does this PR close?

Rationale for this change

By default DF assesses performance by running TPCH benchmarks howevert TPCDS much more complicated and
can expose the performance issue better especially for join changes like

#18393
#18392

This is initial PR to support TPCDS and I can now do it by following instructions, which are also documented

git clone https://github.com/apache/datafusion-benchmarks

Then run the benchmark with the following command:

DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@comphead comphead changed the title chore: Add TPCDS benchmarks [WIP]chore: Add TPCDS benchmarks Nov 28, 2025
@github-actions github-actions bot added core Core DataFusion crate and removed core Core DataFusion crate labels Dec 2, 2025
@comphead comphead changed the title [WIP]chore: Add TPCDS benchmarks chore: Add TPCDS benchmarks Dec 2, 2025
@comphead comphead marked this pull request as ready for review December 2, 2025 22:41
/// Run the tpcds benchmark.
#[derive(Debug, StructOpt, Clone)]
#[structopt(verbatim_doc_comment)]
pub struct RunOpt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a --help to see these commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, same for others, added the help command to .md

};

let mut benchmark_run = BenchmarkRun::new();
let mut config = self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the PR #18971, the first round run will have statistics. Or the first round run will spend time fetching statistics. (Maybe some noises)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if thats the case we should probably fix all the benches as all of them comes powered by the same mechanics

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this PR close the epic yet, considering it still has some open issues linked?

pub fn get_query_sql(base_query_path: &str, query: usize) -> Result<Vec<String>> {
if query > 0 && query < 100 {
let filename = format!("{base_query_path}/q{query}.sql");
let mut errors = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: errors probably doesn't need to be a vec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its very similar to current TPCH. but you are right I also feel the benchmark runners can be improved

Comment on lines 610 to 613
# Points to TPCDS data generation instructions
data_tpcds() {
echo ""
echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the repo already has the data? So might be confusing to call it data generation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest only showing this message when the directory is not present

When I tried it out, I was confused

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpcds
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

For TPC-DS data generation, please clone the datafusion-benchmarks repository:
  git clone https://github.com/apache/datafusion-benchmarks

So I did what the script told me

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$   git clone https://github.com/apache/datafusion-benchmarks
Cloning into 'datafusion-benchmarks'...
remote: Enumerating objects: 283, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (26/26), done.
remote: Total 283 (delta 18), reused 9 (delta 9), pack-reused 248 (from 3)
Receiving objects: 100% (283/283), 268.89 MiB | 40.49 MiB/s, done.
Resolving deltas: 100% (64/64), done.

And then I ran the data command again and got told to get the benchmarking scripts again 🤔

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpcds
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

For TPC-DS data generation, please clone the datafusion-benchmarks repository:
  git clone https://github.com/apache/datafusion-benchmarks

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know how to do it better tbh, TPCDS generation is quite complicated if you dont own dsdgen which needs to be personally downloaded and built.

the expected flow to run benches with data then with benchmark run and since we dont gen data itself for TPCDS I just left an instructions how to get the data.

Let me know if you feel that other way would more fit

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead ❤️

I think we already have the tpc-ds queries in the repo here:
https://github.com/apache/datafusion/tree/main/datafusion/core/tests/tpc-ds

Is there any way to reuse the same queries so we are sure we are testing and benchmarking the same thing?

I am also trying it out locally

@comphead
Copy link
Contributor Author

comphead commented Dec 3, 2025

Thank you @comphead ❤️

I think we already have the tpc-ds queries in the repo here: https://github.com/apache/datafusion/tree/main/datafusion/core/tests/tpc-ds

Is there any way to reuse the same queries so we are sure we are testing and benchmarking the same thing?

I am also trying it out locally

Oh, they named 1-N.sql instead of q1-qN, thats why I never found them, checking them now

@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

When I ran the command locally

./benchmarks/bench.sh run tpcds

I got a bunch of errors like

Warning registering call_center: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/call_center.parquet

Details

./benchmarks/bench.sh run tpcds
...
     Running `/Users/andrewlamb/Software/datafusion/target/release/dfbench tpcds --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data --query_path /Users/andrewlamb/Software/datafusion/benchmarks/queries/tpcds --prefer_hash_join true -o /Users/andrewlamb/Software/datafusion/benchmarks/results/dev2/tpcds_sf1.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data", query_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/tpcds", mem_table: false, output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/dev2/tpcds_sf1.json"), disable_statistics: false, prefer_hash_join: true, enable_piecewise_merge_join: false, sorted: false }
Warning registering call_center: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/call_center.parquet
Warning registering customer_address: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/customer_address.parquet
Warning registering household_demographics: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/household_demographics.parquet
Warning registering promotion: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/promotion.parquet
Warning registering store_sales: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/store_sales.parquet
Warning registering web_page: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/web_page.parquet
Warning registering catalog_page: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/catalog_page.parquet
Warning registering customer_demographics: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/customer_demographics.parquet
Warning registering income_band: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/income_band.parquet
Warning registering reason: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/reason.parquet
Warning registering store: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/store.parquet
Warning registering web_returns: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/web_returns.parquet
Warning registering catalog_returns: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/catalog_returns.parquet
Warning registering customer: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/customer.parquet
Warning registering inventory: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/inventory.parquet
Warning registering ship_mode: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/ship_mode.parquet
Warning registering time_dim: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/time_dim.parquet
Warning registering web_sales: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/web_sales.parquet
Warning registering catalog_sales: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/catalog_sales.parquet
Warning registering date_dim: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/date_dim.parquet
Warning registering item: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/item.parquet
Warning registering store_returns: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/store_returns.parquet
Warning registering warehouse: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/warehouse.parquet
Warning registering web_site: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/web_site.parquet
Query 1 failed: Schema error: No field named sr_returned_date_sk.
Query 2 failed: Schema error: No field named ws_sold_date_sk.
Query 3 failed: Schema error: No field named dt.d_date_sk.
Query 4 failed: Schema error: No field named c_customer_sk.
Query 5 failed: Schema error: No field named ss_store_sk.
Query 6 failed: Schema error: No field named d_year.
Query 7 failed: Schema error: No field named ss_sold_date_sk.
Query 8 failed: Schema error: No field named ca_zip.
Query 9 failed: Schema error: No field named r_reason_sk.
Query 10 failed: Schema error: No field named c.c_customer_sk.
Query 11 failed: Schema error: No field named c_customer_sk.
Query 12 failed: Schema error: No field named ws_item_sk.
Query 13 failed: Schema error: No field named s_store_sk.
Query 14 failed: Schema error: No field named ss_item_sk.
Query 15 failed: Schema error: No field named cs_bill_customer_sk.
Query 16 failed: Schema error: No field named cs1.cs_order_number.
Query 17 failed: Schema error: No field named d1.d_quarter_name.
Query 18 failed: Schema error: No field named cs_sold_date_sk.
Query 19 failed: Schema error: No field named d_date_sk.
Query 20 failed: Schema error: No field named cs_item_sk.
Query 21 failed: Schema error: No field named i_current_price.
Query 22 failed: Schema error: No field named inv_date_sk.
Query 23 failed: Schema error: No field named ss_sold_date_sk.
Query 24 failed: Schema error: No field named ss_ticket_number.
Query 25 failed: Schema error: No field named d1.d_moy.
Query 26 failed: Schema error: No field named cs_sold_date_sk.
Query 27 failed: Schema error: No field named ss_sold_date_sk.
Query 28 failed: Schema error: No field named ss_quantity.
Query 29 failed: Schema error: No field named d1.d_moy.
Query 30 failed: Schema error: No field named wr_returned_date_sk.
Query 31 failed: Schema error: No field named ss_sold_date_sk.
Query 32 failed: Schema error: No field named cs_item_sk.
Query 33 failed: Schema error: No field named i_category.
Query 34 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 35 failed: Schema error: No field named c.c_customer_sk.
Query 36 failed: Schema error: No field named d1.d_year.
Query 37 failed: Schema error: No field named i_current_price.
Query 38 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 39 failed: Schema error: No field named inv_item_sk.
Query 40 failed: Schema error: No field named cs_order_number.
Query 41 failed: Schema error: No field named i_manufact.
Query 42 failed: Schema error: No field named dt.d_date_sk.
Query 43 failed: Schema error: No field named d_date_sk.
Query 44 failed: Schema error: No field named ss_store_sk.
Query 45 failed: Schema error: No field named i_item_sk.
Query 46 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 47 failed: Schema error: No field named ss_item_sk.
Query 48 failed: Schema error: No field named s_store_sk.
Query 49 failed: Schema error: No field named ws.ws_order_number.
Query 50 failed: Schema error: No field named d2.d_year.
Query 51 failed: Schema error: No field named ws_sold_date_sk.
Query 52 failed: Schema error: No field named dt.d_date_sk.
Query 53 failed: Schema error: No field named ss_item_sk.
Query 54 failed: Schema error: No field named cs_sold_date_sk.
Query 55 failed: Schema error: No field named d_date_sk.
Query 56 failed: Schema error: No field named i_color.
Query 57 failed: Schema error: No field named cs_item_sk.
Query 58 failed: Schema error: No field named d_date.
Query 59 failed: Schema error: No field named d_date_sk.
Query 60 failed: Schema error: No field named i_category.
Query 61 failed: Schema error: No field named ss_sold_date_sk.
Query 62 failed: Schema error: No field named d_month_seq.
Query 63 failed: Schema error: No field named ss_item_sk.
Query 64 failed: Schema error: No field named cs_item_sk.
Query 65 failed: Schema error: No field named ss_sold_date_sk.
Query 66 failed: Schema error: No field named ws_warehouse_sk.
Query 67 failed: Schema error: No field named ss_sold_date_sk.
Query 68 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 69 failed: Schema error: No field named c.c_customer_sk.
Query 70 failed: Schema error: No field named d_month_seq.
Query 71 failed: Schema error: No field named d_date_sk.
Query 72 failed: Schema error: No field named cs_item_sk.
Query 73 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 74 failed: Schema error: No field named c_customer_sk.
Query 75 failed: Schema error: No field named i_item_sk.
Query 76 failed: Schema error: No field named ss_customer_sk.
Query 77 failed: Schema error: No field named ss_sold_date_sk.
Query 78 failed: Schema error: No field named wr_order_number.
Query 79 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 80 failed: Schema error: No field named ss_item_sk.
Query 81 failed: Schema error: No field named cr_returned_date_sk.
Query 82 failed: Schema error: No field named i_current_price.
Query 83 failed: Schema error: No field named d_date.
Query 84 failed: Schema error: No field named ca_city.
Query 85 failed: Schema error: No field named ws_web_page_sk.
Query 86 failed: Schema error: No field named d1.d_month_seq.
Query 87 failed: Schema error: No field named store_sales.ss_sold_date_sk.
Query 88 failed: Schema error: No field named ss_sold_time_sk.
Query 89 failed: Schema error: No field named ss_item_sk.
Query 90 failed: Schema error: No field named ws_sold_time_sk.
Query 91 failed: Schema error: No field named cr_call_center_sk.
Query 92 failed: Schema error: No field named ws_item_sk.
Query 93 failed: Schema error: No field named sr_item_sk.
Query 94 failed: Schema error: No field named ws1.ws_order_number.
Query 95 failed: Schema error: No field named ws1.ws_order_number.
Query 96 failed: Schema error: No field named ss_sold_time_sk.
Query 97 failed: Schema error: No field named ss_sold_date_sk.
Query 98 failed: Schema error: No field named ss_item_sk.
Query 99 failed: Schema error: No field named d_month_seq.
Failed Queries: Query 1, Query 2, Query 3, Query 4, Query 5, Query 6, Query 7, Query 8, Query 9, Query 10, Query 11, Query 12, Query 13, Query 14, Query 15, Query 16, Query 17, Query 18, Query 19, Query 20, Query 21, Query 22, Query 23, Query 24, Query 25, Query 26, Query 27, Query 28, Query 29, Query 30, Query 31, Query 32, Query 33, Query 34, Query 35, Query 36, Query 37, Query 38, Query 39, Query 40, Query 41, Query 42, Query 43, Query 44, Query 45, Query 46, Query 47, Query 48, Query 49, Query 50, Query 51, Query 52, Query 53, Query 54, Query 55, Query 56, Query 57, Query 58, Query 59, Query 60, Query 61, Query 62, Query 63, Query 64, Query 65, Query 66, Query 67, Query 68, Query 69, Query 70, Query 71, Query 72, Query 73, Query 74, Query 75, Query 76, Query 77, Query 78, Query 79, Query 80, Query 81, Query 82, Query 83, Query 84, Query 85, Query 86, Query 87, Query 88, Query 89, Query 90, Query 91, Query 92, Query 93, Query 94, Query 95, Query 96, Query 97, Query 98, Query 99
+ set +x
Done

@comphead what would you think about updating the bench.sh data tpcds command so it automatically downloaded the using wget?

For example something like

mkdir -p benchmarks/data/tpcds_sf1
wget https://github.com/apache/datafusion-benchmarks/raw/refs/heads/main/tpcds/data/sf1/call_center.parquet -O benchmarks/data/tpcds_sf1/call_center.parquet

?

@comphead
Copy link
Contributor Author

comphead commented Dec 3, 2025

When I ran the command locally

./benchmarks/bench.sh run tpcds

I got a bunch of errors like

Warning registering call_center: Table file does not exist: /Users/andrewlamb/Software/datafusion/benchmarks/data/call_center.parquet

Details
@comphead what would you think about updating the bench.sh data tpcds command so it automatically downloaded the using wget?

For example something like

mkdir -p benchmarks/data/tpcds_sf1
wget https://github.com/apache/datafusion-benchmarks/raw/refs/heads/main/tpcds/data/sf1/call_center.parquet -O benchmarks/data/tpcds_sf1/call_center.parquet

?

I provided instructions to clone the repo.
I was also confused if the data file is not found then schema cannot be inferred and query parsing failed.

WDYT if I throw an error if no input data exists, pointing them to repo clone?

@comphead
Copy link
Contributor Author

comphead commented Dec 3, 2025

I fixed the scripts, but after latest merge I'm getting

thread 'main' (115794474) panicked at datafusion/physical-expr/src/projection.rs:374:22:
Expected column reference in projection

on Q13, investigating

@comphead
Copy link
Contributor Author

comphead commented Dec 3, 2025

Filed #19075 looks like it is a regression, after the merge the Q13 fails in datafusion-cli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EPIC] Support TPC-DS benchmarks

4 participants