Skip to content

Add h2o SQL benchmark#22660

Open
Omega359 wants to merge 1 commit into
apache:mainfrom
Omega359:sql-benchmarks/h2o
Open

Add h2o SQL benchmark#22660
Omega359 wants to merge 1 commit into
apache:mainfrom
Omega359:sql-benchmarks/h2o

Conversation

@Omega359
Copy link
Copy Markdown
Contributor

@Omega359 Omega359 commented May 31, 2026

Which issue does this PR close?

Part of #21706

Rationale for this change

Continue work on sql benchmark migration.

What changes are included in this PR?

h2o sql benchmark

Are these changes tested?

Yes

BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=small H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=small H2O_FILE_TYPE=parquet cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=medium H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=medium H2O_FILE_TYPE=parquet cargo bench --bench sql

BENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=small H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=small H2O_FILE_TYPE=parquet cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=medium H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=medium H2O_FILE_TYPE=parquet cargo bench --bench sql

BENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=small H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=small H2O_FILE_TYPE=parquet cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=medium H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=medium H2O_FILE_TYPE=parquet cargo bench --bench sql

I was unable to run the following because of limited memory:

BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=big H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=big H2O_FILE_TYPE=parquet cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=big H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=big H2O_FILE_TYPE=parquet cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=big H2O_FILE_TYPE=csv cargo bench --bench sql
BENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=big H2O_FILE_TYPE=parquet cargo bench --bench sql

Are there any user-facing changes?

No

@Omega359 Omega359 marked this pull request as ready for review May 31, 2026 15:04
@adriangb adriangb requested a review from Copilot June 2, 2026 01:29
Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to my eyes, it's mechanical. It's very nice how the framework you designed already fits all of these shapes including the single dataset multi benchmark setup. I kicked off a Copilot review just to see if there's anything I didn't catch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the H2O SQL benchmark suite to DataFusion’s SQL benchmark harness as part of the ongoing migration to SQL-driven benchmarks (#21706). This introduces benchmark definitions for the H2O groupby/join/window subgroups plus corresponding dataset “load” SQL scripts.

Changes:

  • Added H2O .benchmark files for groupby, join, and window subgroups.
  • Added H2O init/ SQL scripts to register external tables for CSV and Parquet across small/medium/big dataset sizes.

Reviewed changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
benchmarks/sql_benchmarks/h2o/init/load_window_small_parquet.sql Registers “small” window parquet dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_window_small_csv.sql Registers “small” window csv dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_window_medium_parquet.sql Registers “medium” window parquet dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_window_medium_csv.sql Registers “medium” window csv dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_window_big_parquet.sql Registers “big” window parquet dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_window_big_csv.sql Registers “big” window csv dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_join_small_parquet.sql Registers “small” join parquet datasets (x/small/medium/large)
benchmarks/sql_benchmarks/h2o/init/load_join_small_csv.sql Registers “small” join csv datasets (x/small/medium/large)
benchmarks/sql_benchmarks/h2o/init/load_join_medium_parquet.sql Registers “medium” join parquet datasets (x/small/medium/large)
benchmarks/sql_benchmarks/h2o/init/load_join_medium_csv.sql Registers “medium” join csv datasets (x/small/medium/large)
benchmarks/sql_benchmarks/h2o/init/load_join_big_parquet.sql Registers “big” join parquet datasets (x/small/medium/large)
benchmarks/sql_benchmarks/h2o/init/load_join_big_csv.sql Registers “big” join csv datasets (x/small/medium/large)
benchmarks/sql_benchmarks/h2o/init/load_groupby_small_parquet.sql Registers “small” groupby parquet dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_groupby_small_csv.sql Registers “small” groupby csv dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_groupby_medium_parquet.sql Registers “medium” groupby parquet dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_groupby_medium_csv.sql Registers “medium” groupby csv dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_groupby_big_parquet.sql Registers “big” groupby parquet dataset as external table
benchmarks/sql_benchmarks/h2o/init/load_groupby_big_csv.sql Registers “big” groupby csv dataset as external table
benchmarks/sql_benchmarks/h2o/benchmarks/window/q01.benchmark Window benchmark query Q01 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q02.benchmark Window benchmark query Q02 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q03.benchmark Window benchmark query Q03 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q04.benchmark Window benchmark query Q04 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q05.benchmark Window benchmark query Q05 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q06.benchmark Window benchmark query Q06 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q07.benchmark Window benchmark query Q07 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q08.benchmark Window benchmark query Q08 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q09.benchmark Window benchmark query Q09 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q10.benchmark Window benchmark query Q10 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q11.benchmark Window benchmark query Q11 definition
benchmarks/sql_benchmarks/h2o/benchmarks/window/q12.benchmark Window benchmark query Q12 definition
benchmarks/sql_benchmarks/h2o/benchmarks/join/q01.benchmark Join benchmark query Q01 definition
benchmarks/sql_benchmarks/h2o/benchmarks/join/q02.benchmark Join benchmark query Q02 definition
benchmarks/sql_benchmarks/h2o/benchmarks/join/q03.benchmark Join benchmark query Q03 definition
benchmarks/sql_benchmarks/h2o/benchmarks/join/q04.benchmark Join benchmark query Q04 definition
benchmarks/sql_benchmarks/h2o/benchmarks/join/q05.benchmark Join benchmark query Q05 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q01.benchmark Groupby benchmark query Q01 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q02.benchmark Groupby benchmark query Q02 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q03.benchmark Groupby benchmark query Q03 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q04.benchmark Groupby benchmark query Q04 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q05.benchmark Groupby benchmark query Q05 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q06.benchmark Groupby benchmark query Q06 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q07.benchmark Groupby benchmark query Q07 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q08.benchmark Groupby benchmark query Q08 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q09.benchmark Groupby benchmark query Q09 definition
benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q10.benchmark Groupby benchmark query Q10 definition

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1 @@
CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/G1_1e7_1e7_100_0.csv';
@@ -0,0 +1 @@
CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/G1_1e7_1e7_100_0.parquet'; No newline at end of file
@@ -0,0 +1 @@
CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/G1_1e8_1e8_100_0.csv';
@@ -0,0 +1 @@
CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/G1_1e8_1e8_100_0.parquet'; No newline at end of file
@@ -0,0 +1 @@
CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/G1_1e9_1e9_100_0.csv'; No newline at end of file
Comment on lines +1 to +5
CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/J1_1e7_NA_0.csv';

CREATE EXTERNAL TABLE small STORED AS CSV LOCATION 'data/h2o/J1_1e7_1e1_0.csv';

CREATE EXTERNAL TABLE medium STORED AS CSV LOCATION 'data/h2o/J1_1e7_1e4_0.csv';
Comment on lines +1 to +5
CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/J1_1e8_NA_0.parquet';

CREATE EXTERNAL TABLE small STORED AS PARQUET LOCATION 'data/h2o/J1_1e8_1e2_0.parquet';

CREATE EXTERNAL TABLE medium STORED AS PARQUET LOCATION 'data/h2o/J1_1e8_1e5_0.parquet';
Comment on lines +1 to +5
CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/J1_1e8_NA_0.csv';

CREATE EXTERNAL TABLE small STORED AS CSV LOCATION 'data/h2o/J1_1e8_1e2_0.csv';

CREATE EXTERNAL TABLE medium STORED AS CSV LOCATION 'data/h2o/J1_1e8_1e5_0.csv';
Comment on lines +1 to +5
CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/J1_1e9_NA_0.parquet';

CREATE EXTERNAL TABLE small STORED AS PARQUET LOCATION 'data/h2o/J1_1e9_1e3_0.parquet';

CREATE EXTERNAL TABLE medium STORED AS PARQUET LOCATION 'data/h2o/J1_1e9_1e6_0.parquet';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Omega359 this does look like a real finding comparing to the existing TPCH scripts:

) STORED AS CSV LOCATION '${DATA_DIR:-data}/tpch_sf${BENCH_SIZE:-1}/csv/nation/nation.1.csv';

Comment on lines +1 to +5
CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/J1_1e9_NA_0.csv';

CREATE EXTERNAL TABLE small STORED AS CSV LOCATION 'data/h2o/J1_1e9_1e3_0.csv';

CREATE EXTERNAL TABLE medium STORED AS CSV LOCATION 'data/h2o/J1_1e9_1e6_0.csv';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants