Add h2o SQL benchmark#22660
Conversation
adriangb
left a comment
There was a problem hiding this comment.
This looks good to my eyes, it's mechanical. It's very nice how the framework you designed already fits all of these shapes including the single dataset multi benchmark setup. I kicked off a Copilot review just to see if there's anything I didn't catch.
There was a problem hiding this comment.
Pull request overview
Adds the H2O SQL benchmark suite to DataFusion’s SQL benchmark harness as part of the ongoing migration to SQL-driven benchmarks (#21706). This introduces benchmark definitions for the H2O groupby/join/window subgroups plus corresponding dataset “load” SQL scripts.
Changes:
- Added H2O
.benchmarkfiles forgroupby,join, andwindowsubgroups. - Added H2O
init/SQL scripts to register external tables for CSV and Parquet across small/medium/big dataset sizes.
Reviewed changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/sql_benchmarks/h2o/init/load_window_small_parquet.sql | Registers “small” window parquet dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_window_small_csv.sql | Registers “small” window csv dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_window_medium_parquet.sql | Registers “medium” window parquet dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_window_medium_csv.sql | Registers “medium” window csv dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_window_big_parquet.sql | Registers “big” window parquet dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_window_big_csv.sql | Registers “big” window csv dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_join_small_parquet.sql | Registers “small” join parquet datasets (x/small/medium/large) |
| benchmarks/sql_benchmarks/h2o/init/load_join_small_csv.sql | Registers “small” join csv datasets (x/small/medium/large) |
| benchmarks/sql_benchmarks/h2o/init/load_join_medium_parquet.sql | Registers “medium” join parquet datasets (x/small/medium/large) |
| benchmarks/sql_benchmarks/h2o/init/load_join_medium_csv.sql | Registers “medium” join csv datasets (x/small/medium/large) |
| benchmarks/sql_benchmarks/h2o/init/load_join_big_parquet.sql | Registers “big” join parquet datasets (x/small/medium/large) |
| benchmarks/sql_benchmarks/h2o/init/load_join_big_csv.sql | Registers “big” join csv datasets (x/small/medium/large) |
| benchmarks/sql_benchmarks/h2o/init/load_groupby_small_parquet.sql | Registers “small” groupby parquet dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_groupby_small_csv.sql | Registers “small” groupby csv dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_groupby_medium_parquet.sql | Registers “medium” groupby parquet dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_groupby_medium_csv.sql | Registers “medium” groupby csv dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_groupby_big_parquet.sql | Registers “big” groupby parquet dataset as external table |
| benchmarks/sql_benchmarks/h2o/init/load_groupby_big_csv.sql | Registers “big” groupby csv dataset as external table |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q01.benchmark | Window benchmark query Q01 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q02.benchmark | Window benchmark query Q02 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q03.benchmark | Window benchmark query Q03 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q04.benchmark | Window benchmark query Q04 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q05.benchmark | Window benchmark query Q05 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q06.benchmark | Window benchmark query Q06 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q07.benchmark | Window benchmark query Q07 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q08.benchmark | Window benchmark query Q08 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q09.benchmark | Window benchmark query Q09 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q10.benchmark | Window benchmark query Q10 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q11.benchmark | Window benchmark query Q11 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/window/q12.benchmark | Window benchmark query Q12 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/join/q01.benchmark | Join benchmark query Q01 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/join/q02.benchmark | Join benchmark query Q02 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/join/q03.benchmark | Join benchmark query Q03 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/join/q04.benchmark | Join benchmark query Q04 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/join/q05.benchmark | Join benchmark query Q05 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q01.benchmark | Groupby benchmark query Q01 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q02.benchmark | Groupby benchmark query Q02 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q03.benchmark | Groupby benchmark query Q03 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q04.benchmark | Groupby benchmark query Q04 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q05.benchmark | Groupby benchmark query Q05 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q06.benchmark | Groupby benchmark query Q06 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q07.benchmark | Groupby benchmark query Q07 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q08.benchmark | Groupby benchmark query Q08 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q09.benchmark | Groupby benchmark query Q09 definition |
| benchmarks/sql_benchmarks/h2o/benchmarks/groupby/q10.benchmark | Groupby benchmark query Q10 definition |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1 @@ | |||
| CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/G1_1e7_1e7_100_0.csv'; | |||
| @@ -0,0 +1 @@ | |||
| CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/G1_1e7_1e7_100_0.parquet'; No newline at end of file | |||
| @@ -0,0 +1 @@ | |||
| CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/G1_1e8_1e8_100_0.csv'; | |||
| @@ -0,0 +1 @@ | |||
| CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/G1_1e8_1e8_100_0.parquet'; No newline at end of file | |||
| @@ -0,0 +1 @@ | |||
| CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/G1_1e9_1e9_100_0.csv'; No newline at end of file | |||
| CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/J1_1e7_NA_0.csv'; | ||
|
|
||
| CREATE EXTERNAL TABLE small STORED AS CSV LOCATION 'data/h2o/J1_1e7_1e1_0.csv'; | ||
|
|
||
| CREATE EXTERNAL TABLE medium STORED AS CSV LOCATION 'data/h2o/J1_1e7_1e4_0.csv'; |
| CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/J1_1e8_NA_0.parquet'; | ||
|
|
||
| CREATE EXTERNAL TABLE small STORED AS PARQUET LOCATION 'data/h2o/J1_1e8_1e2_0.parquet'; | ||
|
|
||
| CREATE EXTERNAL TABLE medium STORED AS PARQUET LOCATION 'data/h2o/J1_1e8_1e5_0.parquet'; |
| CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/J1_1e8_NA_0.csv'; | ||
|
|
||
| CREATE EXTERNAL TABLE small STORED AS CSV LOCATION 'data/h2o/J1_1e8_1e2_0.csv'; | ||
|
|
||
| CREATE EXTERNAL TABLE medium STORED AS CSV LOCATION 'data/h2o/J1_1e8_1e5_0.csv'; |
| CREATE EXTERNAL TABLE x STORED AS PARQUET LOCATION 'data/h2o/J1_1e9_NA_0.parquet'; | ||
|
|
||
| CREATE EXTERNAL TABLE small STORED AS PARQUET LOCATION 'data/h2o/J1_1e9_1e3_0.parquet'; | ||
|
|
||
| CREATE EXTERNAL TABLE medium STORED AS PARQUET LOCATION 'data/h2o/J1_1e9_1e6_0.parquet'; |
There was a problem hiding this comment.
@Omega359 this does look like a real finding comparing to the existing TPCH scripts:
| CREATE EXTERNAL TABLE x STORED AS CSV LOCATION 'data/h2o/J1_1e9_NA_0.csv'; | ||
|
|
||
| CREATE EXTERNAL TABLE small STORED AS CSV LOCATION 'data/h2o/J1_1e9_1e3_0.csv'; | ||
|
|
||
| CREATE EXTERNAL TABLE medium STORED AS CSV LOCATION 'data/h2o/J1_1e9_1e6_0.csv'; |
Which issue does this PR close?
Part of #21706
Rationale for this change
Continue work on sql benchmark migration.
What changes are included in this PR?
h2o sql benchmark
Are these changes tested?
Yes
BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=small H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=small H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=medium H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=medium H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=small H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=small H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=medium H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=medium H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=small H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=small H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=medium H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=medium H2O_FILE_TYPE=parquet cargo bench --bench sqlI was unable to run the following because of limited memory:
BENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=big H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=groupby H2O_BENCH_SIZE=big H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=big H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=join H2O_BENCH_SIZE=big H2O_FILE_TYPE=parquet cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=big H2O_FILE_TYPE=csv cargo bench --bench sqlBENCH_NAME=h2o BENCH_SUBGROUP=window H2O_BENCH_SIZE=big H2O_FILE_TYPE=parquet cargo bench --bench sqlAre there any user-facing changes?
No