Skip to content

danielbeach/benchmarkingVortex

Repository files navigation

benchingVortex

Benchmarks comparing query engines against the same dataset stored in three formats: CSV, Parquet, and Vortex.

The query is simple but realistic: count transaction failures grouped by day across a multi-file dataset (Q3/Q4 2025, one CSV per day).


Data

Path Format Generated from
data/ CSV (one file per day) source
data_parquet/ Parquet (batched 10 CSVs per file) convert_to_parquet.py
data_vortex/ Vortex convert_to_vortex.py

Conversion pipeline: CSV → Parquet (via DuckDB) → Vortex (via vortex.io.write).


Benchmark scripts

CSV

Script Engine
failures_by_day.py DuckDB
failures_by_day_polars.py Polars
failures_by_day_datafusion.py DataFusion

Parquet (parquet_scripts/)

Script Engine
failures_by_day_duckdb.py DuckDB
failures_by_day_polars.py Polars
failures_by_day_datafusion.py DataFusion

Vortex (vortex_scripts/)

Script Engine
failures_by_day_duckdb.py DuckDB (via PyArrow bridge)
failures_by_day_polars.py Polars (via PyArrow dataset)
failures_by_day_pyarrow.py PyArrow + DuckDB
failures_by_day_vortex.py Native Vortex scan + push-down filter

Note: DataFusion is not benchmarked against Vortex. DataFusion has no native Vortex reader, and bridging via PyArrow (the same approach used for DuckDB and Polars) has not been implemented yet.


Running

# setup
uv sync

# run a single benchmark
uv run python failures_by_day.py

# run all benchmarks and produce a grouped bar chart
uv run python run_benchmarks.py

Dependencies

About

Trying DuckDB, Polars, Datafusion on Vortex.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages