Benchmarks comparing query engines against the same dataset stored in three formats: CSV, Parquet, and Vortex.
The query is simple but realistic: count transaction failures grouped by day across a multi-file dataset (Q3/Q4 2025, one CSV per day).
| Path | Format | Generated from |
|---|---|---|
data/ |
CSV (one file per day) | source |
data_parquet/ |
Parquet (batched 10 CSVs per file) | convert_to_parquet.py |
data_vortex/ |
Vortex | convert_to_vortex.py |
Conversion pipeline: CSV → Parquet (via DuckDB) → Vortex (via vortex.io.write).
| Script | Engine |
|---|---|
failures_by_day.py |
DuckDB |
failures_by_day_polars.py |
Polars |
failures_by_day_datafusion.py |
DataFusion |
| Script | Engine |
|---|---|
failures_by_day_duckdb.py |
DuckDB |
failures_by_day_polars.py |
Polars |
failures_by_day_datafusion.py |
DataFusion |
| Script | Engine |
|---|---|
failures_by_day_duckdb.py |
DuckDB (via PyArrow bridge) |
failures_by_day_polars.py |
Polars (via PyArrow dataset) |
failures_by_day_pyarrow.py |
PyArrow + DuckDB |
failures_by_day_vortex.py |
Native Vortex scan + push-down filter |
Note: DataFusion is not benchmarked against Vortex. DataFusion has no native Vortex reader, and bridging via PyArrow (the same approach used for DuckDB and Polars) has not been implemented yet.
# setup
uv sync
# run a single benchmark
uv run python failures_by_day.py
# run all benchmarks and produce a grouped bar chart
uv run python run_benchmarks.py- DuckDB
- Polars
- DataFusion
- Vortex
- PyArrow
- matplotlib (for
run_benchmarks.py)