Apache DataFusion Benchmarks

Overview

This repository is intended as a central resource for documentation and scripts for running queries derived from the industry standard TPC-H and TPC-DS benchmarks against DataFusion and its subprojects, as well as against other open-source query engines for comparison.

TPC-H and TPC-DS both operate on synthetic data, which can be generated at different "scale factors". A scale factor of 1 means that approximately 1 GB of CSV data is generated, and a scale factor of 1000 means that approximately 1 TB of data is generated.

TPC Legal Considerations

It is important to know that TPC benchmarks are copyrighted IP of the Transaction Processing Council. Only members of the TPC consortium are allowed to publish TPC benchmark results. Fun fact: only four companies have published official TPC-DS benchmark results so far, and those results can be seen here.

However, anyone is welcome to create derivative benchmarks under the TPC's fair use policy, and that is what we are doing here. We do not aim to run a true TPC benchmark (which is a significant endeavor). We are just running the individual queries and recording the timings.

Throughout this document and when talking about these benchmarks, you will see the term "derived from TPC-H" or "derived from TPC-DS". We are required to use this terminology and this is explained in the fair-use policy (PDF).

DataFusion benchmarks are a Non-TPC Benchmark. Any comparison between official TPC Results with non-TPC workloads is prohibited by the TPC.

Data Generation

See the benchmark-specific instructions for generating the CSV data for TPC-H and TPC-DS and for converting that data to Parquet format. Although it is valid to run benchmarks against CSV data, this does not really represent how most of the world is running OLAP queries, especially when dealing with large datasets. When benchmarking DataFusion and its subprojects, we typically want to be querying Parquet data. Also, we typically do not want a single file per table, so we also need to repartition the data. The provided scripts take care of this conversion and repartitioning.

Running the Benchmarks with DataFusion

Scripts are available for the following DataFusion projects:

These benchmarking scripts produce JSON files containing query timings.

Comparing Results

The Python script scripts/generate-comparison.py can be used to produce charts comparing results from different benchmark runs.

For example:

python scripts/generate-comparison.py file1.json file2.json --labels "Spark" "Comet" --benchmark "TPC-H 100GB"

This will create image files in the current directory in PNG format.

Legal Notices

TPC, TPC Benchmark, TPC-H, and TPC-DS are trademarks of the Transaction Processing Performance Council.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
runners		runners
scripts		scripts
tpcds		tpcds
tpch		tpch
.asf.yaml		.asf.yaml
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache DataFusion Benchmarks

Overview

TPC Legal Considerations

Data Generation

Running the Benchmarks with DataFusion

Comparing Results

Legal Notices

About

Releases

Packages

Contributors 3

Languages

apache/datafusion-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Apache DataFusion Benchmarks

Overview

TPC Legal Considerations

Data Generation

Running the Benchmarks with DataFusion

Comparing Results

Legal Notices

About

Topics

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages