TPC-DS tools for Apache Impala
The official and latest TPC-DS tools and specification can be found at tpc.org
The query templates and sample queries provided in this repo are compliant with the standards set out by the TPC-DS benchmark specification and include only minor query modifications (MQMs) as set out by section 4.2.3 of the specification. The modification list can be found in
If you use this repo for any results publication, please see Fair Use of TPC Benchmarks.
Step 0: Environment Setup
Install Java JDK and Maven if need be:
sudo yum -y install java-1.8.0-openjdk-devel maven
Install the necessary development tools:
sudo yum -y install git gcc make flex bison byacc curl unzip patch
Step 1: Generate Data
Data generation is done via a MapReduce wrapper around TPC-DS
tpcds-gen/README.md for more details on the commands to generate the flat files.
Step 2: Load Data
Adjust the source/text and target/Parquet schema names and flat file paths in the sql files found in the
scripts/ directory. See the comments at the top of each.
Create external text file tables:
impala-shell -f impala-external.sql
Create Parquet tables:
impala-shell -f impala-parquet.sql
Load Parquet tables and compute stats:
impala-shell -f impala-insert.sql
Step 3: Run Queries
Sample queries from the 10TB scale factor can be found in the
queries/ directory. The
query-templates/ directory contains the Apache Impala TPC-DS query templates which can be used with
dsqgen (found in the official TPC-DS tools) to generate queries for other scale factors or to generate more queries with different substitution variables.