This is the Code Availability Package for the paper "Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines".
This project is written in the Scala programming language and employs SBT for build management. To build the project, run
sbt packageTo submit a job to Spark, run the following Spark command or use one of the provided benchmark scripts (see below):
spark-submit --master spark://hostname:7077 --class de.unikl.cs.dbis.waves.testjobs.$1 waves.jar \
inputPath=/path/to/json/input wavesPath=hdfs://namenode/target/location inputSchemaPath=/path/to/schema.jsonThis repository contains a general framework for managing ingestion pipelines including code for previous publications. The most relevant implementations for this paper are as follows:
The Spark Jobs run for evaluation are the following:
- Gini approach
- Gini+ approach
- Global approach
- Builtin approach
- Schema Extraction (used for warming up the caches)
- Allocation Measurement
- BETZE Queries Twitter
- BETZE Queries GitHub
- TPC-DS Dataset Transformation
- TPC-DS Queries
For ease of use, we also provide scripts that gather log files for the experimental results we present in the paper. Before running, adapt config.sh to your cluster configuration.
- runtime.sh collects ingestion times and compressed sizes for Figures 3 - 5
- minSize.sh collects ingestion times and compressed sizes for Figures 6 and 7
- initSample.sh collects ingestion times and compressed sizes Figure 8
- allocation.sh collects Sizes per partition for Figure 9
- baseline.sh determines the ingestion times and compressed sizes without boosting the compression as stated in Table 3. They are required to compute boost factor and slowdown.
- allruns.sh measures the query times for Figure 10
To parse the logfiles and convert them into CSV tables, we also provide two python scripts, one for ingestion logs and one for query logs