Skip to content

Artifact Availability Package for the Paper "Partition, Don’t Sort! Compression Boosters for Cloud Data Ingestion Pipelines"

Notifications You must be signed in to change notification settings

dbislab/Partition-Dont-Sort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

399 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is the Code Availability Package for the paper "Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines".

Build and Run

This project is written in the Scala programming language and employs SBT for build management. To build the project, run

sbt package

To submit a job to Spark, run the following Spark command or use one of the provided benchmark scripts (see below):

spark-submit --master spark://hostname:7077 --class de.unikl.cs.dbis.waves.testjobs.$1 waves.jar \
             inputPath=/path/to/json/input wavesPath=hdfs://namenode/target/location inputSchemaPath=/path/to/schema.json

Code Overview

This repository contains a general framework for managing ingestion pipelines including code for previous publications. The most relevant implementations for this paper are as follows:

The Spark Jobs run for evaluation are the following:

Benchmark

For ease of use, we also provide scripts that gather log files for the experimental results we present in the paper. Before running, adapt config.sh to your cluster configuration.

  • runtime.sh collects ingestion times and compressed sizes for Figures 3 - 5
  • minSize.sh collects ingestion times and compressed sizes for Figures 6 and 7
  • initSample.sh collects ingestion times and compressed sizes Figure 8
  • allocation.sh collects Sizes per partition for Figure 9
  • baseline.sh determines the ingestion times and compressed sizes without boosting the compression as stated in Table 3. They are required to compute boost factor and slowdown.
  • allruns.sh measures the query times for Figure 10

To parse the logfiles and convert them into CSV tables, we also provide two python scripts, one for ingestion logs and one for query logs

About

Artifact Availability Package for the Paper "Partition, Don’t Sort! Compression Boosters for Cloud Data Ingestion Pipelines"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published