Drizzle integration with Apache Spark
Switch branches/tags
Nothing to show
Clone or download
shivaram Fix previous commit to set barrier across batches
Don't set user class path to true by default
Latest commit bdd481d Sep 11, 2018
Permalink
Failed to load latest commit information.
.github [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in REA… Oct 12, 2016
R [SQL][DOC] updating doc for JSON source to link to jsonlines.org Oct 27, 2016
assembly Set unique maven version number Jun 29, 2017
bin [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] Oct 21, 2016
build [SPARK-14279][BUILD] Pick the spark version from pom Jun 6, 2016
common Set unique maven version number Jun 29, 2017
conf [SPARK-11653][DEPLOY] Allow spark-daemon.sh to run in the foreground Oct 20, 2016
core Fix previous commit to set barrier across batches Sep 11, 2018
data [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs Aug 5, 2016
dev [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] Oct 21, 2016
docs Update Drizzle README Nov 1, 2016
examples Set unique maven version number Jun 29, 2017
external Set unique maven version number Jun 29, 2017
graphx Set unique maven version number Jun 29, 2017
launcher Set unique maven version number Jun 29, 2017
licenses [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" … Jun 4, 2016
mesos Set unique maven version number Jun 29, 2017
mllib-local Set unique maven version number Jun 29, 2017
mllib Set unique maven version number Jun 29, 2017
project [SPARK-18104][DOC] Don't build KafkaSource doc Oct 26, 2016
python [SPARK-17219][ML] enhanced NaN value handling in Bucketizer Oct 27, 2016
repl Set unique maven version number Jun 29, 2017
sbin [SPARK-17944][DEPLOY] sbin/start-* scripts use of `hostname -f` fail … Oct 22, 2016
sql Set unique maven version number Jun 29, 2017
streaming Set unique maven version number Jun 29, 2017
tools Set unique maven version number Jun 29, 2017
yarn [SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource Nov 13, 2017
.gitattributes [SPARK-3870] EOL character enforcement Oct 31, 2014
.gitignore [MINOR][SPARKR] Add sparkr-vignettes.html to gitignore. Sep 24, 2016
.travis.yml [SPARK-16967] move mesos to module Aug 26, 2016
CONTRIBUTING.md [SPARK-17445][DOCS] Reference an ASF page as the main place to find t… Sep 14, 2016
LICENSE [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] Oct 21, 2016
NOTICE [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" … Jun 4, 2016
README.md Update Drizzle README Nov 1, 2016
SPARK-README.md Rename readme files Nov 1, 2016
appveyor.yml [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and tes… Sep 8, 2016
pom.xml Set unique maven version number Jun 29, 2017
scalastyle-config.xml [SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL Oct 26, 2016

README.md

Drizzle: Low Latency Execution for Apache Spark

Drizzle is a low latency execution engine for Apache Spark that is targeted at stream processing and iterative workloads. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency.

In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch.

Drizzle Example

The current Drizzle prototype exposes a low level API using the runJobs method in SparkContext. This method takes in a Seq of RDDs and corresponding functions to execute on these RDDs. Examples of using this API can be seen in DrizzleSingleStageExample and DrizzleRunningSum.

To try out Drizzle locally, we first build Spark based on existing instructions. For example, using SBT we can run

  ./build/sbt package

We can run then run the DrizzleRunningSum example with 4 cores for 10 iterations with group size 10. Note that this example requires at least 4GB of memory on your machine.

  ./bin/run-example --master "local-cluster[4,1,1024]" org.apache.spark.examples.DrizzleRunningSum 10 10

To compare this with existing Spark, we can run the same 10 iterations but now with a group size of 1

  ./bin/run-example --master "local-cluster[4,1,1024]" org.apache.spark.examples.DrizzleRunningSum 10 1

The benefit from using Drizzle is more apparent on large clusters. Results from running the single stage benchmark for 100 iterations on a Amazon EC2 cluster of 128 machines is shown below.

Status

The source code in this repository is a research prototype and only implements the scheduling techniques described in our paper. The existing Spark unit tests pass with our changes and we are actively working on adding more tests for Drizzle. We are also working towards a Spark JIRA to discuss integrating Drizzle with the Apache Spark project.

Finally we would like to note that extensions to integrate Structured Streaming and Spark ML will be implemented separately.

For more details

For more details about the architecture of Drizzle please see our Spark Summit 2015 Talk and our Technical Report

Acknowledgements

This is joint work with Aurojit Panda, Kay Ousterhout, Mike Franklin, Ali Ghodsi, Ben Recht and Ion Stoica from the AMPLab at UC Berkeley.