Skip to content
Code examples and docker environment for Spark
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.


Docker and docker-compose are used for running code samples:

  docker version 1.10
  docker-compose 1.6.0

For building the app, SBT is used

  SBT 0.13

The application was created with Typesafe Activator

##Setting Dockerized Environment

Creating docker-machine and launching containers

Depending on one's needs virtual machine memory could be adjusted to different value, but memory should be more than 2GB. Steps to create new docker-machine and launch docker images:

  docker-machine create -d virtualbox --virtualbox-memory "8000" --virtualbox-cpu-count "4" workshop
  eval "$(docker-machine env workshop)"
  (!) Add "sandbox" to /etc/hosts with address of `docker-machine ip workshop`. This comes really handy when you navigate 
  UI of components through browser or when debugging the code from IDE (e.g. connecting to Cassandra or Mongo containers)  

  docker-compose up -d

After that dockerized Cassandra, MongoDB and single-node Hadoop cluster will launched. docker ps could be used for verification and getting ids of containers. After a short delay all UI components should be accessible via browser (except SparkUI).

Project build directory is linked to hadoop container and available at /target folder. Every time fatjar is rebuilt it is visible inside the container.

For shutting down with deletion of all the data use docker-compose down

List of addresses

##Running spark-shell with samples To run and submit Spark Applications from this project the fatjar should be assembled via sbt clean assembly from the root of project dir.

Logging in to Docker container

  #container name could be used to login
  docker exec -ti sparkworkshop_hadoop_1 bash
  #or hadoop container id could be identified via docker ps
  docker exec -ti [hadoop container id] bash

Running Spark shell with fatjar in classpath will allow to execute applications right from spark-shell

  spark-shell \
  --master yarn \
  --deploy-mode client \
  --executor-cores 1 \
  --num-executors 2 \
  --jars /target/spark-workshop.jar \

Before running all the example apps sample data should be generated and written to different data storage (Cassandra, Mongo and HDFS). From spark-shell:

  import io.datastrophic.spark.workshop._
  SampleDataWriter.generateData(sc, sqlContext)

Progress of sample data generation could be observed via SparkUI at this moment. After the data is generated all the applications could be executed and analyzed via console output or SparkUI.

##List of applications

API examples provide implementation for the "naive lambda" application which reads aggregated data from campaigns data source and groups it with "raw" event data from events data source. After sample data is generated it is available in Json, Parquet and Bson files in HDFS (/workshop dir) and in Cassandra and MongoDB.

  • CassandraRDDExample - RDD based implementation using Cassandra as a data source:
  • CassandraSparkSQLExample - SparkSQL based implementation using sqlContext.sql API and Cassandra as a data source:, sqlContext)
  • CassandraDataFrameExample - SparkSQL based implementation using DataFrame API and Cassandra as a data source:, sqlContext)
  • SampleWriter provides examples of how to access the data in all of these sources: SampleDataWriter.generateData(sc, sqlContext), will fail if files are already present in HDFS, so cleanup or filenames change is needed)
  • DataSourcesExample - provides examples of how to access the data in all of these sources:, sqlContext, entriesToDisplay = 10)
  • ParametrizedApplicationExample - example of how Spark Applications could be parameterized.

ParametrizedApplicationExample could be submitted like that:

  spark-submit --class io.datastrophic.spark.workshop.ParametrizedApplicationExample \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 2 \
  --driver-memory 1g \
  --executor-memory 1g \
  /target/spark-workshop.jar \
  --cassandra-host cassandra \
  --keyspace demo \
  --table event \
  --target-dir /workshop/dumps

Data model

Through all the code samples the same Cassandra data model is used:

  CREATE TABLE IF NOT EXISTS $keyspace.event (
     id uuid,
     campaign_id uuid,
     event_type text,
     value bigint,
     time timestamp,
     PRIMARY KEY((id, event_type), campaign_id, time)
  CREATE TABLE IF NOT EXISTS $keyspace.campaign (
     id uuid,
     event_type text,
     day timestamp,
     value bigint,
     PRIMARY KEY((id, event_type), day)

Data cleanup

  #HDFS files
  docker exec -ti sparkworkshop_hadoop_1  bash 
  hadoop fs -rm -r /workshop/*
  #Cassandra tables
  docker exec -ti sparkworkshop_cassandra_1 cqlsh
  cqlsh> drop keyspace demo;
  #Mongo DB
  docker exec -ti sparkworkshop_mongo_1 mongo
  >use demo
You can’t perform that action at this time.