Skip to content
Apache Spark playground running on Docker containers
Scala Makefile Dockerfile Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
Dockerfile
LICENSE
Makefile
README.md
commands.txt
docker-compose.yml
start-master.sh
start-worker.sh

README.md

Apache Spark Playground

This repository can be used to test Apache Spark applications on a local cluster consisting of Docker containers. The cluster architecture is largely inspired by this awesome tutorial from the Towards Data Science blog on Medium.

Setup

Running the cluster only requires a working Docker installation (see official website for installation instructions).

After installing Docker, you should build the required images using the following commands:

  • Master, worker and submitter image: make build-image
  • Builder image: make build-sbt-image

Run cluster

To run a local Spark cluster, simply run docker-compose up --scale spark-worker=2. This will start a Docker container representing the spark-master and two containers representing spark-worker instances. Adjust the scale parameter as needed to receive more workers. You can access the Spark Web UI at http://localhost:8080/.

Run examples

The following examples are included in this repository:

  • MyFirstScalaSpark: counts lines with a's and b's in this readme file and outputs the result to the console.
  • RossmannSalesForecasting: loads data from the Rossmann Kaggle challenge, performs some basic feature engineering and trains a gradient-boosted trees (GBT) model on the derived features.

Running the included examples requires two steps:

  1. Building the executable JAR file from Scala code inside a dedicated container.
    • Start the builder container using make run-builder.
    • Inside the container, navigate to /local/src/main/scala and start the Scala build tool using the sbt command.
    • In the sbt console, run package to build the executable.
  2. Submitting the created executable to the Spark cluster.
    • Start the submitter using make run-submitter.
    • Inside the container, run one of the commands from commands.txt to start the respective application.
You can’t perform that action at this time.