No description, website, or topics provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Scalable Kafka - Spark streaming - Cassandra pipeline in Docker Swarm

This repository provides a basis to

  • Create clusters using docker-machine and Docker Swarm. We provide scripts to create clusters either locally using virtual machines, or on AWS (Amazon Web Services)
  • Start Kafka, Spark cluster standalone, and Cassandra as Docker Swarm services
  • Run an example demo using a Jupyter notebook connected to the cluster, where a Spark streaming application collects messages from Kafka and writes them to Cassandra

alt text

Preliminary requirements

  • Docker version 1.13.1 and docker-machine version 0.9.0
  • Additionally:
    • Virtual Box, in order to create virtual machines (VMs), and/or
    • An account on AWS, in order to create AWS instances.

The scripts were tested using Mac OS 10.10.5 and Ubuntu 14.04, on a host (for VMs) with 16GB RAM and 2 CPUs.

Create cluster

Use the scripts in cluster-management:

  • : Creates a cluster of virtual machines.

  • : Creates a cluster of AWS machines.

The scripts create a 3 node cluster (one master and two workers) per default. See the cluster management README for more details.

Start services

Use the 'start all' script in the service-management folder to start all services

eval $(docker-machine env master)

Interact wih the cluster from a notebook

Connect to notebook demo container

docker run -it --network=cluster-network \
-p 8888:8888 -p 4040:4040 -p 23:22 \
yannael/notebook-demo bash

Create Cassandra database for demo script

cqlsh cassandra-seed -f init_cassandra.cql

Start notebook


Connect to masterIP:8888 in your browser (where masterIP is the IP of the master node of the cluster) to access the notebooks.


For providing base material to this repository

For funding