All in one: Kafka, Spark Streaming, PostgreSQL and Superset

This repository contains a docker-compose stack with Kafka, PostgreSQL and Superset.

Spark and Python should be installed localy.

We will start by simulating a streaming process then using Spark Stuctured streaming to execute some data-manipulation and write to PostgreSQL

Data used in this project Online Retail II Data Set.

Pipeline Architecture

Docker Compose Details

Container	Image	port
zookeeper	wurstmeister/zookeeper	2181
kafka	wurstmeister/kafka	9092
kafka_manager	hlebalbau/kafka_manager	9000
postgres	postgres	5416
pgadmin	dpage/pgadmin4	5050
superset	amancevice/superset	8088

Quickstart

Clone the Repository and install the requirements

git clone https://github.com/amine-akrout/Spark_Stuctured_Streaming.git

cd Spark_Stuctured_Streaming
pip install -r requirement.txt

Make sure you have Installed Spark 2.4.7

Running Docker Compose

To run docker compose simply run the following command:

docker-compose up -d

PostgreSQL configuration

You can access the pgadmin GUI through http://localhost:8080

1) Create a Server

hostname : postgres
port : 5432
username : superset
password : superset

You can change them in the config file

2) Create a Database and a table

cd postgresql
python create_database.py
python create_table

This will create a database orders and a table retail using psycopg2 library (you can change the names using the data.ini file)

Kafka Spark Stuctured Streaming configuration

1) Kafka Producer

Start the stream of data

spark-submit kafka_producer.py

this will simulate a streaming process from our data (20 rows/sec)

Setup Kafka Manager

Access http://localhos:9000/addCluster :

Cluster name: Kafka
Zookeeper hosts: localhost:2181

2) Spark Consumer

spark-submit --conf "spark.driver.extraClassPath=$SPARK_HOME/jars/kafka-clients-2.2.2.jar" \
             --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.7,org.postgresql:postgresql:42.2.18 \
             SparkConsumer.py

This will allow Spark to :

Construct a streaming DataFrame that reads from kafka topic
Execute some data-manipulation on our stream data
Write aggrgated data in PostgreSQL
Optionally : write batches in the console for debugging

You can check if the table retail is updated in pgAdmin

For Debugging you can print the batches in the console

Spark UI

Superset configuration

1) Initialise Database

docker exec -it superset superset-init

Initialize the database with an admin user and Superset tables

You can always change them in the superset_config.py file

2) Connect our table "retail" to superset

To set up superset we need to configure it. In order to do this, access http://localhost:9000 and create a new source following the fields:

Database: orders
SQLalchemy URI =postgresql+psycopg2://superset:superset@postgres:5432/orders

This will create a database orders and a table retail using psycopg2 library (you can change the names using the data.ini file)

4) Create Dashboard

Finally !

We can now create some charts using superset and use them to make a real time Dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
demo		demo
postgresql		postgresql
.gitignore		.gitignore
README.md		README.md
SparkConsumer.py		SparkConsumer.py
config.ini		config.ini
docker-compose.yml		docker-compose.yml
kafka_producer.py		kafka_producer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

All in one: Kafka, Spark Streaming, PostgreSQL and Superset

Pipeline Architecture

Docker Compose Details

Quickstart

Clone the Repository and install the requirements

Running Docker Compose

PostgreSQL configuration

1) Create a Server

2) Create a Database and a table

Kafka Spark Stuctured Streaming configuration

1) Kafka Producer

Setup Kafka Manager

2) Spark Consumer

Spark UI

Superset configuration

1) Initialise Database

2) Connect our table "retail" to superset

4) Create Dashboard

About

Languages

amine-akrout/Spark_Stuctured_Streaming

Folders and files

Latest commit

History

Repository files navigation

All in one: Kafka, Spark Streaming, PostgreSQL and Superset

Pipeline Architecture

Docker Compose Details

Quickstart

Clone the Repository and install the requirements

Running Docker Compose

PostgreSQL configuration

1) Create a Server

2) Create a Database and a table

Kafka Spark Stuctured Streaming configuration

1) Kafka Producer

Setup Kafka Manager

2) Spark Consumer

Spark UI

Superset configuration

1) Initialise Database

2) Connect our table "retail" to superset

4) Create Dashboard

About

Topics

Resources

Stars

Watchers

Forks

Languages