Description
Demo-architecture for a scalable distribute image stream processing using Kafka, Tensorflow and Spark Streaming. Our protoyp contains a client, that streams webcam-images into Kafka, and a analytics-component for detecting objects on those images using tensorflow on spark. The results are sent back to the client and also aggregated and reported in a Dashboard build with Plotly Dash.
Context
Master programme Data Science & Business Analytics
Lecture BI- and Big-Data-Architectures
At University of Media, Stuttgart (DE)
Goal / Task
Come up with a fictional use-case for Big Data, design an architecture for it, state reasons for the architectural decisions and implement a Proof-of-Concept.
Authors
Marcus and me (dynobo)
Timeline
Feb. 2018 - Mar. 2018
Repo
Code & Docu for the Proof-of-Concept;Slides of our final presentation; Project-Documentation (in german); Screencast of the Workflow in action.
Table of Contents
Prerequisites: Local computer with enough ressources (8 GB, better 16 GB RAM, 10-15 GB free space) and a current version of Oracle VirtualBox.
1. Install Image
- Download Cloudera Quickstart VM - CDH 5.12 for Virtual Box
- Import the Applicance in VirtualBox
2. Change VM-Settings in VirtualBox
- Give the VM as much ressources as possible! (My Specs: 32GB RAM, i5 QuadCore ~4Ghz, SSD)
- Make sure, "Network" is set to "NAT"
- Add "Port Forwarding" (we'll use
90
as prefix for vm-ports):- SSH: Host 127.0.0.1:9022 to Guest 10.0.2.15:22
- HUE: Host 127.0.0.1:9033 to Guest 10.0.2.15:8888 (we want 9088 for jupyter)
- JUPYTER: Host 127.0.0.1:9088 to Guest 10.0.2.15:8889
- KAFKA: Host 127.0.0.1:9092 to Guest 10.0.2.15:9092
- SSH into Cloudera VM, e.g. via:
ssh cloudera@127.0.0.1 -p 9022
(Default User+PW: cloudera) - In VM, run setup_cloudera_vm.sh to install additional packages and do various configuration in (no interaction needed):
wget https://raw.githubusercontent.com/dynobo/PyctureStream/master/setup_cloudera_vm.sh && chmod +x ./setup_cloudera_vm.sh && ./setup_cloudera_vm.sh
- When the setup is finished reboot the VM.
- You might need Anaconda Distro with Python 3.6+ for the code that runs locally
- The dependencies are listen in environment.yaml. You can use this file to recreate the environment with Anaconda:
conda env create -f environment.yml
Run the different components of the pipeline in parallel (e.g. using multiple consoles):
Kafka Producer, that captures images from a connected Webcam (make sure, you have one attached!) and sends them together with metadata into the pycturestream
-Topic in Kafka. It also stores the latest image in input.jpg
, for monitoring purposes.
Run client/stream_webcam_to_kafka.py on Host.
Consumes the pycturestream
-Topic, detects the objects in the pictures, and pushs the results into resultstream
-Topic. Leverages Tensorflow and Spark.
Run processing/detect_objects.py in VM
Kafka Consumer of resultstream
-Topic, to display the results (detected objects, probabilities) as console-output and store the latest image in output.jpg
.
Run client/receive_results_from_kafka.py on Host, open client/monitor_imagestream.html in Browser to monitor input and output images.
Kafka Consumer of resultstream
-Topic to transforms the data and store the detected objects (without images) in events.json
.
Run reporting/store_results_from_kafka.py on Host
The Dashboard itself was build with Dash, and displays the data of the events.json
. (The dashboard was implemented by Marcus.)
Open reporting/dashboard.ipynb in Jupyter Notebook on Host or run it directly in non-interactive mode: jupyter nbconvert --to notebook --execute dashboard.ipynb
A set of test procedures, useful for debugging and identifying problems.
- Make sure, your webcam is connected. Maybe test with other software, if it's generally working.
- Run the test script, which tests the first 4 video devices for certain parameters, e.g.:
python ./client/test_webcam.py
- Interpreting results:
VIDEOIO ERROR: V4L: index 1 is not correct! <--- camera might not be connected
Unable to stop the stream: Device or resource busy <--- camera in use by other program
- Create a topic:
kafka-topics --create --zookeeper localhost:2181 --topic wordcounttopic --partitions 1 --replication-factor 1
- Open console consumer:
/usr/bin/kafka-console-consumer --zookeeper localhost:2181 --topic wordcounttopic
- Open console producer:
kafka-console-producer --broker-list localhost:9092 --topic wordcounttopic
- Produce some events, and see, if consumer receives them
- Create a topic (if not already done):
kafka-topics --create --zookeeper localhost:2181 --topic wordcounttopic --partitions 1 --replication-factor 1
-
Open ./notebooks/kafka_wordcount.ipynb in Jupyter of VM (as Consumer)
-
Open console producer:
kafka-console-producer --broker-list localhost:9092 --topic wordcounttopic
-
Produce some events, and see, if consumer receives & processes them correctly
-
Check Job-Browser in Hue (http://127.0.0.1:9033/) for status of Spark
Some of the ressource that helped me in the implementation.
- https://scotch.io/tutorials/build-a-distributed-streaming-system-with-apache-kafka-and-python
- https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html
- https://github.com/tensorflow/models/tree/master/official/resnet
- https://github.com/tensorflow/models/tree/master/research/object_detection
- https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb