The initial idea is to create a repository where I can place different experiments related streaming libraries. The idea is to start with Akka Streams and then compare with Spark Streaming and other libraries.
The main goal is to reproduce a production environment in your local machine. For doing that, what it is better than Docker?
In this example it is define:
- Kafka Broker using the spotify kafka image. Has been evaluated as well the wurstmeister/kafka, but it works better this one.
- A producer and consumer using the Akka Streams Kafka library.
- An example how to run Spark locally using docker
- Create a spark cluster: spark master and spark workers using docker compose.
- Create a consumer using spark streaming.
- To create the consumer I tried to use the scala template available here lhttps://github.com/big-data-europe/docker-spark
- The template was not up to date with the latest version of spark, so I created a new template available as part of this build. See the spark-docker-template subproject.
- Have sbt installed
- Have docker and docker-compose installed.
- Create kafka infrastructure using docker
- Create spark cluster infrastructure using docker.
- Create one simple producer and consumer that write and consume random numbers.
- Explore Akka Streams and Spark Streaming.
- Usage of docker-compose and swarm
1. sbt docker
2. docker swarm init
3. docker stack deploy -c docker-compose.yml streams
4. docker service ls
5. docker logs -f $akka_producer_container
6. docker logs -f $spark_consumer_container
7. docker logs -f $akka_consumer_container
8. http://localhost:8080 to access the spark cluster console.
8. docker stack rm streams
9. docker swarm leave --force
1. Modify the docker-compose.yml and the producer main and consumer main to point to localhost:9092
2. Run the kafka-spotify image alone: using docker run, or commenting the code in the yml file related the consumer and producer
Use kafka-manager to monitor kafka. Very useful.
https://github.com/yahoo/kafka-manager
Start localhost:9000 in the browser.
- Usage of docker in one click:
- Kafka broker
- Spark cluster. It is possible to see the workers and jobs submitted in http://localhost:8080
- Akka producer.
- Akka consumer.
- Spark streaming consumer.
- Creation of a scala template that allows to submit spark jobs to the spark master node.
- All the projects are dockerized using sbt. Has been used the marcuslonnberg plugin
- This is a good example how to run different subprojects using different scala versions: sbt-cross plugin
- Example how to use docker stack to scale some services (spark-workers)
1. brew cask install minikube
2. Install [VirtualBox](https://www.virtualbox.org/wiki/Downloads)
3. minikube start
1. minikube delete
2. rm -rf ~/.minikube
3. minikube start
4. minikube dashboard
Useful commands:
- kubectl create -f 1_kafka_broker.yml
- kubectl delete pod,service kafka-broker
- kubectl get services
- kubectl get deployments
- kubectl get pods
- //This creates a service
- kubectl expose deployment/kubernetes-bootcamp --type="NodePort" --port 80800
- echo $(minikube ip)
- kubectl logs kubernetes-bootcamp-5c69669756-mts2z
- kubectl delete pod,service,deployment -l app=spark-master
- kubectl describe service kafka-broker-svc
- kubectl exec -it kafka-broker-5d68b98857-l6bjw -- /bin/bash
- To create a service with a external ip address define the type of the container as LoadBalancer, but minikube doesnt support the LoadBalancer type.
Usage of spotify docker image:
https://github.com/spotify/docker-kafka
How to build multiprojects with different scala versions:
https://github.com/lucidsoftware/sbt-cross
Spark Docker images from: https://github.com/big-data-europe/docker-spark
Kafka vs Flume: https://www.linkedin.com/pulse/flume-kafka-real-time-event-processing-lan-jiang/
Kubernetes:
https://kubernetes.io/docs/user-journeys/users/application-developer/foundational/
https://github.com/kubernetes/minikube
https://www.virtualbox.org/wiki/Downloads
Interesting how kubectl creates the environment variables for the service, so another service
can use this service.
https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables
I had problems to read from local docker images with minikube:
https://stackoverflow.com/questions/38979231/imagepullbackoff-local-repository-with-minikube
Machine Learning: http://www.disfrutalasmatematicas.com/datos/desviacion-estandar.html https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ https://github.com/apache/spark/blob/dcdda19785a272969fb1e3ec18382403aaad6c91/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegressionExample.scala