The GRAND PLAN

implement this flow:

user submit a python file to Checker
Checker uses Livy Client code to submit the python file (possibly with data files) to Livy server.
the Livy server runs somewhere in the cloud, and connects to Spark main node, which is part of HDInsight cluster.
the main node dispatches the commands and collects the results.

Playing with spark on local machine with Docker

Run Docker container with the spark master, and a few workers. First using Scala, then with Python.

https://towardsdatascience.com/a-journey-into-big-data-with-apache-spark-part-1-5dfcc2bccdd2

We will create several "machines": 1 master to send the jobs to workers 3 workers (each running executors) 1 machine to create the code and submit to the master

Instead of using real computers or virtual machines, we will use Docker containers. Each container behaves like a virtual machine, only much more lightweight.

Prerequisites

Docker is installed

First, create the docker network: docker network create spark_network

Then, build the images: docker build -t spark:latest . docker build -t spark_py -f ./Dockerfile_with_python .

and the Livy docker image as well: TBD

and run the containers: docker-compose up --scale spark-worker=3

**NOTE: If This is not the first run, and you made some changes, the docker compose might try to use cached values and stopped containers. If you get an error in the above command, try to remove stopped containers (find the ID with docker ps -aq), remove with docker rm id **

Check the master web UI: http://localhost:8080/

To check using builtin examples: Run a new container: docker run --rm -it --network spark_spark-network spark /bin/sh

Then in the new container:

/spark/bin/spark-submit --master spark://spark-master:7077 --class \
    org.apache.spark.examples.SparkPi \
    /spark/examples/jars/spark-examples_2.11-2.4.5.jar 1000

If everything works, you will see a lot of log lines, and after a while:

20/09/06 14:42:09 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 61.659824 s
Pi is roughly 3.14180503141805
20/09/06 14:42:09 INFO SparkUI: Stopped Spark web UI at http://ae4c528de6a2:4040

Playing with pyspark

Look at https://www.tutorialspoint.com/pyspark/pyspark_sparkcontext.htm (and previous pages)

Run pyspark locally

When running in docker container (that contains both Spark and Python): docker run --rm -it --network spark_spark-network spark_py /bin/sh

cd /spark

using pyspark console:

bin/pyspark

using a py file:

Create app.py

from pyspark import SparkContext
logFile = "file:///spark/README.md" 
sc = SparkContext("local", "first app")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

Then run it:

# bin/spark-submit app.py

running pyspark on a cluster

Now we will run the python code on our cluster of 3 workers:

NOTE: To run pyspark on the workers, the workers must have python installed, hence the modification in the docker-compose.yml

Start the cluster docker-compose up --scale spark-worker=3
in the new docker container change the "local" to "spark://:7077" :

from pyspark import SparkContext
logFile = "file:///spark/README.md"
sc = SparkContext("spark://spark-master:7077", "first app")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

and run it again: bin/spark-submit app.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
apache-livy		apache-livy
azure-cli		azure-cli
conf		conf
kafka_		kafka_
kafka_test		kafka_test
Dockerfile		Dockerfile
Dockerfile_with_python		Dockerfile_with_python
build_dockers.sh		build_dockers.sh
connect.sh		connect.sh
delete_batch.sh		delete_batch.sh
deploying-spark-cluster.txt		deploying-spark-cluster.txt
deploying_spark_azure.ctb		deploying_spark_azure.ctb
docker-compose.yml		docker-compose.yml
get_batch.sh		get_batch.sh
get_batches.sh		get_batches.sh
post_example_py.sh		post_example_py.sh
pysaprk_example1.py		pysaprk_example1.py
readme.md		readme.md
run_spark_server.sh		run_spark_server.sh
ssh_cmd.py		ssh_cmd.py
start-master.sh		start-master.sh
start-worker.sh		start-worker.sh
test-spark-py3.py		test-spark-py3.py
test_full_upload_spark.sh		test_full_upload_spark.sh
upload_to_c3_storage.sh		upload_to_c3_storage.sh

cnoam/spark_samples

Folders and files

Latest commit

History

Repository files navigation

The GRAND PLAN

Playing with spark on local machine with Docker

Prerequisites

Playing with pyspark

Run pyspark locally

using pyspark console:

using a py file:

running pyspark on a cluster

About

Topics

Resources

Stars

Watchers

Forks

Languages