# Submitting a Spark Job

Besides working with a Spark cluster interactively, for example via the PySpark console or a Jupyter notebook, a typical way of running Spark programs is to submit them as a [batch job](https://en.wikipedia.org/wiki/Batch_processing): We send a Python script to the cluster via the `spark-submit` command. This chapter explains how to do this.

## Configuration

Make sure that PySpark uses the right Python version for the following examples by setting the environment variable `PYSPARK_PYTHON` to Python 3. A way to do this with Python:

In [1]:
import os
os.environ["PYSPARK_PYTHON"] = "python3"

## spark-submit

`spark-submit` is the command line tool to send jobs to a Spark cluster. It supports Spark jobs written in Java, Scala, or Python. Additionally it offers various configuration options for tuning the performance of the job.

In [2]:
!spark-submit --help

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-

## A Minimal Spark Job

The following is a minimal example of a Python script containing a Spark job:

In [3]:
%%file scripts/spark_job_minimal.py

from pyspark import SparkContext, SparkConf

SPARK_APP_NAME='sparkjob_template'
conf = SparkConf().setAppName(SPARK_APP_NAME) 
spark_context = SparkContext(conf=conf)

#----------------------
# TODO: replace with your Spark code
rdd = spark_context.range(100)
#----------------------

spark_context.stop() # don't forget to cleanly shut down


Overwriting scripts/spark_job_minimal.py


The %%file command is an IPython "cell magic" that automatically writes the code from the cell to a file. So, we can directly submit this to the cluster:

In [4]:
!spark-submit scripts/spark_job_minimal.py

2018-10-22 14:39:19 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-22 14:39:21 INFO  SparkContext:54 - Running Spark version 2.3.0
2018-10-22 14:39:21 INFO  SparkContext:54 - Submitted application: sparkjob_template
2018-10-22 14:39:21 INFO  SecurityManager:54 - Changing view acls to: cls
2018-10-22 14:39:21 INFO  SecurityManager:54 - Changing modify acls to: cls
2018-10-22 14:39:21 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-10-22 14:39:21 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-10-22 14:39:21 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(cls); groups with view permissions: Set(); users  with modify permissions: Set(cls); groups with modify permissions: Set()
2018-10-22 14:39:21 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 63182.
2018-10-22 14:39:21 INFO  

## More Convenience

Here is a slightly more elaborate and convenient example that serves nicely as a template for writing Spark jobs.`contextlib` enables us to use the `with` statement to create and close a spark context in a very concise and clean way, and separate that from our actual Spark program.

In [5]:
%%file scripts/spark_job_template.py

SPARK_APP_NAME='sparkjob_template'

from contextlib import contextmanager
from pyspark import SparkContext, SparkConf

@contextmanager
def use_spark_context(appName):
    conf = SparkConf().setAppName(appName) 
    spark_context = SparkContext(conf=conf)

    try:
        print("starting ", appName)
        yield spark_context
    finally:
        spark_context.stop()
        print("stopping ", appName)


with use_spark_context(appName=SPARK_APP_NAME) as sc:
    #----------------------
    # TODO: replace with your Spark code
    rdd = sc.range(100)
    #----------------------

Overwriting scripts/spark_job_template.py


In [6]:
!spark-submit scripts/spark_job_template.py

2018-10-22 14:40:18 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-22 14:40:19 INFO  SparkContext:54 - Running Spark version 2.3.0
2018-10-22 14:40:19 INFO  SparkContext:54 - Submitted application: sparkjob_template
2018-10-22 14:40:19 INFO  SecurityManager:54 - Changing view acls to: cls
2018-10-22 14:40:19 INFO  SecurityManager:54 - Changing modify acls to: cls
2018-10-22 14:40:19 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-10-22 14:40:19 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-10-22 14:40:19 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(cls); groups with view permissions: Set(); users  with modify permissions: Set(cls); groups with modify permissions: Set()
2018-10-22 14:40:19 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 63189.
2018-10-22 14:40:19 INFO  

## Exercise: Pi Approximation as Spark Job

Now it's your turn. Remember the $\pi$ approximation program? Wrap this into a Spark job script and submit it to the cluster!

In [None]:
%%file scripts/pi_approximation_job.py

# TODO: write a job for the pi approximation program and run it via `spark-submit`


---
_This notebook is licensed under a [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). Copyright © 2018 [Point 8 GmbH](https://point-8.de)_