# Spark cluste Standalone mode with dependencies

> Resource Manager is buildin "Standalone" and Deploy Mode is "client" (only possible with interactive session)

Starting a SparkSession with Python dependencies send by `--archives` flag

+ N.B. these configs don't work as `os.environ`

```python
    .config("spark.archives", "/app/jobs/pyspark_venv.tar.gz#environment")\
    .config("spark.pyspark.python", "./environment/bin/python")\
```

In [1]:
import os
from pyspark.sql import SparkSession 
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# sets the Python path for workers, pointing to the extracted archive under a SparkSession's working directory
# e.g. full path on worker is like: /opt/spark/work/app-20211009135519-0000/1/./environment

spark = SparkSession.builder\
    .appName("pyspark-notebook-dep")\
    .master("spark://spark-master:7077")\
    .config("spark.archives", "/app/jobs/pyspark_venv.tar.gz#environment")\
    .getOrCreate()
# spark.archvies add the file from this machine (jupyter-server) to the workers
# it's extracted by Spark to subdirectory "environment" in SparkSession's working directory
# note this is in Client mode, so Driver should have the same dependencies installed with running Python environment
spark

21/10/09 15:59:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
from pyspark.mllib.random import RandomRDDs

x = RandomRDDs.normalVectorRDD(
    spark.sparkContext, 
    numRows=10000, 
    numCols=5, 
    numPartitions=20, 
    seed=42
)
x.collect()[:3]
# requires numpy package

                                                                                

[array([-0.75661355, -0.83595055, -0.54290339,  0.83210849, -0.78727577]),
 array([ 0.95666722, -1.34126376, -0.68323051, -1.15742816, -0.03667599]),
 array([-0.66918965, -0.54477455, -0.34275965, -0.46614391, -1.07408784])]

In [3]:
spark.stop()

# Spark cluster Standalone mode

> Resource Manager is buildin "Standalone" and Deploy Mode is "client" (only possible with interactive session)


The Python dependencies must be alreay installed on all workers (and Jupyter server)

# Spark Local mode

> Scheduler and executore all on the same JVM, i.e. this jupyter server instance

If the dependencies are already with Jupyter Server, then it'll work

In [1]:
from pyspark.sql import SparkSession

# local[4] to use 4 cores, local[*] to use all
spark = SparkSession\
        .builder\
        .appName("pyspark-notebook")\
        .master("local[4]")\
        .getOrCreate()
spark

In [2]:
from pyspark.mllib.random import RandomRDDs

x = RandomRDDs.normalVectorRDD(
    spark.sparkContext, 
    numRows=10000, 
    numCols=5, 
    numPartitions=20, 
    seed=42
)
x.collect()[:3]
# this only need numpy on jupyter server

[array([-0.75661355, -0.83595055, -0.54290339,  0.83210849, -0.78727577]),
 array([ 0.95666722, -1.34126376, -0.68323051, -1.15742816, -0.03667599]),
 array([-0.66918965, -0.54477455, -0.34275965, -0.46614391, -1.07408784])]