# Spark cluste Standalone mode, with pakced Python dependencies

> Scheduler is Spark build-in "standalone" 

The packed package is created with: 

```bash
python -m venv pyspark_venv
pyspark_venv/bin/pip install -r requirements.txt
pyspark_venv/bin/venv-pack --force -p pyspark_venv/ -o mounted_dirs/jobs/pyspark_venv.tar.gz
```

See: 
+ https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
+ https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

In [1]:
import os
from pyspark.sql import SparkSession 
# os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# this sets the Python on current instance, i.e. on jupyter server
# since Python packages on jupyter server is installed system-wide, no need to set this 

spark = SparkSession.builder\
    .appName("pyspark-notebook-dep")\
    .master("spark://spark-master:7077")\
    .config("spark.archives", "/app/jobs/pyspark_venv.tar.gz#environment")\
    .config("spark.pyspark.python", "./environment/bin/python")\
    .getOrCreate()
spark

In [2]:
from pyspark.mllib.random import RandomRDDs

x = RandomRDDs.normalVectorRDD(
    spark.sparkContext, 
    numRows=10000, 
    numCols=5, 
    numPartitions=20, 
    seed=42
)
x.collect()[:3]

[array([-0.75661355, -0.83595055, -0.54290339,  0.83210849, -0.78727577]),
 array([ 0.95666722, -1.34126376, -0.68323051, -1.15742816, -0.03667599]),
 array([-0.66918965, -0.54477455, -0.34275965, -0.46614391, -1.07408784])]

In [3]:
spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v")
).toPandas() # require Pandas package 

Unnamed: 0,id,v
0,1,1.0
1,1,2.0
2,2,3.0
3,2,5.0
4,2,10.0


# Spark cluste Standalone mode

> Scheduler is Spark build-in "standalone" 

The Python dependencies must be alreay installed on all workers (and Jupyter server)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession\
        .builder\
        .appName("pyspark-notebook")\
        .master("spark://spark-master:7077")\
        .getOrCreate()
spark

In [2]:
from pyspark.mllib.random import RandomRDDs

x = RandomRDDs.normalVectorRDD(
    spark.sparkContext, 
    numRows=10000, 
    numCols=5, 
    numPartitions=20, 
    seed=42
)
x.collect()[:3]
# ModuleNotFoundError: No module named 'numpy' 
# module numpy not installed on workers, this can be resolved by doing `pip install numpy` on workers

[array([-0.75661355, -0.83595055, -0.54290339,  0.83210849, -0.78727577]),
 array([ 0.95666722, -1.34126376, -0.68323051, -1.15742816, -0.03667599]),
 array([-0.66918965, -0.54477455, -0.34275965, -0.46614391, -1.07408784])]

# Spark Local mode

> Scheduler and executore all on the same JVM, i.e. this jupyter server instance

In [1]:
from pyspark.sql import SparkSession

# local[4] to use 4 cores, local[*] to use all
spark = SparkSession\
        .builder\
        .appName("pyspark-notebook")\
        .master("local[4]")\
        .getOrCreate()
spark

In [2]:
from pyspark.mllib.random import RandomRDDs

x = RandomRDDs.normalVectorRDD(
    spark.sparkContext, 
    numRows=10000, 
    numCols=5, 
    numPartitions=20, 
    seed=42
)
x.collect()[:3]
# this only need numpy on jupyter server

[array([-0.75661355, -0.83595055, -0.54290339,  0.83210849, -0.78727577]),
 array([ 0.95666722, -1.34126376, -0.68323051, -1.15742816, -0.03667599]),
 array([-0.66918965, -0.54477455, -0.34275965, -0.46614391, -1.07408784])]