## RDD example

This notebook provide a dummy example of a map on Spark RDD, that can be used to check that parallelisation works fine on your setup.

It consist of creating an RDD with `n_partitions` partitions, and apply a map function that waits for 2 seconds for each partition.

#### General imports

In [1]:
import os 
import time

#### Check relevant environment variables

In [2]:
os.environ['SPARK_HOME']

'/home/guest/spark'

In [3]:
os.environ['PATH']

'/home/guest/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin'

In [4]:
os.environ['PYTHONPATH']

'/home/guest/spark/python:/home/guest/spark/python/lib/pyspark.zip:/home/guest/spark/python/lib/py4j-0.10.9.2-src.zip:'

#### Start Spark session

A Spark session is created by using the pyspark.sql.SparkSession object. See [here](https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession) for the API documentation on the SparkSession Object. 


In [5]:
#This is needed to start a Spark session from the notebook
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=2g  pyspark-shell"

from pyspark.sql import SparkSession

In [6]:
#Uncomment below to recreate a Spark session with other parameters
#spark.stop()
spark = SparkSession \
    .builder \
    .master("local[4]") \
    .appName("demoRDD") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/26 09:31:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Start dummy Spark jobs

In [7]:
# Wait function
def wait2s(x):
    time.sleep(2)
    return x

In [8]:
n_partitions=8

data=range(0,n_partitions)
datardd=sc.parallelize(data,n_partitions)

datardd.map(wait2s).collect()

                                                                                

[0, 1, 2, 3, 4, 5, 6, 7]

### Open Spark UI and check parallelisation

Open Spark UI at `127.0.0.1:4040` (`192.168.99.100:4040` if you are running a Docker with Docker Toolbox)

![](./DemoRDD.png)