## RDD example

This notebook provide a dummy example of a map on Spark RDD, that can be used to check that parallelisation works fine on the cluster.

It consist of creating an RDD with `n_partitions` partitions, and apply a map function that waits for 2 seconds for each partition.

## Preliminaries: Cluster access

* Connect to cluster with 

```
ssh -p 30 -L 8000:jupyter:8000 -L 8888:hue:8888 -L 8088:hue:8088 yourlogin@bigdata.ulb.ac.be
```

Note the port redirection to get access locally to JupyterHub, Hue, and Hadoop Web UI

* Open JupyterHub locally by connecting to `127.0.0.1:8000`
* Upload this notebook


#### General imports

In [None]:
import os 
import time
import numpy as np
import pandas as pd

#### Start Spark session

A Spark session is created by using the pyspark.sql.SparkSession object. See [here](https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession) for the API documentation on the SparkSession Object. 


In [None]:
#This is needed to start a Spark session from the notebook
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=2g\
                                    pyspark-shell"

# For Yarn, so that Spark knows where it runs
os.environ['HADOOP_CONF_DIR']="/etc/hadoop/conf"
# For Yarn, so Spark knows which version to use (and we want Anaconda to be used, so we have access to numpy, pandas, and so forth)
os.environ['PYSPARK_PYTHON']="/etc/anaconda3/bin/python"
os.environ['PYSPARK_DRIVER_PYTHON']="/etc/anaconda3/bin/python"


from pyspark.sql import SparkSession

In [None]:
#Uncomment below to recreate a Spark session with other parameters
#spark.stop()
spark = SparkSession \
    .builder \
    .master("yarn") \
    .config("spark.executor.instances","10") \
    .appName("demoRDD") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

#### Start dummy Spark jobs

In [None]:
# Wait function
def wait2s(x):
    time.sleep(2)
    return x

In [None]:
n_partitions=8

data=range(0,n_partitions)
datardd=sc.parallelize(data,n_partitions)

datardd.map(wait2s).collect()

### Open Spark UI and check parallelisation

* Open Hadoop Web UI at `127.0.0.1:8088`, and click on your running Application. You should land on an URL similar to `http://127.0.0.1:8088/cluster/app/application_1523870291186_0032`
* Change `cluster/app` to `proxy` to get to the Spark UI: `http://127.0.0.1:8088/proxy/application_1523870291186_003`

### Stop Spark

In [None]:
spark.stop

### Scalability

The following runs a benchmark to test the execution times of the wait2s mapping on an RDD with 100 partitions, for a number of executors increasing from 1 to 100. The mapping is repeated ten times for eah number of executors (in the set 1, 2, 5, 10, 20, 50, 100).

In [None]:
#Ten runs (rows), for number of executors in (1,2,5,10,20,50,100) (in columns)
n_executors=[1,2,5,10,20,50,100]
results_benchmark=np.zeros((10,7))


In [None]:
for i in range(len(n_executors)):
    
    print("Run benchmark with "+str(n_executors[i])+" executors")
    spark = SparkSession \
    .builder \
    .master("yarn") \
    .config("spark.executor.instances",str(n_executors[i])) \
    .appName("demoRDD") \
    .getOrCreate()
    
    sc=spark.sparkContext
    
    #100 partitions
    n_partitions=100
    data=range(0,n_partitions)
    datardd=sc.parallelize(data,n_partitions)

    for j in range(10):
        time_start=time.time()
        datardd.map(wait2s).collect()
        time_end=time.time()
        execution_time=time_end-time_start
        results_benchmark[j,i]=execution_time
    
    spark.stop()
        

In [None]:
pd_results=pd.DataFrame(results_benchmark)

In [None]:
pd_results

```
203.9961552619934,102.81437921524048,42.03271555900574,22.165425539016724,12.114766836166382,6.621063947677612,4.2486162185668945
202.44642806053162,101.42299795150757,40.71962809562683,20.532904863357544,10.455049276351929,4.267818212509155,2.2575149536132812
202.36464619636536,101.41451358795166,40.69806241989136,20.44546890258789,10.30104923248291,4.22643518447876,3.038877010345459
202.4120738506317,101.30005168914795,40.68964910507202,20.46614170074463,10.302504062652588,4.2139832973480225,2.2561874389648438
202.29537892341614,101.28277230262756,40.719842195510864,20.5796639919281,10.278221130371094,4.6493425369262695,2.2038023471832275
202.13580703735352,101.29535865783691,40.626713514328,20.42010807991028,10.451300382614136,4.237480401992798,2.251011371612549
202.09938716888428,101.22942876815796,40.62957000732422,20.397658586502075,10.503596067428589,4.172134160995483,2.301781177520752
202.12336921691895,101.22201132774353,40.58945393562317,20.378247022628784,10.340943574905396,4.20829176902771,2.296875
202.11851453781128,101.29037380218506,40.55907702445984,20.389169216156006,10.39184284210205,4.261171340942383,2.4329993724823
202.07814502716064,101.16865134239197,40.75089430809021,20.365695476531982,10.275054216384888,4.304711818695068,2.1503186225891113

```

In [None]:
pd_results.to_csv("resultsBenchmark.csv",index=False,header=False)