## RDD example

This notebook provide a dummy example of a map on Spark RDD, that can be used to check that parallelisation works fine on the cluster.

It consist of creating an RDD with `n_partitions` partitions, and apply a map function that waits for 2 seconds for each partition.

## Preliminaries: Cluster access

* Connect to cluster with 

```
ssh -p 30 -L 8000:jupyter:8000 -L 8888:hue:8888 -L 8088:hue:8088 yourlogin@bigdata.ulb.ac.be
```

Note the port redirection to get access locally to JupyterHub, Hue, and Hadoop Web UI

* Open JupyterHub locally by connecting to `127.0.0.1:8000`
* Upload this notebook


#### General imports

In [1]:
import os 
import sys
import time
import numpy as np
import pandas as pd

#### Start Spark session

A Spark session is created by using the pyspark.sql.SparkSession object. See [here](https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession) for the API documentation on the SparkSession Object. 


In [2]:
#This is needed to start a Spark session from the notebook
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=2g\
                                    pyspark-shell"

# For Yarn, so that Spark knows where it runs
os.environ['HADOOP_CONF_DIR']="/etc/hadoop/conf"
# For Yarn, so Spark knows which version to use (and we want Anaconda to be used, so we have access to numpy, pandas, and so forth)
os.environ['PYSPARK_PYTHON']="/etc/anaconda3/bin/python"
os.environ['PYSPARK_DRIVER_PYTHON']="/etc/anaconda3/bin/python"


from pyspark.sql import SparkSession

In [3]:
#Uncomment below to recreate a Spark session with other parameters
#spark.stop()
spark = SparkSession \
    .builder \
    .master("yarn") \
    .config("spark.executor.instances","10") \
    .appName("demoRDD") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

#### Start dummy Spark jobs

In [11]:
# Wait function
def wait2s(x):
    time.sleep(2)
    return x

In [5]:
n_partitions=8

data=range(0,n_partitions)
datardd=sc.parallelize(data,n_partitions)

datardd.map(wait2s).collect()

[0, 1, 2, 3, 4, 5, 6, 7]

### Open Spark UI and check parallelisation

* Open Hadoop Web UI at `127.0.0.1:8088`, and click on your running Application. You should land on an URL similar to `http://127.0.0.1:8088/cluster/app/application_1523870291186_0032`
* Change `cluster/app` to `proxy` to get to the Spark UI: `http://127.0.0.1:8088/proxy/application_1523870291186_003`

### Stop session

In [23]:
spark.stop()

### Scalability

In [14]:
#Ten runs (rows), for number of executors in (1,2,5,10,20,50,100) (in columns)
n_executors=[1,2,5,10,20,50,100]
results_benchmark=np.zeros((10,7))


In [24]:
for i in range(len(n_executors)):
    
    print("Run benchmark with "+str(n_executors[i])+" executors")
    spark = SparkSession \
    .builder \
    .master("yarn") \
    .config("spark.executor.instances",str(n_executors[i])) \
    .appName("demoRDD") \
    .getOrCreate()
    
    sc=spark.sparkContext
    
    #100 partitions
    n_partitions=100
    data=range(0,n_partitions)
    datardd=sc.parallelize(data,n_partitions)

    for j in range(10):
        time_start=time.time()
        datardd.map(wait2s).collect()
        time_end=time.time()
        execution_time=time_end-time_start
        results_benchmark[j,i]=execution_time
    
    spark.stop()
        

Run benchmark with 1 executors
Run benchmark with 2 executors
Run benchmark with 5 executors
Run benchmark with 10 executors
Run benchmark with 20 executors
Run benchmark with 50 executors
Run benchmark with 100 executors


In [27]:
pd_results=pd.DataFrame(results_benchmark)

In [28]:
pd_results

Unnamed: 0,0,1,2,3,4,5,6
0,203.996155,102.814379,42.032716,22.165426,12.114767,6.621064,4.248616
1,202.446428,101.422998,40.719628,20.532905,10.455049,4.267818,2.257515
2,202.364646,101.414514,40.698062,20.445469,10.301049,4.226435,3.038877
3,202.412074,101.300052,40.689649,20.466142,10.302504,4.213983,2.256187
4,202.295379,101.282772,40.719842,20.579664,10.278221,4.649343,2.203802
5,202.135807,101.295359,40.626714,20.420108,10.4513,4.23748,2.251011
6,202.099387,101.229429,40.62957,20.397659,10.503596,4.172134,2.301781
7,202.123369,101.222011,40.589454,20.378247,10.340944,4.208292,2.296875
8,202.118515,101.290374,40.559077,20.389169,10.391843,4.261171,2.432999
9,202.078145,101.168651,40.750894,20.365695,10.275054,4.304712,2.150319


In [30]:
pd_results.to_csv("resultsBenchmark.csv",index=False,header=False)

### Scalability regression



In [2]:
def genData(N,n,random_seed):
    
    start = time.time()

    np.random.seed(0)   

    #Inputs and the weights of the linear combination are drawn at random
    X=np.random.rand(N,n)
    theta=np.random.rand(n)
    #noise=np.random.rand(N)

    Y=np.dot(X,theta)#+noise
    Y=Y[:,np.newaxis]
    Z=np.concatenate((X,Y),axis=1)

    print("Number of observations :",N)
    print("Number of features :",n)

    print("Dimension of X :",X.shape)
    print("Dimension of theta :",theta.shape)
    print("Dimension of Y :",Y.shape)

    end = time.time()
    print("Time to create artificial data: ",round(end - start,2),"seconds")
    
    return (X,Y,Z,theta)

In [3]:
#Let us generate the dataset 10M rows, 100 features
N=1000000
n=100
(X,Y,Z,theta)=genData(N,n,0)

Number of observations : 1000000
Number of features : 100
Dimension of X : (1000000, 100)
Dimension of theta : (100,)
Dimension of Y : (1000000, 1)
Time to create artificial data:  2.79 seconds


In [4]:
sys.getsizeof(Z)

808000112

In [5]:
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=20g\
                                    pyspark-shell"

# For Yarn, so that Spark knows where it runs
os.environ['HADOOP_CONF_DIR']="/etc/hadoop/conf"
# For Yarn, so Spark knows which version to use (and we want Anaconda to be used, so we have access to numpy, pandas, and so forth)
os.environ['PYSPARK_PYTHON']="/etc/anaconda3/bin/python"
os.environ['PYSPARK_DRIVER_PYTHON']="/etc/anaconda3/bin/python"


from pyspark.sql import SparkSession

In [6]:
def xtx_xty_row(z):
    x=np.array(z[:-1])
    y=z[-1]
    xtx=np.outer(x,x)
    xty=np.dot(x,y)
    return (xtx,xty)

In [7]:
#Ten runs (rows), for number of executors in (1,2,5,10,20,50,100) (in columns)
n_executors=[1,2,5,10,20,50,100]
results_benchmark=np.zeros((10,7))


In [8]:
for i in [6,5,4,3,2,1,0]:#range(len(n_executors)):
    
    #mem_per_exec=np.min(np.round(20/n_executors[i]),2)
    mem_per_exec=str(max([int(np.round(10/n_executors[i])),2]))+"g"
    
    print("Number of executors :"+str(n_executors[i]))
    print("Memory per executor: "+str(mem_per_exec))
    
    #spark.stop()
    print("Run benchmark with "+str(n_executors[i])+" executors")
    spark = SparkSession \
    .builder \
    .master("yarn") \
    .config("spark.executor.instances",n_executors[i]) \
    .config("spark.executor.memory",mem_per_exec) \
    .appName("demoRDD") \
    .getOrCreate()
    
    sc=spark.sparkContext
    
    time_start=time.time()
    
    B=400
    Z_RDD=sc.parallelize(Z,B)#.cache()
    
    time_end=time.time()
    
    print("Time to load data: "+str(time_end-time_start)+" s")
    
    print(Z_RDD.count())
    
    for j in range(10):
        time_start=time.time()
        
        (XtX,XtY)=Z_RDD.map(xtx_xty_row)\
        .reduce(lambda xtx_xty0,xtx_xty1:(xtx_xty0[0]+xtx_xty1[0],xtx_xty0[1]+xtx_xty1[1]))

        XtX_inverse=np.linalg.inv(XtX)

        theta_hat=np.dot(XtX_inverse,XtY)

        time_end=time.time()
        
        execution_time=time_end-time_start
        results_benchmark[j,i]=execution_time
    
    spark.stop()
        

Number of executors :100
Memory per executor: 2g
Run benchmark with 100 executors
Time to load data: 8.856817245483398 s
1000000
Number of executors :50
Memory per executor: 2g
Run benchmark with 50 executors
Time to load data: 6.934412479400635 s
1000000
Number of executors :20
Memory per executor: 2g
Run benchmark with 20 executors
Time to load data: 7.10111403465271 s
1000000
Number of executors :10
Memory per executor: 2g
Run benchmark with 10 executors
Time to load data: 7.819091081619263 s
1000000
Number of executors :5
Memory per executor: 2g
Run benchmark with 5 executors
Time to load data: 7.823525428771973 s
1000000
Number of executors :2
Memory per executor: 5g
Run benchmark with 2 executors
Time to load data: 8.298221826553345 s
1000000
Number of executors :1
Memory per executor: 10g
Run benchmark with 1 executors
Time to load data: 8.146310329437256 s
1000000


In [9]:
pd_results=pd.DataFrame(results_benchmark)

In [10]:
pd_results

Unnamed: 0,0,1,2,3,4,5,6
0,94.217127,48.167062,20.0484,10.594991,6.267728,4.874828,4.970212
1,93.237438,47.922709,19.783868,10.143557,5.708942,2.908823,3.648048
2,93.580764,47.577099,19.63239,10.164382,5.255147,3.128972,3.616532
3,93.616213,47.491464,19.439164,9.927299,5.256613,3.27298,3.562915
4,92.820755,48.071952,19.507984,10.116198,5.238891,2.59777,2.630375
5,91.648386,47.178033,19.568926,10.130866,5.98433,3.734691,3.337098
6,91.736841,46.738262,19.480052,9.88136,5.326632,2.423293,3.241172
7,92.94894,46.719901,19.45149,10.00039,5.134878,3.343642,6.38337
8,91.723531,47.284382,19.73941,10.155546,5.245079,3.095833,6.369322
9,91.498889,47.226938,19.263505,9.849107,5.332969,2.644213,3.723275


In [11]:
pd_results.to_csv("resultsBenchmarkRegression.csv",index=False,header=False)