# Introduction to Spark and the RDD API

In this notebook, we will give an overview of Spark and introduce the RDD API.

You can refer to the introduction to Spark and documentation for the Spark RDD API on the Spark website.
https://spark.apache.org/docs/latest/rdd-programming-guide.html

You can also checkout the academic paper that introduced Spark and RDDs. 
'Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing', Zaharia et al.
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Below we assume you have read the introduction.

Create a Spark application  
local: workers are threads on your machine  
'*': one thread per core

In [None]:
from pyspark.sql import SparkSession

ss = SparkSession.builder \
    .master("local[*]")  \
    .appName('intro') \
    .getOrCreate()

The RDD API is available on the SparkContext sc  
SparkSession ss will be used for the dataframe API in TP2

In [None]:
sc = ss.sparkContext
sc

A UI is available to understand/monitor the state of your application.   
To access it, you http://localhost:4040 in a broswer.

In [None]:
sc.uiWebUrl

Create a random array of 1M float and convert it to a RDD having 16 partitions.

In [None]:
import numpy as np

local_data = np.random.rand(2**20)

In [None]:
print('First 4 elements of local array: ', local_data[0:4])

Let's create the distributed version of this array.

In [None]:
rdd = sc.parallelize(local_data, numSlices=16)

In [None]:
print('First 4 elements of RDD: ', rdd.take(4))

Notice one job was created on the Spark UI with only one task.  
Spark does not need to evaluate all the RDD to access only the first elements.

In [None]:
rdd_size = rdd.count()
print(f'We have {rdd_size} elements in our RDD, the same as in our local array: {len(local_data)}.')

Notice another job was created in the Spark UI with 16 tasks. One task per partition.

How do we modify our RDD ?  
We cannot modify RDD. They are immutable. However we can transform our existing RDD into a new RDD. This won't consume memory. The RDD are not materialized in memory. This is just a description of the computation.

In [None]:
rdd2 = rdd.map(lambda x: x - 0.5)

Notice this was instantaneous and no job was created in the Spark UI. Spark transformations (such as map above) are lazy. Only actions (such as count or take above) trigger the computation.

A map does not change the partitioning.

In [None]:
rdd.getNumPartitions() == rdd2.getNumPartitions()

In [None]:
rdd2.take(4)

We can keep transforming our RDD.

In [None]:
rdd3 = rdd2.map(lambda x: 2*x)

An action will trigger the computation of the map function above. Here we sum all elements.

In [None]:
rdd3.reduce(lambda x,y: x+y)

Notice in the Spark UI that only 16 tasks were executed although we applied a map function twice.  
We could have expected to see 32 tasks: 16 tasks to apply the first map f1 and then 16 tasks to apply the second map f2.  
But that would mean iterating through our RDD twice. Spark is smart and merged our two functions to apply f2 o f1 instead.  
Everything happened as if we would have written the code below directly.

In [None]:
rdd.map(lambda x: 2*(x-0.5)).reduce(lambda x,y: x+y)

Ok we are summing elements unformly distributed between (-1, 1), we would expect a sum close to 0. Let's make sure our random generator is behaving properly by also checking the max and min values.

In [None]:
rdd3.reduce(max)

In [None]:
rdd3.reduce(min)

Sounds good but to compute the sum, max, min we generated 3 jobs meaning going over our RDD 3 times. Can we do it all in one job ?

In [None]:
init_state = (0.0, -2.0, 2.0)
# State is a tuple of size 3
# At position 0: the sum of all elements whose initial value is 0.0
# At position 1: the max of all elements whose initial value is -2.0 (since all elements are between -1 and 1)
# At position 2: the min of all elements whose initial value is 2.0

def add_one_element_to_state(state, elt):
    state_sum, state_max, state_min = state
    state_sum += elt
    state_max = max(state_max, elt)
    state_min = min(state_min, elt)
    return (state_sum, state_max, state_min)

def merge_two_states(state1, state2):
    sum1, max1, min1 = state1
    sum2, max2, min2 = state2
    return ( sum1+sum2, max(max1, max2), min(min1, min2) )

rdd3.aggregate(init_state, add_one_element_to_state, merge_two_states)

Let's dig deeper. We would like to make sure the distribution is uniform. Let's compute the average of the values on a partition of the [-1, 1] interval.

In [None]:
bins = np.arange(-1000, 1001) / 1000
bins

In [None]:
def bucket(x):
    bin_idx = np.digitize(x, bins)
    return (bins[bin_idx-1], bins[bin_idx])

print(f'{local_data[0]} is in {bucket(local_data[0])}')

In [None]:
rdd4 = rdd3.keyBy(bucket)

In [None]:
rdd4.take(4)

Notice that the bucket function depends on the local variable bins. When sending the bucket function to the executors, spark will also send the bins variable. Spark will send the closure of the bucket function.

In [None]:
def bucket_is_bad(elt):
    (low, high), x = elt
    return x < low or x >= high

In [None]:
bad_bucket_rdd = rdd4.filter(bucket_is_bad)

In [None]:
bad_bucket_rdd.count()

In [None]:
init_state = (0, 0.0)
# At position 0, the number of values
# At position 1, the sum of all values

def accum_state(state, elt):
    num_elts, sum_elts = state
    num_elts += 1
    sum_elts += elt
    return (num_elts, sum_elts)

def merge_states(state1, state2):
    n1, sum1 = state1
    n2, sum2 = state2
    return (n1+n2, sum1+sum2)

In [None]:
state_per_bucket = rdd4.aggregateByKey(init_state, accum_state, merge_states, numPartitions=4)

Nothing happened. aggregateByKey is a transformation not an action so it is evaluated lazily giving us the opportunity to do additional transformations.

In [None]:
def compute_avg(state):
    num_elts, sum_elts = state
    return sum_elts / num_elts

avg_per_bucket = state_per_bucket.mapValues(compute_avg)

In [None]:
avg_per_bucket_local = avg_per_bucket.collect()
avg_per_bucket_local[0:4]

Notice in the Spark UI that the job we just executed as two stages. Spark cannot merge operations (like it did with the two map functions above) across a shuffle operation. The first stage corresponds to all map operations before the shuffle plus the preparation to the shuffle (the map phase), the second stage corresponds to the finalization of the shuffle (the reduce phase) plus all maps after the shuffle.

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

avg_per_bucket_local.sort(key=lambda x: x[0][0])

x = [(low+high)/2 for (low, high), _ in avg_per_bucket_local]
y = [avg for _, avg in avg_per_bucket_local]

plt.plot(x, y)

Let's finally check that we have the same number of elements in each bucket.

In [None]:
count_per_bucket = state_per_bucket.mapValues(lambda x: x[0]).collect()

count_per_bucket.sort(key=lambda x: x[0][0])

x = [(low+high)/2 for (low, high), _ in count_per_bucket]
y = [count for _, count in count_per_bucket]

plt.plot(x, y)

Notice Spark executed everything from the beginning again altough the first stage is identical (we only modified the map of the second stage). By default, Spark never caches the result of intermediate computation. We have to tell spark to save intermediate results when needed using the rdd.persist() method.  

Replay the code above after persisting state_per_bucket. Notice the first stage of the job computing the count per bucket is skipped and that one RDD is persisted in the Storage tab.

Finally stop the spark application.

In [None]:
ss.stop()