# RDD: Resilient Distributed Dataset

* RDD is the fundamental level of operation for Spark. 
* Its one of the two low level APIs in Spark, the other being Shared variables
* You typically dont need to use this except when looking to do effecient processing that you cannot with High level APIs
* SparkContext is how you summon the RDD functionality in Spark
* RDDs are immutable, partitioned collection of rows that can be operated in parallel

## Things to know

> * [Transformation](http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations): This is the manipulations you want to do on the Dataset. 
> * [Action](http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions): You use an Action to generate the output
> * <strong>Lazy Evaluation</strong>: Spark accumilates the transformations on a Dataset and evaluates them only when an Action is called

![img](img/RDD.png)
* Credits: image taken from [Transformation process in Apache Spark](https://stackoverflow.com/questions/39311616/transformation-process-in-apache-spark/39313146)

> * <strong>DAG</strong>: Directed Acyclic Graph is the execution plan that Spark generates for processing RDDs 
> * <strong>Partitions</strong>: creating subsets of Data that can be passed to the worker node for faster processing
> * <strong>Shuffle</strong>:  shuffle is a method for re-distributing data so it’s grouped differently across partitions
> * **Spark Job**: Every Action results in a job. Every Job has **stages** and stages are a collection of **tasks**. Task is a transformation of data that will run on a single executor. Tasks are the lowest level of spark execution.
> * **Parllelism**: combination of partitions and nodes. Parallelism defines the speed of your job. If you have one partition but many nodes => the jobs parallelism is 1. If you have many partitions but single node => the jobs parallelism is still 1.

* Flow of a spark job
![img](img/spark-job.png) 
* Image taken from [Understand RDD Operations: Transformations and Actions](https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/) 

> * <strong>Broadcast</strong>: This is when you want to store some data on the executors to make calculations faster 'data locality'. Instead of having map functions request data from Driver node and deserialise before using that in the function its simpler to have the shared variable stored on the executor so the computation is quick. Like braodcasting a lookup table to make processing faster. 
> * <strong>Accumilators</strong>: This is used to update values inside transformations and bring it to the driver. Typical use case like summing a value and propagating it to the driver for debugging.

In [None]:
from pyspark import SparkContext
sc = SparkContext(master="local[3]")

In [None]:
rdd1 = sc.parallelize(range(10),2)

In [None]:
rdd1

In [None]:
print("Number of partitions: {}".format(rdd1.getNumPartitions()))
print("Partitioner: {}".format(rdd1.partitioner))
print("Partitions structure: {}".format(rdd1.glom().collect()))

In [None]:
rdd2 = rdd1.map(lambda x: x*x).collect()

In [None]:
rdd2

### Spark UI
* localhost:4040

In [None]:
words = ['Data','is','fun',"Waterloo Data Science Meetup"]

In [None]:
wordsRDD = sc.parallelize(words)

In [None]:
flat = wordsRDD.map(lambda wordsRDD: wordsRDD.split(' '))

In [None]:
print(flat.collect())

In [None]:
wordsRDD.reduce(lambda w,v: w if len(w) < len(v) else v)

In [None]:
rdd1.getNumPartitions()

In [None]:
rdd1.min()

In [None]:
rdd1.max()

In [None]:
rdd1.take(3)

In [None]:
rdd1.collect()

### RDDs can also do Stats

In [None]:
rdd1.mean()
rdd1.sum()
rdd1.stdev()
rdd1.variance()
rdd1.stats()

In [None]:
data = [('AWS', 1),  ('GCP', 3), ('OpenStack', 4),('AZURE', 2), ('Oracle', 5), ('OnPrem', 6)]

In [None]:
sc.parallelize(data)

In [None]:
sc.parallelize(data).sortByKey(True, 1).collect()