# RDD: Resilient Distributed Dataset

* RDD is the fundamental level of operation for Spark. 
* Its one of the two low level APIs in Spark, the other being Shared variables
* You typically dont need to use this except when looking to do effecient processing that you cannot with High level APIs
* SparkContext is how you summon the RDD functionality in Spark
* RDDs are immutable, partitioned collection of rows that can be operated in parallel

## Things to know

> * [Transformation](http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) : This is the manipulations you want to do on the Dataset. 
> * [Action](http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) : You use an Action to generate the output
> * <strong>Lazy Evaluation</strong>: Spark accumilates the transformations on a Dataset and evaluates them only when an Action is called

![img](img/RDD.png)
* Credits: image taken from [Transformation process in Apache Spark](https://stackoverflow.com/questions/39311616/transformation-process-in-apache-spark/39313146)

> * <strong>DAG</strong>: Directed Acyclic Graph is the execution plan that Spark generates for processing RDDs 
> * <strong>Partitions</strong>: creating subsets of Data that can be passed to the worker node for faster processing
> * <strong>Shuffle</strong>:  shuffle is a method for re-distributing data so it’s grouped differently across partitions
> * **Spark Job**: Every Action results in a job. Every Job has **stages** and stages are a collection of **tasks**. Task is a transformation of data that will run on a single executor. Tasks are the lowest level of spark execution.
> * **Parllelism**: combination of partitions and nodes. Parallelism defines the speed of your job. If you have one partition but many nodes => the jobs parallelism is 1. If you have many partitions but single node => the jobs parallelism is still 1.

* Flow of a spark job
![img](img/spark-job.png) 
* Image taken from [Understand RDD Operations: Transformations and Actions](https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/) 

> * <strong>Broadcast</strong>: This is when you want to store some data to the worker node to make calculations faster 'data locality'
> * <strong>Accumilators</strong>: This is used to collect the values

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local[3]")

In [3]:
rdd1 = sc.parallelize(range(10),2)

In [4]:
rdd1

PythonRDD[1] at RDD at PythonRDD.scala:53

In [5]:
print("Number of partitions: {}".format(rdd1.getNumPartitions()))
print("Partitioner: {}".format(rdd1.partitioner))
print("Partitions structure: {}".format(rdd1.glom().collect()))

Number of partitions: 2
Partitioner: None
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]


In [6]:
rdd2 = rdd1.map(lambda x: x*x).collect()

In [7]:
rdd2

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

### Spark UI
* localhost:4040

In [8]:
words = ['Data','is','fun',"Waterloo Data Science Meetup"]

In [9]:
wordsRDD = sc.parallelize(words)

In [10]:
flat = wordsRDD.map(lambda wordsRDD: wordsRDD.split(' '))

In [11]:
print(flat.collect())

[['Data'], ['is'], ['fun'], ['Waterloo', 'Data', 'Science', 'Meetup']]


In [12]:
wordsRDD.reduce(lambda w,v: w if len(w) < len(v) else v)

'is'

In [13]:
rdd1.getNumPartitions()

2

In [14]:
rdd1.min()

0

In [15]:
rdd1.max()

9

In [16]:
rdd1.take(3)

[0, 1, 2]

In [17]:
rdd1.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### RDDs can also do Stats

In [18]:
rdd1.mean()
rdd1.sum()
rdd1.stdev()
rdd1.variance()
rdd1.stats()

(count: 10, mean: 4.5, stdev: 2.8722813232690143, max: 9.0, min: 0.0)

In [22]:
data = [('AWS', 1),  ('GCP', 3), ('OpenStack', 4),('AZURE', 2), ('Oracle', 5), ('OnPrem', 6)]

In [23]:
sc.parallelize(data)

ParallelCollectionRDD[19] at parallelize at PythonRDD.scala:195

In [24]:
sc.parallelize(data).sortByKey(True, 1).collect()

[('AWS', 1),
 ('AZURE', 2),
 ('GCP', 3),
 ('OnPrem', 6),
 ('OpenStack', 4),
 ('Oracle', 5)]