# RDD: Resilient Distributed Dataset

* RDD is the fundamental level of operation for Spark. 
* Its one of the two low level APIs in Spark, the other being Shared variables
* You typically dont need to use this except when looking to do effecient processing that you cannot with High level APIs
* SparkContext is how you summon the RDD functionality in Spark
* RDDs are immutable, partitioned collection of rows that can be operated in parallel

## Things to know

* <strong>[Transformation](http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)</strong>: This is the manipulations you want to do on the Dataset. 
* <strong>[Action](http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)</strong>: You use an Action to generate the output
* <strong>Lazy Evaluation</strong>: Spark accumilates the transformations on a Dataset and evaluates them only when an Action is called
* <strong>DAG</strong>: Directed Acyclic Graph is the execution plan that Spark generates for processing RDDs
* <strong>Partitions</strong>: creating subsets of Data that can be passed to the worker node for faster processing
* <strong>Broadcast</strong>: This is when you want to store some data to the worker node to make calculations faster 'data locality'
* <strong>Accumilators</strong>: This is used to collect the values in the Driver 
* <strong>Shuffle</strong>:  shuffle is a method for re-distributing data so it’s grouped differently across partitions

In [28]:
from pyspark import SparkContext
sc = SparkContext(master="local[3]")

In [29]:
rdd1 = sc.parallelize(range(10),2)

In [55]:
rdd1

PythonRDD[6] at collect at <ipython-input-36-75ecb6e3744c>:1

In [30]:
print("Default parallelism: {}".format(sc.defaultParallelism))
print("Number of partitions: {}".format(rdd1.getNumPartitions()))
print("Partitioner: {}".format(rdd1.partitioner))
print("Partitions structure: {}".format(rdd1.glom().collect()))

Default parallelism: 3
Number of partitions: 2
Partitioner: None
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]


In [52]:
rdd2 = rdd1.map(lambda x: x*x).collect()

In [53]:
rdd2

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [47]:
words = ['Data','is','fun',"Napoleon lost at Waterloo"]

In [48]:
wordsRDD = sc.parallelize(words)

In [49]:
flat = wordsRDD.map(lambda wordsRDD: wordsRDD.split(' '))

In [50]:
print(flat.collect())

[['Data'], ['is'], ['fun'], ['Napoleon', 'lost', 'at', 'Waterloo']]


In [19]:
wordsRDD.reduce(lambda w,v: w if len(w) < len(v) else v)

'is'

In [32]:
rdd1.getNumPartitions()

2

In [33]:
rdd1.min()

0

In [34]:
rdd1.max()

9

In [35]:
rdd1.take(3)

[0, 1, 2]

In [36]:
rdd1.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [56]:
rdd1.mean()
rdd1.sum()
rdd1.stdev()
rdd1.variance()
rdd1.stats()

(count: 10, mean: 4.5, stdev: 2.8722813232690143, max: 9.0, min: 0.0)

In [38]:
data = [('AWS', 1), ('AZURE', 2), ('GCP', 3), ('OpenStack', 4), ('Oracle', 5), ('OnPrem', 6)]

In [39]:
sc.parallelize(data)

ParallelCollectionRDD[7] at parallelize at PythonRDD.scala:195

In [41]:
sc.parallelize(data).sortByKey(True, 1).collect()

[('AWS', 1),
 ('AZURE', 2),
 ('GCP', 3),
 ('OnPrem', 6),
 ('OpenStack', 4),
 ('Oracle', 5)]