## Low Level Understanding using RDD
#### Agenda
<hr>
1. SparkContext
2. RDD Creation
3. RDD Operations
4. RDD Transformations
5. RDD Actions

<hr>

### 1. SparkContext
<hr>
* Main entry point for Spark functionality.
* sc is already created in databricks environment.
* A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs on that cluster.

### 2. RDD Creation
* Two ways to create RDD
  - Parallelize Collection : Convert python collection to rdd
  - External Datasets : PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc.

In [4]:
rdd = sc.parallelize(list('abcdefghi'))

In [5]:
rdd = sc.textFile('/abc.csv')

In [6]:
rdd.count()

In [7]:
rdd.collect()

### 3. RDD Operations
<hr>
* RDDs support two types of operations - transformations & actions.
* Transformations create a new dataset from existing one. 
* Action return value to the driver program after returning a computation on dataset.
* All transformations are lazy, they donot compute result right away.
* The transformations are only computed when an action requires a result to be returned to the driver program.
* Example of transformation - map,filter etc.
* Example of action - count,collect etc.

### 4. RDD Transformations
<hr>
* map(func) - Return a new distributed dataset formed by passing each element of the source through a function func.
* Example : Add 10 to all the numbers

In [10]:
rdd = sc.parallelize(range(10))
rdd = rdd.map(lambda x:x+10)
rdd.collect()

* filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true.
* Example : Retain all the even data

In [12]:
rdd = sc.parallelize(range(10))
rdd = rdd.filter(lambda x:x%2 == 0)
rdd.collect()

* flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items
* Example : Generate all data (x,x+10,x+100)

In [14]:
rdd = sc.parallelize(range(3))
rdd = rdd.flatMap(lambda x:[x,x+10,x+100])
rdd.collect()

* set operations - union, intersection & distinct

In [16]:
rdd1 = sc.parallelize(range(10,30,2))
rdd2 = sc.parallelize(range(8,15))

In [17]:
rdd1.intersection(rdd2).collect()

In [18]:
rdd1.union(rdd2).collect()

In [19]:
rdd1.distinct().collect()

* Working with Key-Value Pairs
* Data present in format [('a',1),('b',2),('c',3),('d',4)]. 'a','b' .. behaves as key & 1,2, .. behaves as values
* groupByKey, reduceByKey, aggregateByKey, sortByKey, combineByKey

In [21]:
# In MyStr
# In MyStr
# In MyConcat
# In MyConcat
# In MyStr
# In MyStr
# In MyConcat
# In MyConcat
# In myPartConcat

#Invoked per partition first time a key appears, d is the corresponding value 
def mystr(d):
    print 'In MyStr'
    return d

# 2nd time & onwards for same key in same partition
def myconcat(a,b):
    print 'In MyConcat'
    return a + b

#Works across partitions
def mypartConcat(a,b):
    print 'In myPartConcat'
    return a + b

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2),("a",8),("c",4), ("a", 12),("a",18),("c",14)],2)

#mystr - this converts the V into of type C
rdd.combineByKey(mystr, myconcat, mypartConcat).collect()

* Changing number of partitions
* This can be achieved using colease & repartition

In [23]:
rdd = sc.parallelize(range(100),5)

In [24]:
rdd.getNumPartitions()

In [25]:
rdd = rdd.coalesce(3)

In [26]:
rdd.getNumPartitions()

In [27]:
rdd.repartition(6).getNumPartitions()

* Preventing recomputation of rdd using cache() & persist()
* Cached data consumes memory.
* Cache should be made free after usage using unpersist()

In [29]:
rdd = sc.parallelize(range(10000)) #first time this line will be executed
rdd.cache()
rdd1 = rdd.map(lambda x: x+2)
rdd2 = rdd.map(lambda x:x+3)
print rdd1.count()
print rdd2.count()

#Remove data from chache
rdd.unpersist()

### 5. RDD Actions
<hr>
* The computation of rdd starts when an action is associated with the rdd.
* collect - brings the transformed data from all executors to driver. Strictly recommanded for learning purpose
* count - Return the number of elements in the dataset.
* first - Return the first element of the dataset.
* take - Return an array with the first n elements of the dataset.
* saveAsTextFile - save the contents of rdd in text file

In [31]:
rdd.saveAsTextFile('text.txt')

* foreach(func) - the passed func will be executed to all the data in each executor

In [33]:
def f(e):
    print e
    
#This print happens in each executor & not on driver
sc.parallelize([1, 2, 3, 4, 5]).foreach(f)