# Low Level Understanding using RDD
#### Agenda
* SparkContext
* RDD Creation
* RDD Operations
* RDD Transformations
* RDD Actions

## 1. SparkContext
<hr>
* Main entry point for Spark functionality.
* sc is already created in databricks environment.
* A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs on that cluster.

## 2. RDD Creation
<hr>
* Two ways to create RDD:
  1. **Parallelize Collection** : Convert python collection to rdd.
  2. **External Datasets** : PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc.

In [2]:
# Importing libraries
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import Row

# setting up spark
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [3]:
# creating a dataframe 
df = spark.createDataFrame([Row(id=1, value='value1'),Row(id=2, value='value2')])

# let's have a look what's inside
df.show()

# let's print the schema
df.printSchema()

In [4]:
# register dataframe as query table
df.createOrReplaceTempView('df_view')

# execute SQL query
df_result = spark.sql('select value from df_view where id=2')

# examine contents of result
df_result.show()

# get result as string
df_result.first().value

In [5]:
# defining RDD
rdd = sc.parallelize(range(100))
# action 1
rdd.count()

In [6]:
# action 2
rdd.collect()

## 3. RDD Operations
<hr>
* RDDs support two types of operations - transformations & actions.
* Transformations create a new dataset from existing one. 
* Action return value to the driver program after returning a computation on dataset.
* All transformations are lazy, they donot compute result right away.
* The transformations are only computed when an action requires a result to be returned to the driver program.
* Example of transformation - map,filter etc.
* Example of action - count,collect etc.

## 4. RDD Transformations
<hr>
### map(func) 
- Return a new distributed dataset formed by passing each element of the source through a function func.
* Example : Add 10 to all the numbers

In [8]:
# transformation using map
rdd = sc.parallelize(range(10))
rdd = rdd.map(lambda x:x+10)
rdd.collect()

### filter(func)
- Return a new dataset formed by selecting those elements of the source on which func returns true.
* Example : Retain all the even data

In [10]:
# transformation using filter
rdd = sc.parallelize(range(10))
rdd = rdd.filter(lambda x:x%2 == 0)
rdd.collect()

### flatMap(func) 
* Similar to map, but each input item can be mapped to 0 or more output items
* Example : Generate all data (x,x+10,x+100)

In [12]:
# transformation using flatMap
rdd = sc.parallelize(range(3))
rdd = rdd.flatMap(lambda x:[x,x+10,x+100])
rdd.collect()

### Set Operations 
- union
- intersection
- distinct

In [14]:
# initializing 2 RDDs
rdd1 = sc.parallelize(range(10,30,2))
rdd2 = sc.parallelize(range(8,15))
# Intersection operation
print(rdd1.intersection(rdd2).collect())
# Union operation
print(rdd1.union(rdd2).collect())
# Distinct operation
print(rdd1.distinct().collect())

### Working with Key-Value Pairs
* Data present in format [('a',1),('b',2),('c',3),('d',4)]. 'a','b' .. behaves as key & 1,2, .. behaves as values
* groupByKey, reduceByKey, aggregateByKey, sortByKey, combineByKey

In [16]:
#Invoked per partition first time a key appears, d is the corresponding value 
def mystr(d):
    print('In MyStr')
    return d

# 2nd time & onwards for same key in same partition
def myconcat(a,b):
    print('In MyConcat')
    return a + b

#Works across partitions
def mypartConcat(a,b):
    print('In myPartConcat')
    return a + b

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2),("a",8),("c",4), ("a", 12),("a",18),("c",14)],2)

#mystr - this converts the V into of type C
rdd.combineByKey(mystr, myconcat, mypartConcat).collect()

### Changing number of partitions
* This can be achieved using colease & repartition

In [18]:
# initializing RDD
rdd = sc.parallelize(range(100),5)
# number of already existing partitions
print("Number of already existing partitions: ",rdd.getNumPartitions())
# changing number of partitions
rdd = rdd.coalesce(3)
print("After coalesce: ",rdd.getNumPartitions())
# changing number of partitions
print("After repartition: ",rdd.repartition(6).getNumPartitions())

### Preventing recomputation of RDD 
- Using cache() & persist()
* Cached data consumes memory.
* Cache should be made free after usage using unpersist()

In [20]:
# initializing RDD
rdd = sc.parallelize(range(10000))
rdd.cache()
rdd1 = rdd.map(lambda x: x+2)
rdd2 = rdd.map(lambda x:x+3)
print(rdd1.count())
print(rdd2.count())

#Remove data from chache
rdd.unpersist()

## 5. RDD Actions
<hr>
* The computation of RDD starts when an action is associated with the RDD.
* **collect** - brings the transformed data from all executors to driver. Strictly recommanded for learning purpose
* **count** - Return the number of elements in the dataset.
* **first** - Return the first element of the dataset.
* **take** - Return an array with the first n elements of the dataset.
* **saveAsTextFile** - save the contents of RDD in text file.
* **foreach(func)** - the passed func will be executed to all the data in each executor

In [22]:
def f(e):
    print(e)
    
#This print happens in each executor & not on driver
sc.parallelize([1, 2, 3, 4, 5]).foreach(f)