In [1]:
spark

We can call Spark Session Object SparkContext as follows.
* spark.sparkContext
* sc

In [2]:
spark.sparkContext

In [3]:
sc
# For backwards compatibility reasons, it’s also still possible to call the SparkContext with sc.

## Creating RDD from Collections Using Parallelize Method

In [4]:
dataList =list(range(0, 10))
dataList

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
dataRdd = sc.parallelize(dataList)
dataRdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:489

In [6]:
dataRdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [7]:
type(dataRdd)

pyspark.rdd.RDD

### Parallelize Method has two param 
* Collection
* Number of slides

In [8]:
sc.parallelize(dataList, 6).glom().collect()

[[0], [1, 2], [3, 4], [5], [6, 7], [8, 9]]

In [9]:
dataRdd2 = sc.parallelize([2, 5, 3, 1, 7, 9], 2).glom()
sorted(dataRdd2.collect())

[[1, 7, 9], [2, 5, 3]]

In [10]:
keyVal = [('a', 1), ('b', 5), ('c', 7), ('a', 2), ('a', 3), ('b', 4) , ('c', 8)]

In [11]:
keyValRdd = sc.parallelize(keyVal)

## ReduceByKey
* This method is both associative and commutative
* Reduction operator can help break down a task into various partial tasks by calculating partial results which can be used to obtain a final result. 
* It allows certain serial operations to be performed in **parallel**, thereby reducing the number of steps required for certain operations. 
* A reduction operator breaks a serial task into various partial tasks and stores the result into a private copy of the variable. 
* These private copies are then merged into a shared copy at the end.

In [12]:
newKeyVal = keyValRdd.reduceByKey(lambda a, b: a + b).collect()

In [13]:
newKeyVal

[('b', 9), ('c', 15), ('a', 6)]

#### Alternative way 
Both way return the list as w.k.t 
* collect method -> return list
* collectAsMap method -> return map.

In [14]:
from operator import add
newKeyVal = keyValRdd.reduceByKey(add).collect()

## GroupByKey

In [15]:
tupKeyVal = keyValRdd.groupByKey()

In [16]:
list(tupKeyVal.collect()[0][1])

[5, 4]

In [17]:
tupKeyVal.map(lambda para: (para[0], list(para[1]))).collect()

[('b', [5, 4]), ('c', [7, 8]), ('a', [1, 2, 3])]

In [18]:
tupKeyVal.map(lambda para: (para[0], list(para[1]))).collectAsMap()

{'a': [1, 2, 3], 'b': [5, 4], 'c': [7, 8]}

## Fun Part

In [19]:
from PyFiles import pipeDemo
from PyFiles import pipeDemoFunc

In [20]:
dataStr = ['james', 'john', 'vin']
dataRddStr = sc.parallelize(dataStr)

In [21]:
pipeRdd = dataRddStr.pipe(pipeDemo)
pipeFuncRdd = dataRddStr.map(lambda x : pipeDemoFunc.fun(x)).collect()

In [22]:
dataRddStr.collect()

['james', 'john', 'vin']

In [23]:
pipeFuncRdd

['Hello...james', 'Hello...john', 'Hello...vin']

In [24]:
# its not working
#pipeRdd.collect()