**What is Spark? and Why to use it?**

1. Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can distribute data processing tasks across multiple containers.

2. Spark also takes some of the programming burdens of these task off the shoulders of developers with an easy-to-use API. 

**Why?**: 
1. Speed. Spark's in-memory engine means that it can perform tasks up to 100 times faster than MapReduce in certain situations.
2. Developer-friendly API.


**Spark Context**
1. SparkContext is a client of Spark's execution environment.
2. SparkContext is entry point of spark functionality.
3. A SparkContext represents the connection to a spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
4. Only one SparkContext should be active per JVM.
5. SparkContext uses the following parameters. master (url of the cluster it connects to), appName (name of your job), sparkHome (Spark Installation directory), etc.


**Spark Session (After Spark 2.0)**

In [1]:
from pyspark import SparkContext
sc = SparkContext("local", "practice_app")
sc

### RDD - Resilient Distributed Dataset

1. They are the elements that run and operate on multiple nodes to do parallel processing on a cluster.
2. RDDS are immutable elements., means once you create an RDD you cannot change it. 
3. RDDs are resilient means fault tolerant as well, hence in case of any failure, they recover automatically.
4. We can apply operations on these RDDs in 2 ways-
    * Transformation
    * Action

**Transformations**
1. map, flatmap
2. filter
3. distinct
4. reduceByKey
5. mapPartitions
6. sortBy

**Actions**
1. collect, collectAsMap
2. reduce
3. countByKey, countByValue
4. take
5. first

In [3]:
rdd = sc.textFile("sample.txt")
rdd

sample.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

In [5]:
rdd.take(1)

['The plan was Chett’s. He was the clever one; he’d been steward to old Maester Aemon for four good years before that bastard Jon Snow had done him out so his job could be handed to his fat pig of a friend. When he killed Sam Tarly tonight, he planned to whisper, “Give my love to Lord Snow,” right in his ear before he sliced Ser Piggy’s throat open to let the blood come bubbling out through all those layers of suet. Chett knew the ravens, so he wouldn’t have no trouble there, no more than he would with Tarly. One touch of the knife and that craven would piss his pants and start blubbering for his life. Let him beg, it won’t do him no good. After he opened his throat, he’d open the cages and shoo the birds away, so no messages reached the Wall. Softfoot and Small Paul would kill the Old Bear, Dirk would do Blane, and Lark and his cousins would silence Bannen and old Dywen, to keep them from sniffing after their trail. They’d been caching food for a fortnight, and Sweet Donnel and Clubfo

In [14]:
rdd2 = rdd.flatMap( lambda x: x.lower().split() )
rdd2, rdd2.take(5)

(PythonRDD[15] at RDD at PythonRDD.scala:53,
 ['the', 'plan', 'was', 'chett’s.', 'he'])

In [15]:
rdd3 = rdd2.map( lambda x: (x,1) )
rdd3.take(3)

[('the', 1), ('plan', 1), ('was', 1)]

In [24]:
rdd4 = rdd3.reduceByKey( lambda x,y : x+y )
rdd4.take(3)

[('the', 26), ('plan', 1), ('was', 4)]

In [26]:
rdd5 = rdd4.sortBy( lambda x: x[1], ascending=False )
rdd5.take(5)

[('the', 26), ('and', 13), ('to', 9), ('he', 8), ('his', 7)]