# First RDD

Importing SparkConf (configuration) and SparkContext

In [1]:
from pyspark import SparkConf, SparkContext

Defining a configuration as conf

In [2]:
conf = SparkConf().setAppName("RDD_TRANSFORMATIONS_1")

Creating a spark context sc and passing the configuration created above

In [3]:
sc = SparkContext.getOrCreate(conf=conf)

Taking a simple data collection (List) and creating a rdd by parallelizing the list. After that applying take() action

In [4]:
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
rdd.take(5)

[1, 2, 3, 4, 5]

# Reading from files

In [11]:
rdd1 = sc.textFile("file:///D:/Code/big-data-stack/pyspark-rdd-operations/data/sample.txt")
rdd1.collect()

['Hello from pySpark']

# Persistance and Caching

### RDD Persistance

In [14]:
oddNums = sc.parallelize(range(1, 1000, 2))
oddNums.take(10)
oddNums.persist()

PythonRDD[24] at RDD at PythonRDD.scala:53

### RDD Caching

In [15]:
rdd2 = sc.textFile("file:///D:/Code/big-data-stack/pyspark-rdd-operations/data/sample.txt")
rdd2.cache()

file:///D:/Code/big-data-stack/pyspark-rdd-operations/data/sample.txt MapPartitionsRDD[26] at textFile at NativeMethodAccessorImpl.java:0

# Spark Cache vs Persist

Using <strong>cache()</strong> and <strong>persist()</strong> methods, Spark provides an optimization mechanism to store the intermediate computation of an RDD, DataFrame, and Dataset so they can be reused in subsequent actions(reusing the RDD, Dataframe, and Dataset computation result’s).

Both caching and persisting are used to save the Spark RDD, Dataframe and Dataset’s. But, the difference is, <strong>RDD cache()</strong> method default saves it to memory (MEMORY_ONLY) whereas <strong>persist()</strong> method is used to store it to user-defined storage level.

When you persist a dataset, each node stores it’s partitioned data in memory and reuses them in other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if any partition of a Dataset is lost, it will automatically be recomputed using the original transformations that created it.

# Operations of RDDs

There are two type of data operations we can perform on an RDD, <strong>transformations</strong> and <strong>actions</strong>
- <strong>Transformations</strong> : will return a new RDD as RDDs are generally <strong>Immutable</strong>.
- <strong>Actions</strong> : will return a value

## Transformations on RDDs

Transformations are lazy operations on RDD that create one or more new RDDs.

RDD transformations return a pointer to the new RDD and allow use to create dependencies between RDDs. Each RDD in dependency chain has a function to calculate data and a pointer to the parent RDD.

Spark is lazy, so nothing will be executed until we call a transformation or an action that will trigger the job creation and execution.

## Map Transformation

Passes each element through a function

In [17]:
x = sc.parallelize(["apple", "grapes", "banana", "orange", "kiwi"])
y = x.map(lambda x: (x, 1))
y.collect()

[('apple', 1), ('grapes', 1), ('banana', 1), ('orange', 1), ('kiwi', 1)]

## FlatMap Transformation

Its similar to map transformation, but here each item can be mapped with 0 or more output items, so the function should return a sequence rather than a single item

In [20]:
rdd3 = sc.parallelize([2, 3, 4])
result = rdd3.flatMap(lambda x: range(1, x))
result.collect()

[1, 1, 2, 1, 2, 3]

In [22]:
rdd3 = sc.parallelize([2, 3, 4])
result = rdd3.map(lambda x: range(1, x))
result.collect()

[range(1, 2), range(1, 3), range(1, 4)]

## Filter Transformation

Filtering rows based on some condition

In [26]:
rdd4 = sc.parallelize(range(1, 6))
rdd4.filter(lambda x: x%2 == 0).collect()

[2, 4]