<p>In-memory computation and parallel processing are the two major reasons why Apache Spark is used in the industry and one of the best frameworks to handle big-data and perform analysis.</p>
<p>RDD forms one of the backbones of Apache Spark</p>
<p>It is one of the fundamental Data Structures</p>
<p>It is a schema-less structure that can handle both structured and unstructured data</p>
<p>The data in RDD is split into chunks based on a key and then dispersed across all the executor nodes</p>
<p>RDDs are highly resilient, i.e they are able to recover quickly from any issues, as the same data chunks are replicated across multiple executor nodes. Thus even if an executor fails, another will process the data</p>
<p>This also allows us to perform functional calculation against our dataset quickly by harnessing the power of multiple nodes. </p>
<p>RDDs support two types of operations - Transformations and Actions</p>
<p>Transformations are operations applied on a RDD to form a new RDD. The transformations work on the principle of lazy evaluation. Lazy evaluation means that when we call to perform some operations on RDD, it does not execute immediately. Spark maintains the record of which operation is being called through a DAG and since the transformations are lazy in nature we can execute operations any time by calling an action on the data. Hence in lazy evaluation the data is not loaded until necessary. This helps in optimizing the required calculation and recovery of lost data partition</p>
<p>Actions are operations performed on an RDD to instruct Spark to apply computations and pass the result back to the driver. The moment an action is invoked all the computations happen which are in the pipeline. This gives us the result that is stored in intermediate storage or distributed file system</p>

<ul><b>Transformations</b>
<li>map</li>
<li>flatMap</li>
<li>filter</li>
<li>distinct</li>
<li>reduceByKey</li>
<li>mapPartitions</li>
</ul>

<ul><b>Actions</b>
<li>collect</li>
<li>collectAsMap</li>
<li>reduce</li>
<li>countByKey</li>
<li>take</li>
<li>countByValue</li>
</ul>

<h2>Three ways for creating an RDD</h2>
<ul>
<li>From parallelized collection</li>
<li>Another RDD</li>
<li>External data sources like HDFS, Amazon S3, Hbase, etc</li>
</ul>

In [2]:
import findspark
findspark.init()
import pyspark
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[2]").setAppName("Sec_RDD_Tut")
sc = SparkContext(conf=conf)

# Creating RDD from parallelized collection

In [3]:
# To create a RDD from parallelized collection we use the sc.parallelize method
# sc stands from SparkContext which can be found under SparkSession.
# SparkSession contains SparkContext, StreamingContext and SQLContext
# Before Spark 2.0, SparkContext, SQLContext and StreamingContext were distributed separately and had to be loaded separately but now they are put under SparkSession

# sc.parallelize is SparkContext's parallelize method for creating a parallelized collection 
# and it helps Spark to distribute data across multiple nodes instead of depending on a single
# node to process the data

myRDD = sc.parallelize([('Ross', 19), ('Joey', 18), ('Rachael', 16), ('Phoebe', 18), ('Chandler', 17), ('Monica', 20)])

In [4]:
# When an action is invoked, all the computations which are lined in the lineage graph of the 
# transformations which have been performed on the RDD, take place all at once.
# A common approach in spark is to use the collect() method which returns all the values
# in the RDD from the Spark worker nodes to the driver node. This can lead to performance
# implications when working with large amounts of data as this translates to a large volume
# of data being transferred from Spark worker nodes to the driver nodes.
# For a small amount of data this is fine, but as a matter of habit, we should always use the
# take method.

myRDD.take(3)

[('Ross', 19), ('Joey', 18), ('Rachael', 16)]

# Creating RDDs from text files

In [5]:
new_RDD = sc.textFile("file:///home/boom/Documents/programming/pyspark/data_files/rd*")

In [7]:
new_RDD.take(3)

['Hey there where ya goin’, not exactly knowin’, who says you have to call just one place home. He’s goin’ everywhere, B.J. McKay and his best friend Bear. He just keeps on movin’, ladies keep improvin’, every day is better than the last. New dreams and better scenes, and best of all I don’t pay property tax. Rollin’ down to Dallas, who’s providin’ my palace, off to New Orleans or who knows where. Places new and ladies, too, I’m B.J. McKay and this is my best friend Bear.',
 '',
 'Children of the sun, see your time has just begun, searching for your ways, through adventures every day. Every day and night, with the condor in flight, with all your friends in tow, you search for the Cities of Gold. Ah-ah-ah-ah-ah… wishing for The Cities of Gold. Ah-ah-ah-ah-ah… some day we will find The Cities of Gold. Do-do-do-do ah-ah-ah, do-do-do-do, Cities of Gold. Do-do-do-do, Cities of Gold. Ah-ah-ah-ah-ah… some day we will find The Cities of Gold.']

In [10]:
csv_RDD = sc.textFile("file:////home/boom/Documents/programming/pyspark/my_data.csv", minPartitions=4).map(lambda element: element.split("\t"))
# map here is being to used to transform a list of string to a list of lists