<li>https://www.youtube.com/watch?v=e5ol7oyKV0A&t=633s </li>

<p>In-memory computation and parallel processing are the two major reasons why Apache Spark is used in the industry and one of the best frameworks to handle big-data and perform analysis.</p>
<p>RDD forms one of the backbones of Apache Spark</p>
<p>It is one of the fundamental Data Structures</p>
<p>It is a schema-less structure that can handle both structured and unstructured data</p>
<p>The data in RDD is split into chunks based on a key and then dispersed across all the executor nodes</p>
<p>RDDs are highly resilient, i.e they are able to recover quickly from any issues, as the same data chunks are replicated across multiple executor nodes. Thus even if an executor fails, another will process the data</p>
<p>This also allows us to perform functional calculation against our dataset quickly by harnessing the power of multiple nodes. </p>
<p>RDDs support two types of operations - Transformations and Actions</p>
<p>Transformations are operations applied on a RDD to form a new RDD. The transformations work on the principle of lazy evaluation. Lazy evaluation means that when we call to perform some operations on RDD, it does not execute immediately. Spark maintains the record of which operation is being called through a DAG and since the transformations are lazy in nature we can execute operations any time by calling an action on the data. Hence in lazy evaluation the data is not loaded until necessary. This helps in optimizing the required calculation and recovery of lost data partition</p>
<p>Actions are operations performed on an RDD to instruct Spark to apply computations and pass the result back to the driver. The moment an action is invoked all the computations happen which are in the pipeline. This gives us the result that is stored in intermediate storage or distributed file system</p>

<ul><b>Transformations</b>
<li>map</li>
<li>flatMap</li>
<li>filter</li>
<li>distinct</li>
<li>reduceByKey</li>
<li>mapPartitions</li>
</ul>

<ul><b>Actions</b>
<li>collect</li>
<li>collectAsMap</li>
<li>reduce</li>
<li>countByKey</li>
<li>take</li>
<li>countByValue</li>
</ul>

<h2>Three ways for creating an RDD</h2>
<ul>
<li>From parallelized collection</li>
<li>Another RDD</li>
<li>External data sources like HDFS, Amazon S3, Hbase, etc</li>
</ul>

In [1]:
import findspark
findspark.init()
import pyspark
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local[2]").setAppName("Sec_RDD_Tut")
sc = SparkContext(conf=conf)

# Creating RDD from parallelized collection

In [2]:
# To create a RDD from parallelized collection we use the sc.parallelize method
# sc stands from SparkContext which can be found under SparkSession.
# SparkSession contains SparkContext, StreamingContext and SQLContext
# Before Spark 2.0, SparkContext, SQLContext and StreamingContext were distributed separately and had to be loaded separately but now they are put under SparkSession

# sc.parallelize is SparkContext's parallelize method for creating a parallelized collection 
# and it helps Spark to distribute data across multiple nodes instead of depending on a single
# node to process the data

myRDD = sc.parallelize([('Ross', 19), ('Joey', 18), ('Rachael', 16), ('Phoebe', 18), ('Chandler', 17), ('Monica', 20)])

In [3]:
# When an action is invoked, all the computations which are lined in the lineage graph of the 
# transformations which have been performed on the RDD, take place all at once.
# A common approach in spark is to use the collect() method which returns all the values
# in the RDD from the Spark worker nodes to the driver node. This can lead to performance
# implications when working with large amounts of data as this translates to a large volume
# of data being transferred from Spark worker nodes to the driver nodes.
# For a small amount of data this is fine, but as a matter of habit, we should always use the
# take method.

myRDD.take(3)

[('Ross', 19), ('Joey', 18), ('Rachael', 16)]

# Creating RDDs from text files

In [4]:
new_RDD = sc.textFile("file:///home/boom/Documents/programming/pyspark/data_files/rd*")

In [5]:
new_RDD.take(3)

['Hey there where ya goin’, not exactly knowin’, who says you have to call just one place home. He’s goin’ everywhere, B.J. McKay and his best friend Bear. He just keeps on movin’, ladies keep improvin’, every day is better than the last. New dreams and better scenes, and best of all I don’t pay property tax. Rollin’ down to Dallas, who’s providin’ my palace, off to New Orleans or who knows where. Places new and ladies, too, I’m B.J. McKay and this is my best friend Bear.',
 '',
 'Children of the sun, see your time has just begun, searching for your ways, through adventures every day. Every day and night, with the condor in flight, with all your friends in tow, you search for the Cities of Gold. Ah-ah-ah-ah-ah… wishing for The Cities of Gold. Ah-ah-ah-ah-ah… some day we will find The Cities of Gold. Do-do-do-do ah-ah-ah, do-do-do-do, Cities of Gold. Do-do-do-do, Cities of Gold. Ah-ah-ah-ah-ah… some day we will find The Cities of Gold.']

In [6]:
csv_RDD = sc.textFile("file:///home/boom/Documents/programming/pyspark/data_files/my_data.csv", minPartitions=4).map(lambda element: element.split("\t"))
# map here is being to used to transform a list of string to a list of lists

In [7]:
csv_RDD.take(3)

[['id,firstname,lastname,email,email2,profession'],
 ['100,Lynde,Orelee,Lynde.Orelee@yopmail.com,Lynde.Orelee@gmail.com,firefighter'],
 ['101,Vere,Charity,Vere.Charity@yopmail.com,Vere.Charity@gmail.com,police officer']]

In [8]:
csv_RDD.getNumPartitions()

4

In [9]:
csv_RDD.count()

1001

# Next we want to perform some actions
We want to convert all the data into lowercase and divide the paragraphs into words.

In [10]:
def conv_func(lines):
    lines = lines.lower()
    lines = lines.split()
    return lines
split_RDD = new_RDD.map(conv_func)

In [11]:
split_RDD.take(5)

[['hey',
  'there',
  'where',
  'ya',
  'goin’,',
  'not',
  'exactly',
  'knowin’,',
  'who',
  'says',
  'you',
  'have',
  'to',
  'call',
  'just',
  'one',
  'place',
  'home.',
  'he’s',
  'goin’',
  'everywhere,',
  'b.j.',
  'mckay',
  'and',
  'his',
  'best',
  'friend',
  'bear.',
  'he',
  'just',
  'keeps',
  'on',
  'movin’,',
  'ladies',
  'keep',
  'improvin’,',
  'every',
  'day',
  'is',
  'better',
  'than',
  'the',
  'last.',
  'new',
  'dreams',
  'and',
  'better',
  'scenes,',
  'and',
  'best',
  'of',
  'all',
  'i',
  'don’t',
  'pay',
  'property',
  'tax.',
  'rollin’',
  'down',
  'to',
  'dallas,',
  'who’s',
  'providin’',
  'my',
  'palace,',
  'off',
  'to',
  'new',
  'orleans',
  'or',
  'who',
  'knows',
  'where.',
  'places',
  'new',
  'and',
  'ladies,',
  'too,',
  'i’m',
  'b.j.',
  'mckay',
  'and',
  'this',
  'is',
  'my',
  'best',
  'friend',
  'bear.'],
 [],
 ['children',
  'of',
  'the',
  'sun,',
  'see',
  'your',
  'time',
  'has',


In [12]:
flat_map_RDD = new_RDD.flatMap(conv_func)

In [13]:
# flatMap flattened out the RDD vertically instead of horizonatally
flat_map_RDD.take(4)

['hey', 'there', 'where', 'ya']

In [14]:
# NEXT WE WANT TO REMOVE ALL THE STOPWORDS FROM THE GIVEN TEXT
stopwords = ["a", "all", "the", "as", "is", "am", "an", "and", "be", "been", "from", "had", "I", "I'd", "why", "with", "hey", "there"]

In [15]:
stop_words_RDD = flat_map_RDD.filter(lambda x: x not in stopwords)
stop_words_RDD.take(4)

['where', 'ya', 'goin’,', 'not']

In [17]:
import re
filteredRDD = flat_map_RDD.filter(lambda x: x.startswith('c'))
filteredRDD.distinct().take(50)

['call',
 'condor',
 'cat!',
 'close',
 'championship.',
 'children',
 'cities',
 'cat.']

# Word-Count

In [24]:
rdd_mapped = flat_map_RDD.map(lambda x: (x,1))
rdd_grouped = rdd_mapped.groupByKey()
rdd_frequency = rdd_grouped.mapValues(sum).map(lambda x: (x[1], x[0])).sortByKey(False)
rdd_frequency.take(10)

[(13, 'the'),
 (9, 'of'),
 (6, 'and'),
 (6, 'cities'),
 (6, 'gold.'),
 (5, 'he’s'),
 (4, 'to'),
 (4, 'day'),
 (4, 'top'),
 (3, 'just')]

In [25]:
stop_words_RDD.count()

180

In [26]:
distinct_stop_words_RDD = stop_words_RDD.distinct()
distinct_stop_words_RDD.count()

115

In [28]:
similar_starting_words_RDD = distinct_stop_words_RDD.groupBy(lambda w: w[0:3])
print([(k, list(v)) for (k,v) in similar_starting_words_RDD.take(5)])

[('whe', ['where', 'where.']), ('ya', ['ya']), ('goi', ['goin’,', 'goin’']), ('exa', ['exactly']), ('say', ['says'])]


In [29]:
sampled_RDD = stop_words_RDD.sample(False, 0.1)
# False is the "with" replacement parameter. In case I want my output to not have all replacements
# so I am assigning False.
# .1 is the fraction of data with which we are going to take the sample
# 0.1 is 10% of original data, i.e 18 in our case

In [30]:
sampled_RDD.count()

20

In [32]:
sampled_RDD.take(3)

['who', 'you', 'pay']

# Demo of some functions

In [33]:
a = sc.parallelize([('a', 2), ('b', 3)])
b = sc.parallelize([('a', 9), ('b', 7), ('c', 10)])

In [34]:
c = a.join(b)

In [35]:
c.collect()

[('b', (3, 7)), ('a', (2, 9))]

In [36]:
num_rdd = sc.parallelize(range(1, 50000))

In [37]:
num_rdd.reduce(lambda x,y: x+y)

1249975000

In [38]:
data_keydata = sc.parallelize([('a', 4), ('b', 3), ('c', 2), ('a', 8), ('d', 2), ('b', 1), ('d', 3)], 4)

In [39]:
data_keydata.collect()

[('a', 4), ('b', 3), ('c', 2), ('a', 8), ('d', 2), ('b', 1), ('d', 3)]

In [41]:
data_keydata.reduceByKey(lambda x,y: x+y).collect()

[('b', 4), ('c', 2), ('a', 12), ('d', 5)]

In [42]:
data_keydata.saveAsTextFile("file:///home/boom/Documents/programming/pyspark/data_files/output.txt")

In [43]:
test = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]

In [45]:
sc.parallelize(test).sortByKey(True, 1).collect()
# True parameter is ascending order of key

[('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]

In [46]:
union_rdd = sc.parallelize([1, 1, 2, 3])
union2 = sc.parallelize([4, 5, 6, 7, 8, 9])

In [47]:
union_rdd.union(union2).collect()

[1, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [48]:
intersect_rdd_a = sc.parallelize([1, 2, 3, 4])
intersect_rdd_b = sc.parallelize([3, 4, 5, 6])
intersect_rdd_a.intersection(intersect_rdd_b).collect()

[4, 3]

In [49]:
intersect_rdd_a.subtract(intersect_rdd_b).collect()

[1, 2]

In [50]:
intersect_rdd_a.cartesian(intersect_rdd_b).collect()

[(1, 3),
 (1, 4),
 (2, 3),
 (2, 4),
 (1, 5),
 (1, 6),
 (2, 5),
 (2, 6),
 (3, 3),
 (3, 4),
 (4, 3),
 (4, 4),
 (3, 5),
 (3, 6),
 (4, 5),
 (4, 6)]

# Finding Page Rank

In [51]:
# Page Rank is rank of any page developed by Google. Named after Larry Page of Google.
# Higher the page rank, higher is the possibility of it being showed up during search.

In [74]:
pageLinks = [['a', ['b', 'c', 'd']],
            ['c', ['b']], ['b', ['d', 'c']], ['d', ['a', 'c']]]

In [75]:
pageRanks = [['a', 1], ['c', 1], ['b', 1], ['d', 1]]

In [76]:
def rankContributions(uris, rank):
    numberOfUris = len(uris)
    rankContribution = float(rank) / numberOfUris
    newrank = []
    for uri in uris:
        newrank.append((uri, rankContribution))
    return newrank

In [77]:
pageLinksRDD = sc.parallelize(pageLinks, 2)
pageLinksRDD.collect()

[['a', ['b', 'c', 'd']], ['c', ['b']], ['b', ['d', 'c']], ['d', ['a', 'c']]]

In [78]:
pageRanksRDD = sc.parallelize(pageRanks, 2)
pageRanksRDD.collect()

[['a', 1], ['c', 1], ['b', 1], ['d', 1]]

In [79]:
numIter = 20
stamping_factr = 0.85

In [80]:
for i in range(numIter):
    linksRank = pageLinksRDD.join(pageRanksRDD)
    contributedRDD = linksRank.flatMap(lambda x: rankContributions(x[1][0], x[1][1]))
    sumRanks = contributedRDD.reduceByKey(lambda v1, v2: v1+v2)
    pageRanksRDD = sumRanks.map(lambda x: (x[0], (1-s)+s*x[1]))

In [81]:
pageRanksRDD.collect()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 671.0 failed 1 times, most recent failure: Lost task 3.0 in stage 671.0 (TID 149, 192.168.1.102, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 271, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-80-c3230108f0b9>", line 5, in <lambda>
NameError: name 's' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2093)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2133)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:168)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 271, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/boom/Documents/programming/big_data/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-80-c3230108f0b9>", line 5, in <lambda>
NameError: name 's' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
