<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

*The below cell generates the table of contents - don't mess with it *

In [222]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

# Chapter 3: Programming with RDDs
## RDD Basics
* RDD: An immutable distributed collection of objects. Each split into multiple partitions, which pay be computed on different nodes of the cluster
* Created in 2 ways:
    1. Loading an external dataset
    2. Distributing a collection of objects (e.g. list or set) in their driver program
* RDDs offer 2 types of operations:
    1. Transformations: construct a new RDD from a previous one
    2. Actions: Compute a result based on an RDD, either return it to the driver program or save it to an external storage system (e.g. HDFS)
* Difference: Although you can define new RDDs any time, Spark only computes them in a <i> lazy</i> fashion, the first time they are used in an action
* RDDs are by default recomputed each time you run an action on them
    - If you would like to reuse RDD in multiple actions, you can ask Spark to <i>persist</i> it using <code>RDD.persist()</code>. to store the RDD contents in memory (partitioned across the machines in your cluster) and reuse them in future actions
    - It does not persist by default because in the context of big data if you do not use the RDD, there is no reason to waste storage space when Spark could instead stream through the data once and just compute the result

### Example: Filter data that matches a predicate
Create a new RDD holding just the strings that contain "Python":


In [99]:
from __future__ import print_function

# spark_home = os.environ.get('SPARK_HOME', None)
# print 'spark_home: ', spark_home
text_file = sc.textFile("README_spark.md")
# print 'text_file: ',text_file


# lines is an RDD of strings
lines = sc.textFile("README_spark.md")
# pythonLines is a new RDD holding just the strings that contain "Python"
pythonLines = lines.filter(lambda line: "Python" in line)

print('pythonLines Object: ',pythonLines)
print('lines Object: ', lines)

pythonLines Object:  PythonRDD[274] at RDD at PythonRDD.scala:43
lines Object:  README_spark.md MapPartitionsRDD[273] at textFile at NativeMethodAccessorImpl.java:-2


<code>lines.filter</code> is a <b>Transformation</b>

In [3]:
print('first line in pythonLines RDD:\n', pythonLines.first())
print()
print('first line in lines RDD:\n', lines.first())

first line in pythonLines RDD:
 high-level APIs in Scala, Java, Python, and R, and an optimized engine that

first line in lines RDD:
 # Apache Spark


#### Benefits of lazy evaluation:
* Computations only executed the first time they are in exected in an action
* In this example, if Spark were to load and store all the lines in the file at:
        lines = sc.textFile(...)
  it would waste a lot of storage space, given that we then immediately filter out many lines. 
* Instead, once Spark sees the whole chain of transformations, it can compute just the data needed for its result
    - In fact, for the </code>first()</code> action, Spark only scans the file until it finds the first matching line; <b>it does not read the whole file</b>
    

### Example: Practical use of persist
In practice, you will often use <code>persist</code> to load a subset of yoru data into memory and query it repeatedly. For example, if we knew that we wanted to compute multiple results about the README lines that contain "Python", we could write:

In [4]:
pythonLines.persist(StorageLevel.MEMORY_ONLY_SER)
print('number of python lines: ',pythonLines.count())
print('all python lines: ', pythonLines.collect())


number of python lines:  3
all python lines:  [u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', u'## Interactive Python Shell', u'Alternatively, if you prefer Python, you can use the Python shell:']


## Creating RDDs

* Two options:
    1. Load external dataset 
    2. Parallelize a collection in your driver program
    

In [5]:
# load external dataset (1):
lines = sc.textFile('README.md')
print('external dataset: ', lines)

# parallelize a collection in your own driver program (2):
lines = sc.parallelize(['pandas','i like pandas'])
print('parallelize a collection: ', lines)


external dataset:  README.md MapPartitionsRDD[9] at textFile at NativeMethodAccessorImpl.java:-2
parallelize a collection:  ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:423


## RDD Operations


### Transformations Filter Example
* <code>filter</code> does not mutate the existing <code>inputRDD</code> 
* Instead, it returns a pointer to an entirely new RDD
* <code>inputRDD</code> can be used again to search for lines with the word "warning" in them
* We can then use another transformation - <code>union</code> - to print out the number of lines that contain either "error" or "warning"
    - <code>union</code> is different than filter in that it operates on two RDDs opposed to one
    - Transformations can actually operate on any number of input RDDs
* As you derive new RDDs from each other using transformation, Spark keeps track of the set of dependencies between different RDDs called <code>lineage graph</code>
    - It uses this information to copute each RDD on demand and to recover los data if part of a persistent RDD is lost
    
<img src ="figures/lineage_graph.png">

In [6]:
inputRDD = sc.textFile('log.txt')
errorsRDD = inputRDD.filter(lambda x: 'error' in x)
warningsRDD = inputRDD.filter(lambda x: 'warning' in x)
badLinesRDD = errorsRDD.union(warningsRDD)

print('badLines RDD:')
[print('line ', i+1 , a) for i,a in enumerate(badLinesRDD.collect())]

print('inputRDD:')
[print('line ', i+1 , a) for i,a in enumerate(inputRDD.collect())]

print('Note: filter operation does not mutate existing inputRDD')

badLines RDD:
inputRDD:
line  2 cats
line  4 cats
line  6 cats
Note: filter operation does not mutate existing inputRDD


### Actions Example
* Actions <i>do something</i> with our dataset
* <code>count()</code>: returns the count as a number
* <code>take()</code>: collects a number of elements from the RDD
* <code>collect()</code>: Retrieves the entire RDD; useful if your program filters RDDs down to a very small size and you'd like to deal with it locally. 
    - <b> Your entire dataset must fit in memory on a single machine to use <code>collect()</code></b>

In [7]:
print('Input has', badLinesRDD.count(), 'lines' )
print ('2 Examples: ')
for a in badLinesRDD.take(2):
    print(a)
    

Input has 4 lines
2 Examples: 


### MapReduce vs. Spark Exerpt 
<i>Spark uses lazy evaluation to reduce the number of passes it has to take over our data
by grouping operations together. In MapReduce systems like Hadoop, developers often
have to spend a lot time considering how to group together operations to minimize the
number of MapReduce passes. In Spark, there is no substantial benefit to writing a single
complex map instead of chaining together many simple operations. Thus, users are free
to organize their program into smaller, more manageable operations</i>

### Passing Functions to Spark
* When you pass a function that is the member of an object (e.g. <code>self.field</code>), Spark sends the <i> entire</i> object to worker nodes, which can be much larger than the bit of information you need
* This can also cause your program to fail, if your class contains objects that Python can't figure out how to pickle

<b> Don't do this:</b>

    class Word Functions(object):
        ...
        def getMatchesNoReference(self,rdd):
            return rdd.filter(lambda x: self.query in x)

<b> Do this:</b>

    class Word Functions(object):
        ...
        def getMatchesNoReference(self,rdd):
            query = self.query
            return rdd.filter(lambda x: query in x)
        

### Common Transformations and Actions
* The two most <b>common transformations</b> you will likely be using are <code>map()</code> and <code>filter()</code>
    - <code>map()</code> takes in a function and applies it to each element in the RDD with the result of the function being a new value of each element
    - <code>filter()</code> takes in a function and returns an RDD that only has elements that pass the <code>filter()</code> function
* <code>flatmap()</code> produces multiple output elements for each input element 
    - Called individually for each element in our input RDD
    - Instead of returning a single element, we return an iterator with our return values
    - Rather than producing an RDD of iterators, we return an RDD that consists of the elements from all the iterators
    
   
### Pseudo Set Operators 
* RDDs support many of the operations of mathematical sets, such as union, intersection, even when the RDDs themselves are not properly sets 
* The set property most frequently missing from our RDDs is the uniqueness of elements
    - If we only want unique elements we can use <code>RDD.distinct</code> transformation to produce a new RDD with only distinct items
    - Note that <code>distinct()</code> is expensive as it requires shuffling all the data over the network to ensure that we only receive one copy of each element 
* <code>union()</code>: returns an RDD consisting of data from both sources, but if there are duplicates in the input RDDs, the result will also contain duplicates
* <code>intersection()</code>: only returns elements in both RDDs and removes all duplicates (including duplicates from a single RDD)
    - Performance of <code>intersection</code> is much worse than <code>union</code> since it requires a shuffle over the network to identify the common elements
        - In a shuffle operation we have to send the results to different machines rather than processing them locally
* <code>subtract(other)</code> takes in another RDD and returns an RDD that only has values present in teh first RDD and not the second RDD 
* <code>cartesian(other)</code> transformation results in possible pairs of (a,b) where a is the source RDD and b is the other RDD

<img src='figures/cartesian_product_figure.png'>

### Examples of Transformations

#### Map and Flat Map Example

In [8]:
# Map Example - square each element
print('Map Example')
original = [1,2,3,4]
nums = sc.parallelize(original)
squared = nums.map(lambda x: x*x).collect()
print('original: ', original)
print('mapped output: ',squared)

#Flat Map Example - splitting up an input string into words 
print('\nFlatmap Example ')
print('original: ', ['hello world','hi'])
lines = sc.parallelize(['hello world','hi'])
words = lines.flatMap(lambda line: line.split(' '))
print('flatmap output: ', words.collect())
    

Map Example
original:  [1, 2, 3, 4]
mapped output:  [1, 4, 9, 16]

Flatmap Example 
original:  ['hello world', 'hi']
flatmap output:  ['hello', 'world', 'hi']


#### Map, Sample, Union, Intersection Examples

In [9]:
# map example
rdd = sc.parallelize([1,2,3,3])
print('map output: ', rdd.map(lambda x: x+1).collect())

# sample 
print('sample output: ', rdd.sample(False,0.5).collect())


rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([3,4,5])

print('rdd1: ',rdd1.collect())
print('rdd2: ', rdd2.collect())

# union
print('union output:',rdd1.union(rdd2).collect())

# intersection
print('intersection output:',rdd1.intersection(rdd2).collect())
print('subtract output:', rdd1.subtract(rdd2).collect())
print('Cartesian product: ',rdd1.cartesian(rdd2).collect())

map output:  [2, 3, 4, 4]
sample output:  [2]
rdd1:  [1, 2, 3]
rdd2:  [3, 4, 5]
union output: [1, 2, 3, 3, 4, 5]
intersection output: [3]
subtract output: [1, 2]
Cartesian product:  [(1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 3), (3, 4), (3, 5)]


### Actions
* <code>reduce()</code>: Takes in a function with operates on two elements of the same type of your RDD and returns a new element of the same type 
    - <code>reduce()</code> is somewhat special. The "worker function" for this one must accept two arguments (we've called them x and y here), not just one. The function is called with the first two elements from the list, then with the result of that call and the third element, and so on, until all of the list elements have been handled. This means that our function is called n-1 times if the list contains n elements. The return value of the last call is the result of the reduce() construct. In the above example, it simply adds the arguments, so we get the sum of all elements. 
* <code>fold()</code>: takes a function with the same signature needed for <code>reduce</code>, but in addition takes a "zero value" to be used for the initial call on each partition. 
    - The zero value you provide should be the identity element for your operation 
    - That is, applying it multiple times with your function should not change the value (e.g. 0 for +, 1 for *, or an empty list for concatenation)
* <code>aggregate()</code>: this function frees us from the constraint of having the return be the same type as the RDD we are working with. <br>We supply:
    - An initial value of the type we want to return
    - A function to combine the elements from our RDD with the accumulator 
    - A second function to merge two accumulators, given that each node accumulates its own results locally

#### Example of Fold and Reduce

In [205]:
L = ['1','25','8','4']
print('fold output:',sc.parallelize(L).fold('',lambda a,b:a+b))
print('reduce output:',sc.parallelize(L).reduce(lambda a,b:a+b))

fold output: 12584
reduce output: 12584


#### Testing the same type output claim
<i> Both <code>fold()</code> and <code>reduce()</code> reuqire that the return type of our result be the same type as that of the lements int he RDD we are operating over</i>

Note that python does not throw an exception when we typecast the output from int to float, but scala does 

In [132]:
# although python does not throw an exception, scala does 
L = [1,2,3,4]
print('OUTPUT: ', sc.parallelize(L).reduce(lambda a,b:a*1./b))

OUTPUT:  0.0416666666667


#### Aggregate Example

<code>rdd.aggregate((init_val0,init_val1),combiner_function,merger_function)</code>

<code>combiner_function</code>: combine the elements from our RDD with the accumulator<br>
<code>merger_function</code>: merge two accumulators, given that each node accumulates its own results locally

Note that the number of partitions affects how the elements are combined as shown in the "intermediate" step

In [215]:
partitions = 2
rdd = sc.parallelize([1,2,3,4],partitions)
intermediate = rdd.aggregate((0,0),
              (lambda x, y: (x[0]+y, x[1] +1)),
              (lambda x, y: (x,y)))
print('intermediate:',intermediate)
sum_count = rdd.aggregate((0,0),
              (lambda x, y: (x[0] + y, x[1] +1)),
              (lambda x, y: (x[0] + y[0], x[1] +y[1])))
print('sum_count:',sum_count)
print('average:', sum_count[0]*1./sum_count[1])

intermediate: (((0, 0), (3, 2)), (7, 2))
sum_count: (10, 4)
average: 2.5
