# Chapter 3: Programming with RDDs
## RDD Basics
* RDD: An immutable distributed collection of objects. Each split into multiple partitions, which pay be computed on different nodes of the cluster
* Created in 2 ways:
    1. Loading an external dataset
    2. Distributing a collection of objects (e.g. list or set) in their driver program
* RDDs offer 2 types of operations:
    1. Transformations: construct a new RDD from a previous one
    2. Actions: Compute a result based on an RDD, either return it to the driver program or save it to an external storage system (e.g. HDFS)
* Difference: Although you can define new RDDs any time, Spark only computes them in a <i> lazy</i> fashion, the first time they are used in an action
* RDDs are by default recomputed each time you run an action on them
    - If you would like to reuse RDD in multiple actions, you can ask Spark to <i>persist</i> it using <code>RDD.persist()</code>. to store the RDD contents in memory (partitioned across the machines in your cluster) and reuse them in future actions
    - It does not persist by default because in the context of big data if you do not use the RDD, there is no reason to waste storage space when Spark could instead stream through the data once and just compute the result

### Example: Filter data that matches a predicate
Create a new RDD holding just the strings that contain "Python":


In [2]:
spark_home = os.environ.get('SPARK_HOME', None)
# print 'spark_home: ', spark_home
text_file = sc.textFile(spark_home + "/README.md")
# print 'text_file: ',text_file


# lines is an RDD of strings
lines = sc.textFile(spark_home + "/README.md")
# pythonLines is a new RDD holding just the strings that contain "Python"
pythonLines = lines.filter(lambda line: "Python" in line)

print 'pythonLines Object: ',pythonLines
print 'lines Object: ', lines

pythonLines Object:  PythonRDD[4] at RDD at PythonRDD.scala:43
lines Object:  /home/jo/spark/spark-1.6.1-bin-hadoop2.4/README.md MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2


<code>lines.filter</code> is a <b>Transformation</b>

In [3]:
print 'first line in pythonLines RDD:\n', pythonLines.first()
print 
print 'first line in lines RDD:\n', lines.first()

first line in pythonLines RDD:
high-level APIs in Scala, Java, Python, and R, and an optimized engine that

first line in lines RDD:
# Apache Spark


#### Benefits of lazy evaluation:
* Computations only executed the first time they are in exected in an action
* In this example, if Spark were to load and store all the lines in the file at:
        lines = sc.textFile(...)
  it would waste a lot of storage space, given that we then immediately filter out many lines. 
* Instead, once Spark sees the whole chain of transformations, it can compute just the data needed for its result
    - In fact, for the </code>first()</code> action, Spark only scans the file until it finds the first matching line; <b>it does not read the whole file</b>
    

### Example: Practical use of persist
In practice, you will often use <code>persist</code> to load a subset of yoru data into memory and query it repeatedly. For example, if we knew that we wanted to compute multiple results about the README lines that contain "Python", we could write:

In [4]:
pythonLines.persist(StorageLevel.MEMORY_ONLY_SER)
print 'number of python lines: ',pythonLines.count()
print 'all python lines: ', pythonLines.collect()


number of python lines:  3
all python lines:  [u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', u'## Interactive Python Shell', u'Alternatively, if you prefer Python, you can use the Python shell:']


## Creating RDDs

* Two options:
    1. Load external dataset 
    2. Parallelize a collection in your driver program
    

In [5]:
# load external dataset (1):
lines = sc.textFile(spark_home + '/README.md')
print 'external dataset: ', lines

# parallelize a collection in your own driver program (2):
lines = sc.parallelize(['pandas','i like pandas'])
print 'parallelize a collection: ', lines


external dataset:  /home/jo/spark/spark-1.6.1-bin-hadoop2.4/README.md MapPartitionsRDD[9] at textFile at NativeMethodAccessorImpl.java:-2
parallelize a collection:  ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:423


## RDD Operations


### Transformations Filter Example
* <code>filter</code> does not mutate the existing <code>inputRDD</code> 
* Instead, it returns a pointer to an entirely new RDD
* <code>inputRDD</code> can be used again to search for lines with the word "warning" in them
* We can then use another transformation - <code>union</code> - to print out the number of lines that contain either "error" or "warning"
    - <code>union</code> is different than filter in that it operates on two RDDs opposed to one
    - Transformations can actually operate on any number of input RDDs
* As you derive new RDDs from each other using transformation, Spark keeps track of the set of dependencies between different RDDs called <code>lineage graph</code>
    - It uses this information to copute each RDD on demand and to recover los data if part of a persistent RDD is lost
    
<img src ="lineage_graph.png">

In [6]:
from __future__ import print_function
inputRDD = sc.textFile('log.txt')
errorsRDD = inputRDD.filter(lambda x: 'error' in x)
warningsRDD = inputRDD.filter(lambda x: 'warning' in x)
badLinesRDD = errorsRDD.union(warningsRDD)

print('badLines RDD:')
[print('line ', i+1 , a) for i,a in enumerate(badLinesRDD.collect())]

print('inputRDD:')
[print('line ', i+1 , a) for i,a in enumerate(inputRDD.collect())]

print('Note: filter operation does not mutate existing inputRDD')

badLines RDD:
inputRDD:
line  2 cats
line  4 cats
line  6 cats
Note: filter operation does not mutate existing inputRDD


### Actions Example
* Actions <i>do something</i> with our dataset
* <code>count()</code>: returns the count as a number
* <code>take()</code>: collects a number of elements from the RDD
* <code>collect()</code>: Retrieves the entire RDD; useful if your program filters RDDs down to a very small size and you'd like to deal with it locally. <b> Your entire dataset must fit in memory on a single machine to use <code>collect()</code></b>

In [7]:
print('Input has', badLinesRDD.count(), 'lines' )
print ('2 Examples: ')
for a in badLinesRDD.take(2):
    print(a)
    

Input has 4 lines
2 Examples: 


### MapReduce vs. Spark Exerpt 
<i>Spark uses lazy evaluation to reduce the number of passes it has to take over our data
by grouping operations together. In MapReduce systems like Hadoop, developers often
have to spend a lot time considering how to group together operations to minimize the
number of MapReduce passes. In Spark, there is no substantial benefit to writing a single
complex map instead of chaining together many simple operations. Thus, users are free
to organize their program into smaller, more manageable operations</i>

### Passing Functions to Spark
...tbc pg 42 (in pdf)

# Word Count Example

The word count script below is quite simple. It takes the following steps:

1. Split each line from the file into words
2. Map each word to a tuple containing the word and an initial count of 1
3. Sum up the count for each word

In [8]:
import os

spark_home = os.environ.get('SPARK_HOME', None)
text_file = sc.textFile(spark_home + "/README.md")

word_counts = text_file \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

In [3]:
word_counts.collect()

[(u'when', 1),
 (u'R,', 1),
 (u'including', 3),
 (u'computation', 1),
 (u'using:', 1),
 (u'guidance', 2),
 (u'Scala,', 1),
 (u'environment', 1),
 (u'only', 1),
 (u'rich', 1),
 (u'Apache', 1),
 (u'sc.parallelize(range(1000)).count()', 1),
 (u'Building', 1),
 (u'guide,', 1),
 (u'return', 2),
 (u'Please', 3),
 (u'Try', 1),
 (u'not', 1),
 (u'Spark', 13),
 (u'scala>', 1),
 (u'Note', 1),
 (u'cluster.', 1),
 (u'./bin/pyspark', 1),
 (u'params', 1),
 (u'through', 1),
 (u'GraphX', 1),
 (u'[run', 1),
 (u'abbreviated', 1),
 (u'[project', 2),
 (u'##', 8),
 (u'library', 1),
 (u'see', 1),
 (u'"local"', 1),
 (u'[Apache', 1),
 (u'will', 1),
 (u'#', 1),
 (u'processing,', 1),
 (u'for', 11),
 (u'[building', 1),
 (u'provides', 1),
 (u'print', 1),
 (u'supports', 2),
 (u'built,', 1),
 (u'[params]`.', 1),
 (u'available', 1),
 (u'run', 7),
 (u'tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).',
  1),
 (u'This', 2),
 (u'Hadoop,', 2),
 (u'Tests', 1),
 (u'example:', 1),
 (u'-DskipT