In [None]:
%%javascript
$.getScript('http://asimjalis.github.io/ipyn-ext/js/ipyn-present.js')

<!-- 
This file was auto-generated from markdown using notedown.
Instead of modifying the ipynb modify the markdown source. 
-->

<h1 class="tocheading">Apache Spark</h1>
<div id="toc"></div>

<img src="images/spark-logo.png">

Apache Spark 
============

Key Value Pairs
===============

PairRDD
-------

At this point we know how to aggregate values across an RDD. If we
have an RDD containing sales transactions we can find the total
revenue across all transactions.

Q: Using the following sales data find the total revenue across all
transactions.

In [None]:
%%writefile sales.txt
#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00

- Read the file.

In [None]:
sc.textFile('sales.txt')\
    .take(2)

- Split the lines.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .take(2)

- Remove `#`.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: x[0].startswith('#'))\
    .take(2)

- Try again.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .take(2)

- Pick off last field.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: x[-1])\
    .take(2)

- Convert to float and then sum.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: float(x[-1]))\
    .sum()

ReduceByKey
-----------

Q: Calculate revenue per state?

- Instead of creating a sequence of revenue numbers we can create
  tuples of states and revenue.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .collect()

- Now use `reduceByKey` to add them up.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .collect()

Q: Find the state with the highest total revenue.

- You can either use the action `top` or the transformation `sortBy`.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .sortBy(lambda state_amount:state_amount[1],ascending=False) \
    .collect()

Pop Quiz
--------

<details><summary>
Q: What does `reduceByKey` do?
</summary>
1. It is like a reducer.
<br>
2. If the RDD is made up of key-value pairs, it combines the values
   across all tuples with the same key by using the function we pass
   to it.
<br>
3. It only works on RDDs made up of key-value pairs or 2-tuples.
</details>

Notes
-----

- `reduceByKey` only works on RDDs made up of 2-tuples.

- `reduceByKey` works as both a reducer and a combiner.

- It requires that the operation is associative.

Word Count
----------

Q: Implement word count in Spark.

- Create some input.

In [None]:
%%writefile input.txt
hello world
another line
yet another line
yet another another line

- Count the words.

In [None]:
sc.textFile('input.txt')\
    .flatMap(lambda line: line.split())\
    .map(lambda word: (word,1))\
    .reduceByKey(lambda count1,count2: count1+count2)\
    .collect()

Making List Indexing Readable
-----------------------------

- While this code looks reasonable, the list indexes are cryptic and
  hard to read.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .sortBy(lambda state_amount:state_amount[1],ascending=False) \
    .collect()

- We can make this more readable using Python's argument unpacking
  feature.

Argument Unpacking
------------------

Q: Which version of `getCity` is more readable and why?

- Consider this code.

In [None]:
client = ('Dmitri','Smith','SF')

def getCity1(client):
    return client[2]

def getCity2((first,last,city)):
    return city

print getCity1(client)

print getCity2(client)

- What is the difference between `getCity1` and `getCity2`?

- Which is more readable?

- What is the essence of argument unpacking?

Pop Quiz
--------
<details><summary>
Q: Can argument unpacking work for deeper nested structures?
</summary>
Yes. It can work for arbitrarily nested tuples and lists.
</details>

<details><summary>
Q: How would you write `getCity` given 
`client = ('Dmitri','Smith',('123 Eddy','SF','CA'))`
</summary>
`def getCity((first,last,(street,city,state))): return city`
</details>

Argument Unpacking
------------------

- Lets test this out.

In [None]:
client = ('Dmitri','Smith',('123 Eddy','SF','CA'))

def getCity((first,last,(street,city,state))):
    return city

getCity(client)

- Whenever you find yourself indexing into a tuple consider using
  argument unpacking to make it more readable.

- Here is what `getCity` looks like with tuple indexing.

In [None]:
def badGetCity(client):
    return client[2][1]

getCity(client)

Argument Unpacking In Spark
---------------------------

Q: Rewrite the last Spark job using argument unpacking.

- Here is the original version of the code.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .sortBy(lambda state_amount:state_amount[1],ascending=False) \
    .collect()

- Here is the code with argument unpacking.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda (id,date,store,state,product,amount): (state,float(amount)))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .sortBy(lambda (state,amount):amount,ascending=False) \
    .collect()

- In this case because we have a long list or tuple argument unpacking
  is a judgement call.

GroupByKey
----------

`reduceByKey` lets us aggregate values using sum, max, min, and other
associative operations. But what about non-associative operations like
average? How can we calculate them?

- There are several ways to do this.

- The first approach is to change the RDD tuples so that the operation
  becomes associative. 

- Instead of `(state, amount)` use `(state, (amount, count))`.

- The second approach is to use `groupByKey`, which is like
  `reduceByKey` except it gathers together all the values in an
  iterator. 
  
- The iterator can then be reduced in a `map` step immediately after
  the `groupByKey`.

Q: Calculate the average sales per state.

- Approach 1: Restructure the tuples.

In [None]:
sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],(float(x[-1]),1)))\
    .reduceByKey(lambda (amount1,count1),(amount2,count2): \
        (amount1+amount2, count1+count2))\
    .collect()

- Note the argument unpacking we are doing in `reduceByKey` to name
  the elements of the tuples.

- Approach 2: Use `groupByKey`.

In [None]:
def mean(iter):
    total = 0.0; count = 0
    for x in iter:
        total += x; count += 1
    return total/count

sc.textFile('sales.txt')\
    .map(lambda x: x.split())\
    .filter(lambda x: not x[0].startswith('#'))\
    .map(lambda x: (x[-3],float(x[-1])))\
    .groupByKey() \
    .map(lambda (state,iter): mean(iter))\
    .collect()

- Note that we are using unpacking again.

Pop Quiz
--------

<details><summary>
Q: What would be the disadvantage of not using unpacking?
</summary>
1. We will need to drill down into the elements.
<br>
2. The code will be harder to read.
</details>

<details><summary>
Q: What are the pros and cons of `reduceByKey` vs `groupByKey`?
</summary>
1. `groupByKey` stores the values for particular key as an iterable.
<br>
2. This will take up space in memory or on disk.
<br>
3. `reduceByKey` therefore is more scalable.
<br>
4. However, `groupByKey` does not require associative reducer
   operation.
<br>
5. For this reason `groupByKey` can be easier to program with.
</details>


Joins
-----

Q: Given a table of employees and locations find the cities that the
employees live in.


- The easiest way to do this is with a `join`.

In [None]:
# Employees: emp_id, loc_id, name
employee_data = [
    (101, 14, 'Alice'),
    (102, 15, 'Bob'),
    (103, 14, 'Chad'),
    (104, 15, 'Jen'),
    (105, 13, 'Dee') ]

# Locations: loc_id, location
location_data = [
    (14, 'SF'),
    (15, 'Seattle'),
    (16, 'Portland')]

employees = sc.parallelize(employee_data)
locations = sc.parallelize(location_data)

# Re-key employee records with loc_id
employees2 = employees.map(lambda (emp_id,loc_id,name):(loc_id,name));

# Now join.
employees2.join(locations).collect()

Pop Quiz
--------

<details><summary>
Q: How can we keep employees that don't have a valid location ID in
the final result?
</summary>
1. Use `leftOuterJoin` to keep employees without location IDs.
<br>
2. Use `rightOuterJoin` to keep locations without employees. 
<br>
3. Use `fullOuterJoin` to keep both.
<br>
</details>


Cogroup
-------

Q: What is `cogroup` over RDDs?

- If `rdd1` and `rdd2` are pair RDDs.

- Meaning their elements are key-value pairs.

- Then `rdd1.cogroups(rdd2)` will produce another pair RDD.

- The key will be the keys of `rdd1` and `rdd2`.

- For each key the value will be a pair.

- The first element of the pair is a sequence of values from `rdd1` for that key. 

- The second element is the sequence of values from `rdd2` for that key.

Pop Quiz
--------

First, lets initialize Spark.

In [None]:
import pyspark
sc = pyspark.SparkContext()

Q: What will this output?

In [None]:
r2 = sc.parallelize(xrange(5)).map(lambda x:(x%2,x))
r3 = sc.parallelize(xrange(5)).map(lambda x:(x%3,x))
for (k,(seq1,seq2)) in r2.cogroup(r3).collect(): 
    print [k,list(seq1),list(seq2)]


Caching and Persistence
=======================

RDD Caching
-----------

- Consider this Spark job.

In [None]:
import random
num_count = 500*1000
num_list = [random.random() for i in xrange(num_count)]
rdd1 = sc.parallelize(num_list)
rdd2 = rdd1.sortBy(lambda num: num)

- Lets time running `count()` on `rdd2`.

In [None]:
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

- The RDD does no work until an action is called. And then when an
  action is called it figures out the answer and then throws away all
  the data.

- If you have an RDD that you are going to reuse in your computation
  you can use `cache()` to make Spark cache the RDD.

- Lets cache it and try again.

In [None]:
rdd2.cache()
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

- Caching the RDD speeds up the job because the RDD does not have to
  be computed from scratch again.

Notes
-----

- Calling `cache()` flips a flag on the RDD. 

- The data is not cached until an action is called.

- You can uncache an RDD using `unpersist()`.

Pop Quiz
--------

<details><summary>
Q: Will `unpersist` uncache the RDD immediately or does it wait for an
action?
</summary>
It unpersists immediately.
</details>

Caching and Persistence
-----------------------

Q: Persist RDD to disk instead of caching it in memory.

- You can cache RDDs at different levels.

- Here is an example.

In [None]:
import pyspark
rdd = sc.parallelize(xrange(100))
rdd.persist(pyspark.StorageLevel.DISK_ONLY)

Pop Quiz
--------

<details><summary>
Q: Will the RDD be stored on disk at this point?
</summary>
No. It will get stored after we call an action.
</details>

Persistence Levels
------------------

Level                      |Meaning
-----                      |-------
`MEMORY_ONLY`              |Same as `cache()`
`MEMORY_AND_DISK`          |Cache in memory then overflow to disk
`MEMORY_AND_DISK_SER`      |Like above; in cache keep objects serialized instead of live 
`DISK_ONLY`                |Cache to disk not to memory

Notes
-----

- `MEMORY_AND_DISK_SER` is a good compromise between the levels. 

- Fast, but not too expensive.

- Make sure you unpersist when you don't need the RDD any more.


Spark Performance
=================

Narrow and Wide Transformations
-------------------------------

- Spark transformations are *narrow* if each RDD has one unique child
  past the transformation.

- Spark transformations are *wide* if each RDD can have multiple
  children past the transformation.

- Narrow transformations are map-like, while wide transformations are
  reduce-like.

- Narrow transformations are faster because they do move data between
  executors, while wide transformations are slower.
 
Repartitioning
--------------

- Over time partitions can get skewed. 

- Or you might have less data or more data than you started with.

- You can rebalance your partitions using `repartition` or `coalesce`.

- `coalesce` is narrow while `repartition` is wide.

Pop Quiz
--------

<details><summary>
Between `coalesce` and `repartition` which one is faster? Which one is
more effective?
</summary>
1. `coalesce` is narrow so it is faster. 
<br>
2. However, it only combines partitions and does not shuffle them.
<br>
3. `repartition` is wide but it partitions more effectively because it
   reshuffles the records.
</details>

Misc
====

Amazon S3
---------

- *"s3:" URLs break when Secret Key contains a slash, even if encoded*
    <https://issues.apache.org/jira/browse/HADOOP-3733>

- *Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access*
    <https://issues.apache.org/jira/browse/SPARK-7442>