# Day 3: Python and Apache Spark on a Cluster

## Recap

Yesterday, we talked generally about distributed computing, and learned about Spark Core and Spark SQL.

Distributed computing is quite different from working on your laptop:
* Each node only holds **part of the data** and moving data around is expensive
* Code is tiny compared to data, so we **send code to the data** instead of downloading data to your machine.
* **Failure is ubiquitous**: nodes go down, disks die, power goes off, AC fails, ...  all the time.  You need to store data redundantly and be prepared to detect failed partial calculations and rerun them elsewhere.

`Spark` is a framework for writing distributed computations that handles most of the complications of distributed computing transparently.  The devil's bargain is that you need to **write code in a very functional style**, in order to expose the structure and parallelism of your calculations, as well as minimize data transfer and intermediate results.

Main abstraction is an RDD, or Resilient Distributed Dataset.  Like a NumPy array or Python lists, but the data is split up among all the nodes.

In [1]:
# Record current working directory for later use
import os
cwd = os.getcwd()
cwd

'/Users/pat/Work/2015/PythonTraining/4DayTrainingOpenSource'

In [2]:
# A few ways to make an initial RDD

# Items in a Python list (also NumPy array, Pandas series, ...)
rdd1 = sc.parallelize([1,2,3,4,5])

# Lines in a file
rdd2 = sc.textFile("file://" + cwd + "/names/yob1880.txt")

# Files in a folder
# Items are (filename, contents) tuples
rdd3 = sc.wholeTextFiles("file://" + cwd + "/names")

One RDD can be transformed into another with a **transformation**.  The main ones:
* `map()`
* `flatMap()`
* `filter()`

In [3]:
rdd = sc.parallelize(['Hello world', 'My name is Patrick'])
print(
    rdd
    .flatMap(lambda sentence: sentence.split(' '))
    .map(lambda word: word.lower())
    .filter(lambda word: len(word) >= 5)
    .collect()
)

['hello', 'world', 'patrick']


Some transformations produce one RDD from two or more input RDDs:

In [4]:
aRDD = sc.parallelize([1,2,3])
bRDD = sc.parallelize([2,3,4])

aRDD.union(bRDD).collect()   # Also intersection, subtract and cartesian

[1, 2, 3, 2, 3, 4]

Transformations are applied **lazily**: Spark will delay computations as long as possible.

**Actions** demand an immediate result.  The most common are:
* `collect()`
* `first()`
* `count()`
* `take()`
* `reduce()`

In [5]:
# 10! in Spark
sc.parallelize(range(1, 10+1)).reduce(lambda x, y: x * y)

3628800

**Pair RDDs**, where each item is a `(key, value)` pair, have lots more transformations and actions that work on the keys:
* `groupByKey()`
* `reduceByKey()`
* `countByKey()`
* `collectAsMap()`

In [6]:
rdd = sc.parallelize(['Hello hello', 
                      'Hello New York', 
                      'York says hello'])
result = (
    rdd
    .flatMap(lambda sentence: sentence.split(' '))
    .map(lambda word: (word.lower(), 1))
    .countByKey()    # Could also be .reduceByKey(lambda x, y: x + y).collectAsMap()
)
for word, count in result.items():
    print("Word '{0}' appears {1} time(s)".format(word, count))

Word 'new' appears 1 time(s)
Word 'says' appears 1 time(s)
Word 'hello' appears 4 time(s)
Word 'york' appears 2 time(s)


Pair RDDs can also be **joined** by key:

In [7]:
be_gdp_per_capita = sc.parallelize(
    [(1913, 4220), (1950, 5462), (2000, 21205)])
nl_gdp_per_capita = sc.parallelize(
    [(1913, 4049), (1950, 5996), (2000, 21480)])
be_gdp_per_capita.join(nl_gdp_per_capita).collect()

[(2000, (21205, 21480)), (1913, (4220, 4049)), (1950, (5462, 5996))]

In [8]:
(be_gdp_per_capita.join(nl_gdp_per_capita)
 .mapValues(lambda (be, nl): (float(nl) / float(be) - 1.0)*100)
 .sortByKey()
 .collect())

[(1913, -4.052132701421796),
 (1950, 9.77663859392164),
 (2000, 1.2968639471822696)]

For many data manipulations, the Spark Core API is a bit too low-level.  Spark SQL is often more convenient:

In [9]:
from pyspark.sql import SQLContext, Row
sqlCtx = SQLContext(sc)

rdd = sc.parallelize([
        Row(country='BE', year=1913, gdp_per_capita=4220),
        Row(country='BE', year=1950, gdp_per_capita=5462),
        Row(country='BE', year=2003, gdp_per_capita=21205),
        Row(country='NL', year=1913, gdp_per_capita=4049),
        Row(country='NL', year=1950, gdp_per_capita=5996),
        Row(country='NL', year=2003, gdp_per_capita=21480),
    ])
schemaRDD = sqlCtx.inferSchema(rdd)
schemaRDD.registerTempTable("gdp")

resultRDD = sqlCtx.sql("""
    SELECT
       year,
       AVG(gdp_per_capita) as mean
    FROM gdp
    GROUP BY year
    ORDER BY year ASC
    """
)
resultRDD.collect()

[Row(year=1913, mean=4134.5),
 Row(year=1950, mean=5729.0),
 Row(year=2003, mean=21342.5)]

In [10]:
resultRDD.map(lambda row: "Year {0} has GDP/capita of {1}"
             .format(row.year, row.mean)).collect()

['Year 1913 has GDP/capita of 4134.5',
 'Year 1950 has GDP/capita of 5729.0',
 'Year 2003 has GDP/capita of 21342.5']

We also saw contrasted `scikit-learn` with `MLLib`.

---

Today, we'll move to a cluster running Cloudera Hadoop on AWS.  By the end of today, you should be able to:
* Run python scripts on the cluster from a shell and from ipython notebooks
* Use Spark to read from and write to HDFS
* Use SparkSQL to read data from and write data to Hive
* Understand how YARN works
* Submit spark jobs on the cluster
* Use Spark, SparkSQL and Spark MLlib to run algorithms on large-scale data.

**Remember:**  The afternoon will be a free-form session for you to play with Spark on the cluster with your own data, to do an analysis that *you* care about.  I'll be available to help you out and answer questions.