<small><i>This notebook was put together by [Anderson Banihirwe](andersy005.github.io) as part of [2017 CISL/SIParCS Research Project](https://github.com/NCAR/PySpark4Climate): **PySpark for Big Atmospheric & Oceanic Data Analysis**</i></small>

![](http://spark.apache.org/images/spark-logo.png) 
![](https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg)

In [None]:
!hostname

In [None]:
!mpirun.lsf hostname

In [None]:
!bjobs

*To confirm that PySpark is running, run the cells below. If everything is well setup, you shouldn't get any error.*

In [None]:
import pyspark
pyspark.__version__

# Table of Contents
- [I. PySpark](#1.-PySpark)
- [II. Resilient Distributed Datasets](#2.-Resilient-Distributed-Datasets)
- [III. Creating an RDD](#3.-Creating-an-RDD)
- [IV. Transformations](#4.-Transformations)
- [V. Actions](#5.-Actions)
- [VI. Caching RDDs](#6.-Caching-RDDs)
- [VII. Spark Program Lifecycle](#7.-Spark-Program-Lifecycle)
- [VIII. PySpark Closures](#8.-PySpark-Closures)
- [IX. Summary](#9.-Summary)

# 1. PySpark

PySpark is the python programming interface to Spark.

PySpark provides an easy-to-use programming abstraction and parallel runtime:
> Here's an operation, run it on all the data.

Resilient Distributed Datasets are the key concept

## 1.1. Spark Driver and Workers
- A spark program is two programs:
    - **A driver program**
    - **A worker program**

- Worker programs run on cluster nodes or in local threads
- RDDs are distributed across workers

![](https://i.imgur.com/HJ9gpwd.jpg)
source: BerkeleyX-CS100.1x-Big-Data-with-Apache-Spark

## 1.2. Spark Context
- A spark program first creates a **SparkContext** object
 - The SparkContext tells Spark how and where to access a cluster.
 - PySpark Shell automatically creates the **sc** variable
 - **Jupyter notebook** and programs must use a constructor to create a new **SparkContext**
 
- Use **SparkContext** to create RDDs.


In [None]:
from pyspark import SparkContext
# Create a new SparkContext
sc = SparkContext()


## 1.3. Master
The **master** parameter for a **SparkContext** determines which type and size of cluster to use.

| Master Parameter  | Description                                                                   |
|-------------------|-------------------------------------------------------------------------------|
| local             | run Spark locally with one worker thread (no parallelism)                     |
| local[k]          | run Spark locally with K worker threads (ideally set to number of cores)      |
| spark://HOST:PORT | connect to a Spark standalone cluster PORT depends on config(7077 by default) |
| mesos://HOST:PORT | connect to a Mesos cluster; PORT depends on config(5050 by default)           |

The master parameter for Spark installation running on Yellowstone is set to **Spark standalone cluster**

To learn more, check out [APACHE SPARK CLUSTER MANAGERS: YARN, MESOS, OR STANDALONE?](http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/)

# 2. Resilient Distributed Datasets
[back to top](#Table-of-Contents)

- The primary abstraction in Spark
    - Immutable once constructed
    - Spark tracks lineage information to efficiently recompute lost data
    - Enable operations on collection of elements in parallel
    
- You construct RDDs
    - by parallelizing existing Python collections (lists)
    - by transforming an existing RDDs
    - from files in HDFS or any other storage system (glade in case of Yellowstone and Cheyenne)
    
    
- The programmer needs to specify the number of partitions for an RDD or the default value is used if unspecified.

![Partitioning](https://i.imgur.com/zaOQIQY.jpg)


There are two types of operations on RDDs:
- **transformations**
- **actions**

- **transformations** are lazy in a sense that they are not computed immediately
- Transformed RDD is executed when action runs on it.
- RDDs can be persisted(cached) in memory or disk.

## 2.1 Working with RDDs
- Create an RDD from a data source
- Apply transformations to an RDD: ```.map(...)```
- Apply actions to an RDD: ```.collect()```, ```.count()```

![](https://i.imgur.com/iqvUJV5.jpg)


# 3. Creating an RDD
[back to top](#Table-of-Contents)

## 3.1. Creating RDDs from Python collections (lists)

In [None]:
import numpy as np
# create a list of 30 random integers less than 50
data = np.random.randint(50, size=30)
data

In [None]:
rdd = sc.parallelize(data, 4)

In the above example, no computation occurs with **```sc.parallelize()```**. Spark only records how to create the RDD with four partitions

In [None]:
rdd

## 3.2. Creating RDDs from a file
We can also create RDDs from HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles, or any other Hadoop **inputFormat**, etc..

In [None]:
distFile = sc.textFile("spark-cluster.sh", 4)

In [None]:
distFile

From the above example,
- RDD is distributed in 4 partitions
- Elements are lines of input
- **lazy evaluation** means no execution happens now.

# 4. Transformations
[back to top](#Table-of-Contents)

- Create new datasets from an existing one
- Use **lazy evaluation**: results not computed right-away instead Spark remembers set of transformations applied to base dataset.
    - Spark optimizes the required calculations.
    - Spark recovers from failures and slow workers.
    
## 4.1 Some Transformations

### ```.map(func)```

This method is applied to each element of the RDD:

-> returns a new distributed dataset formed by passing each element of the source through a function **func**.

In [None]:
rdd.take(5)

In [None]:
# returns an RDD with each element times two
mapped_rdd = rdd.map(lambda x: x * 2) 
mapped_rdd.take(5)

### ```.flatMap(func)```

-> The ```.flatMap(func)``` method works similar to ```.map(func)``` but returns a flattened results instead of a list. So func should return a new sequence rather than a single item.

In [None]:

flatmapped_rdd = rdd.flatMap(lambda x: [x, x+5])
flatmapped_rdd.take(5)

### ```.filter(func)```
The ```.filter(func)``` method allows you to select elements of our dataset that fit specified criteria.

-> Returns a new dataset formed by selecting those elements of the source on which func returns true

In [None]:
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
filtered_rdd.take(5)

### ```.distinct([numTasks])```

-> This method returns a list of distinct values in a specified column.


In [None]:
distinct_rdd = rdd.distinct()

# 5. Actions
- Cause Spark to execute recipe to transform source
- Mechanism for getting results out of Spark.

## 5.1 Some Actions


### ```.take(.n)```

-> The method returns n top rows from a single data partition.


In [None]:
rdd.take(10)

If you want somewhat randomized records you can use ```.takeSample(n)``` instead.

### ```.reduce(func)```

Another action that processes your data, the ```.reduce(func)``` method **reduces** the elements of an RDD using a specified method.


In [None]:
rdd1 = sc.parallelize([1,3,5])
rdd1.reduce(lambda a, b: a * b)

### ```count()```

The ```.count()``` method counts the number of elements in the RDD.

```count()``` causes spark to:
- read data
- sum within partitions
- combine sums in driver

In [None]:
rdd.count()

### ```.collect()```
-> Return all the elements as an array

**BIG WARNING:** make sure will fit in driver program.

In [None]:
rdd.collect()

### ```.reduceByKey(...)```
The ```.reduceByKey(...)``` method works in a similar way to the ```.reduce(...)``` method but performs a reduction on a key-by-key basis.

In [None]:
data_key = sc.parallelize([('a', 4),('b', 3),('c', 2),('a', 8),('d', 2),('b', 1),('d', 3)],4)
data_key.reduceByKey(lambda x, y: x + y).collect()

# 6. Caching RDDs
[back to top](#Table-of-Contents)

To avoid to reload the data, we can use ```cache()``` to our RDDs.

In [None]:
# Save don't recompute
rdd.cache()

# 7. Spark Program Lifecycle

1. Create RDDs from external data or **parallelize** a collection in your driver program.
2. Lazily **transform** them into new RDDs
3. **```cache()```** some RDDs for reuse
4. Perform **actions** to execute parallel computation and produce results.

# 8. PySpark Closures
[back to top](#Table-of-Contents)

- PySpark automatically creates closures for:
    - Functions that run on RDDs at workers.
    - Any global variables used by workers
    
- One closure per worker
    - sent for every task
    - No communication between workers
    - changes to global variables at workers are not sent to driver
    
## 8.2. Consider These Use Cases
- Iterative or single jobs with large global variables:
    - Sending large read-only lookup table to workers
    - Sending large feature vector in a Machine Learning algorithms to workers
    
- Counting events that occur during job execution
    - How many input lines were blank?
    - How many inpu records were corrupt
 
<div style="color:red;">
Problems:
<ul>
    <li>Closures are (re-) sent with every job</li>
    <li>Inefficient to send large data to each worker</li>
    <li>Closures are one way: driver -> worker</li>
</ul>

</div>

<p style="color:red;">Solution:</p>

## 8.3. PySpark Shared Variables

### Broadcast Variables:
- Efficiently send large, **read-only** value to all workers.
- Saved at workers for use in one or more Spark operations.
- Like sending a large, read-only lookup table to all the nodes

Example: efficiently give every worker a large dataset

Broadcast variable are usually distributed using efficient broadcast algorithm

In [None]:
# At the driver
broadcastVar = sc.broadcast([1, 2, 3])

In [None]:
# At a worker (in code passed via a closure)
broadcastVar.value

### Accumulators:
- Aggreagate values from workers back to driver
- only driver can access value of accumulator
- For tasks, accumulators are write-only
- Use to count errors seen in RDD across workers
- Variables that can only be **added** to by associative operations
- Efficiently implement parallel counters and sums
- only driver can read an accumulator's value, not tasks

In [None]:
accum = sc.accumulator(0)
rdd2 = sc.parallelize([1, 2, 3, 4])
def f(x):
    global accum
    accum += x
    
    

In [None]:
rdd2.foreach(f)
accum.value

# 9. Summary
![](https://i.imgur.com/EuyK62Q.jpg)

[back to top](#Table-of-Contents)

References:
1. https://spark.apache.org/docs/latest/programming-guide.html
2. https://spark.apache.org/docs/latest/api/python/index.html