# PySpark Tutorial
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

Spark was originally developed using Scala, although there are Python and Java interfaces as well. This tutorial covers [most of the RDD API](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) using Python bindings.

You may want to consult the [PySpark manual](http://spark.apache.org/docs/2.1.0/api/python/pyspark.html) as well.

In [None]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

Note that I am using an explicit declaration of the number of local processes to use with `local[3]`

In [None]:
conf=SparkConf().setAppName("pyspark tutorial").setMaster("local[3]")
sc = SparkContext(conf=conf)

## Partitions

RDD's are broken into multiple partitions or slices which are the unit of work allocation (*i.e.*, more partitions gives more potential for parallelism, but too many partitions gives too much overhead). By default, the number of partitions is related to your cluster size. In this example, I uses `local[3]` to specify three worker processes.

In [None]:
a = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7])

In [None]:
a.getNumPartitions()

We can also specify the number of partitions or slices when parallelizing a data structure

In [None]:
a2 = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7], numSlices=2)

In [None]:
a2.getNumPartitions()

## When are partitions visible?

In general, you don't need to be aware of partitions / slices. 

**fold** takes a "identity value" and a function and then repeatedly performs a reduction on the RDD using the identity value and function and then again when the values have be collected at the host. `fold` is a general version of `reduce` that handles the case of data across multiple partitions.

For example, assume we want to create a list from elements of an RDD using list addition. There are numerous reasons why this is a bad idea, but it helps us illustrate the impact of partitions.

The problem is that you can't use "list addition" until you have a list. For example, you need to execute:
` [] + [1] ` before you can create a list using `[] + [1] + [2]`. 

In [None]:
[] + [1] + [2]

**fold** only combines elements in a partition -- **reduce** basically does a **fold** and then combines the results from the individual partitions. Let's apply **fold** to a 3-parition structure:

In [None]:
a.fold([], lambda x,y: x + [y])

Underneath the hood of **reduce** there's a set of tools that handle operations within a partition and then across partitions. We're going to look at the general **aggregate** method.

### Aggregate -- generalization of reduce and fold

In order to understand the order of operations, we need a function that will illustrate that order for an RDD. The **showAdd** function will show the order of operations using parenthesis.

In [None]:
def showAdd(x,y):
    return "({} + {})".format(str(x),str(y))

In [None]:
showAdd( showAdd(1,2), 3)

In [None]:
oneslice = sc.parallelize([2,3,4,5,6],1)
oneslice.reduce(showAdd)

If we partition the same data into two slices, we see that one partition contains `(2+3)` and the other contains `(4+5)+6`. The reduce combines these two together.

In [None]:
twoslice = sc.parallelize([2,3,4,5,6],2)
twoslice.reduce(showAdd)

Now, lets see the semantics of **fold**:

In [None]:
oneslice.fold(1,showAdd)

In [None]:
twoslice.fold(1,showAdd)

In other words, the identity element is added to the first element of each partition Like **reduce**, the **fol** operation really only works well for commutative-associative operators because it's applied to each slice of an RDD independently.

Recall that we explicitly specified that `twoslice`should have two slices.

In [None]:
twoslice.reduce(showAdd)

In [None]:
twoslice.fold(1, showAdd)

The **aggregate* function performs an operation like`fold` on each RDD partition and then uses a __combine function__ to join partitions.

For example, assume we have data:

In [None]:
twoPart = sc.parallelize([1,2,3,4], numSlices=2)

This data will (likely) be divided into `[1,2]` and `[3,4]`. Now assume we want to reduce two values -- the first is the sum of the data (10) and the second is the length of largest partition (likely 2).

We'll have two distinct functions -- `seqOp` will define operations within a partition and `combOp` will define op how partitions are combined.

In [None]:
def seqOp( x, y):
    return "(" + str(x) + "+ S +" + str(y) + ")"

In [None]:
def combOp( x, y ):
    return "[" + str(x) + "+ C +" + str(y) + "]"

As with `fold`, we need a "zero-value" to start folding

In [None]:
oneslice.aggregate( 0, seqOp, combOp )

Recall that `oneslice` has a single partition. The `seqOp` operation is applied to the elements in the single RDD and combined with the identity value (0). That RDD is then combined using  `combOp` the identity value (0) and the result from the single RDD.

Now, lets see what this is like for two slices.

In [None]:
twoslice.aggregate( 0, seqOp, combOp )

In this case, the two values in each of the two RDD's are combined using `seqOp` and the identity element.

The result from the two sequences are then combined using `compOp` in a *left to right* oerdering.

**aggregate** is a the basis of many of the other operations in Spark. You can use it to build additional extensions, but many of the common operations we need are built in using **aggregate**.