# PySpark Tutorial
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

Spark was originally developed using Scala, although there are Python and Java interfaces as well. This tutorial covers [most of the RDD API](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) using Python bindings.

You may want to consult the [PySpark manual](http://spark.apache.org/docs/2.1.0/api/python/pyspark.html) as well.

In [3]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

Note that I am using an explicit declaration of the number of local processes to use with `local[3]`

In [4]:
conf=SparkConf().setAppName("pyspark tutorial").setMaster("local[3]")
sc = SparkContext(conf=conf)

## Partitions

RDD's are broken into multiple partitions or slice which are the unit of work allocation (*i.e.*, more partitions gives more potential for parallelism, but too many partitions gives too much overhead). By default, the number of partitions is related to your cluster size. In this example, I uses `local[3]` to specify three worker processes.

In [5]:
a = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7])

In [7]:
a.getNumPartitions()

3

We can also specify the number of partitions or slices when parallelizing a data structure

In [10]:
a2 = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7], numSlices=2)

In [11]:
a2.getNumPartitions()

2

`fold` takes a "zero value" and a function and then repeatedly performs a reduction on the RDD using the zero value and function and then again when the values have be collected at the host. `fold` is a general version of `reduce` that handles the case of singleton data.

In [None]:
a.collect()

In [None]:
def showAdd(x,y):
    return "({} + {})".format(str(x),str(y))

In [None]:
oneslice = sc.parallelize([1,2,3,4,5],1)
oneslice.reduce(showAdd)

In [None]:
oneslice.fold(1,showAdd)

Like `reduce`, the `fold` operation really only works well for commutative-associative operators because it's applied to each slice of an RDD independently.

Recall that we explicitly specified that `a`should have two slices.

In [None]:
a.reduce(showAdd)

In [None]:
a.fold(1, showAdd)

`aggregate` performs an operation like`fold` on each RDD partition and then uses a __combine function__ to join partitions.

For example, assume we have data:

In [None]:
twoPart = sc.parallelize([1,2,3,4], numSlices=2)

This data will (likely) be divided into `[1,2]` and `[3,4]`. Now assume we want to reduce two values -- the first is the sum of the data (10) and the second is the length of largest partition (likely 2).

We'll have two distinct functions -- `seqOp` will define operations within a partition and `combOp` will define op how partitions are combined.

In [None]:
def seqOp( x, y):
    xSum, xLth = x
    return (xSum+y, xLth + 1)

In [None]:
def combOp( x, y ):
    xSum, xLth= x
    ySum, yLth= y
    return (xSum + ySum, max(xLth, yLth))

As with `fold`, we need a "zero-value" to start folding

In [None]:
twoPart.aggregate( (0,0), seqOp, combOp )

The following diagram (lifted from this nice [StackOverflow article](https://stackoverflow.com/questions/28240706/explain-the-aggregate-functionality-in-spark)) shows how the data flows.
```
(0, 0) <-- zeroValue

[1, 2]                  [3, 4]

0 + 1 = 1               0 + 3 = 3
0 + 1 = 1               0 + 1 = 1

1 + 2 = 3               3 + 4 = 7
1 + 1 = 2               1 + 1 = 2       
    |                       |
    v                       v
  (3, 2)                  (7, 2)
      \                    / 
       \                  /
        \                /
         \              /
          \            /
           \          / 
           ------------
           |  combOp  |
           ------------
                |
                v
             (10, 4)
```

## Filter & Sorting

Filter can be used to remove or filter items from an RDD

In [None]:
isEven = lambda x: x %2 == 0

print(a.collect())
print(a.filter(isEven).collect())

In [None]:
passwd.map( lambda x : x.split(':' ) )\
   .filter( lambda x : x[0] == 'root' )\
   .collect()

**sortBy** and **sortByKey** serve a similar role as takeOrdered but sorts an RDD rather than the returned results.

In [None]:
passwdLines = open('/etc/passwd', 'r').readlines()
passwd = sc.parallelize( passwdLines )

In [None]:
userAndShell = passwd.map( lambda x: x.rstrip('\n').split(':') )\
    .map( lambda y: ( y[0], y[6] ) )
userAndShell.take(3)

`sortBy( cmp: Func, ascending: Boolean)` takes a function that returns the sort key.

In [None]:
userAndShell.take(3)

In [None]:
userAndShell.sortBy(lambda x : x[0] ).take(3)

In [None]:
userAndShell.sortBy(lambda x : x[1] ).take(3)

`sortByKey( ascending: Boolean)` assumes the data is in (k,v) pairs. In this case, the example is the same as sortByUser above.

In [None]:
userAndShell.sortByKey().take(3)

## Set Operations

**union** and **intersection** produce new RDD's where the elements can be thought of as being in a set. **distinct** returns the unique set of items in an RDD (*i.e.* converting a multi-set to a set). **sample**(withReplacement:Boolean, fraction:Float, [seed:int]) draws samples with or without replacement. Sample produces more representative samples with larger datasets and has seemingly erratic behavior with small sets.

In [None]:
a.collect()

In [None]:
b.collect()

In [None]:
a.union(b).collect()

In [None]:
a.intersection(b).collect()

`subtract` removes items from the RDD that are contained in a second RDD

In [None]:
print(a.collect(), " - ", b.collect(), " = ", a.subtract(b).collect())

In [None]:
a.distinct().collect()

In [None]:
print("A has ", a.count(), "items")
s = a.sample(True, 0.2)
print("The sample has ", s.count(), "items: ", s.collect())