# PySpark Tutorial
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

Spark was originally developed using Scala, although there are Python and Java interfaces as well. This tutorial covers [most of the RDD API](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) using Python bindings

In [1]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

The `SparkContext` object tells Spark how to access a cluster. The `SparkConf` object defines information about our job.

The `master` of a Spark configuration is the cluster (YARN or Mesos) manager. It can also be "local" meaning that the Spark job runs on your local machine, which is what we'll do here; the `[*]` notation means to use all the available cores. In general, you shouldn't hardcode the `master` mechanism.

Spark uses a function chaining notation. We'll use that throughout unless it makes this confusing.

In [2]:
conf=SparkConf().setAppName("pyspark tutorial").setMaster("local[*]")
sc = SparkContext(conf=conf)

## Create Datasets
The basic Spark data structure is the RDD (resilient distributed data), which is essentially a vector distributed across the cluster of nodes or on the local system. In PySpark, the vector can contain a heterogenous collection of types (strings, ints, etc).

You can create an RDD from a list or tuple, read it from a local file or read it from networked distributions such as HDFS or S3.

The following shows creating three datasets from lists using the _parallelize_ method. RDD's are broken into multiple slices which are the unit of work allocation (*i.e.*, more slices gives more potential for parallelism, but too many slices gives too much overhead)

In [3]:
a = sc.parallelize([7, 2, 3, 1, 2, 3, 4, 5, 6, 7], numSlices=2)
b = sc.parallelize([2, 3, 99, 22, -77])
c = sc.parallelize([ (1,2), (2,3), (1, 99), (3, 44), (2, 1), (4,5), (3, 19) ] )

In [4]:
passwd = sc.textFile("/etc/passwd")

It is also possible to read and write binary data files, including data formatted in Hadoop Sequence types.

Spark also supports _accumulators_ and _broadcast variables_.  Accumulators are designed to sum or aggregate values from across the cluster; they are really only suitable for commutative-associative operators. Broadcast variables are efficiently disseminated to all nodes in the cluster; they can be used for the equivilent of "map-side joins".

## Transformations and Actions
*Transformations* produce new RDD's by transforming existing RDD's  and *Actions* convert data *to* and *from* an RDD.

### Actions

Some of the most simple actions are:
* count() - Return the number of items in the RDD
* take(_n_) - Extract and return the first _n_ items from the RDD
* first() - Same as take(1)
* collect() - Same as take(count()) - **returns full RDD**
* takeSample(_withReplacement_:Boolean, _num_:int, [ seed:Int] ) - extract a random set of _num_ items from the RDD with or without replacement.
* takeOrdered( _num_ ) - extract _num_ items from the sorted RDD.

In [5]:
print(a.count())
print(a.take(2))
print(a.first())
print(a.collect())
print(a.takeSample(True, 3))
print(a.takeOrdered(4))

10
[7, 2]
7
[7, 2, 3, 1, 2, 3, 4, 5, 6, 7]
[3, 3, 4]
[1, 2, 2, 3]


## Lambda Functions in Python

Lambda functions, or anonymous functions are common in other languages (e.g. Scala) and commonly used in PySpark. The Python lambda is restricted to simple single-line statements.

In [6]:
(lambda x: x + x)(1)

2

In [7]:
add2 = lambda x: x + 2
add2(3)

5

## Map, Reduce & flatMap

`map` is a transformation that produces a new RDD. `reduce` is an action that applies a specified function to the elements of an RDD. Map is applied using a single-argument (unary) function (often a *lambda*) while reduce takes a binary (or dyadic) function.

Examples:

In [8]:
a.map( lambda x: x**2 ).collect()

[49, 4, 9, 1, 4, 9, 16, 25, 36, 49]

In [9]:
c.map( lambda x: x[1] ).collect()

[2, 3, 99, 44, 1, 5, 19]

The following `lambda` sums the first and second element of the tuples in `c`

In [10]:
c.collect()

[(1, 2), (2, 3), (1, 99), (3, 44), (2, 1), (4, 5), (3, 19)]

In [11]:
c.reduce( lambda x,y: (x[0] + y[0], x[1] + y[1]) )

(16, 173)

That should produce the sample result as the more complex example
below, which returns an RDD for each field of the tuple and then
adds those those using reduce. The operator.add function is "+"

In [12]:
( c.map( lambda x : x[0] ).reduce(operator.add), 
  c.map( lambda x: x[1] ).reduce(operator.add) )

(16, 173)

`flatMap` applies a map operation across elements of a list, but then takes those elements and *.appends* them to the list. The result is useful when processing a set of tuples or breaking documents into words and then processing the words rather than lines-of-words.

In [13]:
sent = sc.parallelize(["these are some", "sample words" ])

In [14]:
sent.map( lambda x : x.split() ).collect()

[['these', 'are', 'some'], ['sample', 'words']]

In [15]:
sent.flatMap( lambda x : x.split() ).collect()

['these', 'are', 'some', 'sample', 'words']

`fold` takes a "zero value" and a function and then repeatedly performs a reduction on the RDD using the zero value and function and then again when the values have be collected at the host. `fold` is a general version of `reduce` that handles the case of singleton data.

In [16]:
a.collect()

[7, 2, 3, 1, 2, 3, 4, 5, 6, 7]

In [17]:
def showAdd(x,y):
    return "({} + {})".format(str(x),str(y))

In [18]:
oneslice = sc.parallelize([1,2,3,4,5],1)
oneslice.reduce(showAdd)

'((((1 + 2) + 3) + 4) + 5)'

In [19]:
oneslice.fold(1,showAdd)

'(1 + (((((1 + 1) + 2) + 3) + 4) + 5))'

Like `reduce`, the `fold` operation really only works well for commutative-associative operators because it's applied to each slice of an RDD independently.

Recall that we explicitly specified that `a`should have two slices.

In [20]:
a.reduce(showAdd)

'(((((7 + 2) + 3) + 1) + 2) + ((((3 + 4) + 5) + 6) + 7))'

In [21]:
a.fold(1, showAdd)

'((1 + (((((1 + 7) + 2) + 3) + 1) + 2)) + (((((1 + 3) + 4) + 5) + 6) + 7))'

`aggregate` performs an operation like`fold` on each RDD partition and then uses a __combine function__ to join partitions.

For example, assume we have data:

In [22]:
twoPart = sc.parallelize([1,2,3,4], numSlices=2)

This data will (likely) be divided into `[1,2]` and `[3,4]`. Now assume we want to reduce two values -- the first is the sum of the data (10) and the second is the length of largest partition (likely 2).

We'll have two distinct functions -- `seqOp` will define operations within a partition and `combOp` will define op how partitions are combined.

In [23]:
def seqOp( x, y):
    xSum, xLth = x
    return (xSum+y, xLth + 1)

In [24]:
def combOp( x, y ):
    xSum, xLth= x
    ySum, yLth= y
    return (xSum + ySum, max(xLth, yLth))

As with `fold`, we need a "zero-value" to start folding

In [25]:
twoPart.aggregate( (0,0), seqOp, combOp )

(10, 2)

The following diagram (lifted from this nice [StackOverflow article](https://stackoverflow.com/questions/28240706/explain-the-aggregate-functionality-in-spark)) shows how the data flows.
```
(0, 0) <-- zeroValue

[1, 2]                  [3, 4]

0 + 1 = 1               0 + 3 = 3
0 + 1 = 1               0 + 1 = 1

1 + 2 = 3               3 + 4 = 7
1 + 1 = 2               1 + 1 = 2       
    |                       |
    v                       v
  (3, 2)                  (7, 2)
      \                    / 
       \                  /
        \                /
         \              /
          \            /
           \          / 
           ------------
           |  combOp  |
           ------------
                |
                v
             (10, 4)
```

## Filter & Sorting

Filter can be used to remove or filter items from an RDD

In [26]:
isEven = lambda x: x %2 == 0

print(a.collect())
print(a.filter(isEven).collect())

[7, 2, 3, 1, 2, 3, 4, 5, 6, 7]
[2, 2, 4, 6]


In [27]:
passwd.map( lambda x : x.split(':' ) )\
   .filter( lambda x : x[0] == 'root' )\
   .collect()

[['root', 'x', '0', '0', 'root', '/root', '/bin/bash']]

**sortBy** and **sortByKey** serve a similar role as takeOrdered but sorts an RDD rather than the returned results.

In [28]:
passwdLines = open('/etc/passwd', 'r').readlines()
passwd = sc.parallelize( passwdLines )

In [29]:
userAndShell = passwd.map( lambda x: x.rstrip('\n').split(':') )\
    .map( lambda y: ( y[0], y[6] ) )
userAndShell.take(3)

[('root', '/bin/bash'),
 ('daemon', '/usr/sbin/nologin'),
 ('bin', '/usr/sbin/nologin')]

`sortBy( cmp: Func, ascending: Boolean)` takes a function that returns the sort key.

In [30]:
userAndShell.take(3)

[('root', '/bin/bash'),
 ('daemon', '/usr/sbin/nologin'),
 ('bin', '/usr/sbin/nologin')]

In [31]:
userAndShell.sortBy(lambda x : x[0] ).take(3)

[('_apt', '/usr/sbin/nologin'),
 ('backup', '/usr/sbin/nologin'),
 ('bin', '/usr/sbin/nologin')]

In [32]:
userAndShell.sortBy(lambda x : x[1] ).take(3)

[('root', '/bin/bash'), ('jovyan', '/bin/bash'), ('sync', '/bin/sync')]

`sortByKey( ascending: Boolean)` assumes the data is in (k,v) pairs. In this case, the example is the same as sortByUser above.

In [33]:
userAndShell.sortByKey().take(3)

[('_apt', '/usr/sbin/nologin'),
 ('backup', '/usr/sbin/nologin'),
 ('bin', '/usr/sbin/nologin')]

## Set Operations

**union** and **intersection** produce new RDD's where the elements can be thought of as being in a set. **distinct** returns the unique set of items in an RDD (*i.e.* converting a multi-set to a set). **sample**(withReplacement:Boolean, fraction:Float, [seed:int]) draws samples with or without replacement. Sample produces more representative samples with larger datasets and has seemingly erratic behavior with small sets.

In [34]:
a.collect()

[7, 2, 3, 1, 2, 3, 4, 5, 6, 7]

In [35]:
b.collect()

[2, 3, 99, 22, -77]

In [36]:
a.union(b).collect()

[7, 2, 3, 1, 2, 3, 4, 5, 6, 7, 2, 3, 99, 22, -77]

In [37]:
a.intersection(b).collect()

[2, 3]

`subtract` removes items from the RDD that are contained in a second RDD

In [38]:
print(a.collect(), " - ", b.collect(), " = ", a.subtract(b).collect())

[7, 2, 3, 1, 2, 3, 4, 5, 6, 7]  -  [2, 3, 99, 22, -77]  =  [4, 1, 5, 6, 7, 7]


In [39]:
a.distinct().collect()

[2, 4, 6, 7, 3, 1, 5]

In [40]:
print("A has ", a.count(), "items")
s = a.sample(True, 0.2)
print("The sample has ", s.count(), "items: ", s.collect())

A has  10 items
The sample has  3 items:  [2, 7, 7]
