<img src="img/pySparkLogo.png" style="width: 400px;"/>
# PySpark Hands-on Training
PySpark 2.2.0

http://spark.apache.org/docs/latest/api/python/pyspark.html

In [1]:
import pyspark
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local').setAppName('myApp')
sc = SparkContext(conf=conf)

## Prerequisites
This tutorial is based on Jupyter Notebook. Make sure you have Python and Jupyter Notebook installed.

http://jupyter.readthedocs.io/en/latest/install.html

In [1]:
! python --version

Python 3.6.2 :: Anaconda, Inc.


In [2]:
! jupyter --version

4.3.0


# Agenda
* Installation
* Public Classes
    * SparkConf
    * SparkContext
    * RDD
    * Broadcast
    * Accumulator
    * SparkFiles
    * StorageLevel
    * Serializer
    * StatusTracker
    * Profiler
    * TaskContext
* Subpackages
    * SQL
    * Streaming
    * ML Pipeline
    * MLlib

<img style="float: right; width: 100px;" src="img/icons/download.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# Install  PySpark
## Prerequisites
Java is required for this installation. Make sure Java is installed and the `JAVA_HOME` parth is set propertly.

In [5]:
! java -version

openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)


In [5]:
! echo $JAVA_HOME
#/usr/lib/jvm/java-8-openjdk-amd64

/usr/lib/jvm/java-8-openjdk-amd64


<img style="float: right; width: 100px;" src="img/icons/download.png"/>
## Install PySpark with Conda
To install PySpark with conda run the following command in your terminal:

`conda install -c conda-forge pyspark`

## Install PySpark without Conda

To install PySpark without conda, download Spark from http://spark.apache.org/downloads.html and extract it. Then open your terminal, `cd` to the created folder (e.g. *spark-2.2.0-bin-hadoop2.7*) and type `bin/pyspark` to start the Spark shell.

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>

# `pyspark.SparkConf()`
`class pyspark.SparkConf(loadDefaults=True, _jvm=None, _jconf=None)`

Create your configuration for a Spark application. All properties are set as key-value pairs.

<img style="float: right; width: 100px;" src="img/icons/info.png"/>
## Spark application properties
* **`spark.app.name`** (none)
* `spark.callerContext`: application information (none)

<img style="float: right; width: 100px;" src="img/icons/info.png"/>
## Spark application properties
* `spark.driver.cores`: only in cluster mode (1)
* `spark.driver.maxResultSize`: max size of serialized results of all partitions for each Spark action (1g)
* **`spark.driver.memory`:** for driver process (1g)
* `spark.driver.supervise`: restarts driver automatically in standalone or Mesos mode if true (false)

<img style="float: right; width: 100px;" src="img/icons/info.png"/>
## Spark application properties
* `spark.executor.memory`: per executor process (1g) 
* `spark.extraListeners`: list of classes that implement *SparkListener* (none)
* **`spark.local.dir`:** for storing files and RDDs (/tmp)

<img style="float: right; width: 100px;" src="img/icons/info.png"/>
## Spark application properties
* `spark.logConf`: log SparkConf as INFO (false)
* **`spark.master`:** cluster manager URL (none)
* **`spark.submit.deployMode`**: `client` for local or `cluster` for remote deployment (none)

See https://spark.apache.org/docs/latest/configuration.html for a full list of properties.

(default values in brackets)

* ***spark.local.dir:*** for storing files and RDDs, will be overwritten by standalone's or Mesos' SPARK_LOCAL_DIRS and Yarn's LOCAL_DIRS environment variables (/tmp)


<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf()` methods
* `contains(key)`
* `get(key, defaultValue=None)`
* `getAll()`
* `toDebugString()`: printable configuration.

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf()` methods
* `set(key, value)`: set a configuration property.
* `setAll(pairs)`: set multiple parameters, passed as a list of key-value pairs.
* `setIfMissing(key, value)`: set a configuration property, if not already set.

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf()` methods
* `setAppName(value)`: set `spark.app.name` property.
* `setExecutorEnv(key=None, value=None, pairs=None)`: set an environment variable to be passed to executors.
* **`setMaster(value)`**: set master URL (`spark.master` property).
* `setSparkHome(value)`: set Spark installation path on worker nodes.


<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf().setMaster(value)`
* **local mode:** K worker threads, F maxFailures (4)
    * local
    * local[K]
    * local[K,F]
    * local[\*]
    * local[\*,F]

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf().setMaster(value)`
* **standalone cluster manager** (port 7077 by default)
    * spark://HOST:PORT 
    * spark://HOST1:PORT1,HOST2:PORT2 for Zookeeper

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf().setMaster(value)`
* **Mesos** (port 5050 by default)
    * mesos://HOST:PORT
    * mesos://zk://....

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf().setMaster(value)`
* **Yarn:** cluster location based on HADOOP_CONF_DIR or YARN_CONF_DIR variable
    * yarn
    
https://spark.apache.org/docs/latest/submitting-applications.html#master-urls

For setting the other properties, e.g. for changing the *spark.submit.deployMode* property to "client" use `conf.set('spark.submit.deployMode', "client")`

https://spark.apache.org/docs/latest/configuration.html

https://spark.apache.org/docs/latest/submitting-applications.html#master-urls

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkConf

<img style="float: right; width: 100px;" src="img/icons/settings.png"/>
## `pyspark.SparkConf()`

In [34]:
import pyspark
from pyspark import SparkConf

# all setter methods support chaining
conf = SparkConf().setMaster('local').setAppName('myApp')

In [8]:
conf.contains('spark.master')

True

In [11]:
conf.get('spark.master')

'local'

In [12]:
conf.getAll()

dict_items([('spark.master', 'local'), ('spark.app.name', 'myApp')])

In [13]:
print(conf.toDebugString())

spark.master=local
spark.app.name=myApp


In [14]:
# [optional:] set other properties using the set method
conf.set('spark.driver.memory', '2g')
conf.set('spark.submit.deployMode', 'client')

<pyspark.conf.SparkConf at 0x7f8cb0407a20>

In [15]:
conf.getAll()

dict_items([('spark.master', 'local'), ('spark.app.name', 'myApp'), ('spark.driver.memory', '2g'), ('spark.submit.deployMode', 'client')])

In [16]:
print(conf.toDebugString())

spark.master=local
spark.app.name=myApp
spark.driver.memory=2g
spark.submit.deployMode=client


<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.SparkContext()`
`class pyspark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>)`

Start your **driver program**: connection to a Spark cluster for creating RDDs and broadcast variables on that cluster.

Pass your created config as argument into the constructor `SparkContext(conf=yourConfig)`.

In [17]:
from pyspark import SparkContext

# create SparkContext: Driver program
sc = SparkContext(conf=conf)

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext() `
* Basic properties
* Basic methods
* Shared variables
* Dependencies
* RDDs
* Checkpointing
* Jobs

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/connect.png"/>
    <img style="width: 100px;" src="img/icons/info.png"/>
</div>
# `pyspark.SparkContext() `

## Basic properties

* `applicationId`: unique identifier for the Spark application.
* `defaultMinPartitions`: if not specified by user.
* `defaultParallelism`: if not specified by user.
* `startTime`: returns epoch time.
* `uiWebUrl`: URL of the SparkUI.
* `version`: of Spark.

In [18]:
sc.applicationId

'local-1509008374873'

In [19]:
sc.startTime

1509008373917

In [20]:
sc.uiWebUrl

'http://192.168.1.198:4040'

In [21]:
sc.version

'2.2.0'

In [22]:
sc.defaultParallelism

1

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## Basic methods 
* `dump_profiles(path)`: save profile stats to disk.
* **`getConf()`**: returns SparkConf.
* `getLocalProperty(key)`: affects jobs submitted from this thread.
* `setLocalProperty(key, value)`
* `show_profiles()`: print profile stats.
* `sparkUser()`: get SPARK_USER who is running SparkContext.
* **`stop()`**: shut down SparContext.

In [17]:
sc.getConf()

<pyspark.conf.SparkConf at 0x7f1d9f8e7748>

In [18]:
sc.sparkUser()

'dan'

In [58]:
sc.stop()

In [19]:
sc._conf.get('spark.driver.memory')

'2g'

In [23]:
sc.getConf().getAll()

[('spark.app.id', 'local-1509008374873'),
 ('spark.master', 'local'),
 ('spark.app.name', 'myApp'),
 ('spark.driver.host', '192.168.1.198'),
 ('spark.driver.port', '35975'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.memory', '2g'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client')]

In [4]:
sc.serializer

AutoBatchedSerializer(PickleSerializer())

In [33]:
# Profiler
from pyspark import SparkContext, SparkConf
from pyspark import BasicProfiler
#class MyCustomProfiler(BasicProfiler):
#    def show(self, id):
#        print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set('spark.python.profile', 'true')
sc = SparkContext('local', 'test', conf=conf)#, profiler_cls=MyCustomProfiler)
sc.parallelize(range(1000)).map(lambda x: 2 * x).take(10)
sc.parallelize(range(1000)).count()
#sc.show_profiles()
sc.stop()

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`

## Shared variables
* `accumulator(value, accum_param=None)`: creates **Accumulator** with given initial *value* and optional **AccumulatorParam** helper object to define how to add values (integer or floating-point numbers by default).
* `broadcast(value)`: read-only variable for cluster.

In [5]:
sc.accumulator(0)

Accumulator<id=0, value=0>

In [30]:
sc.broadcast({'a': 0.1, 'b': 0.3, 'c': 0.2})

<pyspark.broadcast.Broadcast at 0x7f60b0f09828>

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## Dependencies

* `addFile(path, recursive=False)`: download file to every node.
* `addPyFile(path)`: adds a .py or .zip dependency for all tasks to be executed on this SparkContext.

In [32]:
from pyspark import SparkFiles

sc.addFile("data/radiohead.txt")
# use SparkFiles.get(fileName) to find the download location
SparkFiles.get("radiohead.txt")

'/tmp/spark-03203490-5004-470d-bb02-873597d2fad6/userFiles-c7295a1f-06e7-4afd-89ca-208b71cc4553/radiohead.txt'

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## RDDs
### Create
* `emptyRDD()`
* `parallelize(c, numSlices=None)`: distributes a **local Python collection to form an RDD**. Using xrange is recommended if the input represents a range for performance.
* `range(start, end=None, step=1, numSlices=None)`: creates a new RDD of **int** containing elements from *start* to *end* (exclusive), increased by *step* every element. If called with a single argument, the argument is interpreted as end, and start is set to 0. Uses `parallelize(xrange(start, end, step), numSlices)` internally.

In [10]:
sc.emptyRDD().collect()

[]

In [11]:
sc.parallelize([1, 2, 3]).collect()

[1, 2, 3]

In [12]:
sc.range(3).collect()

[0, 1, 2]

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## RDDs
### Read Files
* `textFile(path, minPartitions=None, use_unicode=True)`: reads a text file and returns it as an **RDD of Strings**.
* `pickleFile(path, minPartitions=None)`: loads an RDD previously saved using `RDD.saveAsPickleFile` method.
* `sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)`: reads a Hadoop SequenceFile:
    1. A **Java RDD** is created from the SequenceFile or other InputFormat, the key and value Writable classes.
    2. **Serialization** is attempted via **Pyrolite pickling**.
    3. If this fails, the fallback is to call ‘toString’ on each key and value.
    4. **PickleSerializer** is used to deserialize pickled objects on the Python side.

In [28]:
sc.textFile("data/radiohead.txt").collect()

['Radiohead are an English rock band from Abingdon, Oxfordshire, formed in 1985.',
 "The band consists of Thom Yorke (lead vocals, guitar, piano, keyboards), Jonny Greenwood (lead guitar, keyboards, other instruments), Ed O'Brien (guitar, backing vocals), Colin Greenwood (bass), and Phil Selway (drums, percussion, backing vocals).",
 'They have worked with producer Nigel Godrich and cover artist Stanley Donwood since 1994.']

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## RDDs
### Read Directories
* `wholeTextFiles(path, minPartitions=None, use_unicode=True)`: reads a directory of text files and returns an **key-value pair RDD with (path, content)**.
* `binaryFiles(path, minPartitions=None)`: experimental
* `binaryRecords(path, recordLength)`: experimental

In [29]:
sorted(sc.wholeTextFiles("data/slices").collect())

[('file:/home/dan/git/u42/presentations/ApacheSpark/data/slices/r1.txt',
  'Radiohead are an English rock band from Abingdon, Oxfordshire, formed in 1985.\n'),
 ('file:/home/dan/git/u42/presentations/ApacheSpark/data/slices/r2.txt',
  "The band consists of Thom Yorke (lead vocals, guitar, piano, keyboards), Jonny Greenwood (lead guitar, keyboards, other instruments), Ed O'Brien (guitar, backing vocals), Colin Greenwood (bass), and Phil Selway (drums, percussion, backing vocals).\n"),
 ('file:/home/dan/git/u42/presentations/ApacheSpark/data/slices/r3.txt',
  'They have worked with producer Nigel Godrich and cover artist Stanley Donwood since 1994.\n')]

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## RDDs
### Read from Hadoop: File
Same mechanism used as in `sequenceFile(path)`
* `hadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)`
* `newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)`

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## RDDs
### Read from Hadoop: RDD
Same mechanism used as in `sequenceFile(path)`
* `hadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)`
* `newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)`

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## RDDs
### Combine
* `union(rdds)`: builds the union of a list of RDDs.

In [23]:
rdd1 = sc.parallelize([1])
rdd2 = sc.parallelize([2])
rdd3 = sc.parallelize([3])
sc.union([rdd1, rdd2, rdd3]).collect()

[1, 2, 3]

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## Checkpointing
* `setCheckpointDir(dirName)`

<img style="float: right; width: 100px;" src="img/icons/connect.png"/>
# `pyspark.SparkContext()`
## Jobs
* `statusTracker()` returns **StatusTracker** object for monitoring job and stage progress.
* `setJobGroup(groupId, description, interruptOnCancel=False)`: assigns a *groupId* to all the jobs started by this thread.
* `runJob(rdd, partitionFunc, partitions=None, allowLocal=False)`: executes the given *partitionFunc* on the specified set of partitions, returning the result as an array of elements.
* `cancelJobGroup(groupId)`
* `cancelAllJobs()`

In [19]:
sc.statusTracker()

<pyspark.status.StatusTracker at 0x7feaa32543c8>

In [20]:
myRDD = sc.parallelize(range(6), 3)
sc.runJob(myRDD, lambda part: [x * x for x in part])

[0, 1, 4, 9, 16, 25]

<img style="float: right; width: 100px;" src="img/icons/data.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.RDD(jrdd, ctx)`
`class pyspark.RDD(jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer()))`

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

* Properties
* Basic methods
* Actions
* Transformations


<img style="float: right; width: 100px;" src="img/icons/data.png"/>
## `pyspark.RDD(jrdd, ctx)` properties
* `_jrdd`: Java RDD
* `is_cached`
* `is_checkpointed`
* `ctx`: SparkContext
* `_jrdd_deserializer` (`AutoBatchedSerializer(PickleSerializer())`)
* `_id`
* `partitioner` (None)

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
* Meta data
* Basic information
* Cache / Persist
* Checkpoint
* Save
* Iterate

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Meta data
* `context`: returns **SparkContext** (`.ctx`)
* `id()`
* `name()`
* `setName(name)`
* `toDebugString()`: RDD description with recursive dependencies.

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Basic information
* `isEmpty()`
* `getNumPartitions()`
* `getStorageLevel()`

In [2]:
sc.parallelize([]).isEmpty()

True

In [6]:
rdd = sc.range(10, numSlices=3)

In [5]:
rdd.isEmpty()

False

In [4]:
rdd.getNumPartitions()

3

In [143]:
sc.range(10, numSlices=2).getNumPartitions()

2

In [6]:
print(rdd.getStorageLevel())


Serialized 1x Replicated


<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Cache / Persist
Try to keep RDD persistent in cache (keep linage intact) after reading it for the first time (lazy operation)
* `cache()`: persist with default (**MEMORY_ONLY**) storage level.
* `persist()`: persist with specified storage level (see *StorageLevel* class).
* `unpersist()`

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Checkpoint
Stores RDD persistently (not deleted after Spark application terminates) and removes linage.
* `checkpoint()`: mark for checkpointing, save RDD to checkpoint directory and remove all parent references.
* `localCheckpoint()`: sacrifices fault-tolerance for performance.
* `getCheckpointFile()`: file name (not defined for locally checkpointed RDDs).
* `isCheckpointed()`
* `isLocallyCheckpointed()`

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Save (1)
* `saveAsHadoopDataset(conf, keyConverter=None, valueConverter=None)`
* `saveAsNewAPIHadoopDataset(conf, keyConverter=None, valueConverter=None)`

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Save (2)
* `saveAsHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)`
* `saveAsNewAPIHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None)`

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Save (3)
* `saveAsPickleFile(path, batchSize=10)`: SequenceFile of serialized objects using *pyspark.serializers.PickleSerializer*. Python equivalent to Scala's `saveAsObjectFile()` method for **saving RDDs only containing values**.
* `saveAsSequenceFile(path, compressionCodecClass=None)`: **Convert pickled PythonRDD** (RDD[(K, V)]) into **RDD of Java objects using Pyrolite** and convert the keys and values into **org.apache.hadoop.io.Writable** types. Calls `saveAsHadoopFile()` internally.
* `saveAsTextFile(path, compressionCodecClass=None)`: using string representations of elements calling `BytesToString()`.

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/banana.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` basic methods
### Iterator
**These methods do not return a new RDD so the results are not saved!**
* `foreach(f)`: applies a function to all elements of this RDD.
* `foreachPartition(f)`: applies a function to each partition of this RDD.
* `toLocalIterator()`: returns an iterator that contains all of the elements in this RDD. The iterator will consume as much memory as the largest partition in this RDD.

In [132]:
def f(x):
    x += 10
    print(x)
rdd.foreach(f)
# printed to console

In [133]:
[x+10 for x in rdd.toLocalIterator()]

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
**Return a result** to the driver program.

* Collect
* Lookup
* Get
* Count
* Stats
* Reshape
    * Aggregate
    * Fold
    * Reduce

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Collect
These methods should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
* `collect()`: returns a list that contains all of the elements in this RDD.
* `collectAsMap()`: returns a dict with key-value pairs in this RDD (`dict(self.collect)`). Overwrites values for the same key.

In [7]:
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [359]:
kvRdd = sc.parallelize([('a', 1), ('b', 2), ('a', 3), ('c', 4), ('d', 1)])
kvRdd.collectAsMap()

{'a': 3, 'b': 2, 'c': 4, 'd': 1}

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Lookup
* `lookup(key)`: filter by *key* and return a list of values for that *key*. Efficient lookup if RDD has a known partitioner.

In [297]:
kvRdd.lookup('a')

[1, 3]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Get
These methods should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

* `first()`: **`take(1)`**.
* `take(num)`: take the **first *num* elements** of the RDD and return them in a list.
* `takeSample(withReplacement, num, seed=None)`: returns a **sampled subset of size `num`** as list. ***withReplacement*** defines whether elements can be **sampled multiple times** (*False*: replaced when sampled out).
   

In [10]:
rdd.first()

0

In [11]:
rdd.take(3)

[0, 1, 2]

In [12]:
shuffledRdd = sc.parallelize(rdd.takeSample(False, 10))
shuffledRdd.collect()

[3, 9, 0, 6, 1, 7, 8, 2, 5, 4]

In [118]:
sampleRddWithoutReplacement = sc.parallelize(rdd.takeSample(True, 10, seed=8))
sampleRddWithoutReplacement.collect()

[1, 2, 2, 4, 2, 8, 3, 8, 4, 8]

In [14]:
rdd.takeSample(True, 20, 1)

[5, 9, 4, 1, 0, 0, 7, 9, 4, 7, 9, 9, 4, 2, 0, 3, 4, 3, 6, 9]

In [15]:
rdd.takeSample(False, 5, 2)

[5, 9, 3, 4, 6]

In [16]:
len(rdd.takeSample(False, 15, 3))

10

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Get: Ordered
These methods should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
* `top(num, key=None)`: returns top *num* elements of an RDD in sorted list in **descending** order.
* `takeOrdered(num, key=None)`: returns *num* elements of RDD in a sorted list in **ascending** order or as specified by the optional *key* function.

`top` and `takeOrdered` use the same implementation but with opposite `heapq` methods (`nlargest` and `nsmallest`).

In [17]:
shuffledRdd.top(2)

[9, 8]

In [4]:
sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)

[4, 3, 2]

In [19]:
shuffledRdd.takeOrdered(3)

[0, 1, 2]

In [20]:
shuffledRdd.takeOrdered(6, key=lambda x: -x)

[9, 8, 7, 6, 5, 4]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Count
* `count()`: count the number of elements.
* `countByKey()`: count the number of elements per key.
* `countByValue()`: count the number of values as (value, count) pairs.
* `countApprox(timeout, confidence=0.95)`: experimental.
* `countApproxDistinct(relativeSD=0.05)`: experimental.

In [21]:
shuffledRdd.count()

10

In [298]:
kvRdd.countByKey()

defaultdict(int, {'a': 2, 'b': 1, 'c': 1, 'd': 1})

In [117]:
sampleRddWithoutReplacement.countByValue()

defaultdict(int, {1: 1, 2: 3, 3: 1, 4: 2, 8: 3})

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Stats (1)
* `sum()`: add up elements.
* `sumApprox(timeout, confidence=0.95)`: experimental.
* `max(key=None)`: maximum item.
* `min(key=None)`: minimum item.
* `mean()`: mean of elements.
* `meanApprox(timeout, confidence=0.95)`: experimental.

In [24]:
rdd.sum()

45

In [25]:
rdd.max()

9

In [26]:
rdd.min()

0

In [27]:
rdd.mean()

4.5

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Stats (2)
* `stdev()`: **standard deviation** of elements.
* `sampleStdev()`: estimates standard deviation by dividing by **N-1** instead of N to correct bias.
* `variance()`: variance of elements.
* `sampleVariance()`: estimates variance by dividing by **N-1** instead of N to correct bias.

In [28]:
rdd.stdev()

2.8722813232690143

In [29]:
rdd.variance()

8.25

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Stats (3)
* `stats()`: Return a **StatCounter** object that captures the **min, max, mean, variance and count** of the RDD’s elements in one operation.
* `histogram(buckets)`: computes histogram with provided *buckets* (either a number or a list of bucket boundaries). Returns a tuple ***(list of bucket boundaries, list with number of elements in buckets)***.

In [30]:
rdd.stats()

(count: 10, mean: 4.5, stdev: 2.87228132327, max: 9.0, min: 0.0)

In [31]:
rdd.histogram(3)
# b1: 0-2, b2: 3-5, b3:6-9, 3 elements in b1, 3 elements in b2, 4 elements in b3

([0, 3, 6, 9], [3, 3, 4])

In [32]:
rdd.histogram([0, 3, 6, 9])

([0, 3, 6, 9], [3, 3, 4])

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Reshape RDD
* Aggregate
* Fold
* Reduce

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Reshape RDD: Aggregate
* `aggregate(zeroValue, seqOp, combOp)`: aggregates the elements of each partition, and then the results for all the partitions.
* `treeAggregate(zeroValue, seqOp, combOp, depth=2)`: aggregates the elements of this RDD in a multi-level tree pattern.
    * ***zeroValue*:** neutral initial value of type U.
    * ***seqOp*:** sequential operation that can return a different result type than the input, **maps from (U, T) to U**.
    * ***combOp*:** combine operation for merging two U, **maps (U, U) to U**.
    * ***depth*:** suggested depth of the tree (2).


1. **aggregate the elements of each partition** using ***seqOp*** (with ***zeroValue*** as initial value).
2. **aggregate the results** of all partitions using ***combOp***.

In [33]:
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)

(10, 4)

In [34]:
sc.parallelize([]).aggregate((0, 0), seqOp, combOp)

(0, 0)

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Trees
**Reduce the load on the driver program** so it does not become the bottleneck that has to combine all partitions.
### treeAggregate(zeroValue, seqOp, combOp, depth=2)
`treeAggregate` is a specialized implementation of aggregate that **iteratively applies the combine function (*combOp*) to a subset of partitions**. This is done in order to prevent returning all partial results to the driver where a single pass reduce would take place as the classic aggregate does. Logically creates a n-ary tree that has all the partitions at its leaves and the **root (driver) will contain the final reduced value**. 

<img src="img/treeAggregation.png" style="width: 500px;"/>

Same applies for `treeReduce`.

https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/treereduce_and_treeaggregate_demystified.html

In [36]:
from operator import add
rdd.treeAggregate(0, add, add, 10)

45

In [37]:
rdd.aggregate(0, add, add)

45

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Reshape RDD: Fold
* `fold(zeroValue, op)`: **aggregates elements of each partition, and then the results for all the partitions**, using a given **associative function** and a neutral *zeroValue*. For functions that are not commutative, the result may differ.

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.

In [38]:
rdd.fold(0, add)

45

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Reshape RDD:  Reduce (1)
* `reduce(f)`: reduces the elements of this RDD using the specified **commutative and associative binary operator**. Currently reduces partitions locally.
* `treeReduce(f, depth=2)`: reduces the elements of this RDD in a multi-level tree pattern. ***depth*** is the suggested depth of the tree (2).

In [13]:
rdd.reduce(add)

45

In [14]:
rdd.treeReduce(add, 2)

45

In [43]:
# ten 2s in RDD, change to ten 1s
sc.parallelize((2 for _ in range(10))).map(lambda x: 1).reduce(add)

10

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/redShell.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` actions
### Reshape RDD:  Reduce (2)
* `reduceByKeyLocally(func)`: merges the values for each key using an **associative and commutative reduce function**, but **return the results immediately to the master as a dictionary**. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

In [299]:
kvRdd.reduceByKeyLocally(add)

{'a': 4, 'b': 2, 'c': 4, 'd': 1}

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
**Lazily** computed operations that return a **new RDD**.

* Repartition
* Filter
* Sort
* Sample
* Reshape
    * Group
    * Reduce
    * Fold
    * Aggregate
    * Combine
    * Map
* Pipe
* Relational Algebra
    * Cogroup
    * Set Operators
    * Joins
    * Zip


<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Repartition RDD (1)
* `glom()`: returns an RDD created by coalescing all elements within each partition into a list (**shows partitions**).
* `coalesce(numPartitions, shuffle=False)`: returns a new RDD that is **reduced** into *numPartitions* partitions. 

In [7]:
print(rdd.collect())
print(rdd.glom().collect())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]


In [63]:
rdd.coalesce(2).glom().collect()

[[0, 1, 2], [3, 4, 5, 6, 7, 8, 9]]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Repartition RDD (2)
* `repartition(numPartitions)`: returns a new RDD that has exactly *numPartitions* partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data (`self.coalesce(numPartitions, shuffle=True)`). If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

In [64]:
rdd.repartition(5).glom().collect()

[[], [0, 1, 2, 3, 4, 5], [], [], [6, 7, 8, 9]]

In [303]:
kvRdd.repartition(3).glom().collect()

[[], [('a', 1), ('b', 2), ('a', 3), ('c', 4), ('d', 1)], []]

In [69]:
# repartition works better for larger RDDs because of the batch size
sc.range(100).repartition(10).glom().collect()

[[90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Repartition RDD (3)
* `partitionBy(numPartitions, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`: returns a copy of the RDD partitioned using the specified partitioner. Used for **partitioning by key in key-value RDDs**.

In [304]:
kvRdd.partitionBy(3).glom().collect()

[[('d', 1)], [('b', 2)], [('a', 1), ('a', 3), ('c', 4)]]

In [135]:
#http://parrotprediction.com/partitioning-in-apache-spark/
def myPartitionFunc(x):
    return hash(x)
kvRdd.partitionBy(2, myPartitionFunc).glom().collect()

[[('b', 4), ('c', 3), ('d', 1)], [('a', 2), ('a', 6)]]

In [305]:
kvRdd.partitionBy(2, lambda x: x<'c').glom().collect()

[[('c', 4), ('d', 1)], [('a', 1), ('b', 2), ('a', 3)]]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Repartition RDD (4)
* `repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>, ascending=True, keyfunc=<function <lambda> at 0x7f51f1ab3ed8>)`: repartitions the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. Uses `partitionBy()` internally.

In [306]:
kvRdd.repartitionAndSortWithinPartitions(3).glom().collect()

[[('d', 1)], [('b', 2)], [('a', 1), ('a', 3), ('c', 4)]]

In [62]:
rdd2 = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)])
rdd2.repartitionAndSortWithinPartitions(2, lambda x: x % 2).glom().collect()

[[(0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Filter RDD
* `filter(f)`: returns a new RDD containing only the elements that satisfy a predicate.
* `distinct(numPartitions=None)`: returns a new RDD containing the distinct elements in this RDD.
* `keys()`: returns an RDD with the keys of each tuple.
* `values()`: returns an RDD with the values of each tuple.

In [16]:
rdd.filter(lambda x: x % 3 == 0).collect()

[0, 3, 6, 9]

In [122]:
print(sampleRddWithoutReplacement.collect())
sampleRddWithoutReplacement.distinct().collect()

[1, 2, 2, 4, 2, 8, 3, 8, 4, 8]


[1, 2, 4, 8, 3]

In [307]:
kvRdd.keys().collect()

['a', 'b', 'a', 'c', 'd']

In [308]:
kvRdd.values().collect()

[1, 2, 3, 4, 1]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Sort RDD
* `sortBy(keyfunc, ascending=True, numPartitions=None)`: sorts this RDD by the given *keyfunc*.
* `sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda> at 0x7f51f1ab5050>)`: sorts this key-value RDD.


In [143]:
kvRdd.sortBy(lambda x: x[1]).collect()

[('d', 1), ('a', 2), ('c', 3), ('b', 4), ('a', 6)]

In [147]:
sampleRddWithoutReplacement.sortBy(lambda x: x).collect()

[1, 2, 2, 2, 3, 4, 4, 8, 8, 8]

In [309]:
kvRdd.sortByKey().collect()

[('a', 1), ('a', 3), ('b', 2), ('c', 4), ('d', 1)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Sample RDD (1)
* `randomSplit(weights, seed=None)`: randomly splits this RDD with the provided *weights*.

In [392]:
rdd100 = sc.parallelize(range(100))

In [393]:
rdd20, rdd80 = rdd100.randomSplit([0.2, 0.8], 17)

In [395]:
len(rdd20.collect())

20

In [396]:
len(rdd80.collect())

80

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Sample RDD (2)
* `sample(withReplacement, fraction, seed=None)`: returns a sampled subset of this RDD. ***withReplacement*** defines whether elements can be sampled multiple times (replaced when sampled out if *False*). ***fraction*** is the expected size of the sample (probability that each element is chosen. Used by `takeSample()` action.
* `sampleByKey(withReplacement, fractions, seed=None)`: returns a subset of this RDD sampled by key (via **stratified sampling**). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

In [397]:
rdd100.sample(False, 0.1, 81).collect()

[4, 27, 28, 41, 49, 53, 58, 85, 93]

In [403]:
# rdd with (a, 0), (b, 0), (a, 1), (b, 1),..., (b, 99) pairs
fractions = {'a': 0.2, 'b': 0.1}
kvRdd100 = sc.parallelize(fractions.keys()).cartesian(rdd100)

In [413]:
sampleKvRdd100 = kvRdd100.sampleByKey(False, fractions, 2)
sampleKvRdd100Keys = dict(sampleKvRdd100.groupByKey().collect())
print(str(len(sampleKvRdd100Keys['a'])) + ', ' + str(len(sampleKvRdd100Keys['b'])))

19, 12


In [414]:
#kvRdd100.sampleByKey(False, fractions, 2).collect()

In [89]:
10 < len(sampleKvRdd100["a"]) < 30 and 5 < len(sampleKvRdd100["b"]) < 15

True

In [88]:
len(sampleKvRdd100["b"])

12

In [87]:
#sorted(rdd.sampleByKey(False, fractions).collect())

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD
* Group
* Reduce
* Fold
* Aggregate
* Combine
* Map

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Group

* `groupBy(f, numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`: returns an RDD of grouped items.
* `groupByKey(numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`: **group the values for each key** in the RDD into a single sequence. Hash-partitions the resulting RDD with *numPartitions* partitions.
* `keyBy(f)`: **creates tuples** of the elements in this RDD by applying *f*.

In [219]:
[(x, sorted(y)) for (x, y) in rdd.groupBy(lambda x: x % 3).collect()]

[(0, [0, 3, 6, 9]), (1, [1, 4, 7]), (2, [2, 5, 8])]

Use *reduceByKey* or *aggregateByKey* if you want to perform an aggregation (e.g. sum or avg) over each key because of performance.

In [310]:
kvRdd.groupByKey().mapValues(list).collect()

[('a', [1, 3]), ('b', [2]), ('c', [4]), ('d', [1])]

In [311]:
kvRdd.groupByKey().collect()

[('a', <pyspark.resultiterable.ResultIterable at 0x7f288e9d3780>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7f288e9d34e0>),
 ('c', <pyspark.resultiterable.ResultIterable at 0x7f288e9d3898>),
 ('d', <pyspark.resultiterable.ResultIterable at 0x7f288e9d32b0>)]

In [149]:
print(rdd.keyBy(lambda x: x*x).collect())

[(0, 0), (1, 1), (4, 2), (9, 3), (16, 4), (25, 5), (36, 6), (49, 7), (64, 8), (81, 9)]


<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Reduce
* `reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`: merges the values for each key using an **associative and commutative reduce function**. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be partitioned with *numPartitions* partitions, or the default parallelism level if *numPartitions* is not specified. Default partitioner is hash-partition.

Reduce uses the first element as zeroValue.

In [312]:
kvRdd.reduceByKey(add).collect()

[('a', 4), ('b', 2), ('c', 4), ('d', 1)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Fold
* `foldByKey(zeroValue, func, numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`: merge the values for each key using an **associative** function *func* and a **neutral *zeroValue* which may be added to the result an arbitrary number of times**, and must not change the result (e.g., 0 for addition, or 1 for multiplication).

In [313]:
kvRdd.foldByKey(0, add).collect()

[('a', 4), ('b', 2), ('c', 4), ('d', 1)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Aggregate
* `aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`: Aggregate the values of each key, using given combine functions and a neutral *zeroValue*. This function **can return a different result type**, U, than the type of the values in this RDD, V. 
    * ***seqFun*: merging a V into a U** (merging values within a partition).
    * ***combFunc*: merging two U’s** (merging values between partitions).

To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

In [314]:
kvRdd.aggregateByKey(0, add, add).collect()

[('a', 4), ('b', 2), ('c', 4), ('d', 1)]

In [315]:
kvRdd.aggregateByKey(0, lambda x,y: x+y, lambda x,y: x+y).collect()

[('a', 4), ('b', 2), ('c', 4), ('d', 1)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Combine
* `combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=<function portable_hash at 0x7f51f1ac0668>)`" : Generic function to **combine the elements for each key using a custom set of aggregation functions**. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C. Users provide three functions:
    * ***createCombiner***, which **turns a V into a C** (e.g., creates a one-element list).
    * ***mergeValue*** to **merge a V into a C** (e.g., adds it to the end of a list).
    * ***mergeCombiners*** to **combine two C’s** into a single one (e.g., merges the lists).
    

In [316]:
def append(l, e):# merge V into a C
    l.append(e)
    return l
def extend(a,b):# combine two C's
    a.extend(b)
    return a
kvRdd.combineByKey(lambda x: [x], append, extend).collect()# lambda turns V into a C

[('a', [1, 3]), ('b', [2]), ('c', [4]), ('d', [1])]

In [317]:
kvRdd.combineByKey(lambda x: [x], lambda x,y: x.append(y), lambda x,y: x.extend(y)).collect()
# append does not work with lambda as it is a statement and not an expression that implies the return statement
# lambda x,y: x.append(y) translates to 
#f(x,y): 
    #return x.append(y)

[('a', None), ('b', [2]), ('c', [4]), ('d', [1])]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Map (1)
* `map(f, preservesPartitioning=False)`: returns a new RDD by **applying a function to each element** of this RDD.
* `mapValues(f)`: pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.

In [161]:
print(rdd.map(lambda x: (x, 1)).collect())

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]


In [318]:
kvRdd.groupByKey().mapValues(list).collect()

[('a', [1, 3]), ('b', [2]), ('c', [4]), ('d', [1])]

In [319]:
kvRdd.groupByKey().mapValues(lambda x: len(x)).collect()

[('a', 2), ('b', 1), ('c', 1), ('d', 1)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Map (2)
* `mapPartitions(f, preservesPartitioning=False)`: returns a new RDD by **applying a function to each partition** of this RDD.
* `mapPartitionsWithIndex(f, preservesPartitioning=False)`: returns a new RDD by applying a function to each partition of this RDD, while **tracking the index** of the original partition.

In [164]:
def f(iterator): yield sum(iterator)
rdd.repartition(3).mapPartitions(f).collect()

[0, 15, 30]

In [165]:
def f(splitIndex, iterator): yield splitIndex
rdd.repartition(4).mapPartitionsWithIndex(f).collect()

[0, 1, 2, 3]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Reshape RDD: Map (3)
* `flatMap(f, preservesPartitioning=False)`: returns a new RDD by first applying a function to all elements of this RDD, and then **flattening the results**.
* `flatMapValues(f)`: pass each value in the key-value pair RDD through a flatMap function **without changing the keys**; this also retains the original RDD’s partitioning.

In [168]:
print(rdd.collect())
# range(x,y): y is exclusive
print((rdd.flatMap(lambda x: range(1, x)).collect()))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8]


In [320]:
print(kvRdd.groupByKey().mapValues(list).collect())
kvRdd.groupByKey().mapValues(list).flatMapValues(lambda x: x).collect()

[('a', [1, 3]), ('b', [2]), ('c', [4]), ('d', [1])]


[('a', 1), ('a', 3), ('b', 2), ('c', 4), ('d', 1)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Pipe
* `pipe(command, env=None, checkCode=False)`: returns an RDD created by **piping elements to a forked external process**.

cat: concatenate files

In [179]:
rdd.pipe('cat').collect()

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [176]:
rdd.pipe('grep 0').collect()

['0']

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra on Multiple RDDs
* Cogoup
* Set Operators
* Joins
* Zip

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Cogoup
* `cogroup(other, numPartitions=None)`: For each key k in *self* or *other*, return a resulting RDD that contains a **tuple with the list of values** for that key in *self* as well as *other*.
* `groupWith(other, *others)`:  **alias for cogroup** but with support for multiple RDDs.

In [389]:
kvRdd2 = sc.parallelize([('a', 10), ('c', 20), ('e', 30), ('c', 40)], 2)
kvRdd2.collect()

[('a', 10), ('c', 20), ('e', 30), ('c', 40)]

In [336]:
kvRddCogroup = kvRdd.cogroup(kvRdd2)
kvRddCogroup.collect()[0]

('b',
 (<pyspark.resultiterable.ResultIterable at 0x7f288e9d30b8>,
  <pyspark.resultiterable.ResultIterable at 0x7f288e9d3b38>))

In [324]:
kvRddCogroup2 = kvRdd.cogroup(sc.parallelize([('a', 9), ('c', 2)]))
kvRddCogroup2.collect()[0]

('b',
 (<pyspark.resultiterable.ResultIterable at 0x7f288e9d9828>,
  <pyspark.resultiterable.ResultIterable at 0x7f288e9d97f0>))

In [332]:
[(k, tuple(map(list, v))) for k, v in sorted(kvRddCogroup.collect())]

[('a', ([1, 3], [10])),
 ('b', ([2], [])),
 ('c', ([4], [20, 40])),
 ('d', ([1], [])),
 ('e', ([], [30]))]

In [334]:
for k, v in sorted(kvRddCogroup.collect()):
    print(k + ': '+ str(tuple(map(list, v))))

a: ([1, 3], [10])
b: ([2], [])
c: ([4], [20, 40])
d: ([1], [])
e: ([], [30])


In [351]:
def resultIterableItems(r):
    l = []
    for i in r:
        l.append(i)
    return l
            
for k,v in sorted(kvRddCogroup.collect()):
    print('key: ' + k + ', \t ResultIterable[0]: ' + str(resultIterableItems(v[0])) + ', \t ResultIterable[1]: ' + str(resultIterableItems(v[1])))

key: a, 	 ResultIterable[0]: [1, 3], 	 ResultIterable[1]: [10]
key: b, 	 ResultIterable[0]: [2], 	 ResultIterable[1]: []
key: c, 	 ResultIterable[0]: [4], 	 ResultIterable[1]: [20, 40]
key: d, 	 ResultIterable[0]: [1], 	 ResultIterable[1]: []
key: e, 	 ResultIterable[0]: [], 	 ResultIterable[1]: [30]


In [122]:
w = sc.parallelize([("a", 5), ("b", 6)])
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])
z = sc.parallelize([("b", 42)])
[(x, tuple(map(list, y))) for x, y in sorted(list(w.groupWith(x, y, z).collect()))]

[('a', ([5], [1], [2], [])), ('b', ([6], [4], [], [42]))]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Set Operators (1)
* `union(other)`: returns the union of this RDD and another one.
* `intersection(other)`: returns the duplicate-free intersection of this RDD and another one. This method performs a shuffle internally.

In [352]:
rdd.union(rdd).collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [353]:
sorted(rdd.map(lambda x: x * x).intersection(rdd).collect())

[0, 1, 4, 9]

In [354]:
rdd.filter(lambda x: x < 5).intersection(rdd).collect()

[0, 1, 2, 3, 4]

In [355]:
rdd1 = sc.parallelize([1, 10, 2, 1, 2, 3, 4, 5])
rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
sorted(rdd1.intersection(rdd2).collect())

[1, 2, 3]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Set Operators (2)
* `cartesian(other)`: returns the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in *self* and b is in *other*.

In [356]:
sorted(rdd.filter(lambda x: x < 3).cartesian(rdd.filter(lambda x: x > 7)).collect())
# (0, 1, 2) x (8, 9)

[(0, 8), (0, 9), (1, 8), (1, 9), (2, 8), (2, 9)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Set Operators (3)
* `subtract(other, numPartitions=None)`: returns each value in *self* that is **not** contained in *other*.
* `subtractByKey(other, numPartitions=None)`: returns each (key, value) pair in *self* that has no pair with matching key in *other*.

In [357]:
sorted(rdd.subtract(rdd.filter(lambda x: x < 5)).collect())

[5, 6, 7, 8, 9]

In [292]:
x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3), ("c", None)])
y = sc.parallelize([("a", 3), ("c", 8)])
sorted(x.subtract(y).collect())

[('a', 1), ('b', 4), ('b', 5), ('c', None)]

In [293]:
x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
y = sc.parallelize([("a", 3), ("c", None)])
sorted(x.subtractByKey(y).collect())

[('b', 4), ('b', 5)]

In [360]:
kvRdd.subtractByKey(kvRdd2).collect()

[('b', 2), ('d', 1)]

In [361]:
kvRdd2.subtractByKey(kvRdd).collect()

[('e', 30)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Joins (1)
* `join(other, numPartitions=None)`: returns an RDD containing **all pairs of elements with matching keys** in *self* and *other*. Each pair of elements will be returned as a **(k, (v1, v2))** tuple, where (k, v1) is in *self* and (k, v2) is in *other*. Performs a hash join across the cluster.

In [363]:
sorted(kvRdd.join(kvRdd2).collect())

[('a', (1, 10)), ('a', (3, 10)), ('c', (4, 20)), ('c', (4, 40))]

In [141]:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("a", 3)])
sorted(x.join(y).collect())

[('a', (1, 2)), ('a', (1, 3))]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Joins (2)
* `fullOuterJoin(other, numPartitions=None)`
    * for each element **(k, v)** in *self*:
        * **(k, (v, w))** for w in *other*, or
        * **(k, (v, None))** if no elements in *other* have key k. 
    * for each element (k, w) in *other*:
        * **(k, (v, w))** for v in *self*, or
        * **(k, (None, w))** if no elements in *self* have key k. 

Hash-partitions the resulting RDD into the given number of partitions.

In [364]:
sorted(kvRdd.fullOuterJoin(kvRdd2).collect())

[('a', (1, 10)),
 ('a', (3, 10)),
 ('b', (2, None)),
 ('c', (4, 20)),
 ('c', (4, 40)),
 ('d', (1, None)),
 ('e', (None, 30))]

In [104]:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("c", 8)])
sorted(x.fullOuterJoin(y).collect())

[('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Joins (3)
* `leftOuterJoin(other, numPartitions=None)`
    * for each element **(k, v)** in *self*:
        * **(k, (v, w))** for w in *other*, or
        * **(k, (v, None))** if no elements in *other* have key k. 

Hash-partitions the resulting RDD into the given number of partitions.

In [365]:
sorted(kvRdd.leftOuterJoin(kvRdd2).collect())

[('a', (1, 10)),
 ('a', (3, 10)),
 ('b', (2, None)),
 ('c', (4, 20)),
 ('c', (4, 40)),
 ('d', (1, None))]

In [146]:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("c", 3)])
sorted(x.leftOuterJoin(y).collect())

[('a', (1, 2)), ('b', (4, None))]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Joins (4)
* `rightOuterJoin(other, numPartitions=None)`
    * for each element **(k, w)** in *other*:
        * **(k, (v, w))** for v in this, or
        * **(k, (None, w))** if no elements in *self* have key k.

Hash-partitions the resulting RDD into the given number of partitions.

In [366]:
sorted(kvRdd.rightOuterJoin(kvRdd2).collect())

[('a', (1, 10)),
 ('a', (3, 10)),
 ('c', (4, 20)),
 ('c', (4, 40)),
 ('e', (None, 30))]

In [63]:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])
sorted(y.rightOuterJoin(x).collect())

[('a', (2, 1)), ('b', (None, 4))]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Zip (1)
* `zip(other)`: zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the **same number of partitions and the same number of elements in each partition** (e.g. one was made through a map on the other).

In [371]:
print(rdd.zip(rdd.map(lambda x: x+10)).collect())

[(0, 10), (1, 11), (2, 12), (3, 13), (4, 14), (5, 15), (6, 16), (7, 17), (8, 18), (9, 19)]


<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Zip (2)
* `zipWithIndex()`: zips this RDD with its element indices. The **ordering is first based on the partition index and then the ordering of items within each partition**. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when this RDD contains more than one partitions.

In [374]:
kvRdd.keys().zipWithIndex().collect()

[('a', 0), ('b', 1), ('a', 2), ('c', 3), ('d', 4)]

In [135]:
sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()

[('a', 0), ('b', 1), ('c', 2), ('d', 3)]

<div style="float: right; display: block">
    <img style="width: 100px;" src="img/icons/data.png"/>
    <img style="width: 100px" src="img/mario/mushroom.png"/>
</div>
## `pyspark.RDD(jrdd, ctx)` transformations
### Relational Algebra: Zip (3)
* `zipWithUniqueId()`: zips this RDD with **generated unique Long ids**. Items in the kth partition will get ids k, n+k, 2\*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won’t trigger a spark job, which is different from **zipWithIndex**.

In [391]:
print(kvRdd2.keys().glom().collect())
print(kvRdd2.keys().zipWithUniqueId().collect())

[['a', 'c'], ['e', 'c']]
[('a', 0), ('c', 2), ('e', 1), ('c', 3)]


In [142]:
sc.parallelize(["a", "b", "c", "d", "e"], 3).glom().collect()

[['a'], ['b', 'c'], ['d', 'e']]

In [143]:
sc.parallelize(["a", "b", "c", "d", "e"], 3).zipWithUniqueId().collect()

[('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)]

<img style="width: 100px; float: right;" src="img/icons/broadcast.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.Broadcast()`
`class pyspark.Broadcast(sc=None, value=None, pickle_registry=None, path=None)`

A broadcast variable created with `SparkContext.broadcast()`. Access its value through value.


<img style="width: 100px; float: right;" src="img/icons/broadcast.png"/>
## `pyspark.Broadcast()` methods
* `destroy()`: destroys all data and metadata related to this broadcast variable. Use this with caution; once a broadcast variable has been destroyed, it cannot be used again. This method blocks until destroy has completed.
* `dump(value, f)`: write pickled serialization of value to file *f*.
* `load(path)`: load pickled file.
* `unpersist(blocking=False)`: deletes cached copies of this broadcast on the executors. If the broadcast is used after this is called, it will need to be re-sent to each executor. `blocking` defines whether to block until unpersisting has completed.
* `value`: returns the broadcasted value.

In [2]:
b = sc.broadcast([1, 2, 3, 4, 5])
b.value

[1, 2, 3, 4, 5]

In [3]:
b.unpersist()

<img style="width: 100px; float: right;" src="img/icons/combine.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.Accumulator(aid, value, accum_param)`
`class pyspark.Accumulator(aid, value, accum_param)`

A shared variable that can be accumulated, i.e., has a **commutative and associative *add* operation**. Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using value. Updates from the workers get propagated automatically to the driver program.

While *SparkContext* supports accumulators for primitive data types like int and float, users can also define accumulators for custom types by providing a custom *AccumulatorParam* object. Refer to the doctest of this module for an example.

<img style="width: 100px; float: right;" src="img/icons/combine.png"/>
## `pyspark.Accumulator(aid, value, accum_param)` methods
* `add(term)`: adds a term to this accumulator’s value.
* `value`: get the accumulator’s value; only usable in driver program.

In [61]:
a = sc.accumulator(0)
sc.parallelize([1,2,3]).foreach(lambda x: a.add(x))
a.value

6

<img style="width: 100px; float: right;" src="img/icons/combine.png"/>
# `pyspark.AccumulatorParam`
`class pyspark.AccumulatorParam`
Helper object that defines how to accumulate values of a given type.

* `addInPlace(value1, value2)`: add two values of the accumulator’s data type, returning a new value; for efficiency, can also update value1 in place and return it.
* `zero(value)`: provide a “zero value” for the type, compatible in dimensions with the provided value (e.g., a zero vector)

In [62]:
from pyspark.context import SparkContext
from pyspark.accumulators import AccumulatorParam
class BatmanAccumulatorParam(AccumulatorParam):
    def zero(self, value):
        return ''
    def addInPlace(self, val1, val2):
        #for combining values
        if isinstance(val2, str):
            val1 += val2
        # foreach
        else:
            # add based on input number
            for i in range(val2):
                val1 += 'na'
        return val1

ba = sc.accumulator('', BatmanAccumulatorParam())
sc.parallelize([1,2,1]).foreach(lambda x: ba.add(x))

ba.value + ' batman'

'nananana batman'

<img style="width: 100px; float: right;" src="img/icons/file.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.SparkFile`
`class pyspark.SparkFiles`

Resolves paths to files added through `SparkContext.addFile()`. SparkFiles contains only classmethods; **users should not create SparkFiles instances**.

* `classmethod get(filename)`: get the absolute path of a file added through SparkContext.addFile().
* `classmethod getRootDirectory()`: get the root directory that contains files added through SparkContext.addFile().

<img style="width: 100px; float: right;" src="img/icons/storage.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized)`
`class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)`

Flags for **controlling the storage of an RDD**. Each StorageLevel records whether to use memory, whether to drop the RDD to disk if it falls out of memory, whether to keep the data in memory in a JAVA-specific serialized format, and whether to replicate the RDD partitions on multiple nodes. Also contains static constaints for some commonly used storage levels, MEMORY_ONLY. Since the data is always serialized on the Python side, all the constants use the serialized formats.


<img style="width: 100px; float: right;" src="img/icons/storage.png"/>
## `pyspark.StorageLevel flags`
`class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)`

| Flag                       | StorageLevel(  | useDisk | useMemory | useOffHeap | deserialized | replication)|
|:---------------------------|:---------------|:-------:|:---------:|:----------:|:------------:|------------:|
| **DISK_ONLY**              | `StorageLevel(`| `True`  | `False`   | `False`    | `False`      | `1)`        |
| **DISK_ONLY_2**            | `StorageLevel(`| `True`  | `False`   | `False`    | `False`      | `2)`        |
| **MEMORY_AND_DISK**        | `StorageLevel(`| `True`  | `True`    | `False`    | `False`      | `1)`        |
| **MEMORY_AND_DISK_2**      | `StorageLevel(`| `True`  | `True`    | `False`    | `False`      | `2)`        |
| **MEMORY_AND_DISK_SER**    | `StorageLevel(`| `True`  | `True`    | `False`    | `False`      | `1)`        |
| **MEMORY_AND_DISK_SER_2**  | `StorageLevel(`| `True`  | `True`    | `False`    | `False`      | `2)`        |
| **MEMORY_ONLY**            | `StorageLevel(`| `False` | `True`    | `False`    | `False`      | `1)`        |
| **MEMORY_ONLY_2**          | `StorageLevel(`| `False` | `True`    | `False`    | `False`      | `2)`        |
| **MEMORY_ONLY_SER**        | `StorageLevel(`| `False` | `True`    | `False`    | `False`      | `1)`        |
| **MEMORY_ONLY_SER_2**      | `StorageLevel(`| `False` | `True`    | `False`    | `False`      | `2)`        |
| **OFF_HEAP**               | `StorageLevel(`| `True`  | `True`    | `True`     | `False`      | `1)`        |


<img style="width: 100px; float: right;" src="img/icons/pickle.jpg"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# Serializers
By default, PySpark uses *PickleSerializer* to serialize objects using Python's *cPickle* serializer. Serializer is chosen when creating *SparkContext*.

## `pyspark.PickleSerializer`
`class pyspark.PickleSerializer`

Serializes objects using Python’s pickle serializer (http://docs.python.org/2/library/pickle.html). **Supports nearly any Python object, but may not be as fast as more specialized serializers**.
* `dumps(obj)`
* `loads(obj, encoding=None)`

<img style="width: 100px; float: right;" src="img/icons/pickle.jpg"/>
# Serializers
## `pyspark.MarshalSerializer`
`class pyspark.MarshalSerializer`

Serializes objects using Python’s Marshal serializer (http://docs.python.org/2/library/marshal.html).
**Faster than PickleSerializer but supports fewer datatypes**.
* `dumps(obj)`
* `loads(obj)`

In [69]:
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer

sc = SparkContext('local', 'test', serializer=MarshalSerializer())
sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

<img style="width: 100px; float: right;" src="img/icons/status.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.StatusTracker(jtracker)`
`class pyspark.StatusTracker(jtracker)`

Low-level status reporting APIs for **monitoring job and stage progress**.

These APIs intentionally provide very weak consistency semantics; consumers of these APIs should be prepared to handle empty / missing information. For example, a job’s stage ids may be known but the status API may not have any information about the details of those stages, so `getStageInfo` could potentially return *None* for a valid stage id.

To limit memory usage, these APIs only provide information on recent jobs / stages. These APIs will provide information for the last `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs.

<img style="width: 100px; float: right;" src="img/icons/status.png"/>
## `pyspark.StatusTracker(jtracker) methods`
* `getActiveJobsIds()`: returns an array containing the ids of all **active jobs**.
* `getActiveStageIds()`: returns an array containing the ids of all **active stages**.
* `getJobIdsForGroup(jobGroup=None)`: returns a **list of all known jobs** in a particular job group. If jobGroup is None, then returns all known jobs that are not associated with a job group. The returned list may contain running, failed, and completed jobs, and may vary across invocations of this method. This method does not guarantee the order of the elements in its result.
* `getJobInfo(jobId)`: returns a ***SparkJobInfo* object**, or *None* if the job info could not be found or was garbage collected.
* `getStageInfo(stageId)`: returns a ***SparkStageInfo* object**, or *None* if the stage info could not be found or was garbage collected.

<img style="width: 100px; float: right;" src="img/icons/status.png"/>
# `pyspark.SparkJobInfo`
`class pyspark.SparkJobInfo`

Exposes information about Spark Jobs.

# `pyspark.SparkStageInfo`
`class pyspark.SparkStageInfo`

Exposes information about Spark Stages.

<img style="width: 100px; float: right;" src="img/icons/stats.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.Profiler(ctx)`
`class pyspark.Profiler(ctx)`

**DeveloperApi**

Gather statistics of executed programs.


PySpark supports custom profilers, this is to allow for different profilers to be used as well as outputting to different formats than what is provided in the *BasicProfiler*.

A custom profiler has to define or inherit the following methods:
* `profile(func)`: will produce a system profile on *func*. 
* `stats()`: returns the collected stats (pstats.Stats).
* `dump(id, path)`: dumps the profiles to a path for the RDD id.

The profiler class is chosen when creating a SparkContext

* `show(id)`: prints the profile stats to stdout for the RDD id

<img style="width: 100px; float: right;" src="img/icons/stats.png"/>
## `pyspark.BasicProfiler(ctx)`
`class pyspark.BasicProfiler(ctx)`

BasicProfiler is the default profiler, which is implemented based on **cProfile and Accumulator**.

* `profile(func)`: runs and profiles the passed function. A profile object is returned.
* `stats()`: returns accumulated stats.

<img style="width: 100px; float: right;" src="img/icons/task.png"/>
<img src="img/pySparkLogo.png" style="width: 400px;"/>
# `pyspark.TaskContext`
`class pyspark.TaskContext`

**Experimental**

Contextual information about a task which can be read or mutated during execution. To access the TaskContext for a running task, use `TaskContext.get()`.

<img style="width: 100px; float: right;" src="img/icons/task.png"/>
## `pyspark.TaskContext methods`
* `attemptNumber()`: how many times this task has been attempted. The first task attempt will be assigned attemptNumber = 0, and subsequent attempts will have increasing attempt numbers.
* `classmethod get()`: returns the currently active TaskContext. This can be called inside of user functions to access contextual information about running tasks. **Must be called on the worker, not the driver**. Returns None if not initialized.
* `partitionId()`: ID of the RDD partition that is computed by this task.
* `stageId()`: ID of the stage that this task belong to.
* `taskAttemptId()`: ID that is unique to this task attempt (within the same *SparkContext*, no two task attempts will share the same attempt ID). This is roughly equivalent to Hadoop’s *TaskAttemptID*.

<img src="img/pySparkLogo.png" style="width: 400px;"/>
# Example

In [72]:
radioheadLyrics = sc.wholeTextFiles('/home/dan/Dropbox/Ancud/music/radioheadLyrics/')

In [73]:
sorted(radioheadLyrics.collect())

[('file:/home/dan/Dropbox/Ancud/music/radioheadLyrics/(Nice Dream).txt',
  "They love me like I was a brother\nThey protect me, listen to me\nThey dug me my very own garden\nGave me sunshine, made me happy\nNice dream\nNice dream\nNice dream\nI call up my friend, the good angel\nBut she's out with her ansa'phone\nShe said that she'd love to come help but\nThe sea would electrocute us all\nNice dream\nNice dream\nNice dream\nNice dream\nNice dream\nNice dream\nNice dream\nIf you think that you're strong enough\nNice dream\nIf you think you belong enough\nNice dream\nIf you think that you're strong enough\nNice dream\nIf you think you belong enough\nNice dream\nNice dream\nNice dream\nNice dream\n"),
 ('file:/home/dan/Dropbox/Ancud/music/radioheadLyrics/(Talking #1).txt',
  'That was better\nYes, yes!\nYep\nYeah\nI got a [?] just get the drums right!\nHang on a sec\n'),
 ('file:/home/dan/Dropbox/Ancud/music/radioheadLyrics/(Talking #2).txt',
  "Just gonna do a quick version of that last 

In [27]:
#help(sc)

In [10]:
# filter using lambda
radioheadLines = lines.filter(lambda line: "Radiohead" in line)

In [13]:
radioheadLines.first()

'Radiohead are an English rock band from Abingdon, Oxfordshire, formed in 1985.'

In [14]:
radioheadLines.count()

6

In [15]:
# filter using a function
def hasRadiohead(line):
    return "Radiohead" in line
hasRadioheadF = lines.filter(hasRadiohead)

In [16]:
hasRadioheadF.count()

6

In [17]:
hasRadioheadF

PythonRDD[8] at RDD at PythonRDD.scala:48

In [18]:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.session import SparkSession

# Create a SparkSession object for using Spark SQL / Dataset
spark = SparkSession(sc)

# read file using SparkSession object
textFile = spark.read.text("data/radiohead.txt")
textFile

DataFrame[value: string]

In [19]:
# count words per row
lineCounts = textFile.select(size(split(textFile.value, "\s+")).name("numWords"))
#print(lineCounts.collect())

# get largest word count  per row
lineCounts.agg(max(col("numWords"))).collect()


[Row(max(numWords)=42)]

In [20]:
# count word appearances in file
wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
wordCounts


DataFrame[word: string, count: bigint]

In [21]:
wordCounts.collect()

[Row(word='worked', count=1),
 Row(word='often', count=1),
 Row(word='music.', count=1),
 Row(word='production', count=1),
 Row(word='nominated', count=1),
 Row(word='guitarists,', count=1),
 Row(word='(2016),', count=1),
 Row(word='highly', count=1),
 Row(word='hit', count=1),
 Row(word='critical', count=2),
 Row(word='complex', count=1),
 Row(word='experimental', count=1),
 Row(word='listeners', count=1),
 Row(word='subsequent', count=1),
 Row(word='could', count=1),
 Row(word='set', count=1),
 Row(word='singers.', count=1),
 Row(word='albums', count=3),
 Row(word='greatest', count=2),
 Row(word='(drums,', count=1),
 Row(word='1992.', count=1),
 Row(word='style,', count=1),
 Row(word='ninth', count=1),
 Row(word='30', count=1),
 Row(word='producer', count=1),
 Row(word='alternative', count=1),
 Row(word='by', count=3),
 Row(word="O'Brien", count=2),
 Row(word='using', count=1),
 Row(word='consists', count=1),
 Row(word='Bends', count=1),
 Row(word='Thom', count=1),
 Row(word='popular

In [22]:
# return words with higher count than 10 in descending order
wordCounts.filter("count > 10").orderBy(desc("count")).collect()

[Row(word='of', count=18),
 Row(word='the', count=17),
 Row(word='and', count=17),
 Row(word='in', count=11)]