# pySpark Basics
### In Pictures and By Example

## Spark
<img src="img/spark-cluster-overview.png"/>
[ from https://spark.apache.org/docs/2.2.0/cluster-overview.html ]

## setup

In [1]:
# sc.stop() # uncomment in case you want to rerun it
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="test")
sc

In [2]:
# this will be the size of our sample dataset
SIZE = 10000000

## Resilient Distributed Dataset (RDD)
It is lazy - creating a  doesn't actually execute the computation.
<img src="img/rdd-create.png"/>

In [3]:
import math

# that is lazy and quick
%time rdd = sc.parallelize(range(1, SIZE), 1).map(lambda x: math.sqrt(x))
rdd

CPU times: user 1.91 ms, sys: 3.03 ms, total: 4.94 ms
Wall time: 257 ms


PythonRDD[1] at RDD at PythonRDD.scala:48

## RDD executed
You need to run a Spark 'action' (eg count, reduce, collect, first, take, save..., foreach) for the computation to actually execute.
<img src="img/count.png" />

In [4]:
rdd
# The second run time is similar to the first one:
%time rdd.count()
%time rdd.count()

CPU times: user 10.9 ms, sys: 5.36 ms, total: 16.3 ms
Wall time: 3.03 s
CPU times: user 4.87 ms, sys: 2.2 ms, total: 7.07 ms
Wall time: 2.24 s


9999999

## RDD cached
Cached RDD will be reused at subsequent runs.
<img src="img/count-cached.png" />

In [5]:
%time rddcache = rdd.cache()
rddcache

CPU times: user 1.34 ms, sys: 1.37 ms, total: 2.7 ms
Wall time: 6.24 ms


PythonRDD[1] at RDD at PythonRDD.scala:48

In [6]:
# Notice how the second run is quicker now below:
%time rddcache.count()
%time rddcache.count()

CPU times: user 5.59 ms, sys: 3.83 ms, total: 9.42 ms
Wall time: 3.14 s
CPU times: user 5.33 ms, sys: 2.82 ms, total: 8.14 ms
Wall time: 799 ms


9999999

### Q: why is rdd suddenly quicker? (hint: check the UI)

In [7]:
%time rdd.count()
print(rdd)
print(rddcache)

CPU times: user 6.95 ms, sys: 4.29 ms, total: 11.2 ms
Wall time: 783 ms
PythonRDD[1] at RDD at PythonRDD.scala:48
PythonRDD[1] at RDD at PythonRDD.scala:48


## Computation Distribution
Using more partitions yields quicker execution

### Single Partition
<img src="img/single-part.png" />

In [8]:
rdd = sc.parallelize(range(1,SIZE), numSlices=1).cache()
rdd.count()
%time rdd.map(lambda x: math.sqrt(x)).count()

CPU times: user 4.09 ms, sys: 1.22 ms, total: 5.32 ms
Wall time: 2.4 s


9999999

### Multiple partitions
<img src="img/multi-part.png" />

In [9]:
rdd = sc.parallelize(range(1,SIZE), numSlices=8).cache()
rdd.count()
%time rdd.map(lambda x: math.sqrt(x)).count()

CPU times: user 6.01 ms, sys: 1.61 ms, total: 7.61 ms
Wall time: 788 ms


9999999

# Reduce vs GroupBy
Sample problem: group items and return sums of elements in each group


<img src="img/group.png" />

In [10]:
rdd = sc.parallelize(range(1,SIZE*2), numSlices=8).map(lambda x: (x%4, x)).cache()
rdd.count()
%time rdd.groupByKey().map(lambda t: (t[0], sum(t[1]))).collect()

CPU times: user 17.3 ms, sys: 4.44 ms, total: 21.7 ms
Wall time: 4.71 s


[(0, 49999990000000),
 (1, 49999995000000),
 (2, 50000000000000),
 (3, 50000005000000)]

<img src="img/reduce.png" />

In [11]:
rdd = sc.parallelize(range(1,SIZE*2), numSlices=8).map(lambda x: (x%4, x)).cache()
rdd.count()
%time rdd.reduceByKey(lambda x,y: x+y).collect()

CPU times: user 10.9 ms, sys: 3.59 ms, total: 14.5 ms
Wall time: 3.26 s


[(0, 49999990000000),
 (1, 49999995000000),
 (2, 50000000000000),
 (3, 50000005000000)]

# More In Depth

## Spark UI

<img src="img/ui-jobs.png" />

<img src="img/ui-job.png" />

<img src="img/ui-stage.png" />

<img src="img/ui-storage.png" />

<img src="img/ui-storage-rdd.png" />

<img src="img/ui-executors.png" />

## Partitioners, shuffling etc
Be careful when using operations like joins, repartition, coalesce, ...ByKey as they might have to shuffle the data to, for example, put all the items with the same key on one machine.

In [12]:
# keys are evenly spread out on partitions
rdd = sc.parallelize(range(1,SIZE), numSlices=8).map(lambda x: (x%8, x)).cache()
%time rdd.count()
print(rdd.partitioner)
%time rdd.groupByKey().mapValues(lambda x: sum(x)).collect()

CPU times: user 4.8 ms, sys: 2.1 ms, total: 6.91 ms
Wall time: 1.67 s
None
CPU times: user 13.9 ms, sys: 5.49 ms, total: 19.4 ms
Wall time: 2.33 s


[(0, 6249995000000),
 (1, 6249996250000),
 (2, 6249997500000),
 (3, 6249998750000),
 (4, 6250000000000),
 (5, 6250001250000),
 (6, 6250002500000),
 (7, 6250003750000)]

![1](img/Shuffling-1.png)

In [13]:
# same partitioner is used
rdd = sc.parallelize(range(1,SIZE), numSlices=8).map(lambda x: (x%8, x)).partitionBy(numPartitions=8).cache()
%time rdd.count()
print(rdd.partitioner)
%time rdd1=rdd.groupByKey().mapValues(lambda x: sum(x))
print(rdd1.partitioner)
%time rdd1.collect()

CPU times: user 5.44 ms, sys: 2.14 ms, total: 7.58 ms
Wall time: 8.58 s
<pyspark.rdd.Partitioner object at 0x105e1d080>
CPU times: user 3.02 ms, sys: 1.93 ms, total: 4.95 ms
Wall time: 6.43 ms
<pyspark.rdd.Partitioner object at 0x105e1d080>
CPU times: user 6.46 ms, sys: 2.86 ms, total: 9.32 ms
Wall time: 1.75 s


[(0, 6249995000000),
 (1, 6249996250000),
 (2, 6249997500000),
 (3, 6249998750000),
 (4, 6250000000000),
 (5, 6250001250000),
 (6, 6250002500000),
 (7, 6250003750000)]

![3](img/Shuffling-3.png)

In [14]:
# partitioner is changed in the middle - causing double shuffle
rdd = sc.parallelize(range(1,SIZE), numSlices=8).map(lambda x: (x%8, x)).partitionBy(8, lambda x: 1).cache()
%time rdd.count()
print(rdd.partitioner)
%time rdd1=rdd.groupByKey().mapValues(lambda x: sum(x))
%time rdd1.collect()
print(rdd1.partitioner)

CPU times: user 5.43 ms, sys: 2.1 ms, total: 7.53 ms
Wall time: 5.21 s
<pyspark.rdd.Partitioner object at 0x105e250f0>
CPU times: user 4.11 ms, sys: 1.62 ms, total: 5.73 ms
Wall time: 11.1 ms
CPU times: user 6.05 ms, sys: 2.8 ms, total: 8.86 ms
Wall time: 6.13 s
<pyspark.rdd.Partitioner object at 0x105dffda0>


![2](img/Shuffling-2.png)

## pySpark
<img src="img/pyspark.png" />
[ from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals ]

## HADOOP YARN
<img src="img/yarnflow1.png" />
[ from https://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/ ]

## HADOOP YARN
<img src="img/debt.jpg" />
[ from https://www.kdnuggets.com/2017/10/data-science-systems-engineering-approach.html ]