<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Touching Spark - Part 1

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Spark Entry Points

Since spark 2.x the main entry point for Spark applications is the class `SparkSession`.

`SparkSession` gives you access to `Dataframe` and `Dataset` API.

`SparkSession` is a replacement for the other entry points:
* `SparkContext`, available in our notebook as **sc**.
* `SQLContext`, or more specifically it's subclass `HiveContext`, available in our notebook as **sqlContext**.

Since Spark 2.0 the usages of `SparkContext` and `SQLContext` are limited (mostly related to `RDD` direct access).

It is worth to note that the `SparkContext` is still accessible, but you always need a `SparkSession`!!

`SparkSession` function review:
* `createDataSet(..)`
* `createDataFrame(..)`
* `emptyDataSet(..)`
* `emptyDataFrame(..)`
* `range(..)`
* `read(..)`
* `readStream(..)`
* `sparkContext(..)`
* `sqlContext(..)`
* `sql(..)`
* `streams(..)`
* `table(..)`
* `udf(..)`

In the next sections, the function we are most interested in is `SparkSession.sparkContext()` which returns a `SparkContext`.

## Getting Started

Let's start Creating SparkSession and useful variables

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

Now let's start from the bottom, the `RDD`.

We need the `SparkContext`!

In [None]:
sc = spark.sparkContext

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png)  RDD Overview

**Technical Accomplishments:**
* Have a first touch with RDD
* Perform a simple task
  * Read a text file
  * perform a word count operation
  
Let's take a look on the text file we are interested in.

In [None]:
qcutils.print_s3_bucket_object(key='training/word-count-small.txt')

Let's now try to create an RDD from the file and perform a count operation

In [None]:
from __future__ import print_function

# Open textFile for Spark Context RDD
text_file = sc.textFile(baseUri + "word-count-small.txt")

# Execute word count
counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png)  Transformations & Actions

**Technical Accomplishments:**
* Review the Lazy vs. Eager design
* Quick review of Transformations
* Quick review of Actions
* Introduce the Catalyst Optimizer
* Wide vs. Narrow Transformations

### Laziness By Design

RDDs support two types of operations: 
* Transformations, which create a new dataset from an existing one, and 
* Actions, which return a value to the driver program after running a computation on the dataset. 

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

Fundamental to Apache Spark are the notions that:
* Transformations are **LAZY**: they do not compute their results immediately, but the system just remember the transformations applied to some base dataset and applies them once needed
* Actions are **EAGER**: they are computed immediatly, together with all the previous (needed) transformations.

This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

We see this play out when we run multiple transformations back-to-back, and no job is triggered:

#### Spark UI - Jobs

In the **Spark UI**'s **Jobs** page it is available a detailed list of the spark jobs.

The jobs list in the Spark UI is empty..... Why?

....


We only run **transformations**!!!

Let's now try with an **action**!

In [None]:
# Collect the results
output = counts.collect()

# Print the first 10 elements of the output array
for (word, count) in output[:10]:
    print("%s: %i" % (word, count))

Now a single job is listed, a `collect` job

Note that the `collect` function of the the `RDD`, returns (materializes in the driver) an array that contains all of the elements in this RDD.

### A Little Step More

#### Wide vs. Narrow Transformations

Transformations can be classified into two broad categories: **wide** and **narrow**.

**Narrow Transformations**: The data required to compute the records in a single partition reside in at most one partition of the parent RDD.

Examples include:
* `filter(..)`
* `map(..)`
* `...`

<img src="https://www.quantiaconsulting.com/logos/img/transformations-narrow.png" alt="Narrow Transformations" style="height:300px"/>

<br/>

**Wide Transformations**: The data required to compute the records in a single partition may reside in many partitions of the parent RDD. 

Examples include:
* `groupBy(...).sum()` 
* `distinct()` 
* `...` 

<img src="https://www.quantiaconsulting.com/logos/img/transformations-wide.png" alt="Wide Transformations" style="height:300px"/>

#### Shuffle

What if the data you need is not on the same executor and you want to optimize your execution?...

You need to `shuffle` data!

Remember the previuos image, What if you need to to group by color, it will serve us best if...
  * All the reds are in one partitions
  * All the blues are in a second partition
  * All the greens are in a third

From there we can easily sum/count/average all of the reds, blues, and greens.

To carry out the shuffle operation Spark needs to
* Write that data to disk on the local node - at this point the slot is free for the next task.
* Send that data across the wire to another executor
  * Technically the Driver decides which executor gets which piece of data.
  * Then the executor pulls the data it needs from the other executor's shuffle files.
* Copy the data back into RAM on the new executor
  * The concept, if not the action, is just like the initial read "every" `DataFrame` starts with.

**Note:** Some actions, like `count()` and `reduce(..)`, needs a shuffle.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Caching

**Technical Accomplishments:**
* Understand how caching works
* Explore the different caching mechanisims
* Discuss tips for the best use of the cache

Spark can **persistst** (or **caching**) a dataset in memory across operations.

Persist an RDD means that each node stores any computed partitions (of the persisted RDD) in memory in order to speed-up the reuse of that partition in in other actions on that dataset (or datasets derived from it). 

Caching plays a key role in improving iterative algorithms performance. A good usage of cache operation can speed-up operations by more than 10x.

Note that an RDD can be marked to be cached (using [persist()](https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#persist--) or [cache()](https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#cache--) methods). The caching operation is lazy, so the RDD will be cached the first time it is computed in an action.

Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

#### Spark UI - Storage

In the **Spark UI**'s **Storage** page it is available a detailed list of the cached RDDs.

Let's review fields:
* RDD Name
* Storage Level
* Cached Partitions
* Fraction Cached
* Size in Memory
* Size on Disk

Let's play with cache.

Count the number of words in the file `...`

In [None]:
qcutils.print_s3_bucket_object(key='training/enwiki-latest-abstract10.xml')

In [None]:
# Open textFile for Spark Context RDD
text_file = sc.textFile(baseUri + "enwiki-latest-abstract10.xml")

# Execute word count
counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
output = counts.collect()

# Print the first 10 elements of the output array
for (word, count) in output[:10]:
    print("%s: %i" % (word, count))

.... now cache the file before

In [None]:
text_file.cache()

Let's check the spark UI... no entry in the **storage** page... Why?

...

In [None]:
text_file.count()

**NOTE:** the `count()` action **materializes the cache**

The `cache()` is neither an Action nor a Transformation, we can mark an RDD as cachable, Spark will decide if and when cache it.

We can't force caching every time, in this case the `count()` works well because we are running in educational environment, with very few data.

In [None]:
counts = text_file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
output = counts.collect()

# Print the first 10 elements of the output array
for (word, count) in output[:10]:
    print("%s: %i" % (word, count))

### A Little Step More

#### Storage Level

Based on your system characteristics or operation needs, each persisted RDD can be stored using a different storage level:

* `MEMORY_ONLY`: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
* `MEMORY_AND_DISK`: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
* `MEMORY_ONLY_SER` (Java and Scala): Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
* `MEMORY_AND_DISK_SER` (Java and Scala): Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
* `DISK_ONLY`: Store the RDD partitions only on disk.
* `MEMORY_ONLY_2`, `MEMORY_AND_DISK_2`, etc.: Same as the levels above, but replicate each partition on two cluster nodes.
* `OFF_HEAP`: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

**Note** The storage level can be passed to `persist()` method, the `cache()` method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory)


#### Which Storage Level to Choose?

Different storage levels offer different memory usage/CPU efficiency trade-offs. But how to choose?

The following guidelines could help:
* If you have enough memory (RAM) and the RDDs fit comfortably, use `MEMORY_ONLY` storage level. The `MEMORY_ONLY` level is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
* Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
* Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

|        Level        	| Space used 	| CPU time 	| In memory 	| On disk 	| Serialized 	|
|:-------------------:	|:----------:	|:--------:	|:---------:	|:-------:	|:----------:	|
| MEMORY_ONLY         	|    High    	|    Low   	|     Y     	|    N    	|      N     	|
| MEMORY_ONLY_SER     	|     Low    	|   High   	|     Y     	|    N    	|      Y     	|
| MEMORY_AND_DISK     	|    High    	|  Medium  	|    Some   	|   Some  	|    Some    	|
| MEMORY_AND_DISK_SER 	|     Low    	|   High   	|    Some   	|   Some  	|      Y     	|
| DISK_ONLY           	|     Low    	|   High   	|     N     	|    Y    	|      Y     	|



### Deep dive in Storage Level
For the next section, we need to clear the existing cache using the `unpersist()` method on every single cached RDD.

Let's go deeper in the storage level.

As a first step, we need to clean the cache!

In order to upersist an RDD, we need to use the `unpersist()` method.

Let's now read the file in three different RDDs the same file and persist two of them.

In [None]:
qcutils.print_s3_bucket_object(key='training/enwiki-latest-abstract.xml')

In [None]:
from pyspark.storagelevel import StorageLevel

text_file_1 = sc.textFile(baseUri + "enwiki-latest-abstract.xml")
text_file_2 = sc.textFile(baseUri + "enwiki-latest-abstract.xml")
text_file_3 = sc.textFile(baseUri + "enwiki-latest-abstract.xml")

text_file_2.persist(StorageLevel.MEMORY_ONLY).count()
text_file_3.persist(StorageLevel.DISK_ONLY).count()

In the Spark UI storage tab we can see the difference between the cached RDD!

Now, let's try to perform trasformations and actions on them.

In [None]:
text_file_1.filter(lambda s: "italy" in s).count()

In [None]:
text_file_2.filter(lambda s: "italy" in s).count()

In [None]:
text_file_3.filter(lambda s: "italy" in s).count()

What is the difference between the performace of the operations on the 3 files?

* `text_file_1` is not cached and every action implies a new read of the original file from the remote storage
* `text_file_2` is only partially cached and every action implies a partial read of the original file from the remote storage -> It leads to a partial perfromance improvement
* `text_file_3` is completely cached on local disk and the performance results the best

**Note** Since spark 2.x the default level of storage (the one used by `cache()`) is `MEMORY_AND_DISK`. 
Using the `MEMORY_AND_DISK` storage level, the non-cached partition must be recomputed every time.


Of course, the more you use the cache, the better it is

In [None]:
text_file_2.filter(lambda s: "europe" in s).count()

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png)  Take Home Messages

* RDDs are the main brick of Spark application
* RDD are distributed and immutable
* Spark is **Lazy** (...not always!)
* Caching improve performance, but cache must be used in a smart way

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Let's Get Dirty

Let's put at work what we've just learnt on RDD with a [Lab](./lab/01a-touching-spark-part1-lab.ipynb)

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.