# <u><p style="text-align: center;">Introduction to Apache Spark</p></u>

### Contents of this notebook
* What a Resilient Distributed Dataset (RDD) is
* How to load data into Spark
* Difference between transformations and actions

### Background

Apache Spark is a system that allows us to process large amounts of data in parallel. A core concept of Spark that allows to scale data processing operations is the *resilient distributed dataset* (**RDD**). RDDs are immutable collections of elements which can be partitioned across several computers and be operated in parallel. 

When we read data with Spark, they are translated into RDDs. In those RDDs we can then apply `map` and `reduce` operations which Spark automatically executes in parallel.

In this notebook we are going to see how to initialize Spark, how to create RDDs, and how to work with RDDs.

#### Spark initialization
To work with Spark we need a Spark cluster and a way to access to it. In the following cell we configure a cluster and create a `SparkSession` to access it. There is code in this cell that you may not understand, but this part is not the focus of our course. These are just steps that we have to perform to configure our cluster. Note that this code will output some red lines with warnings which should not worry you.

In [None]:
import os
from pyspark.sql import SparkSession

#'swan_spark_conf' is a configuration provided by a plugin for Jupyter. We further extend this configuration with proxy settings.
swan_spark_conf = swan_spark_conf.setAll([('spark.ui.proxyBase', os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040')])

#instantiate a SparkSession object with our configuration
spark = SparkSession\
            .builder\
            .config(conf=swan_spark_conf)\
            .appName('Spark introduction')\
            .getOrCreate()

#get a SparkContext object which will allow us to work with RDDs
sc = spark.sparkContext

#set Spark log level
spark.sparkContext.setLogLevel('ERROR')

`SparkSession` is the entry point of Spark, it provides access to Spark functionality. `SparkContext` represents a connection to a cluster. It can be used to create RDDs and distribute data on that cluster. When starting this and the rest of the notebooks of this course, we are going to initialize Spark by running this cell.

Note that this cell may take some time to be executed since Apache Spark is launching in the background. An asterisk on the left of the cell means that execution has not finished yet, while a number denotes that execution has already finished.

#### Creating an RDD

All parallel work in Spark is done on RDDs, so the first thing we need to do is to
convert our data to an RDD. To do this we are going to use the `parallelize`
method on our `SparkContext` `sc`. 

`parallelize` takes two arguments: (1) our data, and optionally (2) a number of partitions to split the data. Below we create an RDD with two partitions from a list:

In [None]:
animals = ['Dog', 'Cat', 'Rabbit', 'Hare', 'Deer', 'Gull', 'Woodpecker', 'Mole']
animals_rdd = sc.parallelize(animals, 2)

Notice the difference on the data type of the list and the RDD:

In [None]:
print('the type of animals is: ' + str(type(animals)))
print('the type of animals_rdd is: ' + str(type(animals_rdd)))

Also, observe below how the elements of `animals_rdd` are distributed into partitions. You can see that in the number of Tasks that have been executed for this job. Each sub-list represents a partition and its elements:

In [None]:
print(animals_rdd.glom().collect()) #2 partitions

In [None]:
animals_rdd = sc.parallelize(animals, 3) #3 partitions
print(animals_rdd.glom().collect())

In [None]:
animals_rdd = sc.parallelize(animals, 4) #4 partitions
print(animals_rdd.glom().collect())

The number of partitions affects the processing performance as it represents the number of 'pieces' of data that a cluster can work with in parallel. If we have too many partitions, not all of them will be processed in parallel because we might not have enough computing nodes. On the other hand, if we have too few partitions, some computing nodes may be left unused.

The number of partitions is a parameter that requires calibration for intensive tasks but for our notebooks we are going to always have two.

#### Lazy evaluation

Now let's suppose that we have the following RDD:

In [None]:
duplicates = ['Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Parrot']
duplicates_rdd = sc.parallelize(duplicates, 2)

and we want to find its distinct elements using the `distinct` function: 

In [None]:
distinct_elements = duplicates_rdd.distinct()
print(distinct_elements)

As we observe, printing `distinct_elements` does not print any values. That's because of an evaluation strategy called **lazy evaluation** that Spark follows.

In **lazy evaluation** parts of our code are executed only when there is a need to do so. The benefits of **lazy evaluation** are:
* Saving time by executing operations only when we ask for a result to be produced
* Similarly saving system resources
* 'Automatic' performance improvements through operation planning since we know which of them we have to perform before we ask for results

Spark achieves this by having two distinct types of operations, **transformations** and **actions**.

### Transformations

Transformations are operations that will not be completed when the code in a cell is executed - they will only get executed once an **action** is called. We can think of tranformations as operations that we know how to do, but we will not do until there is a reason for it. An example of a transformation might be to map a function over an RDD, or to filter a set of values.

### Actions

Actions are commands that are computed by Spark when the corresponding code is executed in a cell. They consist of running all of the previous transformations in order to get back an actual result. An action is composed of one or more *jobs*, and each job consist of *tasks*. Tasks are executed in parallel when possible.

Below are some examples of transformations and actions:

<img src="images/transformations_actions.png" width="350"/>

### Code examples

In the following examples we are going to use Spark to create RDDs and apply transformations and actions to them. 

***Example 1:*** revisits the conversion of Celcius temperatures to Fahrenheit using `map` with RDDs.  
***Example 2:*** calculates the average temperature of a lake using `reduce` with RDDs.   

When executing cells that contain Spark actions an interface will appear which shows the progress of the operations. Also, more details like operation planning for each example can be found in the [Spark UI]().

#### Example 1: Celcius to Fahrenheit

For our first example we are going to revisit the conversion of temperatures from Fahrenheit to Celcius degrees. The function that we used for the conversion previously was:

In [None]:
def to_Fahrenheit(temperature):
    return temperature * 9/5 + 32

Here, we are going to do the same conversion but in a scalable way (by utilizing Spark). Our list of temperatures is:

In [None]:
celcius_temperatures = [10, 15, 9, -2, 30]

We first pass our data to Spark by converting the temperature list to an RDD:

In [None]:
celcius_temperatures_rdd = sc.parallelize(celcius_temperatures, 2)

Then we verify that the rdd contains the temperatures using `collect`. `collect` returns the contents of an RDD as a list:

In [None]:
print(celcius_temperatures_rdd.collect())

Our next step is to map the function `to_Fahrenheit` over the rdd. The syntax of `map` in Spark is:

In our case this translates to:

In [None]:
celcius_temperatures_rdd.map(to_Fahrenheit)

However, `map` is a transformation, so we won't see any results until we use an action. For this reason we use `collect`:

In [None]:
temperatures = celcius_temperatures_rdd.map(to_Fahrenheit)
print(temperatures.collect())

#### Example 2: Temperature average

Now let's see how we could find the average water temperature of lake Como, Italy, in July. First we convert our data to an RDD:

In [None]:
water_temperatures = [23.4, 27.5, 25.1, 22.1, 23.9]
water_temperatures_rdd = sc.parallelize(water_temperatures, 2)

Then, we define the addition function:

In [None]:
def add(number_1, number_2):
    return number_1 + number_2

and after that we can add the temperatures using `reduce`. The syntax of `reduce` is below:

So it becomes:

In [None]:
added_temperatures = water_temperatures_rdd.reduce(add)

Next, to calculate the average temperature we need to know how many temperatures we have. To count them we can use `count`:

In [None]:
number_of_temperatures = water_temperatures_rdd.count()

And now we are ready to calculate the average water temperature in lake Como:

In [None]:
average_temperature = added_temperatures / number_of_temperatures
print(average_temperature)

<span style="display:none" id="question1">W3sicXVlc3Rpb24iOiAiV2hlbiB3ZSBsb2FkIGRhdGEgdG8gU3BhcmsgdGhleSBhcmUgY29udmVydGVkIHRvIFJERHM6IiwgInR5cGUiOiAibXVsdGlwbGVfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJjb2RlIjogIlRydWUiLCAiY29ycmVjdCI6IHRydWV9LCB7ImNvZGUiOiAiRmFsc2UiLCAiY29ycmVjdCI6IGZhbHNlfV19XQ==</span>

<span style="display:none" id="question2">W3sicXVlc3Rpb24iOiAiJ01hcCcgYW5kICdSZWR1Y2UnIG9wZXJhdGlvbnMgaGFwcGVuIGluIHBhcmFsbGVsIGluIFJERHM6IiwgInR5cGUiOiAibXVsdGlwbGVfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJjb2RlIjogIlRydWUiLCAiY29ycmVjdCI6IHRydWV9LCB7ImNvZGUiOiAiRmFsc2UiLCAiY29ycmVjdCI6IGZhbHNlfV19XQ==</span>

<span style="display:none" id="question3">W3sicXVlc3Rpb24iOiAiU3BhcmsgdHJhbnNmb3JtYXRpb25zIGFyZSBleGVjdXRlZCB3aGVuIHRoZSBjb3JyZXNwb25kaW5nIGNlbGwgaXMgZXhlY3V0ZWQ6IiwgInR5cGUiOiAibXVsdGlwbGVfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJjb2RlIjogIlRydWUiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiVGhhdCdzIHdoYXQgYWN0aW9ucyBkby4ifSwgeyJjb2RlIjogIkZhbHNlIiwgImNvcnJlY3QiOiB0cnVlfV19XQ==</span>

### Practice questions

#### Q1:

In [None]:
from jupyterquiz import display_quiz

display_quiz('#question1')

#### Q2:

In [None]:
display_quiz('#question2')

#### Q3:

In [None]:
display_quiz('#question3')

### More advanced examples

##### Example A1: Operation chaining

There are cases where we need to perform multiple operations to an RDD. Suppose that we want to find how many distinct even numbers exist in the following list:

In [None]:
data = [34, 1, 12, 71, 92, 5, 6, 23, 11, 45]

A possible solution is to define the function which checks if a number is even:

In [None]:
def is_even(number):
    return number%2==0

convert the data to an RDD:
    

In [None]:
data_rdd = sc.parallelize(data, 2)

filter out odd numbers:

In [None]:
data_rdd = data_rdd.filter(is_even)

remove duplicates:

In [None]:
data_rdd = data_rdd.distinct()

and finally count the remaining numbers:

In [None]:
print(data_rdd.count())

These operations can summarized below:

In [None]:
data_rdd = sc.parallelize(data, 2)
data_rdd = data_rdd.filter(is_even)
data_rdd = data_rdd.distinct()
result = data_rdd.count()
print(result)

We notice that as our operations increase in number, our code becomes verbose and difficult to comprehend. For this reason Spark allows to *chain* operations. With chaining, the code of the previous cell could be rewritten as:

In [None]:
result = sc.parallelize(data, 2)\
            .filter(is_even)\
            .distinct()\
            .count()
print(result)

and now the code is more readable, and it is more clear to see the flow of the operations.

### Further reading

* [RDDs](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)
* [Lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation)
* [RDD transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)
* [RDD actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)
* [Map with RDDs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html?highlight=map#pyspark.RDD.map)
* [Reduce with RDDs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html?highlight=reduce#pyspark.RDD.reduce)