![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)

# Spark Tutorial
_______________________

## The Problem

Your data is distributed across multiple disks on several computers connected by a network, likely surpassing the storage capacity of any single computer. Without proper infrastructure, managing and processing this data would involve tracking down each piece separately, processing them individually, and then combining the results—a cumbersome and time-consuming process.

Fortunately, the **Hadoop Distributed File System** (HDFS) resolves many of these challenges. It not only manages distributed data storage but also provides the MapReduce API for processing. However, the MapReduce API is quite low-level and can be time-consuming to work with.

**Spark** offers high-level APIs for large-scale data processing. It can operate on an existing HDFS or in standalone mode, which doesn't require setting up an HDFS. Although Spark in standalone mode may not be as useful, it offers an accessible way to access and test its functionalities. Additionally, any code written and tested in standalone mode on a laptop seamlessly scales to process huge datasets on HDFS.
______________________

### Spark Context

The [SparkContext] represents the entry point to access spark functionalities. 

In the Spark framework, two primary actors are involved: the driver (e.g., a Python notebook) and the executors (e.g., Java Virtual Machines). The driver divides jobs into tasks, which are then submitted to the executors. The executors run these tasks and return the results to the driver.. 

We will run Spark in local mode, so that, we can avoid running a whole HDFS on our machine. We will focus mainly on the programming paradigm. However, it may be useful to know that platforms such as [DataBricks] do exist. They simplify a lot of the work necessary to set up a real cluster on which spark can run. 

In local mode, you can access the Spark Web UI in http://localhost:4040.

[SparkContext]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html?highlight=pyspark%20sparkcontext#pyspark.SparkContext
[DataBricks]: https://databricks.com/

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
sc = spark.sparkContext
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/28 11:55:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Resilient Distributed Dataset (RDD)

[RDDs] are one of the main abstractions of Spark. They represent **immutable** elements distributed across different nodes.
- **Resilient**: The system is able to recompute/recover missing or damaged partitions due to node failures.
- **Distributed**: Data resides on multiple nodes in a cluster.
- **Dataset**: Collection of data.
- **Immutable**: Once created, they cannot change.
- **Lazy evaluated**: Operations are performed only when necessary.
- **Parallel**: Operations are performed parallelly.

<div style="text-align:center"><img src="http://spark-mooc.github.io/web-assets/images/partitions.png" alt="drawing" width="600"/></div>

An RDD can be created by calling SparkContext’s [parallelize] method ```sc.parallelize()``` on an existing collection in your driver program. The elements of the collection are copied to form a distributed dataset. ```sc.parallelize``` takes two arguments:
   1. The collection used to form the RDD.
   2. the number of partitions to cut the dataset into. Spark tries to set the number of partitions automatically based on your cluster.

[parallelize]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.parallelize.html?highlight=parallelize
[RDDs]:https://spark.apache.org/docs/latest/rdd-programming-guide.html

In [6]:
# Parallelize data using 4. partitions
# This operation is a transformation of data into an RDD
# Spark uses lazy evaluation, so no Spark jobs are run at this point
data = range(30)
rdd  = sc.parallelize(data, 4)

print(data)
print(rdd.collect())

range(0, 30)


[Stage 0:>                                                          (0 + 4) / 4]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]


                                                                                

In [7]:
# Each RDD gets a unique ID
print(f'rdd id: {rdd.id()}')

rdd id: 1


In [8]:
# We can name each newly created RDD using the setName() method
rdd.setName('My first rdd')

My first rdd PythonRDD[1] at collect at /tmp/ipykernel_19993/1815227390.py:8

In [9]:
# Let's view the lineage (the set of transformations) of the RDD using toDebugString()
print(rdd.toDebugString())

b'(4) My first rdd PythonRDD[1] at collect at /tmp/ipykernel_19993/1815227390.py:8 []\n |  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289 []'


In [10]:
# Let's see how many partitions the RDD will be split into by using the getNumPartitions()
rdd.getNumPartitions()

4

In [11]:
type(rdd)

pyspark.rdd.PipelinedRDD

## Transformations vs Actions
There are two types of operations that you can perform on an RDD: Transformations and Actions. 
- **Transformations**. Transformations are applied on RDDs and produce other RDDs. Additionally, Transformations are lazily evaluated, meaning that, they are not computed until an action is performed. Some common transformations are [map], and [filter].
- **Actions**. Actions do not return RDDs anymore. Actions do set in motion the sequence of transformations required to produce the result. Once the computation is done you get the result as output. Some common actions are [collect], [count], [reduce], and [take].

[map]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html?highlight=map
[filter]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.filter.html?highlight=filter
[reduce]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html?highlight=reduce
[collect]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collect.html?highlight=collect
[count]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html?highlight=count
[take]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.take.html?highlight=take


### The **map** Transformation
The ```map(f)``` transformation is the most common Spark transformation: it applies a function ```f``` to each item in the dataset and produces the resulting dataset. When you execute [map] on a dataset, it initiates a **stage**. A stage is a group of tasks that all perform the same computation but on different input data. One task is launched for each partition of the dataset. A task represents a unit of execution that runs on a single machine. In the example below, the dataset is divided into four partitions (utilizing three workers), resulting in the launch of four ```map()``` tasks.

<img src="http://spark-mooc.github.io/web-assets/images/tasks.png" alt="drawing" width="500"/> <img src="http://spark-mooc.github.io/web-assets/images/map.png" alt="drawing" width="500"/>

[map]: https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.RDD.map.html?highlight=map

In [14]:
# Create sub function to subtract 1
def sub(value): return (value - 1)

# Apply sub
rdd2 = rdd.map(sub)


We have applied the ```sub()``` transformation to the RDD. Consequently, each element in the RDD is decremented by ```1```. However, no computation has started yet. As mentioned earlier, Spark employs lazy evaluation, meaning that actual computation begins only when an action is required. Let's consider one such operation that triggers the computation: the [collect] action.

It's important to approach calling ```.collect()``` with caution, as it brings the requested data into your machine's memory. If, for example, you inadvertently request several gigabytes of data, your machine may crash as its memory becomes saturated, potentially leading to the loss of several hours' worth of work.

[collect]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collect.html

In [15]:
print(rdd2.collect())

[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]


### The **filter** Transformation
The [filter] transformation is unsurprisingly used to filter() elements of an RDD. It operates similarly to the map() transformation by applying a function to all elements in an RDD. For example, consider a function `f` that returns `True` if the input is odd and `False` otherwise. If you have an RDD containing a list of numbers and apply `f` to the RDD, you will obtain another RDD containing only the odd numbers.

Let's try this out.
 
<img src="http://spark-mooc.github.io/web-assets/images/tasks.png" alt="drawing" width="500"/><img src="http://spark-mooc.github.io/web-assets/images/filter.png" alt="drawing" width="500"/>

[filter]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.filter.html?highlight=filter


In [17]:
### keep only less than 10 numbers

def isLessThan10(x): return x < 10

result = rdd.filter(isLessThan10).collect()
print(result)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


### The **reduce** function
Let's explore the ```reduce``` function outside of the Spark framework. The [reduce] function can be a bit trickier to grasp. 

```reduce(f)``` takes a function, which we'll call ```f```. This time, ```f(*,*)``` accepts two arguments: the first one, referred to as the **accumulator**, and the second one, known as the **current value**. For now, let's set aside the fact that data are stored in RDDs. Imagine you have a list of 100 numbers: [1, 2, 3, ..., 100]. The function ```f``` is called for every element of our list:

1) The first time ```f``` is called, the accumulator takes the first value of the list (```1``` in our example), while the second argument of ```f``` is the second number in the list (```2``` in our example).
2) The second time ```f``` is called, the accumulator takes the value of the output from step 1). Meanwhile, the second argument of ```f``` is the third number in the list (```3``` in our example).
3) The third time ```f``` is called, the accumulator takes the value of the output from step 2). Meanwhile, the second argument of ```f``` is the fourth number in the list (```4``` in our example).
4) And so on...
99) The 99th time ```f``` is called, the accumulator takes the value of the output from step 98). Meanwhile, the second argument of ```f``` is the 100th number in the list (```100``` in our example).

In practice, the [reduce] function applies a function to every element of the list while accumulating results.

[reduce]:https://docs.python.org/3/library/functools.html

In [20]:
from functools import reduce ### not using spark

def sumAll(acc, x): return acc + x

result = reduce(sumAll, range(30))
print(result)

435


In [21]:
# Now let us accumulate only odd numbers. 
# You do not need to understand this function too deeply.
# Just keep in mind that the reduce function is a lot more flexible than it appears.

def isOdd(x):
    return True if x % 2 == 1 else False

def AccumulateOdds(acc, x):
    if type(acc) != list: return [e for e in (acc,x) if isOdd(e)]
    else: return acc + [x] if isOdd(x) else acc
                
print(reduce(AccumulateOdds,list(range(30))))

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]


### Exercises

* (★) given a list `[1,2,3,4,5]` multiply all the elements togheter using `reduce` from functools (result: `120`).
* (★) given a list `[1,2,3,4,5]` sum all the elements greater than `2` using `reduce`  from functools (result: `12`).
* (★) given a list `[1,2,3,4,5]` compute (sum,no.elements) using `reduce` function  from functools (result: `(15,5)`).
* (★★) given a list `[1,2,3,4,5]` filter even elements using the `reduce` function  from functools (result: `[1,3,5]`).

### The **reduce** action
The [reduce] action in Spark differs slightly from the functools `reduce`, but the main concepts remain valid. Once again, the `reduce(f)` action takes a function that is applied to every element in the RDD. This function, accepts two arguments. The first argument accumulates the results and is fed back to successive `f` calls (like for the python's `reduce`). However, the second argument can also act as an accumulator. 

For simple reducers like our `sumAll`, this does not make much difference. However, for reducers like ```AccumulateOdds```, it can make a significant difference.

[reduce]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html?highlight=reduce

In [22]:
# again a function that sums the whole elements
result = rdd.reduce(sumAll)
print(result)

435


In [None]:
# Now let us accumulate only odd numbers. 
# You do not need to understand this function too deeply.
# Just keep in mind that the reduce function is a lot more flexible than it appears.

def AccumulateOddsSpark(x, y):
    if type(x) != list and type(y) != list: return list(filter(isOdd, [x,y]))
    if type(x) == list and type(y) != list: return x + [y] if isOdd(y) else x
    if type(x) != list and type(y) == list: return y + [x] if isOdd(x) else y
    if type(x) == list and type(y) == list: return x + y

result = sc.parallelize(range(20), 4).reduce(AccumulateOddsSpark)
print(result)

# A little side note. There is a drastic difference from using the .filter(isOdd).collect() and using this reducer.
# In the first case the spark context is responsible for gathering all the filtered results. 
# Instead, in this case, we are directly gathering results oursevels. 
# Of course, this can lead to inefficiencies and it is quite error prone.
# Again, this is to show the flexibility of the reduce function.
# If you can obtain your result using map and filter, just use them.

### Exercises

* (★) given a list `[1,2,3,4,5]` multiply all the elements togheter using `reduce` from spark (result: `120`).
* (★) given a list `[1,2,3,4,5]` sum all the elements greater than `2` using `reduce` from spark (result: `12`).
* (★) given a list `[1,2,3,4,5]` compute the (sum, no.element) using `reduce` from spark (result: `(15,5)`).
* (★★) given a list `[1,2,3,4,5]` filter even elements using the `reduce` function from spark (result: `[1,3,5]`).


### The **Count** Action

One of the most basic actions we can run is the [count] method, which counts the number of elements in an RDD.

Each task counts the entries in its partition and sends the result to your SparkContext, which then aggregates all of the counts. The figure below illustrates what would happen if we executed `count` on a small example dataset with just four partitions.

<img src="http://spark-mooc.github.io/web-assets/images/count.png" alt="drawing" width="500"/>


[count]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html?highlight=count

In [None]:
print(rdd.count())

## Additional Actions

* [first]: `first()` returns the first available elements from the RDD. it depends on how the RDD is partitioned.
* [take]: `take(num)` returns the first `num` elements from the RDD. it depends on how the RDD is partitioned.
* [top]: `top(num, key=None)` returns the first (in descending order) `num` elements according to the `key`. 
* [takeOrdered]: `takeOrdered(num, key=None)`returns the first (in ascending order) `num` elements according to the `key`.
* [takeSample]: `takeSample(withReplacement, num, seed=None)` it randomly select `num` elements from the RDD. If `withReplacement=True` then it can return the same elements multiple times. Using two times the same `seed` yields the same results.
* [countByValue]: `countByValue()` returns the count of each unique value in this RDD as a dictionary of (value, count) pairs.

[first]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.first.html?highlight=first#pyspark.RDD.first 
[take]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.take.html?highlight=take#pyspark.RDD.take
[top]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.top.html?highlight=top#pyspark.RDD.top
[takeOrdered]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.takeOrdered.html?highlight=takeordered#pyspark.RDD.takeOrdered
[takeSample]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.takeSample.html?highlight=takesample#pyspark.RDD.takeSample
[countByValue]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.countByValue.html?highlight=countbyvalue#pyspark.RDD.countByValue

In [None]:
print(f"rdd.first()                     = {rdd.first()}")
print(f"rdd.take(5)                    = {rdd.take(5)}")
print(f"rdd.top(5, lambda x:x)         = {rdd.top(5, lambda x:x)}")
print(f"rdd.takeOrdered(5, lambda x:x) = {rdd.takeOrdered(5, lambda x:x)}")
print(f"rdd.takeSample(True, 5, 14)    = {rdd.takeSample(True, 5, 14)}")
print(f"rdd.countByValue()             = {rdd.countByValue()}")

## Additional Transformations
* [flatMap]: The `flatMap(f)` transformation returns a new RDD by applying the function `f` to all elements of the RDD and then flattening the results.
* [groupByKey]: The `groupByKey()` transformation groups the values for each key in the RDD into a single sequence. This operation operates on pair RDDs.
* [reduceByKey]: The `reduceByKey(func)` transformation merges the values for each key using an associative and commutative reduce function.

Both of these transformations ([groupByKey] and [reduceByKey]) operate on pair RDDs, where each element is a tuple (key, value). For example, `sc.parallelize([('a', 1), ('a', 2), ('b', 1)])` would create a pair RDD with keys 'a', 'a', 'b', and values 1, 2, 1 respectively.

The `reduceByKey()` transformation gathers pairs with the same key and applies a **reduce** function to the associated values. It operates by applying the function within each partition first, and then across partitions.

While both `groupByKey()` and `reduceByKey()` can often solve the same problem and produce the same answer, `reduceByKey()` is more efficient for large distributed datasets. This is because Spark can combine output with a common key on each partition before shuffling the data across nodes. Only use `groupByKey()` if reducing the data before redistribution (**shuffling**) would not benefit the operation.

To understand how `reduceByKey` works, observe the diagram below. Pairs with the same key on the same machine are combined before data shuffling occurs. Conversely, when using `groupByKey()`, all key-value pairs are shuffled, resulting in unnecessary data transfer over the network.

When Spark needs to shuffle data, it calls a partitioning function on the key of the pair to determine which machine to shuffle the pair to. If more data is shuffled onto a single executor machine than can fit in memory, Spark spills data to disk, impacting performance severely. This situation should be avoided to maintain optimal performance.
<img src="http://spark-mooc.github.io/web-assets/images/group_by.png" alt="drawing" width="500"/> <img src="http://spark-mooc.github.io/web-assets/images/reduce_by.png" alt="drawing" width="500"/>

[flatMap]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.flatMap.html?highlight=flatmap#pyspark.RDD.flatMap
[groupByKey]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.groupByKey.html?highlight=groupbykey#pyspark.RDD.groupByKey
[reduceByKey]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.reduceByKey.html?highlight=reducebykey#pyspark.RDD.reduceByKey

In [25]:
pair_rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1)])
print("flatMap    :", pair_rdd.flatMap(lambda x: [x,x]).collect())
print("groupByKey :", pair_rdd.groupByKey().map(lambda x:(x[0],list(x[1]))).collect())
print("reduceByKey:", pair_rdd.reduceByKey(lambda x,y:x+y).collect())

flatMap    : [('a', 1), ('a', 1), ('a', 2), ('a', 2), ('b', 1), ('b', 1)]
groupByKey : [('a', [1, 2]), ('b', [1])]
reduceByKey: [('a', 3), ('b', 1)]


### Exercises

You will work on this pair rdd (name, price): 
```
('alpha', '9/2/22' , '932$'), 
('alpha', '10/2/22', '904$'), 
('alpha', '11/2/22', '806$'),
('beta' , '9/2/22' , '2831$'), 
('beta' , '10/2/22', '2732$'), 
('beta' , '11/2/22', '2685$'),
('gamma', '9/2/22' , '312$'), 
('gamma', '10/2/22', '301$'), 
('gamma', '11/2/22', '285$')
```

* (★) compute the total price. (result: `11788`) 
* (★★) compute the average per name. (result: `('alpha', 880.666), ('beta', 2749.333), ('gamma',299.333)`) 


## RDDs Memory Management

For efficiency, Spark stores RDDs in RAM memory, allowing for quick access to the data. However, memory is limited, so if you attempt to keep too many RDDs in memory, Spark will automatically evict RDDs from memory to make space for new ones. If you later reference one of the evicted RDDs, Spark will recreate it automatically, but this process takes time.

To ensure that frequently used RDDs remain in memory, you can use the `cache()` operation to instruct Spark to keep the RDD in memory. However, the RDD will only be cached after you trigger an action on it, such as `collect()`. It's essential to note that if you [cache] too many RDDs and Spark exhausts its memory, it will evict the least recently used (LRU) RDD first. Again, accessing the RDD will trigger automatic recreation.

You can verify if an RDD is cached by using the `is_cached` attribute, and you can monitor your cached RDDs in the "Storage" section of the Spark web UI. Clicking on the RDD's name provides more details about its storage location.

[cache]: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.cache.html?highlight=cache#pyspark.RDD.cache

In [None]:
new_rdd = sc.parallelize(range(10))
new_rdd.setName("MyRDD")
print("before .cache()              :", new_rdd.is_cached)
new_rdd.cache()
print("after  .cache()              :", new_rdd.is_cached)
new_rdd.collect()
print("after  .cache() and an action:", new_rdd.is_cached)

Spark automatically manages the RDDs cached in memory and will save them to disk if it runs out of memory. For efficiency, once you are finished using an RDD, you can optionally tell Spark to stop caching it in memory by using the RDD's `unpersist()` method to inform Spark that you no longer need the RDD in memory.

In [None]:
new_rdd.unpersist()