# Spark Tutorial
_______________________

### The Problem

Your data is split across several disks across several computers connected by a network. And, probably, all your data combined take so much space that they would never fit a single computer. On the other hand, you still need to process your data. Well, without proper infrastructure, you would need to track down each piece of information, process them separately, and combine the information together. On top of that, you will need to repeat this process many times, too many times. 

Luckily, the Hadoop Distributed File System (HDFS) solves most of these problems. It provides also MapReduce API to do the processing. However, the MapReduce API is still very low-level, and getting things done requires too much time.

Here is where Spark comes into play. Spark provides high-level APIs for large-scale data processing. Spark can run over an existing HDFS or it can run in a standalone mode (which does not require setting up any HDFS). To be fair, Spark in standalone mode is not particularly useful but it provides an easy way to access its functionalities and test them. Morever, whatever code you write in standalone mode that runs on your laptop will work automatically on huge HDFS. 
______________________

### Spark Context

The [SparkContext] represents the entry point to access spark functionalities. 

In the Spark framework, there are two main actors: the driver and the executors. The driver has jobs that need to be run. The driver splits jobs into tasks. These tasks are submitted to executors. Once completed, results are sent back to the driver. 

We will run Spark in local mode, so that, we can avoid running a whole HDFS on our machine. We will focus mainly on the programming paradigm. However, it may be useful to know that platforms such as [DataBricks] do exist. They simplify a lot of the work necessary to set up a real cluster on which spark can run. 

In local mode you can access the Spark Web UI in http://localhost:4040.

[SparkContext]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html?highlight=pyspark%20sparkcontext#pyspark.SparkContext
[DataBricks]: https://databricks.com/

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").master("local[*]").getOrCreate()
sc = spark.sparkContext
spark

22/02/19 12:43:39 WARN Utils: Your hostname, ataxia resolves to a loopback address: 127.0.1.1; using 192.168.1.91 instead (on interface enp4s0)
22/02/19 12:43:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/19 12:43:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Regardless of whether you are familiar with python or not, these functions will help you a lot. [help] ```help(x)``` shows the documentation of ```x```. [type] ```type``` show a string representing the type of ```x```. [dir] ```dir(x)``` shows anything that is accessible inside of ```x```. You can find many more on the [built-in] documentation page.

[help]:https://docs.python.org/3/library/functions.html#help
[type]:https://docs.python.org/3/library/functions.html#type
[dir]:https://docs.python.org/3/library/functions.html#dir[built-in]
[built-in]:https://docs.python.org/3/library/functions.html

In [2]:
print(sc)
print(type(sc))
# help(sc)

<SparkContext master=local[*] appName=test>
<class 'pyspark.context.SparkContext'>


### Resilient Distributed Dataset (RDD)

[RDDs] are one of the main abstraction of Spark. They represent immutable elements distributed across different nodes.
- **Resilient**: The system is able to recompute/recover missing or damaged partitions due to node failures.
- **Distributed**: Data resides on multiple nodes in a cluster.
- **Dataset**: Collection of data.
- **Immutable**: Once created, they cannot change.
- **Lazy evaluated**: Operations are performed only when necessary.
- **Parallel**: Operations are performed parallely.

<div style="text-align:center"><img src="http://spark-mooc.github.io/web-assets/images/partitions.png" alt="drawing" width="600"/></div>

An RDD can be created by calling SparkContext’s [parallelize] method ```sc.parallelize()``` on an existing collection in your driver program. The elements of the collection are copied to form a distributed dataset. ```sc.parallelize``` takes two arguments:
   1. The collection used to form the RDD.
   2. the number of partitions to cut the dataset into. Spark tries to set the number of partitions automatically based on your cluster.

[parallelize]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.parallelize.html?highlight=parallelize
[RDDs]:https://spark.apache.org/docs/latest/rdd-programming-guide.html

In [3]:
# Parallelize data using 8 partitions
# This operation is a transformation of data into an RDD
# Spark uses lazy evaluation, so no Spark jobs are run at this point
data = range(100)
rdd  = sc.parallelize(data, 4)

print(data)
print(rdd.collect())

range(0, 100)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


In [4]:
# Each RDD gets a unique ID
print('rdd id: {0}'.format(rdd.id()))

rdd id: 1


In [5]:
# We can name each newly created RDD using the setName() method
rdd.setName('My first rdd')

My first rdd PythonRDD[1] at collect at /tmp/ipykernel_20377/3543828524.py:8

In [6]:
# Let's view the lineage (the set of transformations) of the RDD using toDebugString()
print(rdd.toDebugString())

b'(4) My first rdd PythonRDD[1] at collect at /tmp/ipykernel_20377/3543828524.py:8 []\n |  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274 []'


In [7]:
# Let's see how many partitions the RDD will be split into by using the getNumPartitions()
rdd.getNumPartitions()

4

In [8]:
type(rdd)

pyspark.rdd.PipelinedRDD

### Transformations vs Actions
There are two types of operations that you can perform on an RDD: Transformations and Actions. 
- **Transformations**. Transformations are applied on RDDs and produce other RDDs. Additionally, Transformations are lazily evaluated, meaning that, they are not computed until an action is performed. Some common transformations are [map], and [filter].
- **Actions**. Actions do not return RDDs anymore. Actions do set in motion the sequence of transformation required to produce the result. Once the computation is done you get the result as output. Some common actions are [collect], [count], [reduce], and [take].

[map]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html?highlight=map
[filter]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.filter.html?highlight=filter
[reduce]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html?highlight=reduce
[collect]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collect.html?highlight=collect
[count]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html?highlight=count
[take]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.take.html?highlight=take


### The map() Transformation
```map(f)```, the most common Spark transformation: it applies a function ```f``` to each item in the dataset, and outputs the resulting dataset. When you run [map] on a dataset, a single stage of tasks is launched. A stage is a group of tasks that all perform the same computation, but on different input data. One task is launched for each partitition, as shown in the example below. A task is a unit of execution that runs on a single machine. When we run ```map(f)``` within a partition, a new task applies ```f``` to all of the entries in a particular partition, and outputs a new partition. In this example figure, the dataset is broken into four partitions (using three workers), so four ```map()``` tasks are launched.


<img src="http://spark-mooc.github.io/web-assets/images/tasks.png" alt="drawing" width="600"/><img src="http://spark-mooc.github.io/web-assets/images/map.png" alt="drawing" width="600"/>




[map]: https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.RDD.map.html?highlight=map

In [9]:
# Create sub function to subtract 1
def sub(value): return (value - 1)

# let's apply this function three times.
rdd2 = rdd.map(sub).map(sub).map(sub)

We have applied ```sub()``` to ```rdd``` three times in a row. So, each element in ```rdd``` gets decremented three times of ```1```. However, no computation as yet started. As mentioned earlier, spark is lazily evaluated. This means, that only when we require certain operation to be done the whole computantion will actually start. Let's see one of these operation that force the computation to start, the [collect] action.

You should feel a little bit of fear each time you call ```.collect()``` as it brings the data you requested on your machine memory. But what if you requested several GBs of data by mistake. Well, your machine may crash as the memory gets saturated and you may loose several our worth of work.

[collect]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collect.html

In [10]:
print(rdd2.collect())

[-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96]


### The Filter() Transformation
The [filter] transformation is used, surprisingly, to ```filter()``` elements of an RDD. It works similarly to the ```map()``` transformations. It applies a function to all elements in an RDD. For example, suppose that ```f``` does return ```True``` if the input is odd and ```False``` otherwise. Suppose that you have an RDD containing a list of numbers. If you apply the ```f``` to the RDD you obtain again another RDD but with only odd numbers. Let's try it out.

[filter]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.filter.html?highlight=filter


In [11]:
### keep only Perfect Squares

def isOdd(x): return True if x % 2 == 1 else False

result = rdd.filter(isOdd).collect()
print(result)

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99]


### The reduce() function
Let us see the reduce function outside the Spark framework. The [reduce] function is a little bit harder to understand. ```reduce(f)``` takes, again, a function which we will call ```f```. This time around ```f(*,*)``` will take two arguments: The first one, we will call it the **accumulator**. The second one, we will call it the **current value**. For now, forget that data are stored in RDDs. Suppose that you have a list of 100 numbers [1,2,3,...,100]. Once again, ```f``` is called on every element of our list.
1) the first time ```f``` is called, the accumulator takes the first value of the list (```1```, in our example). Meanwhile the second argument of ```f``` is the second number in the list (```2```, in our example).
2) the second time ```f``` is called, the accumulator takes the value of the output of 1). Meanwhile the second argument of ```f``` is the third number in the list (```3```, in our example).
3) the third time ```f``` is called, the accumulator takes the value of the output of 2). Meanwhile the second argument of ```f``` is the fourth number in the list (```4```, in our example).
4) and so on ...
99) the 99th time ```f``` is called, the accumulator takes the value of the output of 98). Meanwhile the second argument of ```f``` is the 100th number in the list (```100```, in our example).

In practice, the [reduce] function applies a function to every element of the list. However, meanwhile it computed can accumulate results.

[reduce]:https://docs.python.org/3/library/functools.html

In [12]:
from functools import reduce ### not using spark

def sumAll(acc, x): return acc + x

result = reduce(sumAll, range(100))
print(result)

4950


In [None]:
# Now let us accumulate only odd numbers. 
# You do not need to understand this function too deeply.
# Just keep in mind that the reduce function is a lot more flexible than it appears.
def AccumulateOdds(acc, x):
    if type(acc) != list: return [e for e in (acc,x) if isOdd(e)]
    else: return acc + [x] if isOdd(x) else acc
                
print(reduce(AccumulateOdds,list(range(100))))

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99]


### The reduce() action
Now the [reduce] spark action differs a bit from the functools ·```reduce()```, but the main concepts are still valid. Again the ```reduce(*)``` action takes a function that is applied to evey element in the RDD. This function, call it ```f(*,*)```, takes two arguments. The first one accumulates the results and it is fed back to successive ```f(*,*)``` calls. However, the second argument can be an accumulator too. With simple reducer such as our ```sumAll(*,*)```, this does not make any difference. However, To reduces such as ```AccumulateOdds(*,*)```, it makes a lot of difference. 


[reduce]:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html?highlight=reduce

In [None]:
# again a function that sums the whole elements
result = rdd.reduce(sumAll)
print(result)

4950


In [None]:
# Now let us accumulate only odd numbers. 
# You do not need to understand this function too deeply.
# Just keep in mind that the reduce function is a lot more flexible than it appears.

def AccumulateOddsSpark(acc, x):
    if type(acc) != list and type(x) != list: return [e for e in (acc,x) if isOdd(e)]
    if type(acc) == list and type(x) != list: return acc + [x] if isOdd(x) else acc
    if type(acc) == list and type(x) == list: return acc + x

result = rdd.reduce(AccumulateOddsSpark)
print(result)

# A little side note. There is a drastic difference from using the .filter(isOdd).collect() and using this reducer.
# In the first case the spark context is responsible for gathering all the filtered results. 
# Instead, in this case, we are directly gathering results oursevels. 
# Of course, this can lead to inefficiencies and it is quite error prone.
# Again, this is to show the flexibility of the reduce function.
# If you can obtain your result using map and reduce, just use them.

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99]


### The Count() Action

One of the most basic actions that we can run is the [count()] method which will count the number of elements in an RDD.

Each task counts the entries in its partition and sends the result to your SparkContext, which adds up all of the counts. The figure below shows what would happen if we ran `count()` on a small example dataset with just four partitions.




[count()]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html?highlight=count

In [22]:
# count the number of odds number in our rdd
result = rdd.filter(isOdd).count()
print(result)

50
