## **CHAPTER ONE: BASICS**

### ¿What is Apache Spark?


It is an open source project encompassed within the **Hadoop ecosystem (although perhaps 
it should not).**

It is a processing engine (not storage) based on MapReduce with the
following features:

- Based on the execution of in-memory and tasks on disk (instead of only on disk).
- Developed in Scala but with “complete” interfaces for Java, Python and R.
- Integrating multiple forms of work: batch, interactive, streaming ...
- Supporting multiple architectures: standalone, execution on YARN ...


It is the successor of “Hadoop MapReduce” primarily because of its:
- Speed: thanks to memory execution and task optimization.
- Ease of use: functions and programming models beyond MapReduce.
- Flexibility: multiple languages, multiple architectures, multiple ways of working.
- Homogeneity: a single framework for everything instead of an ecosystem.

****

## Spark installation modes

* Spark currently offers three different installation modes: standalone, cluster over
Mesos, cluster on YARN.

* Standalone:
    
  Installation mode with lower “configuration” needs (if you are not going to work in
cluster).

  > Ideal for development or work environments with “intermediate” volumes of information.
  > Work oriented with "local" files.

* Cluster over Mesos / YARN:
    
  > Installation mode with the greatest need for “configuration” (it is necessary to specify
"Manually" cluster configuration).

  > Ideal (and more than recommended) in production environments.
  > Work oriented with "distributed" files.
  
 ## TYPE OF OPERATIONS 
 
 * map as transformation
 
 * reduce for action
 
 **Transformations** consisting of **narrow dependenciess** (we’ll call them narrow transformations) are those where each input partition will contribute to only one output partition.
 
 A wide dependency (or wide transformation) style transformation will have input partitions contributing to many
output partitions. You will often hear this referred to as a shuffle where Spark will exchange partitions across the
cluster. With narrow transformations, Spark will automatically perform an operation called pipelining on narrow
dependencies, this means that if we specify multiple filters on DataFrames they’ll all be performed in-memory. The
same cannot be said for shuffles. When we perform a shuffle, Spark will write the results to disk. You’ll see lots of talks
about shuffle optimization across the web because it’s an important topic but for now all you need to understand are
that there are two kinds of transformations

## **Spark UI**
During Spark’s execution of the previous code block, users can monitor the progress of their job through the Spark UI.
The Spark UI is available on port 4040 of the driver node. If you are running in local mode this will just be the
http://localhost:4040.
 

> https://github.com/XD-DENG/Spark-practice


### **Now, we can start with some code:**

In [1]:
from pyspark import SparkContext
sc = SparkContext('local', 'pyspark tutorial') 

**SparkContext parameters**

* the driver (first argument) can be local[*], spark://”, **yarn, etc. What is available for you depends on how Spark has been deployed on the machine you use.

* the second argument is the application name and is a human readable string you choose.

In [2]:
sc.stop()
#To use a maximum of 2 tasks in parallel:
from pyspark import SparkContext
sc = SparkContext('local[2]', 'pyspark tutorial') 

If you wish to use all the available resource, you can simply use ‘*’ i.e.

In [3]:
sc.stop()
from pyspark import SparkContext
sc = SparkContext('local[*]', 'pyspark tutorial') 

## Deployment of Spark:
It can be deployed on:

* a single machine such as your laptop (local)
* a set of pre-defined machines (stand-alone)
* a dedicated Hadoop-aware scheduler (YARN/Mesos)
* “cloud”, e.g. Amazon EC2

The development workflow is that you start small (local) and scale up to one of the other solutions, depending on your needs and resources.

At UIO, we have the Abel cluster where Spark is available.

Often, you don’t need to change any code to go between these methods of deployment!

## Spark Components

![alt text](fig/components.png "Title")

* Spark Core: It is the core of Spark, the distributed execution engine in-memory.
* Spark SQL: operations with SQL
* Spark Streaming: real time processing
* MLlib: machine learning
* GraphX: graph structure

## **OPERATIONS**

* Map
* Reduce
* join
* filter
* sample

## **map/reduce**

Let’s start, in this exercise the goal is to convert temperature from Celcius to Kelvin.

Here it is how it translates in PySpark.

In [5]:
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).collect()
print(rdd_temp_K)   

[283.15, 276.15, 268.15, 298.15, 274.15, 282.15, 302.15, 263.15, 278.15]


## Remark:

It is often a very bad idea to pull all the elements of the RDD to the driver because we potentially handle very large amount of data. So instead we prefer to use take as you can specify how many elements you wish to pull from the RDD.

For instance to pull the first 3 elements only:

In [6]:
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).take(3)
print(rdd_temp_K)   

[283.15, 276.15, 268.15]


## Challenge
Now let’s take another example where we use **map as the transformation** and **reduce for the action** (types of operations).


In [7]:
# we define a list of integers
numbers = [1, 4, 6, 2, 9, 10]

rdd_numbers=sc.parallelize(numbers)

# Use reduce to combine numbers
rdd_reduce = rdd_numbers.reduce(lambda x,y: "(" + str(x) + ", " + str(y) + ")")
print(rdd_reduce)

(((1, (4, 6)), 2), (9, 10))


## Lambda functions

We see before a commad that is call lambda

Python supports the creation of anonymous functions (i.e. functions defined without a name), using a construct called “lambda”.

The general structure of a lambda function is:



`lambda <args>: <expr>`

Example of normal function

In [1]:
def f (x): 
    return x**2

#For instance to use this function:
print(f(2))

4


In [2]:
#The same function can be written as lambda function:

g = lambda x: x**2
 
# And you call it:
print(g(2))

4


As you can see both functions do exactly the same and can be used in the same ways.

Note that the lambda definition does not include a “return” statement – it always contains a single expression which is returned.
Also note that you can put a lambda definition anywhere a function is expected, and you don’t have to assign it to a variable at all.
Lambda functions come from functional programming languages and the Lambda Calculus. Since they are so small they may be written on a single line.
This is not exactly the same as lambda in functional programming languages, but it is a very powerful concept that’s well integrated into Python.
 

## Conditional expression in Lambda functions
You can use conditional expression in a lambda function or/and have more than one input argument.

For example:



In [19]:
f = lambda x,y: ["PASS",x,y] if x>3 and y<100 else ["FAIL",x,y]
print(f(4,50))


['PASS', 4, 50]


## Map
Map takes a function f and an array as input parameters and outputs an array where f is applied to every element. 
In this respect, **using map is equivalent to for loops**.

For instance, to convert a list of temperatures in Celsius to a list of temperature in Kelvin:

 

In [21]:
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
temp_K = list(map(lambda x: x + 273.15, temp_c))
list(temp_K)

[283.15, 276.15, 268.15, 298.15, 274.15, 282.15, 302.15, 263.15, 278.15]


map() is a function with two arguments:

`r = map(func, seq)`

The first argument func is the name of a function and the second a sequence (e.g. a list) 
seq. map() applies the function func to all the elements of the sequence seq. It returns a new list with the elements changed by func.

## Challenge 1

Start by defining a variable pairs

pairs = [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]

Write a **Lambda function** and use it to sort pairs by key using their names. 
You will be using the list.sort() method of a list. It modifies the list in-place (here pairs)and has a key parameter to specify a function to be called on each list element prior to making comparisons. The value of the key parameter is a function that takes a single argument and returns a key to use for sorting purposes. Define this function as a Lambda function.

In [5]:
pairs = [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
type(pairs)

list

In [6]:
pairs.sort(key=lambda pair: pair[1])
print(pairs)

[(4, 'four'), (1, 'one'), (3, 'three'), (2, 'two')]


## Challenge 2


Let’s define a list of words: list_words = [“big”,”small”, “able”, “about”, “hairdresser”, “laboratory”]

Use a **map function** to print the number of character of each word:

In [37]:
list_words = ["big", "small", "able", "about", "hairdresser", "laboratory"]
print(list(map(len,list_words))) #more simple that it seems

[3, 5, 4, 5, 11, 10]


## Filter
As the name suggests, **filter** can be used to filter your data. 
It tests each element of your input data and returns a subset of it for which a condition given by a function
is TRUE. It does not modify your input data.


In [41]:
numbers = range(-15, 15)
less_than_zero = list(filter(lambda x: x < 0, numbers))
print(less_than_zero)

[-15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]


## Challenge 3

Reuse numbers and extract all the odd numbers:

numbers = range(-15, 15)

In [43]:
numbers = range(-15, 15)
numbers

range(-15, 15)

In [48]:
list(filter(lambda x: x%2 ==1, numbers))

[-15, -13, -11, -9, -7, -5, -3, -1, 1, 3, 5, 7, 9, 11, 13]

## Reduce
Reduce takes a function f and an array as input. The function f gets two 
input parameters that work on individual elements of the array. Reduce combines every 
two elements of the array using the function f. Let’s take an example:



In [50]:
# we define a list of integers
numbers = [1, 4, 6, 2, 9, 10]
# Define a new function combine
# Convert x and y to strings and create a tuple from x,y
def combine(x,y):
  return "(" + str(x) + ", " + str(y) + ")"

# Use reduce to apply combine to numbers
from functools import reduce

print(numbers)
reduce(combine,numbers)

#To use the python reduce function, you need to import it from functools.
#To define combine, we haven’t used a lambda function. With a Lambda function, we would have:
# we define a list of integers
numbers = [1, 4, 6, 2, 9, 10]

# Use reduce to combine numbers
from functools import reduce

print(numbers)
reduce(lambda x,y: "(" + str(x) + ", " + str(y) + ")",numbers)

[1, 4, 6, 2, 9, 10]
[1, 4, 6, 2, 9, 10]


'(((((1, 4), 6), 2), 9), 10)'

## Challenge 4

Let’s define a string variable sentence:

sentence = "Dis-moi ce que tu manges, je te dirai ce que tu es."
Compute the number of words in sentence

## Solution to Challenge 4
* First we remove punctuations from the sentence.
* Then we split the resulting sentence and map 1 to each word of the sentence.
*The last step is to sum up to find the number of words in the sentence:


In [54]:
sentence = "Dis-moi ce que tu manges, je te dirai ce que tu es."

import string
no_punctuation=sentence.translate(str.maketrans("","",string.punctuation))

reduce(lambda x,y: x+y, map(lambda x: 1, no_punctuation.split()))


12


Apply it to an entire text to compute the total number of words of the text you upload yourself in your Galaxy 
history. Or use pre-loaded book available under the Data libraries available on the UIO Galaxy eduPortal 
(Share data –> Data Libraries).

## **Spark SQL**

* Spark SQL is a component on top of Spark Core that facilitates processing of structured and semi-structured 
data and the integration of several data formats as source (Hive, Parquet, JSON).

* It allows to transform RDDs using SQL (Structured Query Language).

* To start Spark SQL within your notebook, you need to create a SQL context.

* For this exercise, import a JSON file in a new history “World Cup”. You can find the historical World cup player dataset in JSON format in our Data Library named “Historical world cup player data “.

* Then create a new python 3 (change kernel if set by default to python 2) jupyter notebook from this file:



In [58]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc.stop()
sc = SparkContext('local', 'Spark SQL') 
sqlc = SQLContext(sc)