# Spark Tutorial
Link: 
* Blog post: https://towardsdatascience.com/the-hitchhikers-guide-to-handle-big-data-using-spark-90b9be0fe89a
* GitHub: https://github.com/MLWhiz/data_science_blogs/tree/master/spark_post

In [1]:
sc

In [2]:
spark

## Functional Programming in Python
Look for this book in Amazon:
* *Functional Python Programming: Discover the power of functional programming, generator functions, lazy evaluation, the built-in itertools library, and monads, 2nd Edition*

### Map
Say you want to apply some function to every element in a list.

You can do this by simply using a for loop but python lambda functions let you do this in a single line in Python.

In [7]:
my_list = [1,2,3,4,5,6,7,8,9,10]

# Lets say I want to square each term in my_list.
squared_list = map(lambda x:x**2,my_list)

print(list(squared_list))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


### Filter
The other function that is used extensively is the `filter` function. This function takes two arguments — A condition and the list to filter.

If you want to filter your list using some condition you use filter.

In [8]:
my_list = [1,2,3,4,5,6,7,8,9,10]

# Lets say I want only the even numbers in my list.
filtered_list = filter(lambda x:x%2==0,my_list)

print(list(filtered_list))

[2, 4, 6, 8, 10]


### Reduce
This function takes two arguments:
* a function to reduce that takes two arguments
* and a list over which the reduce function is to be applied.

In [10]:
# In python3 reduce needs an import
import functools as f
my_list = [1,2,3,4,5]

# Lets say I want to sum all elements in my list.
sum_list = f.reduce(lambda x,y:x+y,my_list)

print(sum_list)

15


A condition on the lambda function we use in reduce is that it must be:
* **commutative** that is a + b = b + a and
* **associative** that is (a + b) + c == a + (b + c).

In the above case, we used sum which is commutative as well as associative. Other functions that we could have used: max, min, * etc.

## Back to Spark
Spark actually consists of two things a *driver* and *workers*.

Workers normally do all the work and the driver makes them do that work.

### RDD
An RDD(Resilient Distributed Dataset) is a parallelized data structure that gets distributed across the worker nodes. 

**They are the basic units of Spark programming.**

For example, given this line `lines = sc.textFile("/FileStore/tables/shakespeare.txt")`

We took a text file and distributed it across worker nodes so that they can work on it in parallel. 

We could also parallelize lists using the function sc.parallelize.

In [11]:
data = [1,2,3,4,5,6,7,8,9,10]
new_rdd = sc.parallelize(data,4)
new_rdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195

In Spark, we can do two different types of operations on RDD: Transformations and Actions.

* **Transformations**: Create new datasets from existing RDDs
* **Actions**: Mechanism to get results out of Spark

## Transformation Basics
So let us say you have got your data in the form of an RDD. You want to do some transformations on the data now.

You may want to filter, apply some function, etc.

In Spark, this is done using Transformation functions.

Spark provides many transformation functions. You can see a comprehensive list [here](http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations). 

Some of the main ones are:

### Map
Applies a given function to an RDD.

Note that the syntax is a little bit different from Python, but it necessarily does the same thing

In [20]:
# List
data = [1,2,3,4,5,6,7,8,9,10]
# List to RDD
rdd = sc.parallelize(data,4)
# Apply map
squared_rdd = rdd.map(lambda x:x**2)
# Show it
squared_rdd.take(30)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

### Filter
Again no surprises here. Takes as input a condition and keeps only those elements that fulfill that condition.

In [21]:
data = [1,2,3,4,5,6,7,8,9,10]
rdd = sc.parallelize(data,4)
filtered_rdd = rdd.filter(lambda x:x%2==0)
filtered_rdd.take(30)

[2, 4, 6, 8, 10]

### flatmap
Similar to `map`, but ...

In [28]:
data = [2,3,4,5]
rdd = sc.parallelize(data,4)
# La salida viene en una lista de listas [numero, numero^3] (puede ser tuplas tmb)
flat_rdd = rdd.map(lambda x:[x,x**3])
flat_rdd.take(10)

[[2, 8], [3, 27], [4, 64], [5, 125]]

In [25]:
data = [2,3,4,5]
rdd = sc.parallelize(data,4)
# La salida viene en una lista de numeros contiguos: numero, numero^3
flat_rdd = rdd.flatMap(lambda x:[x,x**3])
flat_rdd.take(10)

[2, 8, 3, 27, 4, 64, 5, 125]

### ReduceByKey
(Con RDDs, *reduce* es una acción !!!)

Let's assume we have a data in which we have a product, its category, and its selling price. We can still parallelize the data.

In [29]:
data = [('Apple','Fruit',200),('Banana','Fruit',24),('Tomato','Fruit',56),('Potato','Vegetable',103),
        ('Carrot','Vegetable',34)]
rdd = sc.parallelize(data,4)

Right now our RDD rdd holds tuples.

Now we want to find out the total sum of revenue that we got from each category.

To do that we have to transform our rdd to a pair rdd so that it only contains key-value pairs/tuples.

In [30]:
category_price_rdd = rdd.map(lambda x: (x[1],x[2]))
category_price_rdd.take(30)

[('Fruit', 200),
 ('Fruit', 24),
 ('Fruit', 56),
 ('Vegetable', 103),
 ('Vegetable', 34)]

So now our category_price_rdd contains the product category and the price at which the product sold.

Now we want to reduce on the key category and sum the prices. We can do this by:

In [31]:
category_total_price_rdd = category_price_rdd.reduceByKey(lambda x,y:x+y)
category_total_price_rdd.take(30)

[('Fruit', 280), ('Vegetable', 137)]

### GroupByKey
Similar to reduceByKey but does not reduces just puts all the elements in an iterator. 

For example, if we wanted to keep as key the category and as the value all the products we would use this function.

Let us again use map to get data in the required form.

In [33]:
data = [('Apple','Fruit',200),('Banana','Fruit',24),('Tomato','Fruit',56),('Potato','Vegetable',103),
        ('Carrot','Vegetable',34)]
rdd = sc.parallelize(data,4)
category_product_rdd = rdd.map(lambda x: (x[1],x[0]))
category_product_rdd.take(30)

[('Fruit', 'Apple'),
 ('Fruit', 'Banana'),
 ('Fruit', 'Tomato'),
 ('Vegetable', 'Potato'),
 ('Vegetable', 'Carrot')]

In [35]:
grouped_products_by_category_rdd = category_product_rdd.groupByKey()

findata = grouped_products_by_category_rdd.take(30)

for data in findata:
    print(data[0],list(data[1]))

Fruit ['Apple', 'Banana', 'Tomato']
Vegetable ['Potato', 'Carrot']


## Action Basics

You have filtered your data, mapped some functions on it. Done your computation.

Now you want to get the data on your local machine or save it to a file or show the results in the form of some graphs in excel or any visualization tool. **You will need actions for that**. 

A comprehensive list of actions is provided [here](http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions).

Some of the most common actions are:

### collect
(no funciona local, si funciona en toolbox )
It takes the *whole RDD* and brings it back to the driver program.

### take
(igual al collect, pero solo toma los primeros n elementos)
Sometimes you will need to see what your RDD contains without getting all the elements in memory itself. take returns a list with the first n elements of the RDD.

### reduce
(con RDDs, *reduceByKey* es una transformación !!!)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.


### takeOrdered:
takeOrdered returns the first n elements of the RDD using either their natural order or a custom comparator.

In [3]:
rdd = sc.parallelize([5,3,12,23])
# descending order
rdd.takeOrdered(3,lambda s:-1*s)

[23, 12, 5]

In [4]:
rdd = sc.parallelize([(5,23),(3,34),(12,344),(23,29)])
# descending order
rdd.takeOrdered(3,lambda s:-1*s[1])

[(12, 344), (3, 34), (23, 29)]

# Understanding The WordCount Example

In [9]:
# Distribute the data - Create a RDD 
lines = sc.textFile("ej.txt")

# Create a list with all words, Create tuple (word,1), reduce by key i.e. the word
counts = (lines.flatMap(lambda x: x.split(' '))          
                  .map(lambda x: (x, 1))                 
                  .reduceByKey(lambda x,y : x + y))

# get the output on local
output = counts.take(10)                                 

# print output
for (word, count) in output:                             
    print("%s: %i" % (word, count))

parque: 8
Pedro: 1
y: 6
hoy: 1
del: 3
que: 2
hay: 1
junto: 1
mi: 3
divierto: 1


### 2da linea
Analizamos que hacen los map, flatmap y reduceBykey

In [18]:
# Si se aplica map, cada linea la hace una lista separada, para tener todas las palabras juntas es mejor usar flatMap
lines.flatMap(lambda x: x.split(' ')).take(3)

['El', 'parque', 'Me']

In [19]:
# Si se aplica map, cada linea la hace una lista separada, para tener todas las palabras juntas es mejor usar flatMap
# lines.map(lambda x: x.split(' ')).take(3)
# flatMap pasa todas las palabras a una lista (es igual a hacer lines.split(' ') en python)
a = lines.flatMap(lambda x: x.split(' '))
a.take(5)

['El', 'parque', 'Me', 'llamo', 'Pedro']

In [21]:
# A c/palabra la ubico en una tupla (palabra, 1)
b = a.map(lambda x: (x, 1)) 
b.take(5)

[('El', 1), ('parque', 1), ('Me', 1), ('llamo', 1), ('Pedro', 1)]

In [28]:
# Esto agrupa/cuenta por palabra
c = b.reduceByKey(lambda x,y : x + y)
c.take(10)

[('parque', 8),
 ('Pedro', 1),
 ('y', 6),
 ('hoy', 1),
 ('del', 3),
 ('que', 2),
 ('hay', 1),
 ('junto', 1),
 ('mi', 3),
 ('divierto', 1)]

## Spark in Action with Example

See *02_MoviesRDDs.ipynb*