## SESSION 6 : Example 1

### Topics : 

* RDD Creation
* RDD Partioning 
* RDD Operations and Shuffling
* RDD Persistence

### Example objetive :

Given a list of credit card payments per customer , compute:

 1. The total payments per country
 2. The avg amount per country
 3. The min and max payment across all countries

### Reference :

* SPARK Reference Documentation: https://spark.apache.org/docs/2.1.1/programming-guide.html#rdd-operations

### Additional Info :

* RDD Partitions : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html


### Environmental variables

In [1]:
import os
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3.6"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook"

### Creating SparkSession and SparkContext

In [2]:
from pyspark.sql import SparkSession
# session
spark = SparkSession \
    .builder \
    .appName("Session6-Example-1") \
    .master("local[2]") \
    .getOrCreate()
# context
sc = spark.sparkContext

In [3]:
# Configuration Check
print("Spark version  : "+str(sc.version))
print("Spark app ID   : "+sc.applicationId)
print("Spark app name : "+sc.appName)
print("Spark mode     : "+sc.master)

Spark version  : 2.3.2
Spark app ID   : local-1542571232253
Spark app name : PySparkShell
Spark mode     : local[*]


In [4]:
# customer credit card payment records , dummy list
payments= [
    {'name': 'customer01', 'amount': 500, 'country': 'India'},
    {'name': 'customer02', 'amount': 150, 'country': 'India'},
    {'name': 'customer03', 'amount': 50 , 'country': 'India'},
    {'name': 'customer04', 'amount': 200, 'country': 'Germany'},
    {'name': 'customer05', 'amount': 750, 'country': 'India'},
    {'name': 'customer06', 'amount': 100, 'country': 'Poland'},
    {'name': 'customer08', 'amount': 100, 'country': 'Poland'},
    {'name': 'customer08', 'amount': 100, 'country': 'Spain'},
    {'name': 'customer09', 'amount': 100, 'country': 'Spain'},
    {'name': 'customer10', 'amount': 200, 'country': 'Spain'},
    {'name': 'customer11', 'amount': 100, 'country': 'Spain'},
    {'name': 'customer12', 'amount': 100, 'country': 'Spain'},
    {'name': 'customer13', 'amount': 100, 'country': 'Germany'},
    {'name': 'customer14', 'amount': 300, 'country': 'Germany'},
    {'name': 'customer15', 'amount': 100, 'country': 'Germany'},
    {'name': 'customer16', 'amount': 100, 'country': 'Spain'},
    {'name': 'customer17', 'amount': 100, 'country': 'Poland'},
    {'name': 'customer18', 'amount': 400, 'country': 'India'},
    {'name': 'customer19', 'amount': 100, 'country': 'India'},
    {'name': 'customer20', 'amount': 100, 'country': 'India'},
]

### Creating an RDD : by parallelizing a list

In [5]:
# create and RDD by parallelizing this list
# specify the nb of partitions
nb_of_partitions=4
rdd = sc.parallelize(payments,nb_of_partitions)

### Theory : each paritition will have one task associated

<img src="images/partitions.png" height="700px" width="700px" >

### Partitioning Details
 
 *  For PairwiseRDDs (key,value) data is distributed between partitions depending on the value of the key.
 *  For the rest data is simply distributed into chunks containing consecutive records according to either the nb. of records or size of a chunk.

### Checking Partition Details

 * Get RDD partition numbers : with getNumPartitions() function 
 * Inspect RDD partition contents : with glom() function
 
 * **Spark partitioner** (see API : https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/Partitioner.html) 
   * Partitioner defines how the elements in a key-value pair RDD are partitioned by key. 
   * Maps each key to a partition ID, from 0  to numPartitions - 1.

In [6]:
print("Default parallelism : {}".format(sc.defaultParallelism))
print("Number of partitions: {}".format(rdd.getNumPartitions()))
print("Partitioner         : {}".format(rdd.partitioner))
print("Partitions structure: ")
# Note that collect() is an action
i=1
for partition in rdd.glom().collect():
    print('')
    print('partition %d : %s' %(i,partition))
    print('nb of elements: %d'   %len(partition))
    i+=1

Default parallelism : 2
Number of partitions: 4
Partitioner         : None
Partitions structure: 

partition 1 : [{'name': 'customer01', 'amount': 500, 'country': 'India'}, {'name': 'customer02', 'amount': 150, 'country': 'India'}, {'name': 'customer03', 'amount': 50, 'country': 'India'}, {'name': 'customer04', 'amount': 200, 'country': 'Germany'}, {'name': 'customer05', 'amount': 750, 'country': 'India'}]
nb of elements: 5

partition 2 : [{'name': 'customer06', 'amount': 100, 'country': 'Poland'}, {'name': 'customer08', 'amount': 100, 'country': 'Poland'}, {'name': 'customer08', 'amount': 100, 'country': 'Spain'}, {'name': 'customer09', 'amount': 100, 'country': 'Spain'}, {'name': 'customer10', 'amount': 200, 'country': 'Spain'}]
nb of elements: 5

partition 3 : [{'name': 'customer11', 'amount': 100, 'country': 'Spain'}, {'name': 'customer12', 'amount': 100, 'country': 'Spain'}, {'name': 'customer13', 'amount': 100, 'country': 'Germany'}, {'name': 'customer14', 'amount': 300, 'country

### Partition contents 
* What we can see is that Spark has evenly distributed the data across the different partitions
* 5 elements on each of the 4 partitions
* Even distribution of the RDD contents is performed

###  Inspecting Spark Jobs
* What is the expected execution behavior , how many tasks?
* Spark Application can be inspected  @ Spark History Server  http://localhost:18081/ <br/> 

### Apply some transformation + action 

In [21]:
# Compute the total of payments per country
# 
def sum_payments(group):
    # ---------------
    # TODO : exercise
    # ---------------
    return  

# To group all customers by country
# Apply a 2 transformations (groupBy and map) and an action (collect)
# Tip : the map transformation is required here because the result of the groupBy transformation returns
# an iterable object, and we actually want to transform it into a list
result1 = rdd \
    .groupBy(lambda customer: customer['country']) \
    .map(lambda group : (group[0], list(group[1]))) \
    .collect()

In [22]:
result1

[('Poland',
  [{'name': 'customer06', 'amount': 100, 'country': 'Poland'},
   {'name': 'customer08', 'amount': 100, 'country': 'Poland'},
   {'name': 'customer17', 'amount': 100, 'country': 'Poland'}]),
 ('Germany',
  [{'name': 'customer04', 'amount': 200, 'country': 'Germany'},
   {'name': 'customer13', 'amount': 100, 'country': 'Germany'},
   {'name': 'customer14', 'amount': 300, 'country': 'Germany'},
   {'name': 'customer15', 'amount': 100, 'country': 'Germany'}]),
 ('Spain',
  [{'name': 'customer08', 'amount': 100, 'country': 'Spain'},
   {'name': 'customer09', 'amount': 100, 'country': 'Spain'},
   {'name': 'customer10', 'amount': 200, 'country': 'Spain'},
   {'name': 'customer11', 'amount': 100, 'country': 'Spain'},
   {'name': 'customer12', 'amount': 100, 'country': 'Spain'},
   {'name': 'customer16', 'amount': 100, 'country': 'Spain'}]),
 ('India',
  [{'name': 'customer01', 'amount': 500, 'country': 'India'},
   {'name': 'customer02', 'amount': 150, 'country': 'India'},
   {'n

###  Inspecting Spark Jobs
* What is the expected execution behavior , was there a shuffle , why?
* Spark Application can be inspected  @ Spark History Server  http://34.251.237.234:18081/ <br/> 

### Could we reducing the shuffling?
* Yes, Using a custom partitioner : HashPartitioning
* **Note** : custom partioning is only possible for key,value PairRDD
* Let's define a custom partitioner that will allows us to have all customer for a given country in one partition
* We use a hash function which returns a unique integer number for the given string

In [24]:
# Define a custom partitioner based on the hash (python function) of a String
# this python function returns a unique integer number for a given string
def country_partitioner(country):
    return hash(country)

In [25]:
# Create a new RDD but now using our defined partitioner
# --------------
# Important Note : custom partioning can only be applied to (K,V) pairRDDs 
#                  therefore we need to create key,value pair RDD below  
# --------------
new_rdd = sc.parallelize(payments) \
    .map(lambda customer: (customer['country'],customer)) \
    .partitionBy(4, country_partitioner) \
    .persist()

In [26]:
# Compute the total of payments per country
# 

# By using previously the partitioner we placed all customers for a given country in a partition
# hence improving data locality for our required operation and avoiding shuffling

# define the summing function
def sum_payments(iterator):
    yield sum(group[1]['amount'] for group in iterator)


# Now we simply apply a function to each partition with the mapPartitions() transformation
# Therefore this transformation is done in one node with NO SHUFFLING
amount_per_country = new_rdd \
        .mapPartitions(sum_payments) \
        .collect()

In [27]:
amount_per_country

[300, 1400, 2050, 0]

In [28]:
print("Default parallelism : {}".format(sc.defaultParallelism))
print("Number of partitions: {}".format(new_rdd.getNumPartitions()))
print("Partitioner         : {}".format(new_rdd.partitioner))
print("Partitions structure: ")
# Note that collect() is an action
i=1
for partition in new_rdd.glom().collect():
    print('')
    print('partition %d : %s' %(i,partition))
    print('nb of elements: %d'   %len(partition))
    i+=1

Default parallelism : 2
Number of partitions: 4
Partitioner         : None
Partitions structure: 

partition 1 : [{'name': 'customer01', 'amount': 500, 'country': 'India'}, {'name': 'customer02', 'amount': 150, 'country': 'India'}, {'name': 'customer03', 'amount': 50, 'country': 'India'}, {'name': 'customer04', 'amount': 200, 'country': 'Germany'}, {'name': 'customer05', 'amount': 750, 'country': 'India'}]
nb of elements: 5

partition 2 : [{'name': 'customer06', 'amount': 100, 'country': 'Poland'}, {'name': 'customer08', 'amount': 100, 'country': 'Poland'}, {'name': 'customer08', 'amount': 100, 'country': 'Spain'}, {'name': 'customer09', 'amount': 100, 'country': 'Spain'}, {'name': 'customer10', 'amount': 200, 'country': 'Spain'}]
nb of elements: 5

partition 3 : [{'name': 'customer11', 'amount': 100, 'country': 'Spain'}, {'name': 'customer12', 'amount': 100, 'country': 'Spain'}, {'name': 'customer13', 'amount': 100, 'country': 'Germany'}, {'name': 'customer14', 'amount': 300, 'country