## SESSION 6 : Example 2

### Topics : 

* Dataframe Creation
* Dataframe and SQL Operations
* Dataframe Persistence

### Objetive :

 * Getting familiar with Spark Dataframes programming and SQL operations.
 * Get to know the different formats used in big data
 * Intro to the dataframes persistence

### Reference :

* SPARK Reference Documentation: https://spark.apache.org/docs/latest/sql-programming-guide.html

### Additional Info :

* Data sources and formats : https://spark.apache.org/docs/latest/sql-data-sources.html


### Environmental variables

In [1]:
import os
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3.6"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook"

### Creating SparkSession and SparkContext

In [2]:
from pyspark.sql import SparkSession
# session
spark = SparkSession \
    .builder \
    .appName("Session6-Example-2") \
    .master("local[2]") \
    .getOrCreate()
# context
sc = spark.sparkContext

In [3]:
# Configuration Check
print("Spark version  : "+str(sc.version))
print("Spark app ID   : "+sc.applicationId)
print("Spark app name : "+sc.appName)
print("Spark mode     : "+sc.master)

Spark version  : 2.3.2
Spark app ID   : local-1542571339786
Spark app name : PySparkShell
Spark mode     : local[*]


In [76]:
# customer credit card payment records , dummy list of dictionaries
payments= [
    {'name': 'customer01', 'amount': 500, 'country': 'India','age':20},
    {'name': 'customer02', 'amount': 150, 'country': 'India','age':25},
    {'name': 'customer03', 'amount': 50 , 'country': 'India','age':30},
    {'name': 'customer04', 'amount': 200, 'country': 'Germany','age':35},
    {'name': 'customer05', 'amount': 750, 'country': 'India','age':20},
    {'name': 'customer06', 'amount': 100, 'country': 'Poland','age':35},
    {'name': 'customer08', 'amount': 100, 'country': 'Poland','age':45},
    {'name': 'customer08', 'amount': 100, 'country': 'Spain','age':300},
    {'name': 'customer09', 'amount': 100, 'country': 'Spain','age':50},
    {'name': 'customer10', 'amount': 200, 'country': 'Spain','age':60},
    {'name': 'customer11', 'amount': 100, 'country': 'Spain','age':25},
    {'name': 'customer12', 'amount': 100, 'country': 'Spain','age':2},
    {'name': 'customer13', 'amount': 100, 'country': 'Germany','age':50},
    {'name': 'customer14', 'amount': 300, 'country': 'Germany','age':40},
    {'name': 'customer15', 'amount': 100, 'country': 'Germany','age':35},
    {'name': 'customer16', 'amount': 100, 'country': 'Spain','age':25},
    {'name': 'customer17', 'amount': 100, 'country': 'Poland','age':45},
    {'name': 'customer18', 'amount': 400, 'country': 'India','age':50},
    {'name': 'customer19', 'amount': 100, 'country': 'India','age':50},
    {'name': 'customer20', 'amount': 100, 'country': 'India','age':20},
    {'name': 'customer21', 'amount': 500, 'country': 'India','age':20},
    {'name': 'customer22', 'amount': 150, 'country': 'India','age':20},
    {'name': 'customer23', 'amount': 50 , 'country': 'India','age':30},
    {'name': 'customer24', 'amount': 200, 'country': 'Germany','age':60},
    {'name': 'customer25', 'amount': 750, 'country': 'India','age':20},
    {'name': 'customer26', 'amount': 100, 'country': 'Poland','age':20},
    {'name': 'customer28', 'amount': 100, 'country': 'Poland','age':20},
    {'name': 'customer28', 'amount': 100, 'country': 'Spain','age':30},
    {'name': 'customer29', 'amount': 300, 'country': 'Spain','age':20},
    {'name': 'customer30', 'amount': 200, 'country': 'Spain','age':50},
    {'name': 'customer31', 'amount': 100, 'country': 'Spain','age':40},
    {'name': 'customer32', 'amount': 100, 'country': 'Spain','age':40},
    {'name': 'customer33', 'amount': 100, 'country': 'Germany','age':40},
    {'name': 'customer34', 'amount': 300, 'country': 'Germany','age':20},
    {'name': 'customer35', 'amount': 100, 'country': 'Germany','age':40},
    {'name': 'customer36', 'amount': 100, 'country': 'Spain','age':50},
    {'name': 'customer37', 'amount': 100, 'country': 'Poland','age':30},
    {'name': 'customer38', 'amount': 400, 'country': 'India','age':20},
    {'name': 'customer39', 'amount': 100, 'country': 'India','age':20},
    {'name': 'customer40', 'amount': 100, 'country': 'India','age':30},
    {'name': 'customer41', 'amount': 500, 'country': 'India','age':35},
    {'name': 'customer42', 'amount': 150, 'country': 'India','age':35},
    {'name': 'customer43', 'amount': 50 , 'country': 'India','age':40},
    {'name': 'customer44', 'amount': 200, 'country': 'Germany','age':20},
    {'name': 'customer45', 'amount': 750, 'country': 'India','age':20},
    {'name': 'customer46', 'amount': 100, 'country': 'Poland','age':20},
    {'name': 'customer48', 'amount': 100, 'country': 'Poland','age':20},
    {'name': 'customer48', 'amount': 100, 'country': 'Spain','age':30},
    {'name': 'customer49', 'amount': 800, 'country': 'Spain','age':50},
    {'name': 'customer50', 'amount': 200, 'country': 'Spain','age':20},
    {'name': 'customer51', 'amount': 100, 'country': 'Spain','age':50},
    {'name': 'customer52', 'amount': 100, 'country': 'Spain','age':30},
    {'name': 'customer53', 'amount': 100, 'country': 'Germany','age':40},
    {'name': 'customer54', 'amount': 300, 'country': 'Germany','age':40},
    {'name': 'customer55', 'amount': 100, 'country': 'Germany','age':40},
    {'name': 'customer56', 'amount': 100, 'country': 'Spain','age':20},
    {'name': 'customer57', 'amount': 100, 'country': 'Poland','age':20},
    {'name': 'customer58', 'amount': 400, 'country': 'India','age':20},
    {'name': 'customer59', 'amount': 100, 'country': 'India','age':25},
    {'name': 'customer60', 'amount': 100, 'country': 'India','age':25},
]

### Creating a Dataframe from and RDD:

Steps :
1. Creating the RDD 
2. Define the schema required for the Dataframe or infer it from the data
3. Create the actual Dataframe

In [77]:
# create and RDD by parallelizing this list
rdd = sc.parallelize(payments)

In [78]:
# in this particular case we could even infer the schema from the RDD itself
# because we are using dictionaries
df = spark.createDataFrame(rdd)



In [79]:
### Remember the notion of schema?
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- amount: long (nullable = true)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)



In [80]:
# let's inspect the contents
df.show()

+---+------+-------+----------+
|age|amount|country|      name|
+---+------+-------+----------+
| 20|   500|  India|customer01|
| 25|   150|  India|customer02|
| 30|    50|  India|customer03|
| 35|   200|Germany|customer04|
| 20|   750|  India|customer05|
| 35|   100| Poland|customer06|
| 45|   100| Poland|customer08|
|300|   100|  Spain|customer08|
| 50|   100|  Spain|customer09|
| 60|   200|  Spain|customer10|
| 25|   100|  Spain|customer11|
|  2|   100|  Spain|customer12|
| 50|   100|Germany|customer13|
| 40|   300|Germany|customer14|
| 35|   100|Germany|customer15|
| 25|   100|  Spain|customer16|
| 45|   100| Poland|customer17|
| 50|   400|  India|customer18|
| 50|   100|  India|customer19|
| 20|   100|  India|customer20|
+---+------+-------+----------+
only showing top 20 rows



### Dataframe operations by example :

Compute the following :

 1. The number of customer per country
 2. The total payments per country
 3. The avg amount per country
 4. The min and max payment across all countries
 
FOR YOU Compute the following :

 5. The average age of our customers per country 

In [63]:
# 1. nb of customers per country
df.groupBy('country').count().show()

+-------+-----+
|country|count|
+-------+-----+
|Germany|   12|
|  India|   21|
|  Spain|   18|
| Poland|    9|
+-------+-----+



In [64]:
# 2. total payments per country
import pyspark.sql.functions as F
df.groupBy('country').agg(F.sum('amount')).show()

+-------+-----------+
|country|sum(amount)|
+-------+-----------+
|Germany|       2100|
|  India|       6150|
|  Spain|       3000|
| Poland|        900|
+-------+-----------+



In [65]:
# 3. avg amount per country
import pyspark.sql.functions as F
df.groupBy('country').agg(F.avg('amount')).show()

+-------+------------------+
|country|       avg(amount)|
+-------+------------------+
|Germany|             175.0|
|  India|292.85714285714283|
|  Spain|166.66666666666666|
| Poland|             100.0|
+-------+------------------+



In [66]:
# 4. min/max for all countries
df.agg(F.max('amount')).show()

+-----------+
|max(amount)|
+-----------+
|        800|
+-----------+



In [67]:
df.agg(F.min('amount')).show()

+-----------+
|min(amount)|
+-----------+
|         50|
+-----------+



### Dataframe SQL operations by example :

Compute the same quantities using spark SQL API  :

 1. The number of customer per country
 2. The total payments per country
 3. The avg amount per country
 4. The min and max payment across all countries

In [68]:
# 1. First we create a temporary table in the Hive metastore db
df.registerTempTable("payments")

In [69]:
query="SELECT * FROM payments"
sql_df = spark.sql(query)

In [70]:
sql_df.show()

+------+-------+----------+
|amount|country|      name|
+------+-------+----------+
|   500|  India|customer01|
|   150|  India|customer02|
|    50|  India|customer03|
|   200|Germany|customer04|
|   750|  India|customer05|
|   100| Poland|customer06|
|   100| Poland|customer08|
|   100|  Spain|customer08|
|   100|  Spain|customer09|
|   200|  Spain|customer10|
|   100|  Spain|customer11|
|   100|  Spain|customer12|
|   100|Germany|customer13|
|   300|Germany|customer14|
|   100|Germany|customer15|
|   100|  Spain|customer16|
|   100| Poland|customer17|
|   400|  India|customer18|
|   100|  India|customer19|
|   100|  India|customer20|
+------+-------+----------+
only showing top 20 rows



In [71]:
# 1. The number of customer per country in SQL
query="SELECT country,COUNT(*) as count FROM payments GROUP BY country" 
sql_df = spark.sql(query)
sql_df.show()

+-------+-----+
|country|count|
+-------+-----+
|Germany|   12|
|  India|   21|
|  Spain|   18|
| Poland|    9|
+-------+-----+



In [72]:
# 2. The total payments per country
query="SELECT country,SUM(amount) as total FROM payments GROUP BY country" 
sql_df = spark.sql(query)
sql_df.show()

+-------+-----+
|country|total|
+-------+-----+
|Germany| 2100|
|  India| 6150|
|  Spain| 3000|
| Poland|  900|
+-------+-----+



In [73]:
# 3. The avg amount per country
query="SELECT country,AVG(amount) as total FROM payments GROUP BY country" 
sql_df = spark.sql(query)
sql_df.show()

+-------+------------------+
|country|             total|
+-------+------------------+
|Germany|             175.0|
|  India|292.85714285714283|
|  Spain|166.66666666666666|
| Poland|             100.0|
+-------+------------------+



In [74]:
# 4. The min and max payment across all countries
query="SELECT MAX(amount) as max FROM payments" 
sql_df = spark.sql(query)
sql_df.show()

+---+
|max|
+---+
|800|
+---+



In [75]:
# 4. The min and max payment across all countries
query="SELECT MIN(amount) as min FROM payments" 
sql_df = spark.sql(query)
sql_df.show()

+---+
|min|
+---+
| 50|
+---+



### Persisting Dataframes
* We see an example of persisting a dataframe using a big data format parquet
* We also see that we are partitioning the output by specific 'attributes' of the data
* We do this generally to improve the performance of later data loading and querying

In [82]:
# Save the original dataframe 
# partitioning the output by country and age into parquet format.
my_home=os.environ['HOME']
out_dir="spark_sessions/session6/data/"
df.write.partitionBy(
        "age","country"
    ).parquet(
        "file://"
        + my_home
        +'/'
        + out_dir,
        mode='overwrite'
    )

In [84]:
### You can inspect the contents of what you just saved
! ls -l $my_home/$out_dir

total 40
drwxrwxr-x 3 ubuntu ubuntu 4096 Nov 18 21:06 age=2
drwxrwxr-x 6 ubuntu ubuntu 4096 Nov 18 21:06 age=20
drwxrwxr-x 4 ubuntu ubuntu 4096 Nov 18 21:06 age=25
drwxrwxr-x 5 ubuntu ubuntu 4096 Nov 18 21:06 age=30
drwxrwxr-x 3 ubuntu ubuntu 4096 Nov 18 21:06 age=300
drwxrwxr-x 5 ubuntu ubuntu 4096 Nov 18 21:06 age=35
drwxrwxr-x 5 ubuntu ubuntu 4096 Nov 18 21:06 age=40
drwxrwxr-x 3 ubuntu ubuntu 4096 Nov 18 21:06 age=45
drwxrwxr-x 5 ubuntu ubuntu 4096 Nov 18 21:06 age=50
drwxrwxr-x 4 ubuntu ubuntu 4096 Nov 18 21:06 age=60
-rw-r--r-- 1 ubuntu ubuntu    0 Nov 18 21:06 _SUCCESS


In [86]:
# You can see how the output data files have been structured according to the requested partitoning scheme
# age + country
! ls -l $my_home/$out_dir/age=50

total 12
drwxrwxr-x 2 ubuntu ubuntu 4096 Nov 18 21:06 country=Germany
drwxrwxr-x 2 ubuntu ubuntu 4096 Nov 18 21:06 country=India
drwxrwxr-x 2 ubuntu ubuntu 4096 Nov 18 21:06 country=Spain


In [87]:
! ls -l $my_home/$out_dir/age=50/country=Spain

total 8
-rw-r--r-- 1 ubuntu ubuntu 629 Nov 18 21:06 part-00000-64f3c1b1-3ad9-4ada-9b84-f4c996f9d673.c000.snappy.parquet
-rw-r--r-- 1 ubuntu ubuntu 661 Nov 18 21:06 part-00001-64f3c1b1-3ad9-4ada-9b84-f4c996f9d673.c000.snappy.parquet
