---
# Lab Number : 2

## Title : *Data Analysis with Spark* 

## Goal : 

Getting Familiar with Spark Dataframes and Spark workflow  

## Help:

1. Spark Programming Guide : https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
2. Spark API reference : https://spark.apache.org/docs/latest/api/python/index.html

## Datasets reference:

https://archive.ics.uci.edu/ml/datasets/bank+marketing


## Datasets local path:

* $HOME/spark-course/data/bank/bank-full.csv

## Reading :

[Moro et al., 2014](https://www.researchgate.net/publication/260805594_A_Data-Driven_Approach_to_Predict_the_Success_of_Bank_Telemarketing) A Data-Driven Approach to Predict the Success of Bank Telemarketing.

---

## Lab Specific Tasks


### Basic Analysis

 * Create a SparkSession
 * Creta a Dataframe loading the dataset : Bank Products Marketing
 * Inspect the dataset and analyze it's structure (schema)
 * Report the number of columns and their names
 * Report the number of records in dataset

### Advanced Analysis

**Warn:** you will need to transform some column data types from string to a numeric format (float,double)

 * Compute (descriptive) statistics ( count, max, min , average, median , stddev) on numeric columns were applicable.  
 * Compute nb. of people by age bin
 * Compute the mean value of Nb. of contacts performed during this campaign for each age range.
 * Investigate and quantify correlations (if any) between features 

### Save Your Notebook

Once your finished save your notebook: 

1. Go File -> Save and Checkpoint

## Basic Analysis

### Create a SparkSession object

In [2]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

spark = SparkSession \
        .builder \
        .appName("Lab2") \
        .getOrCreate()

sc = spark.sparkContext

In [3]:
sc

### Create a Dataframe reading in data from file 

In [27]:
# Get the home directory path using Python os.environ
import os
my_home=os.environ.get('HOME')

bank_data=my_home+'/spark-course/data/bank/bank-full.csv'
# Use it to load some data
df= spark \
    .read \
    .option("header","true") \
    .csv(bank_data)

### Inspect the dataset and analyze it's structure

In [28]:
# What is df ?
df

DataFrame["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string]

In [29]:
# ok , but this is not very ... telling , we want to see some of the data also
df.head(5)

[Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='47;"blue-collar";"married";"unknown";"no";1506;"yes";"n

In [30]:
# You can se how a Spark DataFrame is actually a Dataset[Row] abstraction

# [Row("age";...)]

# Let's analyze some data
# First let's check the schema
df.printSchema()

root
 |-- "age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string (nullable = true)



In [31]:
# But there seems to be something odd here there is only the 'root' node and then a flat leaf 
# with everything recorded as strings , even stuff that is certainly numeric
# so .. let's try to get the schema (structure) of the data right

### Creating Schema for your data:

* Option 1 : Manually Specify data schema
* Option 2 : Try to infer the schema from the data itself ( requires adding lib , for later )

In [33]:
# Option 1: we specify the schema ourselves
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import Row
fields = [ \
          StructField("age", DoubleType(), True), \
          StructField("job", StringType(), True), \
          StructField("marital", StringType(), True), \
          StructField("education", StringType(), True), \
          StructField("default", StringType(), True), \
          StructField("balance", DoubleType(), True), \
          StructField("housing", StringType(), True), \
          StructField("loan", StringType(), True), \
          StructField("contact", StringType(), True), \
          StructField("day", StringType(), True), \
          StructField("month", StringType(), True), \
          StructField("duration", DoubleType(), True), \
          StructField("campaign", DoubleType(), True), \
          StructField("pdays", DoubleType(), True), \
          StructField("previous", DoubleType(), True), \
          StructField("poutcome", StringType(), True)]

custom_schema=StructType(fields)

In [34]:
df_opt1= spark \
    .read \
    .option("header","true") \
    .schema(custom_schema) \
    .csv(bank_data)

In [35]:
df_opt1.printSchema()

root
 |-- age: double (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: double (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: double (nullable = true)
 |-- campaign: double (nullable = true)
 |-- pdays: double (nullable = true)
 |-- previous: double (nullable = true)
 |-- poutcome: string (nullable = true)



In [18]:
# If we wanted to infer the schema
df_opt2= spark \
    .read \
    .option("header","true") \
    .option("inferSchema","true") \
    .csv(bank_data)

### Report Dataframe column names and their nb

In [37]:
df_opt1.columns

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome']

In [38]:
print(len(df_opt1.columns))

16


### Report the nb. of records in the dataset

In [41]:
df_opt1.count()

45211

## Advanced Analysis

#### 1. Compute Descriptive Statistics

In [24]:
df_opt1.select('age').describe().show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|             45211|
|   mean| 40.93621021432837|
| stddev|10.618762040975408|
|    min|              18.0|
|    max|              95.0|
+-------+------------------+



#### 2. Compute nb. of people by age bin

In [52]:
df_opt1.groupBy('age').count()

TypeError: groupBy() got an unexpected keyword argument 'bin'