---
# Lab Number : 1

## Title : *Data Analysis with Spark* 

## Goal : 

Getting Familiar with Spark workflow  

## Help:

1. Spark Programming Guide : https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
2. Spark API reference : https://spark.apache.org/docs/latest/api/python/index.html

## Datasets reference:

https://archive.ics.uci.edu/ml/datasets/bank+marketing

## Input Datasets:

* bank.csv
* bank-full.csv
* bank-additional.csv


## Datasets local path:

* /spark-course/data/bank/

## Reading (do not yet):

[Moro et al., 2014](https://www.researchgate.net/publication/260805594_A_Data-Driven_Approach_to_Predict_the_Success_of_Bank_Telemarketing) A Data-Driven Approach to Predict the Success of Bank Telemarketing.

---

## Lab Specific Tasks


### Basic Analysis

 * Create a SparkSession
 * Load the dataset : Bank Products Marketing  (** tip : ** use inferschema option) 
 * Inspect the dataset and analyze it's structure (schema)
 * Report the number of columns and their names
 * Report the number of records in dataset

### Advanced Analysis

 * Compute (descriptive) statistics ( count, max, min , average, median , stddev) on numeric columns were applicable. **Warn:** you will need to transform some column data types from string to a numeric format (float,double)
 * Compute (descriptive) statistics for each 'job_type'  , 'marital_status'
 * Compute the mean value of Nb. of contacts performed during this campaign for each age range.
 * Investigate and quantify correlations (if any) between features 

### Save and Report Your Results

Once your finished save your notebook: 

1. Go File -> Save and Checkpoint
2. **Note : All Basic and Advanced ** bulleted tasks above are compulsory and shall have corresponding results in your notebook. 
3. Email the professor : (aabreua@faculty.ie.edu) your saved Notebook ( the .ipynb file )

In [33]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

spark = SparkSession \
        .builder \
        .appName("Lab1") \
        .getOrCreate()

sc = spark.sparkContext

#### Option: Infer Schema from data

In [45]:
datasets_path='/spark-course/data/bank/'
bank_data=datasets_path+'bank.csv'
# Use it to load some data
df= spark \
    .read \
    .option("header","true") \
    .option("inferSchema","true") \
    .csv(bank_data)

In [46]:
# What is df ?
df

DataFrame["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string]

In [47]:
# ok , but this is not very ... telling , we want to see some of the data also
df.take(5)

[Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"management";"married";"tertiary";"no";1476;"y

In [48]:
# You can se how a Spark DataFrame is actually a Dataset[Row] abstraction
# Let's analyze some data
# First let's check the schema
df.printSchema()

root
 |-- "age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string (nullable = true)



In [6]:
# but there seems to be something odd here there is only the 'root' node and then a flat leaf 
# with everything recorded as strings , even stuff that is certainly numeric
# so .. let's provide ourselves the schema

#### Option: Manually Specify data schema

In [27]:
# we can specify the schema ourselves
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import Row
fields = [ \
          StructField("age", DoubleType(), True), \
          StructField("job", StringType(), True), \
          StructField("marital", StringType(), True), \
          StructField("education", StringType(), True), \
          StructField("default", StringType(), True), \
          StructField("balance", DoubleType(), True), \
          StructField("housing", StringType(), True), \
          StructField("loan", StringType(), True), \
          StructField("contact", StringType(), True), \
          StructField("day", StringType(), True), \
          StructField("month", StringType(), True), \
          StructField("duration", DoubleType(), True), \
          StructField("campaign", DoubleType(), True), \
          StructField("pdays", DoubleType(), True), \
          StructField("previous", DoubleType(), True), \
          StructField("poutcome", StringType(), True)]

custom_schema=StructType(fields)

In [49]:
mdf= spark \
    .read \
    .option("header","true") \
    .schema(custom_schema) \
    .csv(bank_data)

In [50]:
mdf.printSchema()

root
 |-- age: double (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: double (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: double (nullable = true)
 |-- campaign: double (nullable = true)
 |-- pdays: double (nullable = true)
 |-- previous: double (nullable = true)
 |-- poutcome: string (nullable = true)



In [51]:
# This looks better
# Maybe : what about inferring the Schema?

In [52]:
mdf.select('age').describe().show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|              4521|
|   mean| 41.17009511170095|
| stddev|10.576210958711263|
|    min|              19.0|
|    max|              87.0|
+-------+------------------+



In [53]:
mdf.select('duration').describe().show()

+-------+--------+
|summary|duration|
+-------+--------+
|  count|       0|
|   mean|    null|
| stddev|    null|
|    min|    null|
|    max|    null|
+-------+--------+



In [12]:
datasets_path='/spark-course/data/bank/'
bank_data=datasets_path+'bank.csv'

[Row(age=30.0, job=None, marital=None, education=None, default=None, balance=None, housing=None, loan=None, contact=None, day=None, month=None, duration=None, campaign=None, pdays=None, previous=None, poutcome=None),
 Row(age=33.0, job=None, marital=None, education=None, default=None, balance=None, housing=None, loan=None, contact=None, day=None, month=None, duration=None, campaign=None, pdays=None, previous=None, poutcome=None),
 Row(age=35.0, job=None, marital=None, education=None, default=None, balance=None, housing=None, loan=None, contact=None, day=None, month=None, duration=None, campaign=None, pdays=None, previous=None, poutcome=None),
 Row(age=30.0, job=None, marital=None, education=None, default=None, balance=None, housing=None, loan=None, contact=None, day=None, month=None, duration=None, campaign=None, pdays=None, previous=None, poutcome=None),
 Row(age=59.0, job=None, marital=None, education=None, default=None, balance=None, housing=None, loan=None, contact=None, day=None, 

In [54]:
df.select('duration').describe().show()

AnalysisException: 'cannot resolve \'`duration`\' given input columns: ["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""];;\n\'Project [\'duration]\n+- Relation["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""#599] csv\n'

In [18]:
rdd = sc.textFile(bank_data)

In [19]:
rdd.take(5)

['"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"',
 '30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"',
 '33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"',
 '35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"',
 '30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"no"']

In [23]:
rdd2=rdd.flatMap(lambda x : x.split(";")).map(lambda x : x.replace('"', ''))

In [25]:
df = rawdata.select(col('house name'), rawdata.price.cast('float').alias('price'))

76874