---
# Lab Number : 1

## Title : *Data Analysis with Spark* 

## Goal : 

Getting Familiar with Spark workflow  

## Help:

1. Spark Programming Guide : https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
2. Spark API reference : https://spark.apache.org/docs/latest/api/python/index.html

## Datasets reference:

https://archive.ics.uci.edu/ml/datasets/bank+marketing

## Input Datasets:

* bank-full.csv


## Datasets local path:

* /spark-course/data/bank/

## Reading (do not yet):

[Moro et al., 2014](https://www.researchgate.net/publication/260805594_A_Data-Driven_Approach_to_Predict_the_Success_of_Bank_Telemarketing) A Data-Driven Approach to Predict the Success of Bank Telemarketing.

---

## Lab Specific Tasks


### Basic Analysis

 * Create a SparkSession
 * Load the dataset : Bank Products Marketing  (** tip : ** use inferschema option) 
 * Inspect the dataset and analyze it's structure (schema)
 * Report the number of columns and their names
 * Report the number of records in dataset

### Advanced Analysis

**Warn:** you will need to transform some column data types from string to a numeric format (float,double)

 * Compute (descriptive) statistics ( count, max, min , average, median , stddev) on numeric columns were applicable.  
 * Compute nb. of people by age bin
 * Compute the mean value of Nb. of contacts performed during this campaign for each age range.
 * Investigate and quantify correlations (if any) between features 

### Save and Report Your Results

Once your finished save your notebook: 

1. Go File -> Save and Checkpoint
2. **Note : All Basic and Advanced ** bulleted tasks above are compulsory and shall have corresponding results in your notebook. 
3. Email the professor : (aabreua@faculty.ie.edu) your saved Notebook ( the .ipynb file )

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

spark = SparkSession \
        .builder \
        .appName("Lab1") \
        .getOrCreate()

sc = spark.sparkContext

#### Option: Infer Schema from data

In [6]:
datasets_path='/spark-course/data/bank/'
bank_data=datasets_path+'bank-full.csv'
# Use it to load some data
df= spark \
    .read \
    .option("header","true") \
    .option("inferSchema","true") \
    .csv(bank_data)

In [7]:
# What is df ?
df

DataFrame["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string]

In [8]:
# ok , but this is not very ... telling , we want to see some of the data also
df.take(5)

[Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='47;"blue-collar";"married";"unknown";"no";1506;"yes";"n

In [9]:
# You can se how a Spark DataFrame is actually a Dataset[Row] abstraction
# Let's analyze some data
# First let's check the schema
df.printSchema()

root
 |-- "age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string (nullable = true)



In [6]:
# but there seems to be something odd here there is only the 'root' node and then a flat leaf 
# with everything recorded as strings , even stuff that is certainly numeric
# so .. let's provide ourselves the schema

#### Option: Manually Specify data schema

In [10]:
# we can specify the schema ourselves
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import Row
fields = [ \
          StructField("age", DoubleType(), True), \
          StructField("job", StringType(), True), \
          StructField("marital", StringType(), True), \
          StructField("education", StringType(), True), \
          StructField("default", StringType(), True), \
          StructField("balance", DoubleType(), True), \
          StructField("housing", StringType(), True), \
          StructField("loan", StringType(), True), \
          StructField("contact", StringType(), True), \
          StructField("day", StringType(), True), \
          StructField("month", StringType(), True), \
          StructField("duration", DoubleType(), True), \
          StructField("campaign", DoubleType(), True), \
          StructField("pdays", DoubleType(), True), \
          StructField("previous", DoubleType(), True), \
          StructField("poutcome", StringType(), True)]

custom_schema=StructType(fields)

In [11]:
mdf= spark \
    .read \
    .option("header","true") \
    .schema(custom_schema) \
    .csv(bank_data)

In [12]:
mdf.printSchema()

root
 |-- age: double (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: double (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: double (nullable = true)
 |-- campaign: double (nullable = true)
 |-- pdays: double (nullable = true)
 |-- previous: double (nullable = true)
 |-- poutcome: string (nullable = true)



In [13]:
# This looks better
# Maybe : what about inferring the Schema?

In [14]:
mdf.select('age').describe().show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|             45211|
|   mean| 40.93621021432837|
| stddev|10.618762040975408|
|    min|              18.0|
|    max|              95.0|
+-------+------------------+



In [17]:
# What about using RDD APIs

In [18]:
rdd = sc.textFile(bank_data)

In [19]:
rdd.take(5)

['"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"',
 '58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"',
 '44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"',
 '33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"',
 '47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"']

In [20]:
# Now each record is separated by a single comma

In [95]:
# Note the [:-1] we are getting all columns -BUT- the last one
# We don't want the "y" variable since our scheme does not specify it
raw_rdd=rdd \
    .flatMap(lambda x : x.split("'")) \
    .map(lambda x : x.replace('"','')) \
    .map(lambda x : x.split(';')[:-1])

In [101]:
raw_rdd.toDebugString()

b'(2) PythonRDD[157] at RDD at PythonRDD.scala:48 []\n |  /spark-course/data/bank/bank-full.csv MapPartitionsRDD[35] at textFile at NativeMethodAccessorImpl.java:0 []\n |  /spark-course/data/bank/bank-full.csv HadoopRDD[34] at textFile at NativeMethodAccessorImpl.java:0 []'

In [102]:
my_df = sqlContext.createDataFrame(raw_rdd,schema=custom_schema)

In [110]:
my_df.dtypes

[('age', 'double'),
 ('job', 'string'),
 ('marital', 'string'),
 ('education', 'string'),
 ('default', 'string'),
 ('balance', 'double'),
 ('housing', 'string'),
 ('loan', 'string'),
 ('contact', 'string'),
 ('day', 'string'),
 ('month', 'string'),
 ('duration', 'double'),
 ('campaign', 'double'),
 ('pdays', 'double'),
 ('previous', 'double'),
 ('poutcome', 'string')]