##### Grading Feedback

# Question 0 (-2 If not answered)
Please provide the following the data so we can verify your github information and ensure accurate grading:
- Your Name: Chaithra Kopparam Cheluvaiah
- Your SU ID: 326926205

# IST 718: Big Data Analytics

- Professors: 
  - Willard Williamson <wewillia@syr.edu>
  - Emory Creel <emcreel@g.syr.edu>
- Faculty Assistants: 
  - Warren Justin Fernandes <wjfernan@syr.edu>
  - Ruchita Hiteshkumar Harsora <	rharsora@g.syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Code from the class text books or class provided code can be copied in its entirety.__
- Google Colab is the official class runtime environment so you should test your code on Colab before submission.
- Do not modify cells marked as grading cells or marked as do not modify.
- Before submitting your work, remember to check for run time errors with the following procedure:
`Runtime `$\rightarrow$ Factory reset runtime followed by Runtime $\rightarrow$ Run All.  All runtime errors will result in a minimum penalty of half off.
- All plots shall include descriptive title and axis labels.  Plot legends shall be included where possible.  Unless stated otherwise, plots can be made using any Python plotting package.
- Grading feedback cells are there for graders to provide feedback to students.  Don't change or remove grading feedback cells.
- Don't add or remove files from your git repo.
- Do not change file names in your repo.  This also means don't change the title of the ipython notebook.
- You are free to add additional code cells around the cells marked `your code here`.
- import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports.  For example, the statement `from sympy import *` is not allowed.  You must import the specific packages that you need. 
- The graders reserve the right to deduct points for subjective things we see with your code.  For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that.  This is only one of many different things we could find in reviewing your code.  In general, write your code like you are submitting it for a code peer review in industry.  
- Level of effort is part of our subjective grading.  For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements.  In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort.  We feel that the students who did a better job deserve a better grade.  We reserve the right to invoke level of effort grading at any time.
- Your notebook must run from start to finish without requiring manual input by the graders.  For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps.  In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.

I was very disappointed with the linear regression model accuracy releted to the insurance data set in homework 3.  In this homework, we will revisit the insurance data set and try to improve prediction scores.  Specifically, we will use random forest, gradient boosting trees, and deep learning to see if we can improve upon the scores achieved in homework 3.  Part 1 of the assignment will explore random forest and GBT.  Part 2 of the assignment will use deep learning.

In [1]:
# Grading Cell
enable_grid_search = False

The following cell is used to read the insurance data set into the colab environment.  Do not change or modify the following cell.

In [2]:
%%bash
# Do not change or modify this cell
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already installed
pip install pyspark &> /dev/null

# Download the data files from github
# If the data file does not exist in the colab environment
data_file_1=insurance.csv

if [[ ! -f ./${data_file_1} ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/wewilli1/ist718_data/master/${data_file_1} &> /dev/null
fi

In [3]:
#creating spark session and spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ist718-hw06-deeplearning').getOrCreate()
sc = spark.sparkContext

In [4]:
spark # checking the spark version

In [5]:
#loading the required libraries

# spark libraries
import pyspark.sql.functions as f
from pyspark.ml import Pipeline, feature, classification
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# non-spark libraries
import pandas as pd
import numpy as np

Your grade for grid search problems in this assignment will be determined in part on level of effort and your model performance results as compared to other students in the class.

# Question 0 (0 pts)
Copy the hard coded MSE scores from part 1 question 9 below (replace the code below from part 1 question 9).

In [6]:
# uncomment and hard code the following variables using output from above.  
# You can copy this code for use in part 2
hc_rf_train_mse = 16477454.74
hc_rf_validation_mse = 14688149.02
hc_gbt_train_mse = 16861956.02
hc_gbt_validation_mse = 16082586.13

# logistic regression AUC scores from HW-03
hc_lr_train_auc = 0.944161973667618
hc_lr_validation_auc= 0.9662866844530674

# Question 1 (0 pts)
- This question is worth 0 points because you can just copy your code from part 1 question 1.  
- Read the insurance data file into a spark data frame named `medical_df`.  Drop any rows that contain NAN / Null values.  Check the schema and fix if needed.  Perform needed feature engineering using **only** a string indexer to get ready for training decision trees.  One hot encoding is not needed for random forest - do not use one hot encoding or any other transformations other than string indexing. 
- Split the data into variables named exactly train, test, and validation. Set the spark randomSplit seed argument to 2019.

In [7]:
# reading insurance data into spark dataframe
medical_df = spark.read.csv('insurance.csv', header=True, inferSchema=True)

# dropping any rows that contain NAN/Null values
medical_df = medical_df.dropna()

# feature engineering - converting categorical data (sex, smoker, and region) into numerical data using String Indexer
feature_encoding = feature.StringIndexer(inputCols=['sex', 'smoker', 'region'], 
                                         outputCols=['sex_indexed', 'smoker_indexed', 'region_indexed'])\
                                         .fit(medical_df)

medical_df = feature_encoding.transform(medical_df)

# stratifying
median_charges = medical_df.approxQuantile('charges',probabilities=[0.5],relativeError=0)
median_discretizer = feature.Binarizer(threshold=median_charges[0], inputCol='charges', outputCol='rate_pool')
medical_df= median_discretizer.transform(medical_df)

In [8]:
# splitting the data into train, test, and validation
train, test, validation = medical_df.randomSplit(weights=[0.9, 0.05, 0.05], seed=2019)
(train.count(), test.count(), validation.count())

(1192, 65, 81)

In [9]:
#Print the schema
medical_df.printSchema()
#Print the shape
print('The shape of the dataframe is:', medical_df.toPandas().shape)
# print the head
medical_df.show()

root
 |-- age: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- bmi: double (nullable = true)
 |-- children: integer (nullable = true)
 |-- smoker: string (nullable = true)
 |-- region: string (nullable = true)
 |-- charges: double (nullable = true)
 |-- sex_indexed: double (nullable = false)
 |-- smoker_indexed: double (nullable = false)
 |-- region_indexed: double (nullable = false)
 |-- rate_pool: double (nullable = true)

The shape of the dataframe is: (1338, 11)
+---+------+------+--------+------+---------+-----------+-----------+--------------+--------------+---------+
|age|   sex|   bmi|children|smoker|   region|    charges|sex_indexed|smoker_indexed|region_indexed|rate_pool|
+---+------+------+--------+------+---------+-----------+-----------+--------------+--------------+---------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|        1.0|           1.0|           2.0|      1.0|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|        0.0|  

##### Grading Feedback Cell

The following questions will use deep learning.  The goal is to see if we can improve upon the linear regression score from homework 3, and also compare MSE scores between deep learning and random forest / GBT. You can find the spark documentation for the spark multilayer perceptron classifier can be found [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.MultilayerPerceptronClassifier.html).

# Question 2 (10 pts)
Create and train a spark multi layer perceptron model using a grid search in the cell below.  Score your model using MSE.  You are free to use K-Fold Cross validation if you wish.  Your grid search must be entirely encapsulated in the `if enable_grid_search` if statement.  The `enable_grid_search` Boolean is defined in a grading cell above.  You will disable the grid search before you submit by setting enable_grid_search to false.  Setting enable_grid_search to false should not result in a runtime error.  You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the `enable_grid_search` variable to false.

In [10]:
if enable_grid_search:
  # building MLP
  mlp_pipe = Pipeline(stages=[feature.VectorAssembler(inputCols=['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'], 
                                                      outputCol='mlp_features'),
                              classification.MultilayerPerceptronClassifier(featuresCol='mlp_features', layers=[6,10,10,10,10,10,2], 
                                                                            labelCol='rate_pool', seed=12072022)])
  
  # grid search
  mlp_grid = ParamGridBuilder().addGrid(mlp_pipe.getStages()[-1].stepSize, [0.00000001, 0.00001, 0.0001, 0.01, 0.1, 0.5, 0.9])\
                                .addGrid(mlp_pipe.getStages()[-1].blockSize, [32, 64, 128, 256, 512]).build()

  # training all the combination of models
  all_mlp_models = []
  for index, grid in enumerate(mlp_grid):
    print(f'Fitting the MLP Model {index}')
    mlp_model = mlp_pipe.fit(train, grid)
    all_mlp_models.append(mlp_model)

  # binary classification evaluator
  mlp_evaluator = BinaryClassificationEvaluator(labelCol=mlp_pipe.getStages()[-1].getLabelCol(),
                                                rawPredictionCol= mlp_pipe.getStages()[-1].getPredictionCol(),
                                                metricName='areaUnderROC')

  # evaluating the models
  mlp_mse_scores = [mlp_evaluator.evaluate(mlp_model.transform(test)) for mlp_model in all_mlp_models]
  print(np.round(mlp_mse_scores,2))

  best_mlp_model_index = np.argmax(mlp_mse_scores)
  print(f'best MLP model index = {best_mlp_model_index}')

  print(mlp_grid[best_mlp_model_index].values())

##### Grading Feedback Cell

# Question 3 (10 pts)
Create a pipeline named `best_mlp_pipe` that hard codes the tuning parameters from the best model found by the grid search in question 2 above.  Train and test best_mlp_pipe.  Score your model using MSE.  Do not use k-fold cross validation in this question.  Clearly print the resulting **train and test MSE** for `best_mlp_pipe` so it's easy for the graders to see your resulting MSEs.  Save train and test MSE scores in variables named mlp_train_mse and mlp_validation_mse.

In [11]:
# hyper parameters
LEARNING_RATE = 0.00000001
BATCH_SIZE = 32

# building the model
best_mlp_pipe = Pipeline(stages=[feature.VectorAssembler(inputCols=['age', 'sex_indexed', 'bmi', 'children', 'smoker_indexed', 'region_indexed'], 
                                                      outputCol='mlp_features'),
                                 classification.MultilayerPerceptronClassifier(featuresCol='mlp_features', layers=[6,10,10,10,10,10,2], labelCol='rate_pool',
                                                                               stepSize=LEARNING_RATE, blockSize= BATCH_SIZE, seed=12072022)])
best_mlp_model = best_mlp_pipe.fit(train)

# model evaluation
best_mlp_evaluator = BinaryClassificationEvaluator(labelCol=best_mlp_pipe.getStages()[-1].getLabelCol(),
                                                   rawPredictionCol= best_mlp_pipe.getStages()[-1].getPredictionCol(),
                                                   metricName='areaUnderROC')

mlp_train_auc = best_mlp_evaluator.evaluate(best_mlp_model.transform(train))
mlp_validation_auc = best_mlp_evaluator.evaluate(best_mlp_model.transform(validation))

print("mlp_train_auc =", mlp_train_auc)
print("mlp_validation_auc =", mlp_validation_auc)

mlp_train_auc = 0.9465411992611614
mlp_validation_auc = 0.9761904761904762


##### Grading Feedback Cell

## Question 4 (10 pts)
Create a pandas dataframe named `rf_gbt_mlp_mse_compare` which contains 3 columns: Model, Train MSE, and Test MSE.  Load the Model column with "RF", "GBT", or "MLP". Load the train and validation score columns using model train and validaiton scores (hc_rf_train_mse, hc_rf_validation_mse, hc_gbt_train_mse, hc_gbt_validation_mse, mlp_train_mse, and mlp_validaiton_mse).  

Deep learning might be able to produce better results than decision trees.  I am not sure if that will be the case for this dataset but you will be graded in comparison to other students in the class.

In [12]:
rf_gbt_mlp_mse_compare = pd.DataFrame(data=list(zip(['Logistic Regression',  'MLP'],  
                                                    [hc_lr_train_auc, mlp_train_auc ], 
                                                    [hc_lr_validation_auc, mlp_validation_auc])), 
                                  columns=['Model', 'Train AUC', 'Validation AUC'])

In [13]:
# Grading Cell Do Not Modify
display(rf_gbt_mlp_mse_compare)

Unnamed: 0,Model,Train AUC,Validation AUC
0,Logistic Regression,0.944162,0.966287
1,MLP,0.946541,0.97619


##### Grading Feedback Cell

# Question 4 (-5 pts if not performed)
Set the `enable_grid_search` Boolean variable to False in the grading cell at the top of this notebook.  Perform a __Runtime -> Disconnect and Delte Runtime__, __Runtime -> Run all__ test to verify there are no runtime errors.  Leave the `enable_grid_search` variable set to False and turn in your assignment.