# Predict Liver Failure based on People's Demographics

***(Model Definition, Training and Evaluation Notebook)***

## 4. Model Definition, Training and Evaluation

In this notebook, we will select the model which is most appropriate for our usecase i.e. the model that is most appropriate for accurately predicting the possibility of a Liver Failure in individuals based on that demographics and health information data that we have that was gathered from the JPAC Center for Health Diagnosis and Control.

### 4.1. Load Feature Engineered Data from Data Store

Let us start by loading the data from the IBM Data store onto this notebook for further processing. Now we will to connect to the object store and read a PARQUET file and create a dataframe out of it. Using SparkSQL we can handle it like a database.

In [None]:
# import required packages and libraries
import types
import pandas as pd
import numpy as np
import ibmos2spark

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190405085903-0002
KERNEL_ID = f22aa495-7cf8-4d49-8bdd-9402d6b8ebca


In [None]:
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'api_key': 'yR6pr44dLxKcEe_-J-YBRKtI9LaoOcG9v_c2zK_I1epP',
    'service_id': 'iam-ServiceId-dd08a5f3-28d2-4f87-bc12-4ec0662689f2',
    'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token'}

configuration_name = 'os_85bf8a7fa4e54387abd3bbb49b9490af_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data = spark.read.parquet(cos.url('ALF_Normalized.parquet', 'fundamentalsofscalabledatascience-donotdelete-pr-qbkdskud4vsck0'))
print("Number of records = ", df_data.count(), "\n")
df_data.createOrReplaceTempView('alf_data')
df_data.show()

Number of records =  5221 

+---+--------------------+
|ALF|            features|
+---+--------------------+
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,0.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|1.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,0.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
+---+--------------------+
only showing top 20 rows



Let us rename the column ALF to label.

In [None]:
df_data_new = df_data.withColumnRenamed('ALF', 'label')
df_data_new.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,0.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,0.0,1.0,1.0,...|
|  0.0|[0.5,0.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,0.0,1.0,1.0,...|
|  0.0|[0.5,0.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  1.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,0.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,0.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
|  0.0|[0.5,1.0,1.0,1.0,...|
+-----+--------------------+
only showing top 20 rows



Now let us split the data into training and test datasets - 80% training data and 20% test data.

In [None]:
splits = df_data_new.randomSplit([0.8, 0.2])
df_train = splits[0] # training dataset
df_test = splits[1] # test dataset

Let us take a quick look at how the class label is split between the training and test data sets.

In [None]:
df_train.createOrReplaceTempView('df_train')
spark.sql("select label, count(*) from df_train group by label").show()

+-----+--------+
|label|count(1)|
+-----+--------+
|  0.0|    3915|
|  1.0|     281|
+-----+--------+



In [None]:
df_test.createOrReplaceTempView('df_test')
spark.sql("select label, count(*) from df_test group by label").show()

+-----+--------+
|label|count(1)|
+-----+--------+
|  0.0|     952|
|  1.0|      73|
+-----+--------+



### 4.2. Choice of Model

In our usecase we are trying to predict the possibility of an individual running into a Liver failure. For modeling this usecase, we have a dataset generated by JPAC Center for Health Diagnosis and Control in which we have a bunch of features that can be used for our prediction and we also have a target variable or label which gives us a binary value of 0 or 1 which would tell us the possiblity of a liver failure.

Considering our usecase and the dataset, ours is a case of Supervised Machine Learning and can precisely be categorised as a Binary Classification Model.

#### 4.2.1 Choice of Machine Learning Algorithm

Now, we have figured out that we should be using a Supervised Machine Learning Algorithm for defining our model. The next task is to figure out the Supervised Machine Learning Algorithm that would best suit our usecase and dataset for predicting possiblity of Liver Failures in individuals.

    1. Linear Regression - This algorithm is used to predict a continuous value.
    2. Logistic Regression - This algorithm is used to predict a binary classifier instead of a continuous variable.
    3. Naive Bayes - This is a classification algorithm for binary (two-class) and multi-class classification problems.
    4. Support Vector Machine - This is a binary classifier that analyze data used for classification and regression analysis.
    5. Gradient Boosted Trees - This can also be used for regression and classification problems.

We can see that we have more than one algorithms to choose from. For our use case, we will go ahead and use ***Gradient Boosted Trees*** because Gradient Boosting is one of the more powerful techniques for building predictive models.

#### 4.2.2 Choice of Deep Learning Algorithm

With respect to Deep Learning Algorithms, there is a wide range of algorithms to choose from. Based on our usecase and dataset, we will use a ***Feed Forward Neural Network / Multi Layer Perceptron*** for our model as the Perceptron is a binary linear classifier and meets our needs for this use case.


### 4.3. Gradient Boosted Trees - Supervised Machine Learning Model

#### 4.3.1. Data Prep for Model

The data prep that we have done so far addresses the data needs for this model. Earlier in this notebook, we have also split the data into training and test datasets. This will be used for training and evaluating our model performance and for measuring the training and validation accuracy for our Supervised Machine Learning Model.

#### 4.3.2. Model Definition and Training

Let us now define our Supervised Machine Learning Model using the Gradient Boosted Trees Algorithm and train it using the training dataset.

In [None]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol='label', featuresCol='features', maxIter=20)

model = gbt.fit(df_train)
prediction = model.transform(df_train)

#### 4.3.3. Model Evaluation

We will capture ***accuracy*** as a measure of evaluation for our model.

Let us now validate the training performance of our model against our training dataset. 

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator().setMetricName('accuracy').setPredictionCol('prediction').setLabelCol('label')
print("GBT Training Accuracy")
print("---------------------")
evaluator.evaluate(prediction)

GBT Training Accuracy
---------------------


0.9611534795042898

The performance of our model against our training dataset appears to be pretty good at 0.96 (96%). 

Now let us validate the performance of our model using test dataset.

In [None]:
model_test = gbt.fit(df_test)
prediction_test = model.transform(df_test)
print("GBT Validation Accuracy")
print("---------------------")
evaluator.evaluate(prediction_test)

GBT Validation Accuracy
---------------------


0.9219512195121952

The performance of our model against our test / validation dataset also appears to be pretty good at 0.92 (92%).  

Overall, based on the above results, we can say that our Gradient Boosted Trees model has done a pretty good job of predicting  the possibility of Liver failure based on our feature set. The accuracy of our model with both the training and test datasets is as follows:  
* Training Accuracy = 0.96 (96%)
* Validation Accuracy = 0.92 (92%)

### 4.4. Feed Forward Neural Network (MLP) - Deep Learning Model

#### 4.4.1. Data Prep for Model

In our data prep so far, we have created training and test datasets to be used for training and testing our model. However this data is represented as dataframes. Our Feed Forward Neural Network model here is implemented using Keras.

This Feed Forward Neural Network model expects the input and output dataset as arrays. So, let us first construct input and output data arrays from our training and test datasets.  

Let us first define a function which will take a dataframe as input and return the input array (Features array) and output array (Label array).

In [None]:
#########################################################
# FUNCTION TO CONSTRUCT INPUT AND OUTPUT DATASET ARRAYS #
########################################################
def construct_arrays (df):
    # Initialize Input and Output arrays for the Data set
    X = []
    y = []
    
    # Convert dataframe from Spark DF to Pandas DF
    df_pd = df.toPandas()

    # Loop through dataframe and add data to input and output arrays
    for index, row in df_pd.iterrows():
        X.append(row[1])
        y.append(row[0])

    # Convery input and output data arrays from Python arrays to Numpy arrays
    X = np.array(X)
    y = np.array(y)
    
    return (X, y)

Now let us pass the training and test data frames to the above function to obtain the respective input and output arrays containing the feature set and the label data.

In [None]:
X_train, y_train = construct_arrays(df_train)
print("Size of Training Features dataset: ", len(X_train))
print(X_train)
print("Size of Training Label dataset: ", len(y_train))
print(y_train)

X_test, y_test = construct_arrays (df_test)
print("Size of Test Features dataset: ", len(X_test))
print(X_test)
print("Size of Test Label dataset: ", len(y_test))
print(y_test)

Size of Training Features dataset:  4196
[[ 0.5         0.          0.         ...,  0.19736842  0.32068311
   0.31073446]
 [ 0.5         0.          0.         ...,  0.51973684  0.33017078
   0.41242938]
 [ 0.5         0.          0.         ...,  0.18421053  0.33965844
   0.32580038]
 ..., 
 [ 0.5         1.          1.         ...,  0.32236842  0.21442125
   0.24105461]
 [ 0.5         1.          1.         ...,  0.20394737  0.13851992
   0.13182674]
 [ 0.5         1.          1.         ...,  0.42763158  0.21252372
   0.2693032 ]]
Size of Training Label dataset:  4196
[ 0.  0.  0. ...,  1.  1.  1.]
Size of Test Features dataset:  1025
[[ 0.5         0.          0.         ...,  0.20394737  0.28462998
   0.27683616]
 [ 0.5         0.          0.         ...,  0.28289474  0.4573055
   0.47080979]
 [ 0.5         0.          0.         ...,  0.17105263  0.32637571
   0.30885122]
 ..., 
 [ 0.5         1.          1.         ...,  0.35526316  0.27514231
   0.31073446]
 [ 0.5         1.  

#### 4.4.2. Model Definition, Training and Evaluation

Let us now define our Deep Learning Model using the Feed Forward Neural Network (Multilayer Perceptron) Algorithm and then train and evaluate the model using the training and test / validation datasets.

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils

# Define a Multilayer Perceptron (MLP) Model using Keras
model = Sequential()
model.add(Dense(25, input_dim = 25, kernel_initializer = 'normal', activation = 'relu')) # input Layer
model.add(Dense(1, kernel_initializer = 'normal', activation = 'sigmoid')) # Output Layer

# Compile our model
model.compile(loss = 'binary_crossentropy', optimizer = 'adadelta', metrics = ['accuracy'])

# Train our model
model.fit(X_train, y_train, epochs = 20, batch_size = 5, verbose = 1, validation_data = (X_test, y_test))

# Evaluate our model
score = model.evaluate(X_test, y_test, verbose = 0)
print(score)

Using TensorFlow backend.


Train on 4196 samples, validate on 1025 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[0.18123500248280966, 0.93073170731707322]


We can see that the training and validation accuracy of our Feed Forward Neural Network is also pretty good. The accuracy values as measured for our model is as follows:
* Training accuracy = 0.94 (94%)
* Validation accuracy = 0.93 (93%)