## Model Building

##### We will be building ML models like Logistic regression, Neural network, Support vector Machine(SVM) and Decision tree and we will be doing hyperparameter tuning to predict the chance of diabetes. We will be using Recall as a performance metric to judge our models.

##### We choose Recall as a performance metric to judge our models as we have to give priority to FALSE NEGATIVES. FN are the cases when our model predicts that there are no chances of Diabetes but in reality there is a high risk of getting it. I such a senario the patient is at the risk of loosing their life as he might not be aware that he has to control his surgar intake.

### Lodaing all the required libraries 

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from __future__ import print_function
from sklearn.neural_network import MLPClassifier
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from pandas import MultiIndex, Int64Index
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeRegressor


import tensorflow as tf
from tensorflow import keras

# fix random seed for reproducibility
np.random.seed(1)
tf.random.set_seed(1)

  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index


### Loading the processed training and test datasets

In [2]:
X_train = pd.read_csv('./data/X_train.csv') 
y_train = pd.read_csv('./data/y_train.csv') 
X_test = pd.read_csv('./data/X_test.csv') 
y_test = pd.read_csv('./data/y_test.csv') 

### Building a dataframe to store our models performance metrics

In [3]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### Implementing Logistic Regression model with hyperparameters

##### Using Randomized search to find the best parameters

In [4]:
param_grid = {'penalty': ['l1', 'l2'], 
              'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'solver': ['liblinear', 'saga'],
              'l1_ratio': [0.25, 0.5, 0.75],
             'max_iter': np.arange(800, 1200)
             }

# Perform Randomized Search CV to find the best hyperparameters
best_lregression = RandomizedSearchCV(estimator=LogisticRegression(random_state=0, solver='saga'),
                                      scoring='recall', 
                                      param_distributions=param_grid, 
                                      cv=10, 
                                      verbose=0, 
                                      return_train_score=True, 
                                      n_iter=500, 
                                      n_jobs=-1)
best_lregression.fit(X_train, y_train)

# Print the best parameters found through Randomized Search CV
print(f"Best parameters found through Randomized Search CV: {best_lregression.best_params_}")






Best parameters found through Randomized Search CV: {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 984, 'l1_ratio': 0.75, 'C': 0.01}


  y = column_or_1d(y, warn=True)


##### Performing GridSearch over a close range of parameters that we got from Randomized search to find the best parameters

In [5]:
# Define the parameter grid for Grid Search CV
param_grid = { 
    'solver': [best_lregression.best_params_['solver']],
    'penalty': [best_lregression.best_params_['penalty']],
    'C': [0.1, 1, 10],
    'max_iter': np.arange(750,950)
}

# Perform Grid Search CV with the best parameters from Randomized Search CV
grid_lregression = GridSearchCV(estimator=LogisticRegression(random_state=0, solver=best_lregression.best_params_['solver']),
                                param_grid=param_grid,
                                scoring='recall',
                                cv=10,
                                n_jobs=-1)
grid_lregression.fit(X_train, y_train)

# Print the best parameters found through Grid Search CV
print(f"Best parameters found through Grid Search CV: {grid_lregression.best_params_}")

Best parameters found through Grid Search CV: {'C': 10, 'max_iter': 750, 'penalty': 'l1', 'solver': 'liblinear'}


  y = column_or_1d(y, warn=True)


##### Storing the performance metrics in the dataframe

In [6]:
# Evaluate the model using the best parameters found through Grid Search CV 
c_matrix = confusion_matrix(y_test, grid_lregression.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model': "LR", 
                                                     'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                     'Precision': [TP/(TP+FP)], 
                                                     'Recall': [TP/(TP+FN)], 
                                                     'F1': [2*TP/(2*TP+FP+FN)]
                                                    }, index=[0])])

In [7]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,LR,0.916667,0.938144,0.928571,0.933333


### Implementing SVM model with hyperparameters

##### Using Randomized search to find the best parameters

In [8]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'C': np.arange(1,25),   
    'gamma': ['scale','auto'],
    'kernel':['linear','rbf','poly']
}

svm = SVC()
rand_search = RandomizedSearchCV(estimator = svm, param_distributions=param_grid, cv=kfolds, n_iter=140,
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 140 candidates, totalling 700 fits
The best recall score is 1.0
... with parameters: {'kernel': 'poly', 'gamma': 'scale', 'C': 2}


  y = column_or_1d(y, warn=True)


##### Performing GridSearch over a close range of parameters that we got from Randomized search to find the best parameters

In [9]:
score_measure = "recall"
kfolds = 5

C = rand_search.best_params_['C']
gamma = rand_search.best_params_['gamma']
kernel = rand_search.best_params_['kernel']

param_grid = {
    'C': np.arange(C-2,C+2),  
    'gamma': [gamma],
    'kernel': [kernel]
    
}

svm1 = SVC()
grid_search = GridSearchCV(estimator = svm1, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecision_SVM = grid_search.best_estimator_

5 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\akash\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\akash\Anaconda3\lib\site-packages\sklearn\svm\_base.py", line 255, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "C:\Users\akash\Anaconda3\lib\site-packages\sklearn\svm\_base.py", line 315, in _dense_fit
    ) = libsvm.fit(
  File "sklearn\svm\_libsvm.pyx", line 189, in sklearn.svm._libsvm.fit
ValueError: C <= 0



Fitting 5 folds for each of 4 candidates, totalling 20 fits
The best recall score is 1.0
... with parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'poly'}


  y = column_or_1d(y, warn=True)


##### Storing the performance metrics in the dataframe

In [10]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"SVM", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,LR,0.916667,0.938144,0.928571,0.933333
0,SVM,0.628205,0.628205,1.0,0.771654


### Implementing Decision Tree model with hyperparameters

##### Using Randomized search to find the best parameters

In [11]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(2,50),  
    'min_samples_leaf': np.arange(1,50),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 50), 
    'max_depth': np.arange(1,20), 
    'criterion': ['gini', 'entropy'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator=dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                                 scoring=score_measure, verbose=1, n_jobs=-1, # n_jobs=-1 will utilize all available CPUs 
                                 return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestPrecTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.932020202020202
... with parameters: {'min_samples_split': 45, 'min_samples_leaf': 15, 'min_impurity_decrease': 0.0061, 'max_leaf_nodes': 30, 'max_depth': 13, 'criterion': 'gini'}


##### Performing GridSearch over a close range of parameters that we got from Randomized search to find the best parameters

In [12]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(26,36),  
    'min_samples_leaf': np.arange(8,16),
    'min_impurity_decrease': np.arange( 0.0005, 0.0010, 0.0020),
    'max_leaf_nodes': [10,30], 
    'max_depth': [5,15], 
    'criterion': ['entropy']
}


dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestPrecisionTree = grid_search.best_estimator_

Fitting 5 folds for each of 320 candidates, totalling 1600 fits
The best recall score is 0.9094949494949495
... with parameters: {'criterion': 'entropy', 'max_depth': 5, 'max_leaf_nodes': 10, 'min_impurity_decrease': 0.0005, 'min_samples_leaf': 8, 'min_samples_split': 26}


##### Storing the performance metrics in the dataframe

In [13]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,LR,0.916667,0.938144,0.928571,0.933333
0,SVM,0.628205,0.628205,1.0,0.771654
0,Decision Tree,0.923077,0.93,0.94898,0.939394


### Implementing Neural Networks

In [14]:
%%time

score_measure = "recall"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (50,), (70,),(50,30), (40,20), (60,40, 20)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2],
    'max_iter': [1000]
}

ann = MLPClassifier()
grid_search = RandomizedSearchCV(estimator = ann, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
{'solver': 'adam', 'max_iter': 1000, 'learning_rate_init': 0.2, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (50, 30), 'alpha': 0.7, 'activation': 'logistic'}
CPU times: total: 15.4 s
Wall time: 2min 15s


  y = column_or_1d(y, warn=True)


In [15]:

y_pred = bestRecallTree.predict(X_test)

c_matrix = confusion_matrix(y_test, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Network Randomized search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)],  
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,LR,0.916667,0.938144,0.928571,0.933333
0,SVM,0.628205,0.628205,1.0,0.771654
0,Decision Tree,0.923077,0.93,0.94898,0.939394
0,Neural Network Randomized search,0.794872,,0.693878,0.809524


In [16]:
%%time

score_measure = "recall"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (30,), (50,), (70,), (90,)],
    'activation': ['tanh', 'relu'],
    'solver': ['adam'],
    'alpha': [.5, .7, 1],
    'learning_rate': ['adaptive', 'invscaling'],
    'learning_rate_init': [0.005, 0.01, 0.15],
    'max_iter': [1000]
}

ann = MLPClassifier()
grid_search = GridSearchCV(estimator = ann, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
{'activation': 'tanh', 'alpha': 0.5, 'hidden_layer_sizes': (30,), 'learning_rate': 'adaptive', 'learning_rate_init': 0.15, 'max_iter': 1000, 'solver': 'adam'}
CPU times: total: 3.12 s
Wall time: 21.8 s


  y = column_or_1d(y, warn=True)


In [17]:
y_pred = bestRecallTree.predict(X_test)

c_matrix = confusion_matrix(y_test, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Network Grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)],  
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,LR,0.916667,0.938144,0.928571,0.933333
0,SVM,0.628205,0.628205,1.0,0.771654
0,Decision Tree,0.923077,0.93,0.94898,0.939394
0,Neural Network Randomized search,0.794872,,0.693878,0.809524
0,Neural Network Grid search,0.628205,,1.0,0.771654


#### Deep Neural Network Model

In [18]:
import tensorflow.keras.backend as K

# define recall function as a member function of the Model class
class Metrics(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.recall = []

    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(X_test)
        y_pred = np.round(y_pred)
        _recall = recall_score(y_test, y_pred)
        self.recall.append(_recall)
        print("val_recall:",_recall)

def recall(y_test, y_pred):
    true_positives = K.sum(K.round(K.clip(y_test * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_test, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall



In [19]:

%%time

# create model stucture
model = keras.models.Sequential()
model.add(keras.layers.Input(16))
model.add(keras.layers.Dense(10, activation='relu',kernel_initializer= tf.keras.initializers.GlorotNormal()))
model.add(keras.layers.Dense(10, activation='relu', kernel_initializer= tf.keras.initializers.GlorotNormal()))
model.add(keras.layers.Dense(10, activation='relu', kernel_initializer= tf.keras.initializers.GlorotNormal()))
model.add(keras.layers.Dense(1, activation='sigmoid')) 

# compile the model with the custom loss function
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[recall])


CPU times: total: 46.9 ms
Wall time: 97.1 ms


In [20]:
%%time

# fit the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=100, callbacks=[Metrics()])


Epoch 1/20
val_recall: 1.0
Epoch 2/20
val_recall: 1.0
Epoch 3/20
val_recall: 1.0
Epoch 4/20
val_recall: 1.0
Epoch 5/20
val_recall: 1.0
Epoch 6/20
val_recall: 1.0
Epoch 7/20
val_recall: 1.0
Epoch 8/20
val_recall: 1.0
Epoch 9/20
val_recall: 1.0
Epoch 10/20
val_recall: 1.0
Epoch 11/20
val_recall: 1.0
Epoch 12/20
val_recall: 1.0
Epoch 13/20
val_recall: 1.0
Epoch 14/20
val_recall: 1.0
Epoch 15/20
val_recall: 1.0
Epoch 16/20
val_recall: 1.0
Epoch 17/20
val_recall: 1.0
Epoch 18/20
val_recall: 1.0
Epoch 19/20
val_recall: 1.0
Epoch 20/20
val_recall: 1.0
CPU times: total: 2.17 s
Wall time: 4.46 s


In [21]:
%%time

# fit the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=100)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
CPU times: total: 359 ms
Wall time: 1.1 s


In [22]:
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)

c_matrix = confusion_matrix(y_test, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Deep Neural Network", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)],  
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance



Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,LR,0.916667,0.938144,0.928571,0.933333
0,SVM,0.628205,0.628205,1.0,0.771654
0,Decision Tree,0.923077,0.93,0.94898,0.939394
0,Neural Network Randomized search,0.794872,,0.693878,0.809524
0,Neural Network Grid search,0.628205,,1.0,0.771654
0,Deep Neural Network,0.660256,,0.989796,0.785425


#### Looking at the performance metric dataframe we can see that the highest recall score of 100 percent is of SVM and Neural Network model. Although these model  has the low accuracy, precision and F1 score.  We know that no model can predict anything with 100 percent accuracy and the same goes for these models also.

#### Overall we can say that DNN model is the best model for predicting that if a person is at the risk of getting Diabetes as it has a recall score of 98.9 percent.