# Lab 4: Neural Networks with Keras

## Data

In [14]:
#| echo: False
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, preprocessing
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, f1_score

First we will begin by importing the data, the dataset warns that the data is imbalanced, so we will examine that aspect next. One nice thing about the dataset is that it is completely numeric, meaning there is no need for OneHotEncoding to deal with categorical variables.

In [331]:
data = pd.read_csv("/Users/Bnkes/Desktop/GitHub/AdvancedMachineLearning/Data/DiabetesData/diabetes_binary_health_indicators_BRFSS2015.csv")
data.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [24]:
data["Diabetes_binary"].value_counts()

Diabetes_binary
0.0    218334
1.0     35346
Name: count, dtype: int64

As can be see above, the majority of the target variable are people without diabetes (~86% of the dataset). This suggests that we may want to pursue some form of sampling to get rid of this bias, but first we will begin without sampling the data so as to establish a baseline score.

For each sampling technique, I will try three different neural networks, as well as a random forest model.

## Sampling Method 1: No Sampling

In [399]:
X = data.drop("Diabetes_binary", axis = 1)
y = data["Diabetes_binary"]

scaler = preprocessing.MinMaxScaler().fit(X)
X = scaler.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y)

### Random Forest Baseline Model
We will begin with a random forest to create a baseline model to compare our neural networks against.

In [401]:
my_pipeline = Pipeline(
    [
        ("forest", RandomForestClassifier(n_jobs = -1))
    ]
)

parameters = {
    "forest__min_samples_leaf": [1, 2, 3, 4, 5, 10, 15, 25],
    "forest__min_samples_split": [2, 3, 4, 5, 10, 15, 25],
    "forest__ccp_alpha": [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
}

gscv = GridSearchCV(my_pipeline, parameters, cv = 5, scoring='f1', n_jobs=1, verbose = 1)
gscv_fitted = gscv.fit(X_train, y_train)
test_scores = gscv_fitted.cv_results_["mean_test_score"]
gscv_fitted.best_estimator_

Fitting 5 folds for each of 392 candidates, totalling 1960 fits


In [402]:
pd.DataFrame(gscv_fitted.cv_results_).sort_values(by = "rank_test_score", ascending = True).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_forest__ccp_alpha,param_forest__min_samples_leaf,param_forest__min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
57,1.127321,0.008818,0.070968,0.004372,1e-06,1,3,"{'forest__ccp_alpha': 1e-06, 'forest__min_samp...",0.248607,0.257599,0.243672,0.238294,0.247159,0.247066,0.006353,1
1,1.045382,0.014506,0.077248,0.00072,0.0,1,3,"{'forest__ccp_alpha': 1e-07, 'forest__min_samp...",0.24932,0.255656,0.243124,0.233842,0.241618,0.244712,0.007366,2
56,1.073587,0.026592,0.077323,0.00158,1e-06,1,2,"{'forest__ccp_alpha': 1e-06, 'forest__min_samp...",0.245117,0.24993,0.236205,0.235747,0.255741,0.244548,0.007767,3
2,1.008385,0.014793,0.076261,0.000482,0.0,1,4,"{'forest__ccp_alpha': 1e-07, 'forest__min_samp...",0.250827,0.246828,0.235584,0.233873,0.248801,0.243183,0.007038,4
0,1.043253,0.022021,0.079122,0.002693,0.0,1,2,"{'forest__ccp_alpha': 1e-07, 'forest__min_samp...",0.249825,0.242116,0.236302,0.233718,0.245253,0.241443,0.005854,5


In [409]:
my_pipeline = Pipeline(
    [
        ("forest", RandomForestClassifier(n_jobs=-1, ccp_alpha=1e-6, min_samples_leaf=1, min_samples_split=3))
    ]
)

fitted_pipeline = my_pipeline.fit(X_train, y_train)

In [410]:
y_pred = fitted_pipeline.predict(X)

cm = confusion_matrix(y_true = y, y_pred = y_pred)

cm_df = pd.DataFrame(cm, index=["Actual No Diabetes", "Actual Diabetes"], columns=["Predicted No Diabetes", "Predicted Diabetes"])

cm_df

Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,152692,65642
Actual Diabetes,3937,31409


In [411]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Random Forest Classifier: {f1}")

F1 Score for Random Forest Classifier: 0.4744669441150479


As can be seen above, the random forest has done a good job classifying the data, however it seems to struggle to distinguish a small group of people who have diabetes from those who do not. This is very good performance given the lack of sampling techniques used.

### Neural Network 1
We will begin with a simple neural network. It will have three layers all with the input size of 21.

In [103]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(21, activation="relu")(inputs)
x = layers.Dense(21, activation="relu")(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_1")
model.summary()

In [111]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 655us/step - f1_score: 0.2457 - loss: 0.3147 - val_f1_score: 0.2435 - val_loss: 0.3139
Epoch 2/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 615us/step - f1_score: 0.2433 - loss: 0.3120 - val_f1_score: 0.2435 - val_loss: 0.3130
Epoch 3/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 620us/step - f1_score: 0.2440 - loss: 0.3144 - val_f1_score: 0.2435 - val_loss: 0.3125
Epoch 4/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 784us/step - f1_score: 0.2470 - loss: 0.3161 - val_f1_score: 0.2435 - val_loss: 0.3126
Epoch 5/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 618us/step - f1_score: 0.2469 - loss: 0.3173 - val_f1_score: 0.2435 - val_loss: 0.3129
Epoch 6/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 622us/step - f1_score: 0.2453 - loss: 0.3148 - val_f1_score: 0.2435 - val_loss: 0.311

In [113]:
scores = model.evaluate(X_test, y_test, verbose=2)

1982/1982 - 1s - 421us/step - f1_score: 0.2439 - loss: 0.3138


In [115]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 448us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,214194,4140
Actual Diabetes,29639,5707


In this first neural network, we can see substantial overfitting. This is likely a result of the not employing any sampling techniques, however we will attempt to fix this by making some changes to the neural network first.

### Neural Network 2
With this neural network, we will add in dropout layers and change the activation function of the network.

In [117]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(21, activation="linear")(inputs)
x = layers.Dropout(rate=.1)(x)
x = layers.Dense(21, activation="linear")(x)
x = layers.Dropout(rate=.1)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_2")
model.summary()

In [127]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5, monitor="f1_score"), validation_split=.2)

Epoch 1/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 799us/step - f1_score: 0.2443 - loss: 0.3214 - val_f1_score: 0.2435 - val_loss: 0.3179
Epoch 2/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 713us/step - f1_score: 0.2445 - loss: 0.3222 - val_f1_score: 0.2435 - val_loss: 0.3175
Epoch 3/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 728us/step - f1_score: 0.2459 - loss: 0.3243 - val_f1_score: 0.2435 - val_loss: 0.3176
Epoch 4/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 719us/step - f1_score: 0.2457 - loss: 0.3240 - val_f1_score: 0.2435 - val_loss: 0.3214
Epoch 5/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 719us/step - f1_score: 0.2446 - loss: 0.3238 - val_f1_score: 0.2435 - val_loss: 0.3181
Epoch 6/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 719us/step - f1_score: 0.2431 - loss: 0.3201 - val_f1_score: 0.2435 - val_loss: 0.319

In [129]:
scores = model.evaluate(X_test, y_test, verbose=2)

1982/1982 - 1s - 457us/step - f1_score: 0.2439 - loss: 0.3219


In [131]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 412us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,216095,2239
Actual Diabetes,32284,3062


Adding the dropout layers and changing the activation function did improve the quality of the network, but there is still a large problem of predicting people who have diabetes not having diabetes. This would be dangerous if used in the real world. We will now try to solve this issue

### Neural Network 3
Now we will try changing the size of the network.

In [135]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(40, activation="relu")(inputs)
x = layers.Dropout(rate=.1)(x)
x = layers.Dense(30, activation="relu")(x)
x = layers.Dropout(rate=.1)(x)
x = layers.Dense(21, activation="relu")(x)
x = layers.Dropout(rate=.1)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_3")
model.summary()

In [139]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=3, monitor="f1_score"), validation_split=.2)

Epoch 1/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 868us/step - f1_score: 0.2431 - loss: 0.3205 - val_f1_score: 0.2435 - val_loss: 0.3153
Epoch 2/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 821us/step - f1_score: 0.2452 - loss: 0.3215 - val_f1_score: 0.2435 - val_loss: 0.3138
Epoch 3/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 825us/step - f1_score: 0.2442 - loss: 0.3214 - val_f1_score: 0.2435 - val_loss: 0.3152
Epoch 4/100
[1m2379/2379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 826us/step - f1_score: 0.2456 - loss: 0.3192 - val_f1_score: 0.2435 - val_loss: 0.3139


In [141]:
scores = model.evaluate(X_test, y_test, verbose=2)

1982/1982 - 1s - 460us/step - f1_score: 0.2439 - loss: 0.3163


In [143]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 503us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,215508,2826
Actual Diabetes,31289,4057


While we have slightly improved the results of the neural network, it still suffers from overfitting. At this point, the random forest performs significantly better than the neural network. For this reason we will now explore different sampling techniques to deal with this problem.

## Undersampling
We will begin with undersampling the predominant class. By strategically reducing the number of people without diabetes in the dataset, we will hopefully help the neural network distinguish between the two classes better. For this case, we will try creating two equal classes through undersampling

In [403]:
data_under_neg = data[data["Diabetes_binary"] == 0].sample(20000, random_state=1)
data_under_pos = data[data["Diabetes_binary"] == 1].sample(20000, random_state=1)

X_train = pd.concat([data_under_neg.drop("Diabetes_binary", axis = 1), data_under_pos.drop("Diabetes_binary", axis = 1)])
y_train = pd.concat([data_under_neg[["Diabetes_binary"]], data_under_pos[["Diabetes_binary"]]])
y_train = y_train["Diabetes_binary"]

train_indices = pd.concat([data_under_neg, data_under_pos]).index

X_test = data.drop(index=train_indices)
y_test = X_test["Diabetes_binary"]
X_test = X_test.drop("Diabetes_binary", axis = 1)

X = data.drop("Diabetes_binary", axis = 1)
y = data["Diabetes_binary"]

scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X = scaler.fit_transform(X)

### Random Forest Baseline Model
We will begin with a random forest to create a baseline model to compare our neural networks against.

In [404]:
my_pipeline = Pipeline(
    [
        ("forest", RandomForestClassifier(n_jobs = -1))
    ]
)

parameters = {
    "forest__min_samples_leaf": [1, 2, 3, 4, 5, 10, 15, 25],
    "forest__min_samples_split": [2, 3, 4, 5, 10, 15, 25],
    "forest__ccp_alpha": [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
}

gscv = GridSearchCV(my_pipeline, parameters, cv = 5, scoring='f1', n_jobs=1, verbose = 1)
gscv_fitted = gscv.fit(X_train, y_train)
test_scores = gscv_fitted.cv_results_["mean_test_score"]
gscv_fitted.best_estimator_

Fitting 5 folds for each of 392 candidates, totalling 1960 fits


In [405]:
pd.DataFrame(gscv_fitted.cv_results_).sort_values(by = "rank_test_score", ascending = True).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_forest__ccp_alpha,param_forest__min_samples_leaf,param_forest__min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
172,0.261146,0.001594,0.033906,0.000735,0.0001,1,10,"{'forest__ccp_alpha': 0.0001, 'forest__min_sam...",0.758555,0.764902,0.761449,0.766546,0.764867,0.763264,0.002881,1
202,0.213954,0.002187,0.033444,0.000646,0.0001,5,25,"{'forest__ccp_alpha': 0.0001, 'forest__min_sam...",0.758678,0.763922,0.762301,0.767193,0.763584,0.763136,0.002751,2
69,0.215881,0.003757,0.033871,0.000553,1e-06,2,25,"{'forest__ccp_alpha': 1e-06, 'forest__min_samp...",0.756563,0.764104,0.761871,0.768695,0.763654,0.762977,0.003919,3
177,0.310524,0.004161,0.03291,0.000452,0.0001,2,4,"{'forest__ccp_alpha': 0.0001, 'forest__min_sam...",0.756118,0.765699,0.761859,0.768044,0.763054,0.762955,0.004035,4
197,0.235623,0.004204,0.033829,0.000777,0.0001,5,3,"{'forest__ccp_alpha': 0.0001, 'forest__min_sam...",0.757221,0.764517,0.762405,0.767044,0.763051,0.762848,0.003234,5


In [419]:
my_pipeline = Pipeline(
    [
        ("forest", RandomForestClassifier(n_jobs=-1, ccp_alpha=1e-4, min_samples_leaf=1, min_samples_split=10))
    ]
)

fitted_pipeline = my_pipeline.fit(X_train, y_train)

In [421]:
y_pred = fitted_pipeline.predict(X)

cm = confusion_matrix(y_true = y, y_pred = y_pred)

cm_df = pd.DataFrame(cm, index=["Actual No Diabetes", "Actual Diabetes"], columns=["Predicted No Diabetes", "Predicted Diabetes"])

cm_df

Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,149057,69277
Actual Diabetes,6167,29179


In [423]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Random Forest Classifier: {f1}")

F1 Score for Random Forest Classifier: 0.43615192598017966


While the undersampling seems to have helped the network better predict people with diabetes, it has also resulted in poor performance for predicting the people without diabetes accurately. Now we will move onto neural networks to see if we can best this performance.

### Neural Network 1
Once again, we will begin with a fairly simple neural network.

In [425]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(21, activation="relu")(inputs)
x = layers.Dense(21, activation="relu")(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_1")
model.summary()

In [427]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 842us/step - f1_score: 0.5458 - loss: 0.5777 - val_f1_score: 1.0000 - val_loss: 0.7070
Epoch 2/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 630us/step - f1_score: 0.5462 - loss: 0.5153 - val_f1_score: 1.0000 - val_loss: 0.8374
Epoch 3/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 671us/step - f1_score: 0.5450 - loss: 0.5063 - val_f1_score: 1.0000 - val_loss: 0.6664
Epoch 4/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 701us/step - f1_score: 0.5477 - loss: 0.5058 - val_f1_score: 1.0000 - val_loss: 0.6849
Epoch 5/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 684us/step - f1_score: 0.5452 - loss: 0.5036 - val_f1_score: 1.0000 - val_loss: 0.8010
Epoch 6/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 673us/step - f1_score: 0.5488 - loss: 0.4934 - val_f1_score: 1.0000 - val_loss: 0.8022
Epoch 7/10

In [429]:
scores = model.evaluate(X_test, y_test, verbose=2)

6678/6678 - 3s - 413us/step - f1_score: 0.1340 - loss: 0.4593


In [430]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 408us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,162621,55713
Actual Diabetes,8906,26440


In [432]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Neural Network 1: {f1}")

F1 Score for Neural Network 1: 0.4500463833734755


As can be seen above, this neural network is slightly better than the random forest model. I will now try to improve on this performance by altering the model.

### Neural Network 2
In this network, I am going to add dropout layers and increase the size of the network to try and alleviate any overfitting while also increasing the accuracy of the model.

In [435]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(40, activation="relu")(inputs)
x = layers.Dropout(rate = .1)(x)
x = layers.Dense(30, activation="relu")(x)
x = layers.Dropout(rate = .1)(x)
x = layers.Dense(21, activation="relu")(x)
x = layers.Dropout(rate = .1)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_2")
model.summary()

In [437]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 983us/step - f1_score: 0.5482 - loss: 0.5747 - val_f1_score: 1.0000 - val_loss: 0.7151
Epoch 2/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 760us/step - f1_score: 0.5469 - loss: 0.5238 - val_f1_score: 1.0000 - val_loss: 0.7264
Epoch 3/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 762us/step - f1_score: 0.5466 - loss: 0.5135 - val_f1_score: 1.0000 - val_loss: 0.5652
Epoch 4/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 744us/step - f1_score: 0.5468 - loss: 0.5109 - val_f1_score: 1.0000 - val_loss: 0.7513
Epoch 5/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 769us/step - f1_score: 0.5405 - loss: 0.5064 - val_f1_score: 1.0000 - val_loss: 0.6773
Epoch 6/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 760us/step - f1_score: 0.5503 - loss: 0.5042 - val_f1_score: 1.0000 - val_loss: 0.7751
Epoch 7/10

In [439]:
scores = model.evaluate(X_test, y_test, verbose=2)

6678/6678 - 3s - 474us/step - f1_score: 0.1340 - loss: 0.4431


In [440]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 451us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,167138,51196
Actual Diabetes,10026,25320


In [441]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Neural Network 2: {f1}")

F1 Score for Neural Network 2: 0.4527006490139636


Adding in the dropout layers and increasing the size of the network has resulted in the neural network performing better than the previous iteration, however it still struggles with false positives/negatives.

### Neural Network 3
For this model I am going to significantly increase the size of the neural network and increase the dropout rate to attempt to combat any overfitting that could result due to a larger network.

In [445]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(200, activation="relu")(inputs)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(150, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(100, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(50, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_3")
model.summary()

In [447]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5408 - loss: 0.5728 - val_f1_score: 1.0000 - val_loss: 0.6450
Epoch 2/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5471 - loss: 0.5230 - val_f1_score: 1.0000 - val_loss: 0.6293
Epoch 3/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5488 - loss: 0.5162 - val_f1_score: 1.0000 - val_loss: 0.7443
Epoch 4/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5454 - loss: 0.5057 - val_f1_score: 1.0000 - val_loss: 0.6654
Epoch 5/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5429 - loss: 0.5048 - val_f1_score: 1.0000 - val_loss: 0.6431
Epoch 6/100
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5459 - loss: 0.5051 - val_f1_score: 1.0000 - val_loss: 0.6576
Epoch 7/100
[1m500/50

In [449]:
scores = model.evaluate(X_test, y_test, verbose=2)

6678/6678 - 3s - 510us/step - f1_score: 0.1340 - loss: 0.3845


In [450]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 531us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,172446,45888
Actual Diabetes,11148,24198


In [452]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Neural Network 3: {f1}")

F1 Score for Neural Network 3: 0.4590257227407239


Increasing the size of the network and the dropout layer rate has resulted in the network performing slightly less well than the previous model. I believe the best way to improve performance at this point is to attempt oversampling the minority class.

## Oversampling
We will now attempt to increase the model performance by oversampling the minority class. In this section, we will duplicate some of the minority data to attempt to better teach the model how to identify that class.

In [455]:
data_over_pos = data[data["Diabetes_binary"] == 0].sample(40000, random_state=1)
data_over_neg = data[data["Diabetes_binary"] == 1].sample(20000, random_state=1)
data_over_neg = pd.concat([data_over_neg, data_over_neg])
data_over = pd.concat([data_over_pos, data_over_neg])

X_train = data_over.drop("Diabetes_binary", axis = 1)
y_train = data_over["Diabetes_binary"]

train_indices = pd.concat([data_over_pos, data_over_neg]).index

X_test = data.drop(index=train_indices)
y_test = X_test["Diabetes_binary"]
X_test = X_test.drop("Diabetes_binary", axis = 1)

scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X = scaler.fit_transform(X)

### Random Forest
We will once again begin with a random forest model to set a baseline.

In [458]:
my_pipeline = Pipeline(
    [
        ("forest", RandomForestClassifier(n_jobs = -1))
    ]
)

parameters = {
    "forest__min_samples_leaf": [1, 2, 3, 4, 5, 10, 15, 25],
    "forest__min_samples_split": [2, 3, 4, 5, 10, 15, 25],
    "forest__ccp_alpha": [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
}

gscv = GridSearchCV(my_pipeline, parameters, cv = 5, scoring='f1', n_jobs=1, verbose = 1)
gscv_fitted = gscv.fit(X_train, y_train)
test_scores = gscv_fitted.cv_results_["mean_test_score"]
gscv_fitted.best_estimator_

Fitting 5 folds for each of 392 candidates, totalling 1960 fits


In [459]:
pd.DataFrame(gscv_fitted.cv_results_).sort_values(by = "rank_test_score", ascending = True).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_forest__ccp_alpha,param_forest__min_samples_leaf,param_forest__min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.420829,0.00958,0.045906,0.000802,0.0,1,2,"{'forest__ccp_alpha': 1e-07, 'forest__min_samp...",0.887658,0.888043,0.883646,0.883721,0.887496,0.886113,0.001992,1
56,0.430986,0.011921,0.046341,0.000351,1e-06,1,2,"{'forest__ccp_alpha': 1e-06, 'forest__min_samp...",0.885785,0.888777,0.884214,0.884009,0.887708,0.886098,0.001886,2
57,0.412292,0.005934,0.043536,0.004024,1e-06,1,3,"{'forest__ccp_alpha': 1e-06, 'forest__min_samp...",0.879006,0.88468,0.876735,0.876606,0.879174,0.87924,0.002928,3
1,0.407266,0.011561,0.046022,0.000667,0.0,1,3,"{'forest__ccp_alpha': 1e-07, 'forest__min_samp...",0.878356,0.883006,0.876059,0.87507,0.88022,0.878542,0.002863,4
112,0.540599,0.007344,0.043665,0.004336,1e-05,1,2,"{'forest__ccp_alpha': 1e-05, 'forest__min_samp...",0.879151,0.883025,0.876449,0.876763,0.877116,0.878501,0.002451,5


In [464]:
my_pipeline = Pipeline(
    [
        ("forest", RandomForestClassifier(n_jobs=-1, ccp_alpha=0, min_samples_leaf=1, min_samples_split=2))
    ]
)

fitted_pipeline = my_pipeline.fit(X_train, y_train)

In [466]:
y_pred = fitted_pipeline.predict(X)

cm = confusion_matrix(y_true = y, y_pred = y_pred)

cm_df = pd.DataFrame(cm, index=["Actual No Diabetes", "Actual Diabetes"], columns=["Predicted No Diabetes", "Predicted Diabetes"])

cm_df

Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,178349,39985
Actual Diabetes,5107,30239


In [468]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Random Forest Classifier: {f1}")

F1 Score for Random Forest Classifier: 0.5728710807994696


The random forest has done fairly well, the most impressive thing is that it does well at not misclasifying someone as not having diabetes when they actually have it. We can also see that oversampling does seem to be more effective in this use case than undersampling.

### Neural Network 1
For the last time, we will start with a simple neural network and build from there.

In [472]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(21, activation="relu")(inputs)
x = layers.Dense(21, activation="relu")(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_1")
model.summary()

In [474]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 748us/step - f1_score: 0.5480 - loss: 0.5629 - val_f1_score: 1.0000 - val_loss: 0.7331
Epoch 2/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 630us/step - f1_score: 0.5446 - loss: 0.5058 - val_f1_score: 1.0000 - val_loss: 0.5894
Epoch 3/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 646us/step - f1_score: 0.5452 - loss: 0.5024 - val_f1_score: 1.0000 - val_loss: 0.6504
Epoch 4/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 688us/step - f1_score: 0.5463 - loss: 0.4909 - val_f1_score: 1.0000 - val_loss: 0.7757
Epoch 5/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 684us/step - f1_score: 0.5482 - loss: 0.4933 - val_f1_score: 1.0000 - val_loss: 0.6277
Epoch 6/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 631us/step - f1_score: 0.5422 - loss: 0.4917 - val_f1_score: 1.0000 - val_loss: 0.731

In [476]:
scores = model.evaluate(X_test, y_test, verbose=2)

6053/6053 - 3s - 429us/step - f1_score: 0.1468 - loss: 0.4225


In [478]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 404us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,169673,48661
Actual Diabetes,10367,24979


In [480]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Neural Network 1: {f1}")

F1 Score for Neural Network 1: 0.4583891509001156


While this first neural network works alright, it is not better than the random forest above. We will try to solve this problem by adding dropout layers and increasing the size of the network.

### Neural Network 3

In [484]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(40, activation="relu")(inputs)
x = layers.Dropout(rate = .1)(x)
x = layers.Dense(30, activation="relu")(x)
x = layers.Dropout(rate = .1)(x)
x = layers.Dense(21, activation="relu")(x)
x = layers.Dropout(rate = .1)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_2")
model.summary()

In [486]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 859us/step - f1_score: 0.5472 - loss: 0.5616 - val_f1_score: 1.0000 - val_loss: 0.6918
Epoch 2/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 731us/step - f1_score: 0.5490 - loss: 0.5166 - val_f1_score: 1.0000 - val_loss: 0.6794
Epoch 3/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 738us/step - f1_score: 0.5439 - loss: 0.5072 - val_f1_score: 1.0000 - val_loss: 0.6991
Epoch 4/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 734us/step - f1_score: 0.5424 - loss: 0.5035 - val_f1_score: 1.0000 - val_loss: 0.7100
Epoch 5/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 724us/step - f1_score: 0.5458 - loss: 0.5050 - val_f1_score: 1.0000 - val_loss: 0.6574
Epoch 6/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 734us/step - f1_score: 0.5444 - loss: 0.5013 - val_f1_score: 1.0000 - val_loss: 0.677

In [488]:
scores = model.evaluate(X_test, y_test, verbose=2)

6053/6053 - 3s - 458us/step - f1_score: 0.1468 - loss: 0.3789


In [490]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 505us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,178306,40028
Actual Diabetes,12439,22907


In [492]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Neural Network 2: {f1}")

F1 Score for Neural Network 2: 0.4661531730446373


Adding the dropout layers appears to have slightly increased the performance of the model, but it is still less accurate than the random forest model.

### Neural Network 3
For this final neural network, we will try making the network much larger and increase the amount dropout rate to try and deal with the false negatives/positives.

In [498]:
inputs = keras.Input(shape = (21, ))
x = layers.Dense(200, activation="relu")(inputs)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(150, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(100, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(50, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(rate = .2)(x)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="diabetes_model_3")
model.summary()

In [500]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=[keras.metrics.F1Score()],
)

history = model.fit(X_train, y_train, batch_size = 64, epochs=100, callbacks=keras.callbacks.EarlyStopping(patience=5), validation_split=.2)

Epoch 1/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - f1_score: 0.5458 - loss: 0.5470 - val_f1_score: 1.0000 - val_loss: 0.6717
Epoch 2/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5438 - loss: 0.5099 - val_f1_score: 1.0000 - val_loss: 0.7105
Epoch 3/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5448 - loss: 0.5101 - val_f1_score: 1.0000 - val_loss: 0.7512
Epoch 4/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5464 - loss: 0.5055 - val_f1_score: 1.0000 - val_loss: 0.7039
Epoch 5/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5470 - loss: 0.5066 - val_f1_score: 1.0000 - val_loss: 0.7129
Epoch 6/100
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 0.5480 - loss: 0.5039 - val_f1_score: 1.0000 - val_loss: 0.6488
Epoch 7/10

In [502]:
scores = model.evaluate(X_test, y_test, verbose=2)

6053/6053 - 3s - 522us/step - f1_score: 0.1468 - loss: 0.4032


In [504]:
y_pred_prob = model.predict(X)  # Get predicted probabilities for each class
y_pred = (y_pred_prob > .5).astype(int) # Convert probabilities to class labels
pd.DataFrame(confusion_matrix(y, y_pred), columns=["Predicted No Diabetes", "Predicted Diabetes"], index = ["Actual No Diabetes", "Actual Diabetes"])

[1m7928/7928[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 508us/step


Unnamed: 0,Predicted No Diabetes,Predicted Diabetes
Actual No Diabetes,175125,43209
Actual Diabetes,11374,23972


In [506]:
f1 = f1_score(y_true=y, y_pred = y_pred)
print(f"F1 Score for Neural Network 3: {f1}")

F1 Score for Neural Network 3: 0.46762316267909915


While this network did perform better than the previous attempt, its accuracy is still very low for the intended use case.

## Final Thoughts

While the neural networks showed some promise in this application, no model or sampling technique was able to best the random forest. In a situation like this one where there is a large amount of data that is unbalanced, the random forest can make better use of the extra data to distinguish between the two classes than the neural network. This may be a result of the neural network becoming overfit, however attempting to correct this with dropout layers and increasing size did not result in better performance. Overall, I would pick the random forest model, especially because it is an interpretable model.