## Looking at how the models perform on a dataset trimmed to have a 50/50 split on the target variable (Diabetes_binary). Full dataset was over 85% non-diabetic, which can add unwanted bias to the model.

## Part 1: Prepare the Data

In [1]:
# Import Dependencies
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report, confusion_matrix

Load the data into a Pandas DataFrame and fetch the top 10 rows.

In [2]:
# Read in CSV
file_path = Path("../Resources/diabetes_data_5050split.csv")
df = pd.read_csv(file_path)
df.head(10)

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0
5,0.0,0.0,0.0,1.0,18.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,2.0,7.0,0.0,0.0,0.0,1.0,4.0,7.0
6,0.0,0.0,1.0,1.0,26.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,13.0,5.0,6.0
7,0.0,0.0,0.0,1.0,31.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,4.0,0.0,0.0,0.0,1.0,6.0,4.0,3.0
8,0.0,0.0,0.0,1.0,32.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,6.0,8.0
9,0.0,0.0,0.0,1.0,27.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,6.0,0.0,1.0,6.0,4.0,4.0


List the DataFrame's data types to ensure they're aligned to the type of data stored on each column.

In [3]:
# List dataframe data types
df.dtypes

Diabetes_binary         float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object

Remove all rows with `null` values if any.

In [4]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

Column Diabetes_binary has 0 null values
Column HighBP has 0 null values
Column HighChol has 0 null values
Column CholCheck has 0 null values
Column BMI has 0 null values
Column Smoker has 0 null values
Column Stroke has 0 null values
Column HeartDiseaseorAttack has 0 null values
Column PhysActivity has 0 null values
Column Fruits has 0 null values
Column Veggies has 0 null values
Column HvyAlcoholConsump has 0 null values
Column AnyHealthcare has 0 null values
Column NoDocbcCost has 0 null values
Column GenHlth has 0 null values
Column MentHlth has 0 null values
Column PhysHlth has 0 null values
Column DiffWalk has 0 null values
Column Sex has 0 null values
Column Age has 0 null values
Column Education has 0 null values
Column Income has 0 null values


In [5]:
# Look at min and max values
for column in df.columns:
    print(f"Column {column} has {df[column].min()} as minimum value")
    print(f"Column {column} has {df[column].max()} as maximum value")

Column Diabetes_binary has 0.0 as minimum value
Column Diabetes_binary has 1.0 as maximum value
Column HighBP has 0.0 as minimum value
Column HighBP has 1.0 as maximum value
Column HighChol has 0.0 as minimum value
Column HighChol has 1.0 as maximum value
Column CholCheck has 0.0 as minimum value
Column CholCheck has 1.0 as maximum value
Column BMI has 12.0 as minimum value
Column BMI has 98.0 as maximum value
Column Smoker has 0.0 as minimum value
Column Smoker has 1.0 as maximum value
Column Stroke has 0.0 as minimum value
Column Stroke has 1.0 as maximum value
Column HeartDiseaseorAttack has 0.0 as minimum value
Column HeartDiseaseorAttack has 1.0 as maximum value
Column PhysActivity has 0.0 as minimum value
Column PhysActivity has 1.0 as maximum value
Column Fruits has 0.0 as minimum value
Column Fruits has 1.0 as maximum value
Column Veggies has 0.0 as minimum value
Column Veggies has 1.0 as maximum value
Column HvyAlcoholConsump has 0.0 as minimum value
Column HvyAlcoholConsump h

No null values were found. Check for duplicates.

In [6]:
# Find duplicate entries
print(f"Duplicate entries: {df.duplicated().sum()}")

Duplicate entries: 1635


In [7]:
# Drop duplicate entries
df = df.drop_duplicates()

In [8]:
# Confirm that duplicate entries have been dropped
print(f"Duplicate entries: {df.duplicated().sum()}")

Duplicate entries: 0


Check for duplicates but ignore the target variable.

In [9]:
# Find duplicate entries ignoring target column
features = ['HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income']
print(f"Duplicate entries excluding the target variable: {df.duplicated(subset=features).sum()}")

Duplicate entries excluding the target variable: 408


In [10]:
# Drop duplicate entries
df = df.drop_duplicates(subset=features, keep=False)

In [11]:
# Confirm that duplicate entries have been dropped
print(f"Duplicate entries excluding the target variable: {df.duplicated(subset=features).sum()}")

Duplicate entries excluding the target variable: 0


In [12]:
# Evaluate shape after dropping duplicates
df.shape

(68241, 22)

Export CSV for Tableau Visualizations

In [13]:
# export csv. commented out to avoid overwriting file once exported
#df.to_csv('../Resources/viz_data.csv', index_label='index')

In [14]:
# function by Boern found in the following link 
# https://stackoverflow.com/questions/31323499/sklearn-error-valueerror-input-contains-nan-infinity-or-a-value-too-large-for
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

In [15]:
df_cleaned = clean_dataset(df)

In [16]:
# Assign feature (X) and target (y) variables
X = df_cleaned.drop('Diabetes_binary', axis=1)
y = df_cleaned['Diabetes_binary']

Look at X and y to make sure everything looks as expected.

In [17]:
# Preview X
X.head()

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


In [18]:
# Prevew y
y.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: Diabetes_binary, dtype: float64

In [19]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [20]:
def model_tester(model, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    clf = model.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))
    print(f'Training Score: {clf.score(X_train, y_train)}')
    print(f'Testing Score: {clf.score(X_test, y_test)}')

In [21]:
# Look at different Logistic Regression models and find bester performing for further tuning
model_tester(LogisticRegression(random_state=42), X, y)
model_tester(LogisticRegression(random_state=42, max_iter=500), X, y)
model_tester(LogisticRegression(random_state=42, max_iter=1000), X, y)
model_tester(LogisticRegression(random_state=42, max_iter=10000), X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


              precision    recall  f1-score   support

         0.0       0.75      0.72      0.74      8342
         1.0       0.74      0.78      0.76      8719

    accuracy                           0.75     17061
   macro avg       0.75      0.75      0.75     17061
weighted avg       0.75      0.75      0.75     17061

Training Score: 0.7386479093395858
Testing Score: 0.7475528984233046
              precision    recall  f1-score   support

         0.0       0.76      0.72      0.74      8342
         1.0       0.75      0.78      0.76      8719

    accuracy                           0.75     17061
   macro avg       0.75      0.75      0.75     17061
weighted avg       0.75      0.75      0.75     17061

Training Score: 0.7449980461117625
Testing Score: 0.7541761913135221
              precision    recall  f1-score   support

         0.0       0.76      0.72      0.74      8342
         1.0       0.75      0.78      0.76      8719

    accuracy                           0.75 

The second model with max_iter=500 is the model selected as the 'best' LogisticRegression model. The accuracy scores leveled off for 1000 and 10000 suggesting there were diminishing returns for the additional iterations.

In [22]:
# Look at different RandomForest models and find bester performing for further tuning
model_tester(RandomForestClassifier(random_state=42), X, y)
model_tester(RandomForestClassifier(random_state=42, bootstrap=False), X, y)
model_tester(RandomForestClassifier(random_state=42, n_estimators=200), X, y)
model_tester(RandomForestClassifier(random_state=42, n_estimators=200, bootstrap=False), X, y)
model_tester(RandomForestClassifier(random_state=42, n_estimators=500), X, y)
model_tester(RandomForestClassifier(random_state=42, n_estimators=500, bootstrap=False), X, y)

              precision    recall  f1-score   support

         0.0       0.76      0.70      0.73      8342
         1.0       0.73      0.79      0.76      8719

    accuracy                           0.74     17061
   macro avg       0.75      0.74      0.74     17061
weighted avg       0.74      0.74      0.74     17061

Training Score: 0.9999804611176241
Testing Score: 0.743801652892562
              precision    recall  f1-score   support

         0.0       0.75      0.69      0.72      8342
         1.0       0.72      0.77      0.75      8719

    accuracy                           0.73     17061
   macro avg       0.73      0.73      0.73     17061
weighted avg       0.73      0.73      0.73     17061

Training Score: 1.0
Testing Score: 0.7337201805286911
              precision    recall  f1-score   support

         0.0       0.76      0.69      0.73      8342
         1.0       0.73      0.79      0.76      8719

    accuracy                           0.75     17061
   mac

The third model with n_estimators=200 and the default bootstrap setting is selected as the 'best' Random Forest model. This is due to it having the highest testing accuracy score.

In [23]:
# Look at different AdaBoost models and find bester performing for further tuning
model_tester(AdaBoostClassifier(random_state=42, n_estimators=100), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200, learning_rate=0.1), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=500, learning_rate=0.1), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=1000, learning_rate=0.1), X, y)

              precision    recall  f1-score   support

         0.0       0.77      0.72      0.75      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76     17061
   macro avg       0.76      0.76      0.76     17061
weighted avg       0.76      0.76      0.76     17061

Training Score: 0.7465806955842126
Testing Score: 0.758630795381279
              precision    recall  f1-score   support

         0.0       0.77      0.72      0.75      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76     17061
   macro avg       0.76      0.76      0.76     17061
weighted avg       0.76      0.76      0.76     17061

Training Score: 0.7467956232903478
Testing Score: 0.758220502901354
              precision    recall  f1-score   support

         0.0       0.77      0.72      0.75      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76   

The model with n_estimators=200 and learnign rate = 0.1 is performing the best. Next, we will try to further tune by changing the learning_rate and keeping the n_estimators the same.

In [24]:
# further tune based on results of previous cell
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200, learning_rate=0.1), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200, learning_rate=0.05), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200, learning_rate=0.01), X, y)

              precision    recall  f1-score   support

         0.0       0.77      0.72      0.75      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76     17061
   macro avg       0.76      0.76      0.76     17061
weighted avg       0.76      0.76      0.76     17061

Training Score: 0.7467956232903478
Testing Score: 0.758220502901354
              precision    recall  f1-score   support

         0.0       0.77      0.72      0.75      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76     17061
   macro avg       0.76      0.76      0.76     17061
weighted avg       0.76      0.76      0.76     17061

Training Score: 0.7453497459945291
Testing Score: 0.7593341539182932
              precision    recall  f1-score   support

         0.0       0.77      0.72      0.74      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76  

The model with n_estimators=200 and default learning_rate=0.1 is selected as the 'best' Ada Boost model. This is due to it having the highest testing accuracy score.

In [25]:
# The best of each type of model
model_tester(LogisticRegression(random_state=42, max_iter=500), X, y)
model_tester(RandomForestClassifier(random_state=42, n_estimators=200), X, y)
model_tester(AdaBoostClassifier(random_state=42, n_estimators=200, learning_rate=0.1), X, y)

              precision    recall  f1-score   support

         0.0       0.76      0.72      0.74      8342
         1.0       0.75      0.78      0.76      8719

    accuracy                           0.75     17061
   macro avg       0.75      0.75      0.75     17061
weighted avg       0.75      0.75      0.75     17061

Training Score: 0.7449980461117625
Testing Score: 0.7541761913135221
              precision    recall  f1-score   support

         0.0       0.76      0.69      0.73      8342
         1.0       0.73      0.79      0.76      8719

    accuracy                           0.75     17061
   macro avg       0.75      0.74      0.74     17061
weighted avg       0.75      0.75      0.75     17061

Training Score: 1.0
Testing Score: 0.7459703417150225
              precision    recall  f1-score   support

         0.0       0.77      0.72      0.75      8342
         1.0       0.75      0.79      0.77      8719

    accuracy                           0.76     17061
   ma

The best performing model of all models tested is the Ada Boost Classifier. With a testing accuracy score of 0.75933 it edged out the Logistic Regression (0.75418) and the Random Forest Classifier (0.74597). It also had the best recall with 0.72.

The models with unscaled data performed worse than the models with scaled data on the 50/50 split dataset.