# Prerequisites

In [2]:
# Import numpy and pandas libraries to begin with
import pandas as pd
import numpy as np


The question at hand is are we able to predict if a patient's diabetes status would turn our negative or positive based on the result of the various tests?

The idea is that we highly test patients for diabetes where a model predicts that a particular patient is likely to test positive for iadbetes.

This is a classification problem

The success criteria is a model that is able to predict if a patient has diabetes or not with a success rate of above 85%

# Data Importation

In [3]:
# Load the diabetes dataset and preview first few records
diabetes_df = pd.read_csv("https://bit.ly/DiabetesDS")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Data Exploration 

In [4]:
# Check dataframe structure
diabetes_df.shape

(768, 9)

In [6]:
# Check the column datatypes
diabetes_df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [7]:
# Check if there any all null columns
diabetes_df.isnull().any()

Pregnancies                 False
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool

In [11]:
# Select and preview unique Pregnacies
diabetes_df.Pregnancies.unique().tolist()

[6, 1, 8, 0, 5, 3, 10, 2, 4, 7, 9, 11, 13, 15, 17, 12, 14]

In [13]:
# Select and preview unique Outcomes
diabetes_df.Outcome.unique().tolist()

[1, 0]

In [12]:
# Check for duplicate rows based on all columns
diabetes_df[diabetes_df.duplicated()]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


# Data Exploration Observations
- The dataset 9 columns and 768 rows
- All columns are of integer or float datatype
- There are no duplicate rows
- There are no null values in any of the columns
- First 8 columns will form the features for our analysis while the Outcome column will be our target
- So far the dataset look ok.



# Data Cleanup

We will undertake two clean up exercises.
- When modeling, it is important to clean the data sample to ensure that the observations best represent the problem.
- Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data i.e. outliers.
- Outliers are known to cause e.g. the linear regression model to learn a bias or skewed understanding of the problem, thus removing these outliers from the training set will allow a more effective model to be learned.

Our first clean up excersie will be 
- Round of the Diabetes Pedegree Fuction to 2 decimal places
- Remove outliers from the dataset

In [4]:
# Round diabetes pedegree function to 2 decimal places
diabetes_df['DiabetesPedigreeFunction'] = diabetes_df['DiabetesPedigreeFunction'].round(decimals=2)
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.63,50,1
1,1,85,66,29,0,26.6,0.35,31,0
2,8,183,64,0,0,23.3,0.67,32,1
3,1,89,66,23,94,28.1,0.17,21,0
4,0,137,40,35,168,43.1,2.29,33,1


In [5]:
# Removing Outliers in the dataframe
# We first defining our quantiles using the quantile() function
# ---
# 
Q1 = diabetes_df.quantile(0.25)
Q3 = diabetes_df.quantile(0.75)
IQR = Q3 - Q1
IQR

# Then filtering out our outliers by getting values which are outside our IQR Range.
# ---
#
diabetes_df_iqr = diabetes_df[((diabetes_df < (Q1 - 1.5 * IQR)) | (diabetes_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# One way of dealing with outliers is removing them 
# Checking the size of the dataset with outliers for cleaning purposes
# ---
#
diabetes_df_iqr.shape

(128, 9)

In [7]:
# Explore the outliers before deleting them
diabetes_df_iqr

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
4,0,137,40,35,168,43.1,2.29,33,1
7,10,115,0,0,0,35.3,0.13,29,0
8,2,197,70,45,543,30.5,0.16,53,1
9,8,125,96,0,0,0.0,0.23,54,1
12,10,139,80,0,0,27.1,1.44,57,0
...,...,...,...,...,...,...,...,...,...
706,10,115,0,0,0,0.0,0.26,30,1
707,2,127,46,21,335,34.4,0.18,22,0
710,3,158,64,13,387,31.2,0.30,24,0
715,7,187,50,33,392,33.9,0.83,34,1


We will omit 128 rows from the dataset of which are outliers so that have a dataset that help create a more effective model 

In [6]:
# Lets drop the outliers and retain a clean dataframe
clean_df = diabetes_df[ ~((diabetes_df < (Q1 - 1.5 * IQR)) | (diabetes_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# Checking the size of our final dataset.
clean_df.shape

(640, 9)

Our clean dataframe has 640 rows and 9 columns

# Data Preparation and Modeling

# Decision Tree

Decision Tree: 1- Test the max_depth parameter that gives us the highest accuracy for our model

In [10]:
# import DecisionTreeClassifier from sklearn 7 train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=123456789
)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid

for depth in range(1, 20):
        model = DecisionTreeClassifier(random_state=123456789, max_depth=depth) # < create a model, specify max_depth=depth >

        model.fit(features_train, target_train) # < train the model >

        predictions_valid = model.predict(features_valid) # < find the predictions using validation set >

        print("max_depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7875
max_depth = 2 : 0.7625
max_depth = 3 : 0.7625
max_depth = 4 : 0.725
max_depth = 5 : 0.73125
max_depth = 6 : 0.76875
max_depth = 7 : 0.775
max_depth = 8 : 0.7875
max_depth = 9 : 0.78125
max_depth = 10 : 0.76875
max_depth = 11 : 0.7625
max_depth = 12 : 0.74375
max_depth = 13 : 0.76875
max_depth = 14 : 0.775
max_depth = 15 : 0.75
max_depth = 16 : 0.75
max_depth = 17 : 0.75
max_depth = 18 : 0.75
max_depth = 19 : 0.75


Implement using a max_depth of 8 for the best accuracy of 79% when random state is set to 123456789

In [11]:
# import DecisionTreeClassifier from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=123456789
)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid

# create a decison tree classifier model and set max_depth to 8
model = DecisionTreeClassifier(random_state=123456789, max_depth=8)
      
# train the model
model.fit(features_train, target_train)

# predict and convert outcome to a 1 dimensional array and print the outcome
predicted_valid = pd.Series(model.predict(features_valid))  
# print(predicted_valid.head())

# check model accuracy and print the outcome
accuracy_valid = accuracy_score(predicted_valid, target_valid) 
print(accuracy_valid)

0.7875


# Random Forest

Random Forest: 1- Test the n_stimator parameter that gives us the highest accuracy for our model

In [12]:
# import RandomForestClassifier from sklearn & train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=42
)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid

for estimator in range(1, 25):
        model = RandomForestClassifier(random_state=42, n_estimators=estimator) # < create a model >

        model.fit(features_train, target_train) # < train the model >

        predictions_valid = model.score(features_valid, target_valid) # < find the predictions using validation set >

        print("n_estimators =", estimator, ": ", end='')
        print(predictions_valid)

for depth in range(1, 17):
        model = RandomForestClassifier(random_state=42, max_depth=depth) # < create a model, specify max_depth=depth >

        model.fit(features_train, target_train) # < train the model >

        predictions_valid = model.predict(features_valid) # < find the predictions using validation set >

        print("max_depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, predictions_valid))

n_estimators = 1 : 0.70625
n_estimators = 2 : 0.74375
n_estimators = 3 : 0.74375
n_estimators = 4 : 0.76875
n_estimators = 5 : 0.775
n_estimators = 6 : 0.775
n_estimators = 7 : 0.7875
n_estimators = 8 : 0.775
n_estimators = 9 : 0.775
n_estimators = 10 : 0.79375
n_estimators = 11 : 0.76875
n_estimators = 12 : 0.775
n_estimators = 13 : 0.7875
n_estimators = 14 : 0.79375
n_estimators = 15 : 0.79375
n_estimators = 16 : 0.80625
n_estimators = 17 : 0.7875
n_estimators = 18 : 0.79375
n_estimators = 19 : 0.8125
n_estimators = 20 : 0.8375
n_estimators = 21 : 0.825
n_estimators = 22 : 0.825
n_estimators = 23 : 0.8125
n_estimators = 24 : 0.8375
max_depth = 1 : 0.7125
max_depth = 2 : 0.76875
max_depth = 3 : 0.775
max_depth = 4 : 0.79375
max_depth = 5 : 0.80625
max_depth = 6 : 0.80625
max_depth = 7 : 0.8
max_depth = 8 : 0.80625
max_depth = 9 : 0.78125
max_depth = 10 : 0.79375
max_depth = 11 : 0.81875
max_depth = 12 : 0.8125
max_depth = 13 : 0.825
max_depth = 14 : 0.8125
max_depth = 15 : 0.825
max_d

2: Implement using n_estimator value of 16 for the best accuracy of 84% when random state is set to 42

In [28]:
# import RandomForestClassifier from sklearn & train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=42)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid


# create a random forest classifier model and set n_estimators to 20 and max_depth at 16
model = RandomForestClassifier(random_state=42, n_estimators=20, max_depth=16)
      
# train the model
model.fit(features_train, target_train)

# predict and convert outcome to a 1 dimensional array and print the outcome
predicted_valid = pd.Series(model.predict(features_valid))  
# print(predicted_valid.head())

# check model accuracy and print the model parameters
accuracy_valid = accuracy_score(predicted_valid, target_valid) 
print("Accuracy:", accuracy_valid)

print(model)

# Define a new entry and predict the outcome
new_features = pd.DataFrame(
    [
        [10, 130, 50, 45, 30, 33.0, 0.321, 35],
    ],
    columns=features_train.columns
)

# Predict the outcome
Outcome = model.predict(new_features)
print("Outcome:", Outcome)


Accuracy: 0.8375
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=16, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
Outcome: [0]


Our model has an 84% accuracy level and predicts that a sample entry tested would turn out negative for diabetes.

2.1  Test RandomForestRegressor model

In [15]:
from sklearn.ensemble import RandomForestRegressor 
# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=42
)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid

for estimator in range(1, 25):
        model = RandomForestRegressor (random_state=42, n_estimators=estimator) # < create a model >

        model.fit(features_train, target_train) # < train the model >

        predictions_valid = model.score(features_valid, target_valid) # < find the predictions using validation set >

        print("n_estimators =", estimator, ": ", end='')
        print(predictions_valid)

for depth in range(1, 17):
        model = RandomForestRegressor(random_state=42, max_depth=depth) # < create a model, specify max_depth=depth >

        model.fit(features_train, target_train) # < train the model >

        predictions_valid = model.predict(features_valid) # < find the predictions using validation set >

        print("max_depth =", depth, ": ", end='')
        print(model.score(features_valid, target_valid)) 

n_estimators = 1 : -0.33818181818181814
n_estimators = 2 : -0.040000000000000036
n_estimators = 3 : 0.040000000000000036
n_estimators = 4 : 0.13818181818181818
n_estimators = 5 : 0.2040727272727273
n_estimators = 6 : 0.23393939393939403
n_estimators = 7 : 0.21751391465677172
n_estimators = 8 : 0.19727272727272727
n_estimators = 9 : 0.20879910213243547
n_estimators = 10 : 0.23025454545454538
n_estimators = 11 : 0.23906836964688202
n_estimators = 12 : 0.24222222222222212
n_estimators = 13 : 0.24587412587412583
n_estimators = 14 : 0.23858998144712415
n_estimators = 15 : 0.2532040404040402
n_estimators = 16 : 0.2623863636363636
n_estimators = 17 : 0.2695061340044038
n_estimators = 18 : 0.27030303030303027
n_estimators = 19 : 0.2634600856207505
n_estimators = 20 : 0.25709090909090904
n_estimators = 21 : 0.2609193980622553
n_estimators = 22 : 0.25920360631104433
n_estimators = 23 : 0.26398350231998635
n_estimators = 24 : 0.26515151515151514
max_depth = 1 : 0.20935609830693347
max_depth = 2 :

2.2 Our data is numerical thus we implement a RandomForestRegressor Model with n_estimators set to 3. 

In [27]:
from sklearn.ensemble import RandomForestRegressor 
# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=42)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid

# create a regressor object
random_regressor = RandomForestRegressor(random_state =42,  n_estimators=3)

# fit the regressor with X and Y data
random_regressor.fit(features_train, target_train)

# predicting Outcome
random_regressor.predict(features_valid) 

# check accuracy level
accuracy = random_regressor.score(features_valid, target_valid) 
print("Accuracy:", accuracy)

# Define a new entry and predict the outcome
new_features = pd.DataFrame(
    [
        [10, 130, 50, 45, 30, 33.0, 0.321, 35],
    ],
    columns=features_train.columns
)

# Predict the outcome
Outcome = random_regressor.predict(new_features)
print("Outcome:", Outcome)


Accuracy: 0.040000000000000036
Outcome: [1.]


Our test data shows that the patient was predictied to be diabetic with an error of 0.04 or 96% accuracy

In [14]:
# from sklearn.model_selection import GridSearchCV
# # Create the parameter grid based on the results of random search 
# param_grid = {
#     'bootstrap': [True],
#     'max_depth': [80, 90, 100, 110],
#     'max_features': [2, 3],
#     'min_samples_leaf': [3, 4, 5],
#     'min_samples_split': [8, 10, 12],
#     'n_estimators': [100, 200, 300, 1000]
# }
# # Create a based model
# model = RandomForestClassifier(random_state=42, n_estimators=20, max_depth=16)
# # Instantiate the grid search model
# grid_search = GridSearchCV(estimator = model, param_grid = param_grid, 
#                           cv = 3, n_jobs = -1, verbose = 2)

# # Fit the grid search to the data
# grid_search.fit(features_train, target_train)
# grid_search.best_params_
# {'bootstrap': True,
#  'max_depth': 80,
#  'max_features': 3,
#  'min_samples_leaf': 5,
#  'min_samples_split': 12,
#  'n_estimators': 100}
# best_grid = grid_search.best_estimator_
# # grid_accuracy = evaluate(best_grid, features_valid, target_valid)

# print(best_grid)

Fitting 3 folds for each of 288 candidates, totalling 864 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed:  7.6min finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=80, max_features=2,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=12,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)


# Logistic Regression

In [84]:
# import LogisticRegression from sklearn & train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['Outcome'], axis=1)
target = clean_df['Outcome']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

# declare four variables
features_train = features_train
target_train = target_train
features_valid = features_valid
target_valid = target_valid


model = LogisticRegression(random_state=12345, solver='liblinear')  # < create a model, specify random state as 12345>

model.fit(features_train, target_train) # < train the model >

predictions_valid = model.score(features_valid, target_valid) # < find the predictions using validation set >

print(predictions_valid)

0.76875


# Findings and Recommendations

- Our data is fully numerical.  
- Non of the records were found to have null values.
- To preventour model from being affected by outliers, the Interquartile range - was used to create a clean dataframe.
- After creating and assessing outcomes for 4 models, I opted for the Random forest regressor Model which works well with numerical data in the features.
- The problem at had was a classification problem
-The model of choice is the RandomForestclassifeir which gave us an 84% accuracy and was able to predict that the sample entry will not test positive for daibetes.
- Further tuning is required to achieve the 85% and above accuracy level.
- An alternative model (Random forest regressor) seems to give an error of 4% or an accuracy of 94%.