Notebook made for completing our assignment


Introduction: 

    Diabetes is a metabolic disease disallowing the use or creation (depending on the type of diabetes) of insulin. This disease impedes various metabolic functions and can result in fatal consequences if left untreated. It is therefore imperative for strong predictive measures to be implemented to ensure early identification of the disease. For this reason, our project hopes to answer the predictive question of “how do variables such as plasma glucose concentration, blood pressure, and BMI predict whether an individual has Diabetes or not?” 
    
    The dataset we will be using to answer this question is the Pima Indians Diabetes Dataset that was created through data collected by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset has been constrained to only women of at least 21 years of age and of Pima Indian heritage with the goal of isolating the dataset from as many confounding variables as possible. It contains various medical predictors of diabetes (including skin thickness, glucose concentration, bmi, number of pregnancies, blood pressure, insulin levels, and the diabetes pedigree function) and one boolean outcome variable.


In [1]:
#Import necessary packages 
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [2]:
diabetes=pd.read_csv('data/diabetes.csv')

In [3]:
#preview dataset
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
#Changing numerical into categorical for diagnosis
diabetes["Diagnosis"] = diabetes["Outcome"].replace({
    1 : "diabetes",
    0 : "none"
})
diabetes=diabetes.drop('Outcome', axis=1)
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diagnosis
0,6,148,72,35,0,33.6,0.627,50,diabetes
1,1,85,66,29,0,26.6,0.351,31,none
2,8,183,64,0,0,23.3,0.672,32,diabetes
3,1,89,66,23,94,28.1,0.167,21,none
4,0,137,40,35,168,43.1,2.288,33,diabetes
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,none
764,2,122,70,27,0,36.8,0.340,27,none
765,5,121,72,23,112,26.2,0.245,30,none
766,1,126,60,0,0,30.1,0.349,47,diabetes


In [5]:
#Checking for missing values in each column. The dataset uses 0 in columns 'Skin Thickness', 'BMI', 'Blood Pressure', 'Glucose' and 'Insulin' for a missing observation
(diabetes == 0).astype(int).sum(axis=0)

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Diagnosis                     0
dtype: int64

Note:
There is a lot of missing data, especially in the insulin and skin thickness columns. 

In [6]:
#Replacing zeroes in these columns with NaN so that we can use imputing function
cols = ["BloodPressure","Insulin","BMI","Glucose","SkinThickness"]
diabetes[cols] = diabetes[cols].replace({
    0 : np.nan})
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diagnosis
0,6,148.0,72.0,35.0,,33.6,0.627,50,diabetes
1,1,85.0,66.0,29.0,,26.6,0.351,31,none
2,8,183.0,64.0,,,23.3,0.672,32,diabetes
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,none
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,diabetes
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,none
764,2,122.0,70.0,27.0,,36.8,0.340,27,none
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,none
766,1,126.0,60.0,,,30.1,0.349,47,diabetes


In [7]:
from sklearn.impute import SimpleImputer

preprocessor = make_column_transformer(
    (SimpleImputer(), ["Pregnancies", "Glucose", "BloodPressure","SkinThickness","Insulin","BMI"]),
    verbose_feature_names_out=False,
    remainder='passthrough'
)
preprocessor

In [8]:
preprocessor.fit(diabetes)
imputed_diabetes = pd.DataFrame(preprocessor.transform(diabetes))
imputed_diabetes

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6.0,148.0,72.0,35.0,155.548223,33.6,0.627,50,diabetes
1,1.0,85.0,66.0,29.0,155.548223,26.6,0.351,31,none
2,8.0,183.0,64.0,29.15342,155.548223,23.3,0.672,32,diabetes
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,none
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33,diabetes
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63,none
764,2.0,122.0,70.0,27.0,155.548223,36.8,0.34,27,none
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30,none
766,1.0,126.0,60.0,29.15342,155.548223,30.1,0.349,47,diabetes


In [9]:
clean_diabetes=imputed_diabetes.rename(columns={0:'Pregnancies',1:"Glucose",2:"BloodPressure",3:"SkinThickness",4:"Insulin",5:"BMI",6:'DiabetesPedigreeFunction',7:"Age",8:"Diagnosis"})
clean_diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diagnosis
0,6.0,148.0,72.0,35.0,155.548223,33.6,0.627,50,diabetes
1,1.0,85.0,66.0,29.0,155.548223,26.6,0.351,31,none
2,8.0,183.0,64.0,29.15342,155.548223,23.3,0.672,32,diabetes
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,none
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33,diabetes
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63,none
764,2.0,122.0,70.0,27.0,155.548223,36.8,0.34,27,none
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30,none
766,1.0,126.0,60.0,29.15342,155.548223,30.1,0.349,47,diabetes


In [10]:
#Splitting into training and testing data

from sklearn.model_selection import train_test_split

# set the seed
np.random.seed(10)

#use stratify to make sure there is the same proportion of diagnoses throughout the testing and training set

diabetes_train, diabetes_test = train_test_split(
    clean_diabetes, train_size=0.75, stratify=diabetes["Diagnosis"]
)

In [11]:
#Can comment out

columns_to_plot=["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age"]
diabetes_pairplot = alt.Chart(diabetes_train).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative"),
    alt.Y(alt.repeat("column"), type="quantitative"),
    #color=alt.Color("Diagnosis").title("Diagnosis")
).properties(
    width=200,
    height=200
).repeat(
    column=columns_to_plot,
    row=columns_to_plot
)
diabetes_pairplot

In [12]:

#Evaluating variables 

diabetes_box=alt.Chart(diabetes_train).mark_boxplot().encode(
    alt.X("Diagnosis"),
    alt.Y(alt.repeat("row"), type="quantitative"),
).properties(
    width=200,
    height=200
).repeat(
    row=columns_to_plot
)
diabetes_box

In [13]:
#Choosing our predictor variables

predictor_cols=["Glucose", "BMI"]

In [14]:
#Finding the mean values for the predictor variables (NaNs not included)
diabetes_train[predictor_cols].mean()

Glucose    121.518658
BMI         32.735446
dtype: object

In [15]:
#Visualizing two of our predictor variables for the testing data set
diabetes_plot=alt.Chart(diabetes_train).mark_point(opacity=0.5).encode(
    x=alt.X("Glucose").title("Glucose"),
    y=alt.Y("BMI").title("Bmi"),
    color=alt.Color("Diagnosis").title("Diagnosis")
)
diabetes_plot

In [16]:
#Building our model using K=3

knn_model = KNeighborsClassifier(n_neighbors=3)

preprocessor_model = make_column_transformer(
    (StandardScaler(), ["Glucose", "BMI"]),
)

X = diabetes_train[["Glucose", "BMI"]]
y = diabetes_train["Diagnosis"]


knn_pipeline = make_pipeline(preprocessor_model, knn_model)

knn_pipeline.fit(X, y)

knn_pipeline

In [17]:
#Showing table with results
# diabetes_test["predicted"] = knn_pipeline.predict(diabetes_test[["Glucose", "BMI"]])
# diabetes_test[["Diagnosis", "predicted"]]

diabetes_df=diabetes_test.assign(predicted=knn_pipeline.predict(diabetes_test[["Glucose", "BMI"]]))
diabetes_df[["Diagnosis", "predicted"]]
                                

Unnamed: 0,Diagnosis,predicted
90,none,none
327,none,diabetes
755,diabetes,none
304,none,none
480,diabetes,diabetes
...,...,...
397,diabetes,diabetes
22,diabetes,diabetes
464,none,none
61,diabetes,none


In [18]:
#Finding the score/accuracy of the model
knn_pipeline.score(
    diabetes_test[["Glucose", "BMI"]],
    diabetes_test["Diagnosis"]
)

0.7552083333333334

In [20]:
from sklearn.metrics import recall_score, precision_score

precision_score(
    y_true=diabetes_df["Diagnosis"],
    y_pred=diabetes_df["predicted"],
    pos_label="none"
)

0.8

In [21]:
recall_score(
    y_true=diabetes_df["Diagnosis"],
    y_pred=diabetes_df["predicted"],
    pos_label="none"
)

0.832

In [22]:
#Confusion Matrix
pd.crosstab(
    diabetes_df["Diagnosis"],
    diabetes_df["predicted"]
)

predicted,diabetes,none
Diagnosis,Unnamed: 1_level_1,Unnamed: 2_level_1
diabetes,41,26
none,21,104


In [23]:
#Performing Cross Validation

# create the 25/75 split of the *training data* into sub-training and validation
diabetes_subtrain, diabetes_validation = train_test_split(
    diabetes_train, train_size=0.75, stratify=diabetes_train["Diagnosis"]
)

# fit the model on the sub-training data
knn_cv = KNeighborsClassifier(n_neighbors=3)
X_cv = diabetes_subtrain[["Glucose", "BMI"]]
y_cv = diabetes_subtrain["Diagnosis"]
knn_pipeline_cv = make_pipeline(preprocessor_model, knn_cv)
knn_pipeline_cv.fit(X_cv, y_cv)

# compute the score on validation data
acc = knn_pipeline_cv.score(
    diabetes_validation[["Glucose", "BMI"]],
    diabetes_validation["Diagnosis"]
)
acc

0.6875

In [24]:
#Checking the best # of neighbors


knn = KNeighborsClassifier()
diabetes_preprocessor = make_column_transformer(
    (StandardScaler(), predictor_cols),
)
diabetes_tune_pipe = make_pipeline(diabetes_preprocessor, knn)

In [25]:
#Performing a 5-fold grid search
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 20, 1),
}
diabetes_tune_grid = GridSearchCV(
    estimator=diabetes_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)

diabetes_tune_grid.fit(
    diabetes_train[["Glucose", "BMI"]],
    diabetes_train["Diagnosis"]
)
accuracies_grid = pd.DataFrame(diabetes_tune_grid.cv_results_)

accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

In [26]:
#Best parameter
diabetes_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 7}

In [27]:
#Trying with the new number of neighbors

knn_spec = KNeighborsClassifier(n_neighbors=7)

preprocessor_spec = make_column_transformer(
    (StandardScaler(), ["Glucose", "BMI"]),
)

X_spec = diabetes_train[["Glucose", "BMI"]]
y_spec = diabetes_train["Diagnosis"]


pipeline_spec = make_pipeline(preprocessor_spec, knn_spec)

pipeline_spec.fit(X_spec, y_spec)

pipeline_spec

In [28]:
diabetes_test["predicted"] = pipeline_spec.predict(diabetes_test[["Glucose", "BMI"]])
diabetes_test[["Diagnosis", "predicted"]]

Unnamed: 0,Diagnosis,predicted
90,none,none
327,none,diabetes
755,diabetes,none
304,none,none
480,diabetes,diabetes
...,...,...
397,diabetes,none
22,diabetes,diabetes
464,none,none
61,diabetes,none


In [29]:
pd.crosstab(
    diabetes_test["Diagnosis"],
    diabetes_test["predicted"]
)

predicted,diabetes,none
Diagnosis,Unnamed: 1_level_1,Unnamed: 2_level_1
diabetes,42,25
none,16,109


In [30]:
pipeline_spec.score(
    diabetes_test[["Glucose", "BMI"]],
    diabetes_test["Diagnosis"]
)

0.7864583333333334