# Predictive Analysis of Diabetes Risk Factors


## Introduction: 

**Diabetes** is a metabolic disease disallowing the use or creation (depending on the type of diabetes) of insulin. This disease impedes various metabolic functions and can result in fatal consequences if left untreated. It is therefore imperative for strong predictive measures to be implemented to ensure early identification of the disease. For this reason, our project hopes to answer the predictive question of **“How do variables such as plasma glucose concentration, blood pressure, and BMI predict whether an individual has Diabetes or not?”**
    
The dataset we used to answer this question is the **Pima Indians Diabetes Dataset** which was created through data collected by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset has been constrained to only women of at least 21 years of age and of Pima Indian heritage with the goal of isolating the dataset from as many confounding variables as possible. It contains various medical predictors of diabetes (including skin thickness, glucose concentration, BMI, number of pregnancies, blood pressure, insulin levels, and the diabetes pedigree function) and one boolean outcome variable.


## Method:

In our data analysis, we used scikit-learn's **k-nearest neighbours (KNN) classification** algorithm as a pivotal component of our predictive modelling strategy. Our initial steps involved loading and exploring the Pima Indians Diabetes Dataset using pandas, where we addressed any missing values or data cleaning requirements using a preprocessor. 

To guide our **variable selection**, we employed a **preliminary analysis**, utilizing data visualizations and **correlation matrices**. These visualizations assisted in identifying patterns, relationships, and potential predictors that may significantly contribute to diabetes prediction. For example, to compare blood pressure between individuals with and without diabetes, we used **boxplots**. These visualizations displayed the distribution of blood pressure values for each group, allowing for a quick comparison of central tendency and spread. The boxplots helped identify potential differences in blood pressure that may be indicative of its relevance as a predictive variable. 

Subsequently, we split the dataset into training and testing sets and trained the KNN classifier using the training data. **Model evaluations** were also conducted using metrics such as accuracy scores on the testing set. By adopting this approach, we built a robust and effective model for early identification of diabetes in Pima Indian women, employing the strengths of the KNN algorithm in the classification task..


In [None]:
#Import necessary packages 
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

The first step of our analysis involved loading the data using the read_csv method:


In [None]:
diabetes=pd.read_csv('data/diabetes.csv')

In [None]:
#preview dataset
diabetes

## Data Wrangling

Our target variable, ‘Outcome’, was a binary numeric variable which took the value 1 for a diabetes diagnosis and 0 otherwise. So to improve clarity, we renamed the column to “Diagnosis” and changed it to a categorical variable with the values “diabetes” and “none”


In [None]:
#Changing numerical into categorical for diagnosis
diabetes["Diagnosis"] = diabetes["Outcome"].replace({
    1 : "diabetes",
    0 : "none"
})
diabetes=diabetes.drop('Outcome', axis=1)
diabetes

The next step in cleaning the data involved checking for missing values，the dataset uses 0 in columns 'Skin Thickness', 'BMI', 'Blood Pressure', 'Glucose' and 'Insulin' for a missing observation:


In [None]:
(diabetes == 0).astype(int).sum(axis=0)

Note:
There is a lot of missing data, especially in the insulin and skin thickness columns. 

 Since the insulin and skin thickness columns were missing a lot of values, we chose to impute these values instead. To do so, we first replaced all the missing values (represented by 0 in the dataset) with “NaN”:


In [None]:
#Replacing zeroes with NaN
cols = ["BloodPressure","Insulin","BMI","Glucose","SkinThickness"]
diabetes[cols] = diabetes[cols].replace({
    0 : np.nan})
diabetes

Then, we made the preprocessor which passes the columns with missing values through the SimpleImputer() transformer:


In [None]:
from sklearn.impute import SimpleImputer

preprocessor = make_column_transformer(
    (SimpleImputer(), ["Pregnancies", "Glucose", "BloodPressure","SkinThickness","Insulin","BMI"]),
    verbose_feature_names_out=False,
    remainder='passthrough'
)
preprocessor

After the missing values had been handled and the data had been cleaned, we fit the dataframe to our preprocessor:


In [None]:
preprocessor.fit(diabetes)
imputed_diabetes = pd.DataFrame(preprocessor.transform(diabetes))
imputed_diabetes

The resultant dataframe had numerical column names, so we rename them to avoid confusion and improve presentability:


In [None]:
clean_diabetes=imputed_diabetes.rename(columns={0:'Pregnancies',1:"Glucose",2:"BloodPressure",3:"SkinThickness",4:"Insulin",5:"BMI",6:'DiabetesPedigreeFunction',7:"Age",8:"Diagnosis"})
clean_diabetes

Next, we split our dataset, using 75% for training and 25% for testing. To ensure that both datasets contain equivalent proportions of diagnoses, we used the stratify argument as well:


In [None]:
#Splitting into training and testing data

from sklearn.model_selection import train_test_split

# set the seed
np.random.seed(10)

#use stratify to make sure there is the same proportion of diagnoses throughout the testing and training set

diabetes_train, diabetes_test = train_test_split(
    clean_diabetes, train_size=0.75, stratify=diabetes["Diagnosis"]
)

## Preliminary exploratory data analysis

To explore any potential correlations between the variables, we created **pairplots**:


In [None]:
#Can comment out

columns_to_plot=["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age"]
diabetes_pairplot = alt.Chart(diabetes_train).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative"),
    alt.Y(alt.repeat("column"), type="quantitative"),
    #color=alt.Color("Diagnosis").title("Diagnosis")
).properties(
    width=200,
    height=200
).repeat(
    column=columns_to_plot,
    row=columns_to_plot
)
diabetes_pairplot

To further explore the relationship of each variable with the outcome variable, "Diagnosis", we created a series of **boxplots** aiming to indentify potential variables with strong explanatory power. 


In [None]:
#Evaluating variables 

diabetes_box=alt.Chart(diabetes_train).mark_boxplot().encode(
    alt.X("Diagnosis"),
    alt.Y(alt.repeat("row"), type="quantitative"),
).properties(
    width=200,
    height=200
).repeat(
    row=columns_to_plot
)
diabetes_box

From the boxplots, we found that BMI and Glucose had the most stark contrast between their interquartile ranges for patients diagnosed with diabetes versus those who were not diagnosed with diabetes. Hence, we chose these as our predictor variables:


In [None]:
#Choosing our predictor variables

predictor_cols=["Glucose", "BMI"]

In [None]:
#Finding the mean values for the predictor variables (NaNs not included)
diabetes_train[predictor_cols].mean()

Next, we visualized the two variables using a scatterplot. Here we clearly see that diabetes patients tend to have a **higher BMI and glucose levels**


In [None]:
#Visualizing two of our predictor variables for the testing data set
diabetes_plot=alt.Chart(diabetes_train).mark_point(opacity=0.5).encode(
    x=alt.X("Glucose").title("Glucose"),
    y=alt.Y("BMI").title("Bmi"),
    color=alt.Color("Diagnosis").title("Diagnosis")
)
diabetes_plot

Next we created our knn model using a K value of 3, preprocessed the data using a StandardScaler(), and combined the two using a pipeline. After that, we fitted our training dataset to the pipeline.


In [None]:
#Building our model using K=3

knn_model = KNeighborsClassifier(n_neighbors=3)

preprocessor_model = make_column_transformer(
    (StandardScaler(), ["Glucose", "BMI"]),
)

X = diabetes_train[["Glucose", "BMI"]]
y = diabetes_train["Diagnosis"]


knn_pipeline = make_pipeline(preprocessor_model, knn_model)

knn_pipeline.fit(X, y)

knn_pipeline

Next, we used our model to make predictions about the test dataset and assigned these to a column called “predicted”.


In [None]:
#Showing table with results
# diabetes_test["predicted"] = knn_pipeline.predict(diabetes_test[["Glucose", "BMI"]])
# diabetes_test[["Diagnosis", "predicted"]]

diabetes_df=diabetes_test.assign(predicted=knn_pipeline.predict(diabetes_test[["Glucose", "BMI"]]))
diabetes_df[["Diagnosis", "predicted"]]
                                

To assess the **performance** of our model, we then found its accuracy, precision, and recall scores. 


In [None]:
#Finding the score/accuracy of the model
knn_pipeline.score(
    diabetes_test[["Glucose", "BMI"]],
    diabetes_test["Diagnosis"]
)

In [None]:
from sklearn.metrics import recall_score, precision_score

precision_score(
    y_true=diabetes_df["Diagnosis"],
    y_pred=diabetes_df["predicted"],
    pos_label="none"
)

In [None]:
recall_score(
    y_true=diabetes_df["Diagnosis"],
    y_pred=diabetes_df["predicted"],
    pos_label="none"
)

Another measure we utilized was a **confusion matrix**:


In [None]:
#Confusion Matrix
pd.crosstab(
    diabetes_df["Diagnosis"],
    diabetes_df["predicted"]
)

To cross validate our data, we split our training data in a **75 % to 25%** split once again and fit it to our model. The accuracy score after testing the model on the validation data was about **69%**


In [None]:
#Performing Cross Validation

# create the 25/75 split of the *training data* into sub-training and validation
diabetes_subtrain, diabetes_validation = train_test_split(
    diabetes_train, train_size=0.75, stratify=diabetes_train["Diagnosis"]
)

# fit the model on the sub-training data
knn_cv = KNeighborsClassifier(n_neighbors=3)
X_cv = diabetes_subtrain[["Glucose", "BMI"]]
y_cv = diabetes_subtrain["Diagnosis"]
knn_pipeline_cv = make_pipeline(preprocessor_model, knn_cv)
knn_pipeline_cv.fit(X_cv, y_cv)

# compute the score on validation data
acc = knn_pipeline_cv.score(
    diabetes_validation[["Glucose", "BMI"]],
    diabetes_validation["Diagnosis"]
)
acc

To find the K value which maximizes the accuracy of our model, we performed a **5 fold grid search**. From the graph, we can see that the model yields the highest accuracy at **k=7**.


In [None]:
#Checking the best # of neighbors


knn = KNeighborsClassifier()
diabetes_preprocessor = make_column_transformer(
    (StandardScaler(), predictor_cols),
)
diabetes_tune_pipe = make_pipeline(diabetes_preprocessor, knn)

In [None]:
#Performing a 5-fold grid search
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 20, 1),
}
diabetes_tune_grid = GridSearchCV(
    estimator=diabetes_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)

diabetes_tune_grid.fit(
    diabetes_train[["Glucose", "BMI"]],
    diabetes_train["Diagnosis"]
)
accuracies_grid = pd.DataFrame(diabetes_tune_grid.cv_results_)

accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

In [None]:
#Best parameter
diabetes_tune_grid.best_params_

So we retrain the model with **k=7** and use the retrained model for our predictions. 


In [None]:
#Trying with the new number of neighbors

knn_spec = KNeighborsClassifier(n_neighbors=7)

preprocessor_spec = make_column_transformer(
    (StandardScaler(), ["Glucose", "BMI"]),
)

X_spec = diabetes_train[["Glucose", "BMI"]]
y_spec = diabetes_train["Diagnosis"]


pipeline_spec = make_pipeline(preprocessor_spec, knn_spec)

pipeline_spec.fit(X_spec, y_spec)

pipeline_spec

In [None]:
diabetes_test["predicted"] = pipeline_spec.predict(diabetes_test[["Glucose", "BMI"]])
diabetes_test[["Diagnosis", "predicted"]]

In [None]:
pd.crosstab(
    diabetes_test["Diagnosis"],
    diabetes_test["predicted"]
)

The retrained model yields an accuracy of **78.64%**


In [None]:
pipeline_spec.score(
    diabetes_test[["Glucose", "BMI"]],
    diabetes_test["Diagnosis"]
)

## Discussion:

## References: