Notebook made for completing our assignment


# Project Proposal

### Introduction

Diabetes is a metabolic disease disallowing the use or creation (depending on the type of diabetes) of insulin. This disease impedes various metabolic functions and can result in fatal consequences if left untreated. It is therefore imperative for strong predictive measures to be implemented to ensure early identification of the disease. For this reason, our project hopes to answer the predictive question of “how do variables such as plasma glucose concentration, blood pressure, and BMI predict whether an individual has Diabetes or not?” 


The dataset we will be using to answer this question is the Pima Indians Diabetes Dataset that was created through data collected by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset has been constrained to only women of at least 21 years of age and of Pima Indian heritage with the goal of isolating the dataset from as many confounding variables as possible. It contains various medical predictors of diabetes (including skin thickness, glucose concentration, bmi, number of pregnancies, blood pressure, insulin levels, and the diabetes pedigree function) and one boolean outcome variable.


### Preliminary exploratory data analysis

In [4]:
#Import necessary packages 
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

In [5]:
diabetes=pd.read_csv('data/diabetes.csv')

In [6]:
#preview dataset
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
#Changing numerical into categorical for diagnosis
diabetes["Diagnosis"] = diabetes["Outcome"].replace({
    1 : "diabetes",
    0 : "none"
})
diabetes=diabetes.drop('Outcome', axis=1)
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diagnosis
0,6,148,72,35,0,33.6,0.627,50,diabetes
1,1,85,66,29,0,26.6,0.351,31,none
2,8,183,64,0,0,23.3,0.672,32,diabetes
3,1,89,66,23,94,28.1,0.167,21,none
4,0,137,40,35,168,43.1,2.288,33,diabetes
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,none
764,2,122,70,27,0,36.8,0.340,27,none
765,5,121,72,23,112,26.2,0.245,30,none
766,1,126,60,0,0,30.1,0.349,47,diabetes


In [8]:
#Checking for missing values in each column. The dataset uses 0 in columns 'Skin Thickness', 'BMI', 'Blood Pressure', 'Glucose' and 'Insulin' for a missing observation
(diabetes == 0).astype(int).sum(axis=0)

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Diagnosis                     0
dtype: int64

Note:
There is a lot of missing data, especially in the insulin and skin thickness columns. 

In [9]:
#Replacing zeroes in these columns with NaN so that we can use imputing function
cols = ["BloodPressure","Insulin","BMI","Glucose","SkinThickness"]
diabetes[cols] = diabetes[cols].replace({
    0 : np.nan})
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diagnosis
0,6,148.0,72.0,35.0,,33.6,0.627,50,diabetes
1,1,85.0,66.0,29.0,,26.6,0.351,31,none
2,8,183.0,64.0,,,23.3,0.672,32,diabetes
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,none
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,diabetes
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,none
764,2,122.0,70.0,27.0,,36.8,0.340,27,none
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,none
766,1,126.0,60.0,,,30.1,0.349,47,diabetes


In [10]:
#Splitting into training and testing data

from sklearn.model_selection import train_test_split

#use stratify to make sure there is the same proportion of diagnoses throughout the testing and training set

diabetes_train, diabetes_test = train_test_split(
    diabetes, train_size=0.75, stratify=diabetes["Diagnosis"]
)

In [11]:
#Choosing our predictor variables

predictor_cols=["BloodPressure","Glucose", "BMI", "Age"]

In [12]:
#Table to show basic information on each column of our training dataset (number of obvs, number of NaNs)
diabetes_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 576 entries, 733 to 404
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               576 non-null    int64  
 1   Glucose                   572 non-null    float64
 2   BloodPressure             550 non-null    float64
 3   SkinThickness             405 non-null    float64
 4   Insulin                   292 non-null    float64
 5   BMI                       565 non-null    float64
 6   DiabetesPedigreeFunction  576 non-null    float64
 7   Age                       576 non-null    int64  
 8   Diagnosis                 576 non-null    object 
dtypes: float64(6), int64(2), object(1)
memory usage: 45.0+ KB


In [13]:
#Finding the mean values for the predictor variables (NaNs not included)
diabetes_train[predictor_cols].mean(numeric_only=True)

BloodPressure     72.550909
Glucose          121.510490
BMI               32.468673
Age               33.269097
dtype: float64

In [14]:
#Visualizing two of our predictor variables for the testing data set
diabetes_plot=alt.Chart(diabetes_test).mark_point(opacity=0.5).encode(
    x=alt.X("Glucose").title("Glucose"),
    y=alt.Y("BMI").title("Bmi"),
    color=alt.Color("Diagnosis").title("Diagnosis")
)
diabetes_plot

In [15]:
#Visualizing two more of our predictor variables for the testing data set
diabetes_plot2=alt.Chart(diabetes_test).mark_point(opacity=0.5).encode(
    x=alt.X("BloodPressure").title("Blood Pressure"),
    y=alt.Y("Age").title("Age"),
    color=alt.Color("Diagnosis").title("Diagnosis")
)
diabetes_plot2

### Methods

In our data analysis, we have chosen to use **scikit-learn's k-nearest neighbours (KNN) classification** algorithm as a pivotal component of our predictive modelling strategy. Our initial steps involve loading and exploring the Pima Indians Diabetes Dataset using pandas, where we will address any **missing values** or data cleaning requirements using a **preprocessor**. 


To guide our **variable selection**, we will employ a preliminary analysis, utilizing data **visualizations** and **correlation matrices**. These visualizations will assist in identifying patterns, relationships, and potential predictors that may significantly contribute to diabetes prediction. For example, to compare **blood pressure** between individuals with and without diabetes, we will use **scatterplot**. These visualizations will display the distribution of blood pressure values for each group, allowing for a quick comparison of central tendency and spread. The boxplots will help identify potential differences in blood pressure that may be indicative of its relevance as a predictive variable. 


Subsequently, we will split the dataset into training and testing sets and train the KNN classifier using the training data. **Model evaluation** will be conducted using metrics such as **accuracy score** on the testing set. By adopting this approach, we aim to build a robust and effective model for early identification of diabetes in Pima Indian women, employing the strengths of the KNN algorithm in the classification task.


### Expected outcomes and significance

In [None]:
Based on our preliminary analysis, we believe that __________________ are likely to be the best predictors of diabetes. Given that early diagnosis of diabetes can vastly reduce the risks of devastating complications, an accurate predictive model has the power to improve and save countless lives. An interesting avenue for future analysis would be finding whether our findings about the predictive power of these variables can be generalised to other segments of the population.