Notebook made for completing our assignment


# Project Proposal

### Introduction

Diabetes is a metabolic disease disallowing the use or creation (depending on the type of diabetes) of insulin. This disease impedes various metabolic functions and can result in fatal consequences if left untreated. It is therefore imperative for strong predictive measures to be implemented to ensure early identification of the disease. For this reason, our project hopes to answer the predictive question of “how do variables such as plasma glucose concentration, blood pressure, and BMI predict whether an individual has Diabetes or not?” 


The dataset we will be using to answer this question is the Pima Indians Diabetes Dataset that was created through data collected by the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset has been constrained to only women of at least 21 years of age and of Pima Indian heritage with the goal of isolating the dataset from as many confounding variables as possible. It contains various medical predictors of diabetes (including skin thickness, glucose concentration, bmi, number of pregnancies, blood pressure, insulin levels, and the diabetes pedigree function) and one boolean outcome variable.


### Preliminary exploratory data analysis

In [None]:
#Import necessary packages 
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

In [None]:
diabetes=pd.read_csv('data/diabetes.csv')

In [None]:
#preview dataset
diabetes.head()

In [None]:
#Changing numerical into categorical for diagnosis
diabetes["Diagnosis"] = diabetes["Outcome"].replace({
    1 : "diabetes",
    0 : "none"
})
diabetes=diabetes.drop('Outcome', axis=1)
diabetes

In [None]:
#Checking for missing values in each column. The dataset uses 0 in columns 'Skin Thickness', 'BMI', 'Blood Pressure', 'Glucose' and 'Insulin' for a missing observation
(diabetes == 0).astype(int).sum(axis=0)

Note:
There is a lot of missing data, especially in the insulin and skin thickness columns. 

In [None]:
#Replacing zeroes in these columns with NaN so that we can use imputing function
cols = ["BloodPressure","Insulin","BMI","Glucose","SkinThickness"]
diabetes[cols] = diabetes[cols].replace({
    0 : np.nan})
diabetes

In [None]:
#Splitting into training and testing data

from sklearn.model_selection import train_test_split

#use stratify to make sure there is the same proportion of diagnoses throughout the testing and training set

diabetes_train, diabetes_test = train_test_split(
    diabetes, train_size=0.75, stratify=diabetes["Diagnosis"]
)

In [None]:
#Choosing our predictor variables

predictor_cols=["BloodPressure","Glucose", "BMI", "Age"]

In [None]:
#Table to show basic information on each column of our training dataset (number of obvs, number of NaNs)
diabetes_train.info()

In [None]:
#Finding the mean values for the predictor variables (NaNs not included)
diabetes_train[predictor_cols].mean(numeric_only=True)

In [None]:
#Visualizing two of our predictor variables for the testing data set
diabetes_plot=alt.Chart(diabetes_test).mark_point(opacity=0.5).encode(
    x=alt.X("Glucose").title("Glucose"),
    y=alt.Y("BMI").title("Bmi"),
    color=alt.Color("Diagnosis").title("Diagnosis")
)
diabetes_plot

In [None]:
#Visualizing two more of our predictor variables for the testing data set
diabetes_plot2=alt.Chart(diabetes_test).mark_point(opacity=0.5).encode(
    x=alt.X("BloodPressure").title("Blood Pressure"),
    y=alt.Y("Age").title("Age"),
    color=alt.Color("Diagnosis").title("Diagnosis")
)
diabetes_plot2

### Methods

In our data analysis, we have chosen to use **scikit-learn's k-nearest neighbours (KNN) classification** algorithm as a pivotal component of our predictive modelling strategy. Our initial steps involve loading and exploring the Pima Indians Diabetes Dataset using pandas, where we will address any **missing values** or data cleaning requirements using a **preprocessor**. 


To guide our **variable selection**, we will employ a preliminary analysis, utilizing data **visualizations** and **correlation matrices**. These visualizations will assist in identifying patterns, relationships, and potential predictors that may significantly contribute to diabetes prediction. For example, to compare **blood pressure** between individuals with and without diabetes, we will use **boxplots**. These visualizations will display the distribution of blood pressure values for each group, allowing for a quick comparison of central tendency and spread. The boxplots will help identify potential differences in blood pressure that may be indicative of its relevance as a predictive variable. 


Subsequently, we will split the dataset into training and testing sets and train the KNN classifier using the training data. **Model evaluation** will be conducted using metrics such as **accuracy score** on the testing set. By adopting this approach, we aim to build a robust and effective model for early identification of diabetes in Pima Indian women, employing the strengths of the KNN algorithm in the classification task.


### Expected outcomes and significance

Need to be finalized