In [1]:
import pandas as pd
import pickle
from myst_nb import glue

In [2]:
# Define variables with glue here
knn_tree_cross_val = pd.read_csv("../results/tables/cross_val_results.csv")
glue("knn_tree_cross_val", knn_tree_cross_val, display=False)

## Summary
This project endeavors to develop a predictive classification model for ascertaining an individual's diabetic status, while comparing the efficiency of logistic regression and k-nearest neighbours (k-nn) algorithms. The dataset used in this analysis is collected through the Behavioral Risk Factor Surveillance System (BFRSS) by the Centers for Disease Control and Prevention (CDC) for the year 2015. Notably, the primary determinant influencing the prediction is identified as the feature High Blood Pressue (HighBP), displaying a coefficient of 0.354 as revealed by the logistic regression model. On top of considering a logisitc regression model, we also explored a k-nearest nieghbours model and a decision tree model for predicting diabetes. The optimized logistic regression model demonstrates a test score of 0.728 ,while the k-nn model yields a test score of 0.746. Both of the test scores are relatively close to the validation score which shows that the model will generalized well to unseen data, however, there is still room for improvement in the test score. 
 

## Introduction
Diabetes mellitus, commonly referred to as diabetes is a disease which impacts the body’s control of blood glucose levels (Sapra, Bhandari 2023). It is important to note that there are different types of diabetes, although we do not explore this discrepancy in this project (Sapra, Bhandari 2023). Diabetes is a manageable disease thanks to the discovery of insulin in 1922. Globally, 1 in 11 adults have diabetes (Sapra, Bhandari 2023). As such, understanding the factors which are strongly related to diabetes can be important for researchers studying how to better prevent or manage the disease. In this project, we create several machine learning models to predict diabetes in a patient and evaluate the success of these models. We also explore the coefficients of a logistic regression model to better understand the factors which are associated with diabetes. 

## Methods

### Data
The dataset used in this project is a collection of the Centers for Disease Control and Prevention (CDC) diabetes health indicators collected as a response to the CDC's BRFSS2015 survey. The data were sourced from the UCI Machine Learning Repository (Burrows, Hora, Geiss, Gregg, and Albright 2017) which can be found [here](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators). The file specifically used from this repository for this analysis includes 70, 692 survey responses from which 50% of the respondents recorded having either prediabetes or diabetes. Each row in the dataset represents a recorded survey response including whether or not the responded has diabetes or prediabetes, and a collection of 21 other diabetes health indicators identified by the CDC. 

### Analysis
In our efforts to determine the best model for classifying a patient with diabetes or prediabetes as opposed to no diabetes or prediabetes, we performed hyperparamter optimization on both a knn model and a decision tree model. We also explored a logistic regression model to gain insight into which features may contribute most to a classification of diabetes. All features from the original dataset were included in each model. In all cases, the data were split into training and testing datasets, with 80% of the data designated as training and 20% as testing. The data was preprocessesed such that all continuous (non-binary) variables were scaled using a scikit-learn's StandardScaler function. Model performance was tested using a 10 - fold cross validation score. Feature importance was investigated using the coefficients generated by the logistic regression algorithm. The k-nn algorithm's hyperparameter K was optimized using the F1 score as the classification metric. Python programming (Van Rossum and Drake 2009) was used for all analysis. The following Python packages were used for this analysis: Pandas (McKinney 2010), altair (VanderPlas, 2018), and scikit-learn (Pedregosa et al. 2011).

### Results & Discussion

Through using a random search we performed hyper parameter optimization on a knn model and a decision tree model. 
The table below, ({numref}`Figure {number} <knn_tree_cross_val>`) shows the results of performing cross validation on the best performing knn and decision tree models, respectively. 

```{glue:figure} knn_tree_cross_val
:figwidth: 400px
:name: "knn_tree_cross_val"

Cross validation results for best perfoming models as a result of a random search hyperparameter optimization. 
```

### References

{bibliography}