# Diabetes Prediction

In [None]:
import numpy as np
import pandas as pd
import requests
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from ucimlrepo import fetch_ucirepo 
from sklearn.metrics import accuracy_score, roc_auc_score
alt.data_transformers.disable_max_rows()

## Summary

## Introduction

In Canada and the USA approximately 10% of people are living with diabetes. In Canada in 2023 approximately 3.7 million people were living with diabetes and in the USA in 2021 approzimately 38.4 million people were living with diabetes. In the USA it is the 8th leading cause of death. 

In this project we try to predicted diabetes disease based on common health factors. A reliable model could help to prescreen people and recommend following up with a physician for people who are at risk.

## Methods and Results

The analysis uses the CDC Behavioural Risk Factor Surveillance System (BRFSS) 2015 Diabetes Health Indicators dataset (UCI ID 891), containing 253,680 survey responses with 21 health-related features and a binary diabetes outcome (0 = no diabetes/pre-diabetes, 1 = diabetes).  
No missing values were present and all features were already encoded numerically. The target classes are heavily imbalanced (≈86% non-diabetic, ≈14% diabetic).

### EDA
Group-wise mean differences revealed the strongest risk factors for diabetes:
- PhysHlth (days of poor physical health)
- BMI
- Age
- MentHlth (days of poor mental health)
- GenHlth (self-rated general health)

Weakest factors
- HvyAlcoholConsump
- Fruits
- Veggies
- PhysActivity
- Education
- Income

Box plots of the top five predictors clearly separate the diabetic and non-diabetic groups.

### Modeling Approach
The data were split 70/30 into training and test sets with stratification on the target.  
Two classifiers were trained and tuned using 5-fold cross-validated grid search with **ROC AUC** as the scoring metric (more appropriate than accuracy given the class imbalance):

1. **Decision Tree** (class_weight='balanced')  
   Hyperparameters: max_depth ∈ {8,10,12,14}, min_samples_leaf ∈ {10,20,50}  
   **Best parameters**: max_depth=8, min_samples_leaf=50  
   **Best CV AUC** = 0.8169

2. **k-Nearest Neighbours** (with StandardScaler preprocessing)  
   Hyperparameters: n_neighbors ∈ {5,11,21,31,41,51}  
   **Best parameters**: n_neighbors=51  
   **Best CV AUC** = 0.8118

### Results

| Model            | Test Accuracy | Test AUC |
|------------------|---------------|----------|
| Decision Tree    | 0.7272       | **0.8169**   |
| KNN (k=51)       | **0.8639**    | 0.8118   |


### Load Data

In [None]:
# fetch dataset 
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
# data (as pandas dataframes) 
X = cdc_diabetes_health_indicators.data.features 
y = cdc_diabetes_health_indicators.data.targets 


### Data Wrangling

In [None]:
# No major cleaning needed — dataset is already very clean!
# Combine features and targets to get a overview of the full data set
df = X.copy()
df['diabetes'] = y

# Quick info
df.info()
df.head()

### Data Summary

In [None]:
df.describe()

Too many features, and not all of them are useful.We want to select a few features that have the highest impact.Identify the top risk factors by average value for people with and without diabetes.This way, we can understand which features associated with people with diabetes have the greatest impact compared to those without diabetes.

A positive 'difference' value means the feature has a higher average value in people with diabetes

In [None]:
summary = df.groupby('diabetes').mean().T
summary['difference'] = summary[1] - summary[0]
summary.sort_values('difference', ascending=False)

### Visualizations

In [None]:
# EDA on count of diabetes records
alt.Chart(df).mark_bar().encode(
    x=alt.X('diabetes:N', title='Has Diabetes'),
    y='count()'
).properties(title='Diabetes Prevalence in Dataset')

As the analysis above, we can ignore some of the features

In [None]:
np.random.seed(522)

drop_cols = ['HvyAlcoholConsump', 'Fruits', 'Veggies','PhysActivity','Education','Income']

df_clean = df.drop(columns=drop_cols)

In [None]:
alt.data_transformers.enable('vegafusion')

top5 = ['PhysHlth', 'BMI', 'Age', 'MentHlth', 'GenHlth']
plot_data = df_clean.melt(id_vars='diabetes', value_vars=top5)

alt.Chart(plot_data).mark_boxplot().encode(
    x='diabetes:N',
    y='value:Q',
    color='diabetes:N'
).facet(
    column='variable:N'
).properties(title='Top 5 Predictors: Diabetic vs Non-Diabetic')



In [None]:
# split the data 70-30 split
train_df, test_df = train_test_split(
    df_clean, test_size=0.3, random_state=522, stratify=df_clean['diabetes']
)

# Save processed data
train_df.to_csv("../data/processed/diabetes_train.csv", index=False)
test_df.to_csv("../data/processed/diabetes_test.csv", index=False)

In [None]:
X_train = train_df.drop('diabetes', axis=1)
y_train = train_df['diabetes']
X_test  = test_df.drop('diabetes', axis=1)
y_test  = test_df['diabetes']

### Classification Analysis

In [None]:
tree = DecisionTreeClassifier(random_state=522, class_weight='balanced')

tree_params = {
    'max_depth': [8, 10, 12, 14],
    'min_samples_leaf': [10, 20, 50]
}

tree_grid = GridSearchCV(tree, tree_params, cv=5, scoring='roc_auc', n_jobs=-1)
tree_grid.fit(X_train, y_train)

best_tree = tree_grid.best_estimator_
print("Best Decision Tree params:", tree_grid.best_params_)
print("Best CV AUC:", tree_grid.best_score_.round(4))

In [None]:
knn_preprocessor = make_column_transformer(
    (StandardScaler(), X_train.columns)
)

knn_pipe = make_pipeline(
    knn_preprocessor,
    KNeighborsClassifier(n_jobs=-1)
)

knn_params = {'kneighborsclassifier__n_neighbors': [5, 11, 21, 31, 41, 51]}

knn_grid = GridSearchCV(knn_pipe, knn_params, cv=5, scoring='roc_auc', n_jobs=-1)
knn_grid.fit(X_train, y_train)

best_knn = knn_grid.best_estimator_
print("Best KNN k:", knn_grid.best_params_)
print("Best CV AUC:", knn_grid.best_score_.round(4))

In [None]:
models = {
    'Decision Tree': best_tree,
    'KNN (k=51)': best_knn
}

results = []

for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    results.append({
        'Model': name,
        'Test Accuracy': accuracy_score(y_test, y_pred).round(4),
        'Test AUC': roc_auc_score(y_test, y_prob).round(4)
    })

score_df = pd.DataFrame(results)
score_df

### Result Visualizations

In [None]:
score_melt = score_df.melt(id_vars='Model', var_name='Metric', value_name='Score')

alt.Chart(score_melt).mark_bar().encode(
    x='Model:N',
    y='Score:Q',
    color='Model:N',
    column='Metric:N'
).properties(
    title='Decision Tree vs KNN Performance on Test Set'
)

## Discussion

## References