# Diabetes Analysis

Data Reference: https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

In [1]:
import pandas as pd
import numpy as np
import altair as alt

## Summary

This project attempts to predict diabetes status using the Logistic Regression and LinearSVC models, against a baseline DummyClassifier on an imbalanced dataset. All models achieved similar accuracy on the test set (approximately 0.86), which highlights a key issue: accuracy alone is not a reliable performance.

These findings motivate deeper exploratory data analysis, evaluation with additional metrics (precision, recall, F1), and exploration of alternative models and threshold tuning to get a more robust assessment of the model's predictability. 

## Introduction

Diabetes is a chronic disease that prevents the body from properly controlling blood sugar levels, which can lead to serious health problems including heart disease, vision loss, kidney disease, and limb amputation (Teboul, 2020). Given the severity of the disease, early detection can allow people to make lifestyle changes and receive treatment that can slow disease progression. We believe that machine learning models using survey data can offer a promising way to create accessible, cost-effective screening tools to identify high-risk individuals and support public health efforts.

### Research Question
Can we use health indicators and lifestyle factors from the CDC's Behavioral Risk Factor Surveillance System (BRFSS) survey to accurately predict whether an individual has diabetes?

We are looking to :
1. Build and evaluate classification models that predict diabetes status based on 21 health and lifestyle features
2. Compare the performance and efficiency of logistic regression and support vector machine (SVM) classifiers
3. Assess whether survey-based features can provide sufficiently accurate predictions for practical screening applications

## Methods & Results

This analysis uses the diabetes_binary_health_indicators_BRFSS2015.csv dataset, a cleaned and preprocessed version of the CDC's 2015 Behavioral Risk Factor Surveillance System (BRFSS) survey data, made available by Alex Teboul on Kaggle (Teboul, 2020).

For this analysis, we split the dataset into training (80%) and testing (20%) sets using a fixed random state (522) to ensure reproducibility. We implemented two classification algorithms:

1. Logistic Regression: A linear model appropriate for binary classification that estimates the probability of diabetes based on a linear combination of features.
2. Linear Support Vector Classifier (SVC): A classifier that finds an optimal hyperplane to separate diabetic from non-diabetic individuals.

Both models were implemented using scikit-learn pipelines that include feature standardization (StandardScaler) to normalize the features to comparable scales. We evaluated model performance using cross-validation on the training set and final accuracy assessment on the held-out test set.

Our results show that both models achieve approximately 86% accuracy, with logistic regression demonstrating slightly faster training time.

## Discussion

The results suggest that the baseline DummyClassifier achieves about 0.86 accuracy, since it predicts based on the most frequent class. Both Logistic Regression and LinearSVC achieve similar accuracy (approximately 0.86) with little to no improvement.

From the EDA, there is class imbalance (more non-diabetic than diabetic patients). This may affect the models’ reliability. Therefore, more analysis is needed to explore additional models, check class balance with metrics such as precision and recall, examine confusion matrices, and test different data splits or tune hyperparameters to determine if performance is stable across scenarios before drawing strong conclusions.

The similarity in test scores is an unexpected finding. With a clean dataset containing informative and diverse features, we would expect the classification models to perform at least better than the dummy classifier. Additionally, initial hyperparameter tuning for logistic regression did not affect accuracy (data not shown). This finding highlights the importance of understanding the data through EDA to interpret where accuracy scores come from.

This suggests the next step for deeper EDA, including distributions, to see whether features overlap and whether the model can separate them effectively. Other future questions would be determining which features are most important for classifying an individual as diabetic or not, evaluating the probability estimates, and assessing whether all features are truly helpful for drawing conclusions.

## Analysis

### Read in and Explore Data

In [2]:
from ucimlrepo import fetch_ucirepo 
   
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
dat = cdc_diabetes_health_indicators.data.original

In [3]:
dat.head()

Unnamed: 0,ID,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0,0,1,1,1,40,1,0,0,0,...,1,0,5,18,15,1,0,9,4,3
1,1,0,0,0,0,25,1,0,0,1,...,0,1,3,0,0,0,0,7,6,1
2,2,0,1,1,1,28,0,0,0,0,...,1,1,5,30,30,1,0,9,4,8
3,3,0,1,0,1,27,0,0,0,1,...,1,0,2,0,0,0,0,11,3,6
4,4,0,1,1,1,24,0,0,0,1,...,1,0,2,3,0,0,0,11,5,4


In [4]:
dat.tail()

Unnamed: 0,ID,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
253675,253675,0,1,1,1,45,0,0,0,0,...,1,0,3,0,5,0,1,5,6,7
253676,253676,1,1,1,1,18,0,0,0,0,...,1,0,4,0,0,1,0,11,2,4
253677,253677,0,0,0,1,28,0,0,0,1,...,1,0,1,0,0,0,0,2,5,2
253678,253678,0,1,0,1,23,0,0,0,0,...,1,0,3,0,0,0,1,7,5,1
253679,253679,1,1,1,1,25,0,0,1,1,...,1,0,2,0,0,0,0,9,6,2


In [5]:
dat.shape

(253680, 23)

In [6]:
dat.columns

Index(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income'],
      dtype='object')

### Data Visualization

In [7]:
# Check the inbalance sample size of the two classes
alt.data_transformers.enable('vegafusion')

alt.Chart(dat, title = "Number of Records of Two Classes").mark_bar().encode(
    x = "Diabetes_binary", 
    y = "count()"
)

In [8]:
numeric_features = ["BMI", "Age"]
binary_features = ["HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", 
                   "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", 
                   "AnyHealthcare", "NoDocbcCost", "DiffWalk", "Sex"]
ordinal_features = ["GenHlth", "MentHlth", "PhysHlth", "Education", "Income"]

In [9]:
# Boxplot for Numeric Features
alt.Chart(dat).mark_boxplot().encode(
    x=alt.X('Diabetes_binary:N', title='Diabetes (0/1)'),
    y=alt.Y(alt.repeat('row'), type='quantitative')
).properties(
    width=200,
    height=150
).repeat(
    row=numeric_features, 
)

# Those having diabetes (diabetes_binary = 1) have a higher BMI and older age on average

In [10]:
# Bar Chart of Proportion with Diabetes for Binary Features
alt.Chart(dat).mark_bar().transform_fold(
    binary_features,
    as_=['feature', 'value']
).encode(
    x=alt.X('value:N', title='0 or 1'),
    y=alt.Y('mean(Diabetes_binary):Q', title='Proportion with Diabetes'),
).properties(
    width=150, 
    height=150
).facet(
    facet='feature:N', 
    columns=5
)

In [11]:
# Bar Chart for Ordinal Features
alt.Chart(dat).mark_bar().encode(
    x=alt.X(alt.repeat("row"), type="ordinal", sort=[1,2,3,4,5]),
    y="count()",
    color="Diabetes_binary:N",
    column=alt.Column("Diabetes_binary:N")
).properties(
    width=200, 
    height=150
).repeat(
    row=ordinal_features
)


### Model Training

#### Train Test Split
Reminder: set.seed(522)

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

train_df, test_df = train_test_split(dat, test_size=0.2, random_state=522)

X_train, y_train = (
    train_df.drop(columns=["Diabetes_binary"]),
    train_df["Diabetes_binary"],
)
X_test, y_test = (
    test_df.drop(columns=["Diabetes_binary"]),
    test_df["Diabetes_binary"],
)

#### Feature Processing

In [13]:
dat.columns

Index(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income'],
      dtype='object')

In [14]:
# features
numeric_feats = ["GenHlth", "Education", "Income", "Age", "MentHlth", "PhysHlth", "BMI"]


passthrough_feats = [
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "DiffWalk",
    "Sex"
]

In [15]:
from sklearn.compose import make_column_transformer

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_feats),
    ("passthrough", passthrough_feats)
)

#### Dummy Classifier

In [16]:
from sklearn.dummy import DummyClassifier

dummy_df = DummyClassifier(strategy="most_frequent", random_state=552)

scores_dummy = pd.DataFrame(cross_validate(dummy_df, X_train, y_train, return_train_score=True)).mean()
scores_dummy

fit_time       0.066497
score_time     0.008219
test_score     0.860922
train_score    0.860922
dtype: float64

#### Logistic Regression

In [17]:
lr_pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))

scores_logistic = cross_validate(lr_pipe, X_train, y_train, return_train_score=True)
results = pd.DataFrame(scores_logistic)
results.mean()

fit_time       1.085523
score_time     0.029637
test_score     0.863731
train_score    0.863839
dtype: float64

#### Linear SVC

In [18]:
from sklearn.svm import LinearSVC

linear_svc_pipe = make_pipeline(preprocessor, LinearSVC(max_iter=5000))

scores = cross_validate(linear_svc_pipe, X_train, y_train, return_train_score=True)
results = pd.DataFrame(scores)
results.mean()

fit_time       1.566425
score_time     0.020817
test_score     0.863539
train_score    0.863546
dtype: float64

#### Final Test (predict on the testset)

In [19]:
from sklearn.metrics import accuracy_score

lr_pipe.fit(X_train, y_train)
prediction_lr = lr_pipe.predict(X_test)
accuracy_lr = accuracy_score(y_test, prediction_lr)

linear_svc_pipe.fit(X_train, y_train)
prediction_svc = linear_svc_pipe.predict(X_test)
accuracy_svc = accuracy_score(y_test, prediction_svc)
print(f"The accuracy of the Logistic Regression model is {accuracy_lr}")
print(f"The accuracy of Linear SVC model is {accuracy_svc}")

The accuracy of the Logistic Regression model is 0.8627207505518764
The accuracy of Linear SVC model is 0.8632726269315674


## Conclusion

After training, Logistic Regression and Linear SVC produced similar accuracy on X_test (about 86%), with Logistic Regression training faster. Given the small difference, either model could be chosen for further evaluation; if speed and interpretability/probability estimates are important, it would make sense to go with Logistic Regression.

A higher-priority next step is addressing class imbalance and re-evaluating both models to see if they outperform the dummy classifier. This motivates deeper EDA, examining feature distributions and predictions, reviewing confusion matrices, and conducting hyperparameter tuning to test for potential improvements. At this point, we cannot draw firm conclusions about the models’ predictive ability based on the current dataset and features.