# Diabetes prediction from UCI diabetes data

In [69]:
# Importing all used libraries
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

## Summary
This project endeavors to develop a predictive classification model for ascertaining an individual's diabetic status, while comparing the efficiency of logistic regression and k-nearest neighbours (kNN) algorithms. The dataset used in this analysis is collected through the Behavioral Risk Factor Surveillance System (BFRSS) by the Centers for Disease Control and Prevention (CDC) for the year 2015. Notably, the primary determinant influencing the prediction is identified as the feature High Blood Pressue (HighBP), displaying a coefficient of 0.354 as revealed by the logistic regression model. A futile attempt of hyperparameter optimization was carried out on the kNN model with intention to improve the validation score, but result showed that it only improved the validation score from 0.71 to 0.74. The optimized logistic regression model demonstrates a test score of 0.73 ,while the kNN model yields a test score of 0.75. Both of the test scores are relatively close to the validation score which shows that the model will generalized well to unseen data, however, there is still room for improvement in the test score. 
 

## Introduction
Diabetes mellitus, commonly referred to as diabetes is a disease which impacts the body’s control of blood glucose levels (Sapra, Bhandari 2023). It is important to note that there are different types of diabetes, although we do not explore this discrepancy in this project (Sapra, Bhandari 2023). Diabetes is a manageable disease thanks to the discovery of insulin in 1922. Globally, 1 in 11 adults have diabetes (Sapra, Bhandari 2023). As such, understanding the factors which are strongly related to diabetes can be important for researchers studying how to better prevent or manage the disease. In this project, we create several machine learning models to predict diabetes in a patient and evaluate the success of these models. We also explore the coefficients of a logistic regression model to better understand the factors which are associated with diabetes. 

In [25]:
diabetes_df = pd.read_csv("data/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")
diabetes_df.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


In [26]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes_binary       70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GenHlth               70692 non-null  float64
 15  MentHlth           

In [27]:
print(diabetes_df.shape)
diabetes_df.describe().T


(70692, 22)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Diabetes_binary,70692.0,0.5,0.500004,0.0,0.0,0.5,1.0,1.0
HighBP,70692.0,0.563458,0.49596,0.0,0.0,1.0,1.0,1.0
HighChol,70692.0,0.525703,0.499342,0.0,0.0,1.0,1.0,1.0
CholCheck,70692.0,0.975259,0.155336,0.0,1.0,1.0,1.0,1.0
BMI,70692.0,29.856985,7.113954,12.0,25.0,29.0,33.0,98.0
Smoker,70692.0,0.475273,0.499392,0.0,0.0,0.0,1.0,1.0
Stroke,70692.0,0.062171,0.241468,0.0,0.0,0.0,0.0,1.0
HeartDiseaseorAttack,70692.0,0.14781,0.354914,0.0,0.0,0.0,0.0,1.0
PhysActivity,70692.0,0.703036,0.456924,0.0,0.0,1.0,1.0,1.0
Fruits,70692.0,0.611795,0.487345,0.0,0.0,1.0,1.0,1.0


In [28]:
# Check for duplicate in dataset
duplicate_rows = diabetes_df.duplicated()
print(duplicate_rows.value_counts())

False    69057
True      1635
Name: count, dtype: int64


In [29]:
# Check for imbalance dataset
diabetes_df.drop_duplicates(inplace=True)
diabetes_df["Diabetes_binary"].value_counts()

Diabetes_binary
1.0    35097
0.0    33960
Name: count, dtype: int64

In [30]:
# Check for null values
diabetes_df.isnull().sum()

Diabetes_binary         0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [31]:
#Creating train and test data
train_df, test_df = train_test_split(diabetes_df, test_size = 0.2, random_state=123)

X_train = train_df.drop(columns = "Diabetes_binary")
y_train = train_df["Diabetes_binary"]

X_test = test_df.drop(columns = "Diabetes_binary")
y_test = test_df["Diabetes_binary"]

In [32]:
# plotting histogram distributions
alt.data_transformers.enable("vegafusion")
numeric_cols = train_df.select_dtypes(include=['float64']).columns.to_list()

hist_plot = alt.Chart(train_df).mark_bar(opacity=0.7).encode(
            x=alt.X(alt.repeat(),type='quantitative', bin=alt.Bin(maxbins=20)),
            y=alt.Y('count()').stack(False),
            color=alt.Color('Diabetes_binary:N')
        ).properties(
            width=150,
            height=150
        ).repeat(
            numeric_cols,
            columns=4
        )

hist_plot

In [33]:
#Creating the baseline for our model
dummy = DummyClassifier()
scores = cross_validate(dummy, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.021863,0.001702,0.510363,0.510408
1,0.005243,0.00175,0.510363,0.510408
2,0.005849,0.00175,0.510363,0.510408
3,0.004645,0.001125,0.510453,0.510386
4,0.005647,0.001151,0.510453,0.510386


# Model comparison

In [34]:
# Designate binary and continuous cols
binary_cols = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 'HeartDiseaseorAttack', 
               'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
              'DiffWalk', 'Sex']
continuous_cols = ['BMI', 'Age', 'GenHlth', 'MentHlth', 'PhysHlth', 'Education', 'Income']


In [35]:
# Create a pre-processor which scales the continuous cols
preprocessor = ColumnTransformer(
    transformers=[
        ('continuous', StandardScaler(), continuous_cols),
        ('binary', 'passthrough', binary_cols)
    ])

In [47]:
# Models to test
models = {
    "Dummy": make_pipeline(preprocessor, DummyClassifier()),
    "Decision tree": make_pipeline(preprocessor, DecisionTreeClassifier(random_state=123)),
    "Logistic regression": make_pipeline(preprocessor, LogisticRegression(max_iter=1000)),
    "Knn": make_pipeline(preprocessor, KNeighborsClassifier())
}

In [48]:
# Evaluate each model
results_dict = {}

for name, pipeline in models.items():
    
    # Cross-validation on training data
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv = 5)
    mean_cv_score = round(cv_scores.mean(), 2)

    pipeline.fit(X_train, y_train)
    predictions = pipeline.predict(X_test)
    test_accuracy = round(accuracy_score(y_test, predictions), 2)

    results_dict[name] = (mean_cv_score, test_accuracy)

results_df = pd.DataFrame(list(results_dict.values()), index=results_dict.keys(), columns=['Mean CV Score', 'Test Accuracy'])
results_df

Unnamed: 0,Mean CV Score,Test Accuracy
Dummy,0.51,0.5
Decision tree,0.64,0.64
Logistic regression,0.74,0.75
Knn,0.71,0.71


# Feature Importance

In [38]:
# Manually scaling the data
scaler = StandardScaler()  
scaler.fit(X_train)  
X_train_scaled = scaler.transform(X_train)  
X_test_scaled = scaler.transform(X_test)

In [39]:
# Show coefficients
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)
cols = train_df.drop(columns=["Diabetes_binary"]).columns
data = {"features": cols, "coefficients": lr.coef_[0]}
pd.DataFrame(data)

Unnamed: 0,features,coefficients
0,HighBP,0.353947
1,HighChol,0.289383
2,CholCheck,0.216358
3,BMI,0.536654
4,Smoker,-0.011014
5,Stroke,0.045484
6,HeartDiseaseorAttack,0.092977
7,PhysActivity,-0.012531
8,Fruits,-0.011764
9,Veggies,-0.021392


## Exploring Hyperparameters

While the logistic regression model had the highest accuracy score of the models we explored. However, the knn model was the second best model and had a cross validation accuracy only 0.03 less than the regression model. As such, we will now explore the hyperparameters of the knn model to see if we can improve this score. 

In [40]:
from sklearn.model_selection import RandomizedSearchCV

knn_pipe = make_pipeline(preprocessor, KNeighborsClassifier())

param_grid = {
    "kneighborsclassifier__n_neighbors": [50, 100, 200, 300, 500]
}
first_search = RandomizedSearchCV(knn_pipe, param_distributions=param_grid, n_iter=10, n_jobs= -1, return_train_score=True) 
first_search.fit(X_train, y_train)



In [41]:
print ("the best parameter:", first_search.best_params_)
print ("the best score:", first_search.best_score_)

the best parameter: {'kneighborsclassifier__n_neighbors': 100}
the best score: 0.741732283464567


## Model Selection and Testing

In [44]:
final_knn = KNeighborsClassifier(n_neighbors=100)
final_knn.fit(X_train, y_train)


In [66]:
y_pred_knn = final_knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_knn)
print(f'Accuracy of knn model of n=100: {accuracy}')

# Additional metrics
print(classification_report(y_test, y_pred_knn))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_knn))
print('AUC-ROC Score:', roc_auc_score(y_test, final_knn.predict_proba(X_test)[:, 1]))

Accuracy of knn model of n=100: 0.7280625543006082
              precision    recall  f1-score   support

         0.0       0.76      0.67      0.71      6912
         1.0       0.71      0.78      0.74      6900

    accuracy                           0.73     13812
   macro avg       0.73      0.73      0.73     13812
weighted avg       0.73      0.73      0.73     13812

Confusion Matrix:
 [[4660 2252]
 [1504 5396]]
AUC-ROC Score: 0.8017921887580515


In [67]:
final_logistic = LogisticRegression(max_iter=1000)
final_logistic.fit(X_train, y_train)

In [68]:
y_pred_log = final_logistic.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_log)
print(f'Accuracy of logistic model: {accuracy}')

# Additional metrics
print(classification_report(y_test, y_pred_log))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_log))
print('AUC-ROC Score:', roc_auc_score(y_test, final_logistic.predict_proba(X_test)[:, 1]))

Accuracy of logistic model: 0.7461627570228787
              precision    recall  f1-score   support

         0.0       0.76      0.72      0.74      6912
         1.0       0.74      0.77      0.75      6900

    accuracy                           0.75     13812
   macro avg       0.75      0.75      0.75     13812
weighted avg       0.75      0.75      0.75     13812

Confusion Matrix:
 [[5006 1906]
 [1600 5300]]
AUC-ROC Score: 0.8207051273986851


Conclusion:
Based on the evaluation metrics, the Logistic Regression model performs better than the K-Nearest Neighbors model on the provided test dataset. It achieves higher accuracy, precision, recall, and AUC-ROC score, indicating a better overall performance.

Therefore, considering these results and the fact that Logistic Regression also offers interpretability of feature coefficients, it seems reasonable to prefer the Logistic Regression model for this specific classification task. 