<a href="https://colab.research.google.com/github/alivarastepour/diabetes_prediction/blob/master/diabetes_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook
This notebook aims to build a model that determines whether a person is prone to diabetes or not. Additionally, it seeks to identify a subset of features (risk factors) that can accurately predict the risk of diabetes. The weights of the optimal solution will be utilized in another project, where they will be applied to users' inputs in real time.

## Dataset
This notebook makes use of a subset of a larger dataset which aimed to collect uniform, state-specific data on preventive health practices and risk behaviors that are associated with chronic diseases, injuries, and preventable infectious diseases in the adult population. The subset used in this notebook can be accessed [here](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv).

In [17]:
import pandas as pd
import numpy as np
from google.colab import drive

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [3]:
drive.mount('/drive')
DATASET_ADDRESS = '/drive/MyDrive/diabetes_info.csv'
raw_dataset = pd.read_csv(DATASET_ADDRESS)

Mounted at /drive


In [4]:
raw_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes_binary       70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GenHlth               70692 non-null  float64
 15  MentHlth           

## The correlation matrix and its usage
Correlation matrix simply explains the relationship between columns of a dataset. The correlation coefficient ranges between -1 and 1. A correlation coefficient of 1 indicates a perfect positive correlation, meaning that the two variables increase or decrease together in a linear relationship. A correlation coefficient of -1 indicates a perfect negative correlation, meaning that the two variables move in opposite directions in a linear relationship. A correlation coefficient close to 0 suggests no linear relationship between the variables.

This matrix can be helpful when finding an optimal subset of features.

In [5]:
columns = raw_dataset.keys()
correlation = raw_dataset[columns].corr()
correlation["Diabetes_binary"]

Diabetes_binary         1.000000
HighBP                  0.381516
HighChol                0.289213
CholCheck               0.115382
BMI                     0.293373
Smoker                  0.085999
Stroke                  0.125427
HeartDiseaseorAttack    0.211523
PhysActivity           -0.158666
Fruits                 -0.054077
Veggies                -0.079293
HvyAlcoholConsump      -0.094853
AnyHealthcare           0.023191
NoDocbcCost             0.040977
GenHlth                 0.407612
MentHlth                0.087029
PhysHlth                0.213081
DiffWalk                0.272646
Sex                     0.044413
Age                     0.278738
Education              -0.170481
Income                 -0.224449
Name: Diabetes_binary, dtype: float64

In [6]:
y = raw_dataset["Diabetes_binary"]
x = raw_dataset.drop(columns=["Diabetes_binary"])

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=9)

# Model selection
While our data may appear relatively clean, this does not guarantee optimal performance. Therefore, we must leverage a range of machine learning models to assess their effectiveness and identify potential modifications to the original data that can enhance the performance of our models.

## First model: Gradient boost classifier
Boosting algorithms have been widely recognized as effective choices for handling tabular data. Among them, gradient boosting stands out as a prominent technique that leverages decision trees to create a powerful ensemble model. Nonetheless, to ensure its optimal performance, careful consideration should be given to hyperparameter tuning.

In [8]:
reg = GradientBoostingClassifier(random_state=90,
                                loss='deviance',
                                learning_rate=0.05,
                                n_estimators=150,
                                subsample=0.8,
                                criterion='friedman_mse',
                                verbose=2,
                                )

In [9]:
reg.fit(x_train, y_train)



      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3619           0.0245           10.25s
         2           1.3400           0.0224            9.53s
         3           1.3200           0.0203            9.15s
         4           1.3008           0.0178            9.35s
         5           1.2841           0.0163            9.24s
         6           1.2685           0.0151            9.20s
         7           1.2545           0.0136            9.04s
         8           1.2417           0.0127            9.03s
         9           1.2303           0.0117            9.03s
        10           1.2191           0.0112            8.91s
        11           1.2090           0.0101            8.91s
        12           1.1982           0.0100            8.81s
        13           1.1890           0.0094            8.70s
        14           1.1793           0.0088            8.70s
        15           1.1707           0.0078            8.59s
       

## The deviance loss
Deviance loss is a commonly used loss function in binary classification problems. With a glance at its formula, we can easily unserstand why:

$$
L(y, p) = \left(y \log(p) + (1 - y) \log(1 - p)\right)
$$

where y is true class and p is statistical probability.




## F-1 score
F-1 score uses precision(ratio of true possitives to true possitves and false possitives) and recall(ratio of true possitives to true possitves and false negatives) scores to prvoide a balance between them:

$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$


In [10]:
y_pred = reg.predict(x_test)

In [11]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.76      0.71      0.73      7010
         1.0       0.73      0.78      0.76      7129

    accuracy                           0.75     14139
   macro avg       0.75      0.75      0.75     14139
weighted avg       0.75      0.75      0.75     14139



In [15]:
scores = cross_val_score(reg, x_train, y_train, cv=5, scoring='f1_macro')
print("cross validation scores(F-1) where k=5: ", scores)



      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3618           0.0250            6.89s
         2           1.3393           0.0221            6.86s
         3           1.3186           0.0197            6.75s
         4           1.3001           0.0183            6.77s
         5           1.2847           0.0173            6.87s
         6           1.2681           0.0153            6.80s
         7           1.2545           0.0140            6.78s
         8           1.2407           0.0127            6.76s
         9           1.2278           0.0122            6.72s
        10           1.2161           0.0107            6.69s
        11           1.2076           0.0105            6.68s
        12           1.1964           0.0093            6.67s
        13           1.1879           0.0097            6.72s
        14           1.1787           0.0084            6.68s
        15           1.1716           0.0078            6.60s
       



         2           1.3396           0.0220            6.85s
         3           1.3189           0.0195            6.84s
         4           1.3011           0.0183            6.81s
         5           1.2853           0.0171            6.78s
         6           1.2685           0.0149            6.81s
         7           1.2552           0.0142            6.80s
         8           1.2417           0.0126            6.75s
         9           1.2289           0.0123            6.69s
        10           1.2168           0.0104            6.60s
        11           1.2080           0.0102            6.51s
        12           1.1982           0.0101            6.47s
        13           1.1895           0.0095            6.50s
        14           1.1787           0.0084            6.42s
        15           1.1716           0.0080            6.36s
        16           1.1653           0.0073            6.30s
        17           1.1579           0.0077            6.27s
        



         1           1.3619           0.0248            7.08s
         2           1.3401           0.0226            6.92s
         3           1.3187           0.0193            6.82s
         4           1.3010           0.0182            6.99s
         5           1.2848           0.0169            6.93s
         6           1.2686           0.0151            6.88s
         7           1.2555           0.0142            6.92s
         8           1.2418           0.0127            6.89s
         9           1.2288           0.0114            6.81s
        10           1.2172           0.0113            6.73s
        11           1.2087           0.0106            6.69s
        12           1.1983           0.0101            6.62s
        13           1.1895           0.0096            6.61s
        14           1.1794           0.0085            6.54s
        15           1.1706           0.0079            6.48s
        16           1.1655           0.0072            6.44s
        



      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3619           0.0243           10.27s
         2           1.3400           0.0222           10.12s
         3           1.3197           0.0199           10.68s
         4           1.3006           0.0175           10.75s
         5           1.2857           0.0170           10.55s
         6           1.2699           0.0153           10.50s
         7           1.2558           0.0139           10.37s
         8           1.2425           0.0127           10.22s
         9           1.2306           0.0123           10.13s
        10           1.2202           0.0112           10.05s
        11           1.2084           0.0100           10.08s
        12           1.2001           0.0098           10.01s
        13           1.1905           0.0085            9.88s
        14           1.1817           0.0093            9.77s
        15           1.1730           0.0080            9.70s
       



         3           1.3196           0.0196            6.84s
         4           1.3004           0.0171            7.12s
         5           1.2860           0.0171            7.04s
         6           1.2700           0.0153            6.97s
         7           1.2566           0.0143            6.88s
         8           1.2424           0.0125            6.82s
         9           1.2310           0.0121            6.74s
        10           1.2201           0.0108            6.66s
        11           1.2074           0.0105            6.61s
        12           1.1995           0.0094            6.55s
        13           1.1896           0.0085            6.52s
        14           1.1811           0.0091            6.48s
        15           1.1720           0.0079            6.48s
        16           1.1649           0.0072            6.42s
        17           1.1608           0.0076            6.36s
        18           1.1518           0.0065            6.31s
        

In [16]:
scores = cross_val_score(reg, x_train, y_train, cv=10, scoring='f1_macro')
print("cross validation scores(F-1) where k=10: ", scores)

      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3616           0.0244           11.80s




         2           1.3389           0.0219           12.14s
         3           1.3191           0.0204           12.38s
         4           1.3010           0.0186           12.15s
         5           1.2832           0.0164           12.22s
         6           1.2685           0.0152           12.02s
         7           1.2541           0.0141           11.84s
         8           1.2417           0.0129           11.81s
         9           1.2282           0.0123           11.70s
        10           1.2186           0.0114           11.53s
        11           1.2069           0.0102           11.37s
        12           1.1979           0.0097           11.25s
        13           1.1875           0.0094           11.13s
        14           1.1800           0.0093           11.05s
        15           1.1705           0.0081           10.99s
        16           1.1617           0.0073           10.90s
        17           1.1543           0.0066           10.81s
        



         1           1.3618           0.0245            7.55s
         2           1.3393           0.0218            8.16s
         3           1.3196           0.0203            7.99s
         4           1.3016           0.0184            8.01s
         5           1.2840           0.0161            7.99s
         6           1.2693           0.0152            7.93s
         7           1.2553           0.0141            8.02s
         8           1.2429           0.0130            7.92s
         9           1.2300           0.0117            7.90s
        10           1.2203           0.0112            7.87s
        11           1.2088           0.0107            7.82s
        12           1.1993           0.0096            7.77s
        13           1.1900           0.0089            7.71s
        14           1.1826           0.0093            7.63s
        15           1.1735           0.0078            7.53s
        16           1.1646           0.0071            7.53s
        



         4           1.3018           0.0182            8.00s
         5           1.2841           0.0162            7.91s
         6           1.2697           0.0151            7.88s
         7           1.2552           0.0138            7.91s
         8           1.2432           0.0129            7.87s
         9           1.2293           0.0123            7.76s
        10           1.2204           0.0114            7.65s
        11           1.2092           0.0102            7.59s
        12           1.1999           0.0094            7.51s
        13           1.1894           0.0094            7.44s
        14           1.1827           0.0084            7.37s
        15           1.1731           0.0085            7.30s
        16           1.1643           0.0069            7.24s
        17           1.1564           0.0075            7.19s
        18           1.1498           0.0063            7.13s
        19           1.1440           0.0067            7.06s
        



      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3616           0.0242           13.06s
         2           1.3399           0.0225           12.27s
         3           1.3200           0.0206           12.57s
         4           1.3005           0.0179           12.31s
         5           1.2841           0.0166           12.13s
         6           1.2697           0.0155           12.20s
         7           1.2538           0.0137           12.06s
         8           1.2408           0.0126           11.85s
         9           1.2298           0.0117           11.83s
        10           1.2175           0.0107           11.70s
        11           1.2089           0.0102           11.54s
        12           1.1991           0.0102           11.48s
        13           1.1888           0.0092           11.38s
        14           1.1803           0.0083           11.33s
        15           1.1715           0.0085           11.31s
       



         1           1.3619           0.0243            8.34s
         2           1.3401           0.0225            8.04s
         3           1.3205           0.0204            7.84s
         4           1.3009           0.0177            7.99s
         5           1.2851           0.0166            7.91s
         6           1.2701           0.0153            7.77s
         7           1.2550           0.0138            7.76s
         8           1.2420           0.0127            7.68s
         9           1.2300           0.0124            7.65s
        10           1.2180           0.0107            7.59s
        11           1.2084           0.0105            7.55s
        12           1.1996           0.0101            7.63s
        13           1.1890           0.0091            7.61s
        14           1.1808           0.0083            7.52s
        15           1.1716           0.0085            7.44s
        16           1.1641           0.0069            7.36s
        



         1           1.3618           0.0246            8.73s
         2           1.3398           0.0223            8.30s
         3           1.3202           0.0205            8.03s
         4           1.3006           0.0180            7.90s
         5           1.2845           0.0167            7.83s
         6           1.2695           0.0155            7.74s
         7           1.2543           0.0138            7.72s
         8           1.2412           0.0127            7.63s
         9           1.2293           0.0123            7.72s
        10           1.2179           0.0110            7.61s
        11           1.2078           0.0106            7.55s
        12           1.1981           0.0099            7.52s
        13           1.1890           0.0087            7.47s
        14           1.1804           0.0083            7.43s
        15           1.1709           0.0081            7.34s
        16           1.1634           0.0070            7.27s
        



      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3620           0.0243           11.51s
         2           1.3400           0.0220           11.66s
         3           1.3205           0.0201           12.04s
         4           1.3014           0.0179           11.91s
         5           1.2855           0.0166           11.76s
         6           1.2704           0.0153           11.83s
         7           1.2559           0.0139           11.68s
         8           1.2422           0.0125           11.55s
         9           1.2306           0.0123           11.51s
        10           1.2185           0.0108           11.36s
        11           1.2089           0.0105           11.26s
        12           1.1998           0.0098           11.31s
        13           1.1900           0.0086           11.18s
        14           1.1811           0.0085           11.07s
        15           1.1716           0.0084           11.00s
       



         1           1.3619           0.0247            8.79s
         2           1.3398           0.0223            8.73s
         3           1.3200           0.0203            8.39s
         4           1.3007           0.0180            8.31s
         5           1.2845           0.0167            8.50s
         6           1.2696           0.0152            8.47s
         7           1.2552           0.0139            8.28s
         8           1.2419           0.0127            8.15s
         9           1.2294           0.0121            8.12s
        10           1.2180           0.0108            8.00s
        11           1.2086           0.0103            7.95s
        12           1.2004           0.0102            7.83s
        13           1.1910           0.0087            7.75s
        14           1.1808           0.0083            7.65s
        15           1.1719           0.0080            7.58s
        16           1.1645           0.0070            7.53s
        



      Iter       Train Loss      OOB Improve   Remaining Time 
         1           1.3620           0.0246            8.27s
         2           1.3398           0.0219            8.06s
         3           1.3203           0.0203            7.90s
         4           1.3011           0.0178            8.19s
         5           1.2847           0.0166            8.06s
         6           1.2702           0.0152            7.91s
         7           1.2558           0.0141            7.80s
         8           1.2421           0.0126            7.77s
         9           1.2297           0.0120            7.84s
        10           1.2178           0.0109            7.74s
        11           1.2091           0.0104            7.65s
        12           1.2005           0.0102            7.60s
        13           1.1908           0.0088            7.51s
        14           1.1800           0.0089            7.45s
        15           1.1711           0.0081            7.41s
       



         3           1.3201           0.0202            7.78s
         4           1.3012           0.0180            7.82s
         5           1.2847           0.0167            7.73s
         6           1.2694           0.0150            7.64s
         7           1.2556           0.0140            7.64s
         8           1.2421           0.0127            7.72s
         9           1.2300           0.0117            7.65s
        10           1.2186           0.0108            7.58s
        11           1.2096           0.0100            7.57s
        12           1.2010           0.0103            7.48s
        13           1.1911           0.0090            7.41s
        14           1.1808           0.0089            7.38s
        15           1.1717           0.0083            7.34s
        16           1.1641           0.0070            7.27s
        17           1.1581           0.0067            7.19s
        18           1.1520           0.0062            7.14s
        

## Initial evaluataion result
As demonstrated above, whether employing Gradient Boosting with or without cross-validation, the F1 score hovers around 0.75. While this performance is acceptable, there is room for improvement.

# Second model: Logistic regression
While Logistic Regression is typically considered a more linear model compared to ensemble methods, it remains a highly prevalent choice in classification problems. It offers several distinct advantages, such as strong interpretability, feature importance insights, and the ability to not only make binary classifications but also provide class probabilities. This probabilistic aspect can prove particularly valuable in certain situations."







In [48]:
log_reg = LogisticRegression(random_state=32, solver='sag', multi_class='ovr', verbose=2)

In [49]:
log_reg.fit(x_train, y_train)

max_iter reached after 4 seconds




In [50]:
y_pred_log_reg = log_reg.predict(x_test)

In [51]:
report_log_reg = classification_report(y_test, y_pred_log_reg)
print(report_log_reg)

              precision    recall  f1-score   support

         0.0       0.75      0.73      0.74      7010
         1.0       0.74      0.76      0.75      7129

    accuracy                           0.74     14139
   macro avg       0.74      0.74      0.74     14139
weighted avg       0.74      0.74      0.74     14139



## Evaluation result
Logistic regression exhibited slightly lower performance compared to Gradient Boosting, indicating that additional data preprocessing may be necessary to enhance model outcomes.