Diabetes Progression Predictor

Diabetes is a chronic disease that alters how the body turns food into energy. Food is normally broken down into simple sugars (glucose) and released into the bloodstream. Insulin (a hormone made by the pancreas) helps the glucose pass from the blood stream into the cells. In diabetic patients, the pancreas can no longer make insulin or cannot adequately use the insulin it does produce. Without careful management, diabetes can lead to dangerous complications.

This is a linear regression model used to predict disease progression one year after the baseline. The data comes from a study by  Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) titled "Least Angle Regression," Annals of Statistics. It is accesible here: Source URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html Data URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt Note: The Data URL mentioned-above is obtained from the source URL. The source URL provides detailed information about the dataset, variables and also reference links including the dataset link.

The dataset containts n=422 observations and ten baseline variables:

age: age in years
sex: sex
bmi: body mass index
bp: average blood pressure
s1: tc, total serum cholesterol
s2: ldl, low-density lipoproteins
s3: hdl, high-density lipoproteins
s4: tch, total cholesterol / HDL
s5: ltg, possibly log of serum triglycerides level
s6: glu, blood sugar level

Data Set Characteristics Number of Instances: 442 Number of Attributes First 10 columns are numeric predictive values 
Target: Column 11 is a quantitative measure of disease progression one year after baseline Attribute Information.It is the variable of interest. 


Diabetes Progression Predictor: a MLR Model 

In [1]:
#Import tools 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import norm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn import datasets 

In [4]:
#Use this URL to read in the data into a pandas dataframe called "df".
#Hint: set sep="\t" when reading in the csv file.
df = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", sep = "\t")
print(df.head())
print(df.tail())

   AGE  SEX   BMI     BP   S1     S2    S3   S4      S5  S6    Y
0   59    2  32.1  101.0  157   93.2  38.0  4.0  4.8598  87  151
1   48    1  21.6   87.0  183  103.2  70.0  3.0  3.8918  69   75
2   72    2  30.5   93.0  156   93.6  41.0  4.0  4.6728  85  141
3   24    1  25.3   84.0  198  131.4  40.0  5.0  4.8903  89  206
4   50    1  23.0  101.0  192  125.4  52.0  4.0  4.2905  80  135
     AGE  SEX   BMI      BP   S1     S2    S3    S4      S5   S6    Y
437   60    2  28.2  112.00  185  113.8  42.0  4.00  4.9836   93  178
438   47    2  24.9   75.00  225  166.0  42.0  5.00  4.4427  102  104
439   60    2  24.9   99.67  162  106.6  43.0  3.77  4.1271   95  132
440   36    1  30.0   95.00  201  125.2  42.0  4.79  5.1299   85  220
441   36    1  19.6   71.00  250  133.2  97.0  3.00  4.5951   92   57


In [5]:
#Get some information on the variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AGE     442 non-null    int64  
 1   SEX     442 non-null    int64  
 2   BMI     442 non-null    float64
 3   BP      442 non-null    float64
 4   S1      442 non-null    int64  
 5   S2      442 non-null    float64
 6   S3      442 non-null    float64
 7   S4      442 non-null    float64
 8   S5      442 non-null    float64
 9   S6      442 non-null    int64  
 10  Y       442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


In [7]:
#Convert SEX to a categorical varibale, it should not be an integer
categorical_variables = ["SEX"]
df[categorical_variables] = df[categorical_variables].astype("category")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   AGE     442 non-null    int64   
 1   SEX     442 non-null    category
 2   BMI     442 non-null    float64 
 3   BP      442 non-null    float64 
 4   S1      442 non-null    int64   
 5   S2      442 non-null    float64 
 6   S3      442 non-null    float64 
 7   S4      442 non-null    float64 
 8   S5      442 non-null    float64 
 9   S6      442 non-null    int64   
 10  Y       442 non-null    int64   
dtypes: category(1), float64(6), int64(4)
memory usage: 35.2 KB


In [9]:
#Lets examine the data
dfDes = df.describe(include = "all")
print(dfDes)

               AGE    SEX         BMI          BP          S1          S2  \
count   442.000000  442.0  442.000000  442.000000  442.000000  442.000000   
unique         NaN    2.0         NaN         NaN         NaN         NaN   
top            NaN    1.0         NaN         NaN         NaN         NaN   
freq           NaN  235.0         NaN         NaN         NaN         NaN   
mean     48.518100    NaN   26.375792   94.647014  189.140271  115.439140   
std      13.109028    NaN    4.418122   13.831283   34.608052   30.413081   
min      19.000000    NaN   18.000000   62.000000   97.000000   41.600000   
25%      38.250000    NaN   23.200000   84.000000  164.250000   96.050000   
50%      50.000000    NaN   25.700000   93.000000  186.000000  113.000000   
75%      59.000000    NaN   29.275000  105.000000  209.750000  134.500000   
max      79.000000    NaN   42.200000  133.000000  301.000000  242.400000   

                S3          S4          S5          S6           Y  
count 

In [10]:
#Split data set into test (0.3) and train (0.7) with a random state of 42.
df_train, df_test = train_test_split(df, test_size= 0.3, random_state = 42)

In [12]:
#Fit Multilinear OLS regression model to the train dataset. 
est_train = ols(formula= "Y ~ AGE + SEX + BMI + BP+ S1 + S2 + S3 + S4 + S5 +S6", data = df_train).fit()
print(est_train.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.524
Model:                            OLS   Adj. R-squared:                  0.508
Method:                 Least Squares   F-statistic:                     32.86
Date:                Fri, 31 Jan 2025   Prob (F-statistic):           1.37e-42
Time:                        11:26:24   Log-Likelihood:                -1671.5
No. Observations:                 309   AIC:                             3365.
Df Residuals:                     298   BIC:                             3406.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -341.2349     78.615     -4.341      0.0

In [None]:
#Remove the nonsignificant coeficient (p> 0.05) and run the model again with only variables with significant coeficient (p<0.05)

In [14]:
est_train = ols(formula = "Y~ SEX + BMI + BP + S5", data = df_train). fit()
print(est_train. summary())
print(est_train.params)

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.481
Model:                            OLS   Adj. R-squared:                  0.474
Method:                 Least Squares   F-statistic:                     70.44
Date:                Fri, 31 Jan 2025   Prob (F-statistic):           3.76e-42
Time:                        11:30:04   Log-Likelihood:                -1685.0
No. Observations:                 309   AIC:                             3380.
Df Residuals:                     304   BIC:                             3399.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -332.2349     31.853    -10.430      0.0

In [15]:
#Time to test our model with the test data to make predictions  on the test data and to measure the R^2
test_pred = est_train.predict(df_test)
r2 = r2_score(df_test["Y"], test_pred)
print("OOS R-squared: " + str(r2))


OOS R-squared: 0.48587882336937604


Conclusion: 
R^2 is pretty similar for both train and test data. This means that there is not much overfitting going on in the model. Good predictors of diabetes progression are SEX, BMI, BP, and S5 (log of tryglicerides). Stay Healthy!!!