Title: Predicting Diabetes Through Common Health Metrics


Description:

Predicting the onset of diabetes is a critical challenge in public health, and this project aims to address it by leveraging machine learning techniques applied to real-world data from the National Health and Nutrition Examination Survey (NHANES). The objective is to build a predictive model that identifies individuals at elevated risk for developing diabetes by examining key clinical and lifestyle variables. With diabetes rates increasing and its significant impact on healthcare systems, early detection can lead to timely interventions, better patient outcomes, and reduced long-term healthcare costs.

The project will utilize NHANES data from the Centers for Disease Control and Prevention (CDC) website (https://wwwn.cdc.gov/nchs/nhanes/Default.aspx). This dataset provides a wide range of information, including demographics, socioeconomic data, dietary habits, clinical measurements, and health exam results. By focusing on essential variables such as fasting plasma glucose, body mass index (BMI), blood pressure, lipid profiles, physical activity levels, and dietary intake, the project will isolate the factors that most strongly predict the risk of developing diabetes. I will probably use a technique I learned in an earlier project to get the two or three best variables that accurately predict diabetes outcomes. 

The modeling process will start with straightforward supervised learning algorithms like Logistic Regression and Decision Trees to establish a performance baseline. From there, more advanced techniques, such as Random Forests and SVMs, will be implemented to capture any nonlinear interactions and improve overall predictive accuracy. If the data demonstrates sufficient complexity, I will consider using an MLP. 



Status: I am still working on data processing, but should have it done soon and I have been playing a lot with the different types of variables I am going to use. I think I am also going to run a cross validation to see which model I am going to use, but it will probably be a SVM or Gradient Boosting. 


Data Processing

In [None]:
import pandas as pd


# Use pandas.read_sas to read the XPT file
diabetes_ques = pd.read_sas('DIQ_L.xpt', format='xport')
fasting_glucose = pd.read_sas('GLU_L.xpt', format='xport')
insulin = pd.read_sas('INS_L.xpt', format='xport')
tot_cholesterol = pd.read_sas('TCHOL_L.xpt', format='xport')
weight_hist = pd.read_sas('WHQ_L.xpt', format='xport')
demographics = pd.read_sas('DEMO_L.xpt', format='xport')

demographics = demographics[['SEQN', 'RIAGENDR', 'RIDAGEYR']]
print(demographics.head())



diabetes_ques = diabetes_ques[['SEQN', 'DIQ160']]
#DIQ160: Have you ever been told by a doctor or health professional that you have diabetes?
# 1 = Yes, 2 = No, 7 = Refused, 9 = Don't know
#print(diabetes_ques['DIQ160'].value_counts())

#get fasting glucose (mg/dL), just means glucose level after fasting for 8 hours
fasting_glucose = fasting_glucose[['SEQN', 'LBXGLU']]
#LBXGLU: Glucose, Serum or Plasma (mg/dL)
fasting_glucose['diabetes_meas'] = fasting_glucose['LBXGLU'].apply(lambda x: 1 if x >= 126 else 0)
#print(fasting_glucose['diabetes'].value_counts())



#measured in (pmol/L)
insulin = insulin[['SEQN', 'LBDINSI']]

#measured in (mg/dL)
tot_cholesterol = tot_cholesterol[['SEQN', 'LBXTC']]

#weight history height (inch) and weight (lbs) 
weight_hist = weight_hist[['SEQN', 'WHD010', 'WHD020']]
weight_hist['BMI'] = (weight_hist['WHD020'] / (weight_hist['WHD010'] * weight_hist['WHD010'])) * 703
weight_hist['is_overweight'] = weight_hist['BMI'].apply(lambda x: 1 if x >= 30 else 0)
#print(weight_hist.head())

#comparing the questionnaire to the fasting glucose test
diab_merged = pd.merge(diabetes_ques, fasting_glucose, on='SEQN', how='inner')
diab_merged['is_diabetic'] = diab_merged['DIQ160'].apply(lambda x: 1 if x == 1.0 else 0)
matches = diab_merged[diab_merged['is_diabetic'] == diab_merged['diabetes_meas']]
#print(f"Number of matches: {len(matches)}")
#print(f"Number of mismatches: {len(diab_merged) - len(matches)}")

insulin_merged = pd.merge(insulin, fasting_glucose, on='SEQN', how='inner')
insulin_merged['HOMA_IR'] = (insulin_merged['LBDINSI'] * insulin_merged['LBXGLU']) / 405
insulin_merged['high_homa'] = insulin_merged['HOMA_IR'].apply(lambda x: 1 if x >= 2.5 else 0)
print(insulin_merged['high_homa'].value_counts())


       SEQN  RIAGENDR  RIDAGEYR  RIDRETH1
0  130378.0       1.0      43.0       5.0
1  130379.0       1.0      66.0       3.0
2  130380.0       2.0      44.0       2.0
3  130381.0       2.0       5.0       5.0
4  130382.0       1.0       2.0       3.0
high_homa
1    3456
0     540
Name: count, dtype: int64


In [14]:
from sklearn.metrics import confusion_matrix, classification_report

y_true = diab_merged['diabetes_meas']
y_pred = diab_merged['is_diabetic']

# confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"TN={tn}  FP={fp}\nFN={fn}  TP={tp}")

# classification report
print(classification_report(y_true, y_pred, target_names=['no-diabetes','diabetes']))

TN=3196  FP=370
FN=384  TP=46
              precision    recall  f1-score   support

 no-diabetes       0.89      0.90      0.89      3566
    diabetes       0.11      0.11      0.11       430

    accuracy                           0.81      3996
   macro avg       0.50      0.50      0.50      3996
weighted avg       0.81      0.81      0.81      3996



In [None]:
#K-fold cross-validation

