# Diabetes Risk Prediction
This project uses the Behavioral Risk Factor Surveillance System (BRFSS) survey data from [this link](https://www.cdc.gov/brfss/annual_data/annual_2024.html) to predict the probability of developing different types of Diabetes. Features about U.S. residents include demographic data (e.g. income level, education, race) as well as data regarding health-related risk behaviors, chronic health conditions, and use of preventive services.

This is the **second notebook** for the project, which reads the dataframe created by the `parse_raw_data.ipynb` notebook, performs exploratory data analysis (EDA), and develops a predictive model for assessing Diabetes risk. If all goes well, I will try to add multiclass prediction to include Prediabetes risk. The primary target variable for this study will be the DIABETE4 column in the dataframe (i.e. "(Ever told) you had diabetes"), which includes Prediabetes as a separate class. This dataset is highly imbalanced so will require additional effort to train the model effectively. Note that the other two Diabetes related columns (PREDIAB2 and DIABTYPE) have been included for validation purposes, and for potential future study.

There are 26 candidate features included in the input dataframe. EDA and modeling evaluation will likely reduce this number. The final columns used for the predictive model will be listed here when they have been selected.

The input dataset contains 2 identifier columns, 3 Target variable candidates and a total of 26 potential features. Each column name is provided below with their column name in the dataframe (i.e. SAS variable name and their human-readable label from the HTML file)
- Each row is uniquely defined by (i.e. Table's Grain)
  1. "State FIPS Code" -> _STATE
  2. "Annual Sequence Number" -> SEQNO
- Target variable candidates related to Diabetes include
  1. "(Ever told) you had diabetes" -> DIABETE4
  2. "Ever been told by a doctor or other health professional that you have pre-diabetes or borderline diabetes?" -> PREDIAB2
  3. "What type of diabetes do you have?" -> DIABTYPE
- Demographic features include
  1. "Urban/Rural Status" -> _URBSTAT
  2. "Reported age in five-year age categories calculated variable" -> _AGEG5YR
  3. "Sex of Respondent" -> SEXVAR
  4. "Computed Race-Ethnicity grouping" -> _RACE
  5. "Education Level" -> EDUCA
  6. "Income Level" -> INCOME3
- Personal health features include
  1. "Have Personal Health Care Provider?" -> PERSDOC3
  2. "Could Not Afford To See Doctor" -> MEDCOST1
  3. "Computed Weight in Kilograms" -> WTKG3
  4. "Computed Height in Meters" -> HTM4
  5. "Computed body mass index" -> _BMI5
  6. "Exercise in Past 30 Days" -> EXERANY2
  7. "How often did you drink regular soda or pop that contains sugar?" -> SSBSUGR2
  8. "How often did you drink sugar-sweetened drinks?" -> SSBFRUT3
  9. "Computed Smoking Status" -> _SMOKER3
  10. "Computed number of drinks of alcohol beverages per week" -> _DRNKWK3
  11. "Drink any alcoholic beverages in past 30 days" -> DRNKANY6
  12. "Heavy Alcohol Consumption  Calculated Variable" -> _RFDRHV9
  13. "General Health" -> GENHLTH
- Other disease indicator features include
  1. "Ever Diagnosed with Heart Attack" -> CVDINFR4
  2. "Ever Diagnosed with Angina or Coronary Heart Disease" -> CVDCRHD4
  3. "Ever Diagnosed with a Stroke" -> CVDSTRK3
  4. "Ever told you have kidney disease?" -> CHCKDNY2
  5. "Ever Told Had Asthma" -> ASTHMA3
  6. "(Ever told) you had a depressive disorder" -> ADDEPEV3
  7. "Told Had Arthritis" -> HAVARTH4

**Notes**
- Target variable is imbalanced so I will need to use stratified train/validation/test splitting. Stretch goal would be to stratify by State as well to make sure that each split is geographically distributed...

## Setup
### Define parameters
The input/output parameters are defined in the next cell.

In [51]:
input_data_file = "diabetes_data.pickle"
target_col = "DIABETE4"
target_val = "Yes"
cols_to_remove = [
    "PREDIAB2",
    "DIABTYPE",
    "SSBSUGR2",
    "SSBFRUT3",
    "DRNKANY6",
    "_RFDRHV9"
]
categorical_features = [
    "_URBSTAT",
    "SEXVAR",
    "_RACE",
    "PERSDOC3",
    "MEDCOST1",
    "EXERANY2",
    "_SMOKER3",
    "CVDINFR4",
    "CVDCRHD4",
    "CVDSTRK3",
    "CHCKDNY2",
    "ASTHMA3",
    "ADDEPEV3",
    "HAVARTH4"
]
ordinal_features = [
    "_AGEG5YR",
    "EDUCA",
    "INCOME3",
    "GENHLTH"
]
numeric_features = [
    "WTKG3",
    "HTM4",
    "_BMI5",
    "_DRNKWK3"
]
val_ratio = .2
test_ratio = .2

### Import packages

In [13]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score, roc_auc_score, f1_score
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

pd.set_option('display.max_columns', None)

### Define Functions

In [None]:
def split_train_val_test(df, val_ratio=.2, test_ratio=.2, r_seed=1, verbose=False):
    """Use the train_test_split function from sklearn to split input dataframe
    into randomly shuffled train, validation, and test datasets with the
    validation dataset containing val_ratio of the input data and the test
    dataset containing test_ratio of the input data.
    """
    n = len(df)
    # Generate test dataset
    full_train_df, test_df = train_test_split(df, test_size=test_ratio, random_state=r_seed)
    test_df = test_df.reset_index(drop=True)
    # Generate train, validation, and test splits
    val_ft_ratio = val_ratio / (1 - test_ratio)
    train_df, val_df = train_test_split(full_train_df, test_size=val_ft_ratio, random_state=r_seed)
    train_df = train_df.reset_index(drop=True)
    val_df = val_df.reset_index(drop=True)
    if verbose:
        print(f"All rows in the original dataframe are contained within the training, validation, or test datasets: {len(train_df) + len(val_df) + len(test_df) == len(df)}")
    return train_df, val_df, test_df

## Load data
### Input dataframe

In [3]:
input_df = pd.read_pickle(input_data_file)

In [24]:
display(input_df.head())
display(input_df.info())

Unnamed: 0,_STATE,SEQNO,DIABETE4,PREDIAB2,DIABTYPE,_URBSTAT,_AGEG5YR,SEXVAR,_RACE,EDUCA,INCOME3,PERSDOC3,MEDCOST1,WTKG3,HTM4,_BMI5,EXERANY2,SSBSUGR2,SSBFRUT3,_SMOKER3,_DRNKWK3,DRNKANY6,_RFDRHV9,GENHLTH,CVDINFR4,CVDCRHD4,CVDSTRK3,CHCKDNY2,ASTHMA3,ADDEPEV3,HAVARTH4
0,Alabama,2024000001,No,,,"Urban counties (_URBNRRL = 1,2,3,4,5)",Age 75 to 79,Female,"White only, non-Hispanic",Grade 12 or GED (High school graduate),Refused,More than one,No,59.42,1.63,22.49,Yes,,,Never smoked,0.0,No,No,Good,No,No,No,No,No,No,Yes
1,Alabama,2024000002,No,,,"Urban counties (_URBNRRL = 1,2,3,4,5)",Age 80 or older,Male,"White only, non-Hispanic",College 4 years or more (College graduate),"$200,000 or more","Yes, only one",No,81.65,1.78,25.83,Yes,,,Former smoker,0.0,No,No,Excellent,No,Yes,No,No,No,No,Yes
2,Alabama,2024000003,No,,,"Urban counties (_URBNRRL = 1,2,3,4,5)",Age 55 to 59,Male,"White only, non-Hispanic",College 1 year to 3 years (Some college or tec...,Refused,No,Yes,88.45,1.98,22.53,Yes,,,Current smoker - now smokes every day,14.0,Yes,No,Very good,No,No,No,No,No,No,Yes
3,Alabama,2024000004,No,,,"Urban counties (_URBNRRL = 1,2,3,4,5)",Age 80 or older,Male,"White only, non-Hispanic",College 4 years or more (College graduate),"Less than $50,000 ($35,000 to < $50,000)","Yes, only one",No,74.84,1.73,25.09,Yes,,,Never smoked,0.0,No,No,Excellent,No,No,No,No,No,No,Yes
4,Alabama,2024000005,No,,,"Urban counties (_URBNRRL = 1,2,3,4,5)",Age 45 to 49,Male,"White only, non-Hispanic",College 1 year to 3 years (Some college or tec...,"Less than $20,000 ($15,000 to < $20,000)","Yes, only one",No,58.97,1.73,19.77,No,,,Never smoked,0.0,No,No,Good,No,No,No,No,No,No,No


<class 'pandas.core.frame.DataFrame'>
Index: 457656 entries, 0 to 457669
Data columns (total 31 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   _STATE    457656 non-null  object 
 1   SEQNO     457656 non-null  object 
 2   DIABETE4  457652 non-null  object 
 3   PREDIAB2  159191 non-null  object 
 4   DIABTYPE  13818 non-null   object 
 5   _URBSTAT  443033 non-null  object 
 6   _AGEG5YR  457656 non-null  object 
 7   SEXVAR    457656 non-null  object 
 8   _RACE     457656 non-null  object 
 9   EDUCA     457649 non-null  object 
 10  INCOME3   448387 non-null  object 
 11  PERSDOC3  457653 non-null  object 
 12  MEDCOST1  457650 non-null  object 
 13  WTKG3     421264 non-null  float64
 14  HTM4      433599 non-null  float64
 15  _BMI5     414632 non-null  float64
 16  EXERANY2  457653 non-null  object 
 17  SSBSUGR2  115597 non-null  object 
 18  SSBFRUT3  115311 non-null  object 
 19  _SMOKER3  457656 non-null  object 
 20  _DRNKWK3 

None

## Preprocessing
### Clean target data
Perform the following steps:
- Remove rows in the dataframe with `DIABETE4` **not** equal to either "Yes" or "No"
- Encode target values such that "Yes" -> 1 and "No" -> 0

In [61]:
pp_df = input_df.copy()
pp_df = pp_df[(pp_df['DIABETE4']=="Yes") | (pp_df['DIABETE4']=="No")]
pp_df['DIABETE4'] = (pp_df['DIABETE4']==target_val).astype(int)
pp_df['DIABETE4'].value_counts(normalize=True)

DIABETE4
0    0.851091
1    0.148909
Name: proportion, dtype: float64

### Map Missing values to Unknown
Categorical and Ordinal features should have all missing category label values (e.g. "Refused", NaN) converted to "Unknown"

## EDA
### Target variable

np.int64(65806)