# Diabetes Risk Prediction
This project uses the Behavioral Risk Factor Surveillance System (BRFSS) survey data from [this link](https://www.cdc.gov/brfss/annual_data/annual_2024.html) to predict the probability of developing different types of Diabetes. Features about U.S. residents include demographic data (e.g. income level, education, race) as well as data regarding health-related risk behaviors, chronic health conditions, and use of preventive services.

This is the **second notebook** for the project, which reads the dataframe created by the `parse_raw_data.ipynb` notebook, performs exploratory data analysis (EDA), and develops a predictive model for assessing Diabetes risk. If all goes well, I will try to add multiclass prediction to include Prediabetes risk. The primary target variable for this study will be the DIABETE4 column in the dataframe (i.e. "(Ever told) you had diabetes"), which includes Prediabetes as a separate class. This dataset is highly imbalanced so will require additional effort to train the model effectively. Note that the other two Diabetes related columns (PREDIAB2 and DIABTYPE) have been included for validation purposes, and for potential future study.

There are 26 candidate features included in the input dataframe. EDA and modeling evaluation will likely reduce this number. The final columns used for the predictive model will be listed here when they have been selected.

The input dataset contains 2 identifier columns, 3 Target variable candidates and a total of 26 potential features. Each column name is provided below with their column name in the dataframe (i.e. SAS variable name and their human-readable label from the HTML file)
- Each row is uniquely defined by (i.e. Table's Grain)
  1. "State FIPS Code" -> _STATE
  2. "Annual Sequence Number" -> SEQNO
- Target variable candidates related to Diabetes include
  1. "(Ever told) you had diabetes" -> DIABETE4
  2. "Ever been told by a doctor or other health professional that you have pre-diabetes or borderline diabetes?" -> PREDIAB2
  3. "What type of diabetes do you have?" -> DIABTYPE
- Demographic features include
  1. "Urban/Rural Status" -> _URBSTAT
  2. "Reported age in five-year age categories calculated variable" -> _AGEG5YR
  3. "Sex of Respondent" -> SEXVAR
  4. "Computed Race-Ethnicity grouping" -> _RACE
  5. "Education Level" -> EDUCA
  6. "Income Level" -> INCOME3
- Personal health features include
  1. "Have Personal Health Care Provider?" -> PERSDOC3
  2. "Could Not Afford To See Doctor" -> MEDCOST1
  3. "Computed Weight in Kilograms" -> WTKG3
  4. "Computed Height in Meters" -> HTM4
  5. "Computed body mass index" -> _BMI5
  6. "Exercise in Past 30 Days" -> EXERANY2
  7. "How often did you drink regular soda or pop that contains sugar?" -> SSBSUGR2
  8. "How often did you drink sugar-sweetened drinks?" -> SSBFRUT3
  9. "Computed Smoking Status" -> _SMOKER3
  10. "Computed number of drinks of alcohol beverages per week" -> _DRNKWK3
  11. "Drink any alcoholic beverages in past 30 days" -> DRNKANY6
  12. "Heavy Alcohol Consumption  Calculated Variable" -> _RFDRHV9
  13. "General Health" -> GENHLTH
- Other disease indicator features include
  1. "Ever Diagnosed with Heart Attack" -> CVDINFR4
  2. "Ever Diagnosed with Angina or Coronary Heart Disease" -> CVDCRHD4
  3. "Ever Diagnosed with a Stroke" -> CVDSTRK3
  4. "Ever told you have kidney disease?" -> CHCKDNY2
  5. "Ever Told Had Asthma" -> ASTHMA3
  6. "(Ever told) you had a depressive disorder" -> ADDEPEV3
  7. "Told Had Arthritis" -> HAVARTH4

## Setup
### Define parameters
The input/output parameters are defined in the next cell.

In [1]:
input_data_file = "diabetes_data.pickle"
target_col = "DIABETE4"
target_val = "Yes"
feature_cols_to_remove = [
    "SSBSUGR2",
    "SSBFRUT3",
    "DRNKANY6",
    "_RFDRHV9"
]
categorical_features = [
    "_URBSTAT",
    "SEXVAR",
    "_RACE",
    "PERSDOC3",
    "MEDCOST1",
    "EXERANY2",
    "_SMOKER3",
    ""
]
ordinal_features = [
    "_AGEG5YR",
    "EDUCA",
    "INCOME3",
    "GENHLTH"
]
numeric_features = [
    "WTKG3",
    "HTM4",
    "_BMI5",
    "_DRNKWK3"
]

### Import packages

In [13]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score, roc_auc_score, f1_score
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

pd.set_option('display.max_columns', None)

### Define Functions

## Load data
### Input dataframe

In [3]:
input_df = pd.read_pickle(input_data_file)

In [6]:
print(input_df.head())
print(input_df.info())

    _STATE       SEQNO DIABETE4 PREDIAB2 DIABTYPE  \
0  Alabama  2024000001       No      NaN      NaN   
1  Alabama  2024000002       No      NaN      NaN   
2  Alabama  2024000003       No      NaN      NaN   
3  Alabama  2024000004       No      NaN      NaN   
4  Alabama  2024000005       No      NaN      NaN   

                                _URBSTAT         _AGEG5YR  SEXVAR  \
0  Urban counties (_URBNRRL = 1,2,3,4,5)     Age 75 to 79  Female   
1  Urban counties (_URBNRRL = 1,2,3,4,5)  Age 80 or older    Male   
2  Urban counties (_URBNRRL = 1,2,3,4,5)     Age 55 to 59    Male   
3  Urban counties (_URBNRRL = 1,2,3,4,5)  Age 80 or older    Male   
4  Urban counties (_URBNRRL = 1,2,3,4,5)     Age 45 to 49    Male   

                      _RACE  \
0  White only, non-Hispanic   
1  White only, non-Hispanic   
2  White only, non-Hispanic   
3  White only, non-Hispanic   
4  White only, non-Hispanic   

                                               EDUCA  \
0             Grade 12 

## EDA
### Target variable

In [17]:
input_df['GENHLTH'].value_counts(dropna=False)

GENHLTH
Good         156224
Very good    145789
Fair          67919
Excellent     64213
Poor          22201
Unknown         915
Refused         390
NaN               5
Name: count, dtype: int64