# Analyzing eart_disease_health_indicators_BRFSS2015.csv

Loation: Location: /work/shibberu/share/MA384_Data_Mining_Projects_Winter_2023-24/CDSA

Website: https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset

## Background
The Centers for Disease Control and Prevention has identified high blood pressure, high blood cholesterol, and smoking as three key risk factors for heart disease. Roughly half of Americans have at least one of these three risk factors. The National Heart, Lung, and Blood Institute highlights a wider array of factors such as Age, Environment and Occupation, Family History and Genetics, Lifestyle Habits, Other Medical Conditions, Race or Ethnicity, and Sex for clinicians to use in diagnosing coronary heart disease. Diagnosis tends to be driven by an initial survey of these common risk factors followed by bloodwork and other tests.

## Data Set Info:

This data has already been cleaned and this website has the notebook used for cleaning:
https://www.kaggle.com/code/alexteboul/heart-disease-health-indicators-dataset-notebook/notebook   

I downloaded a csv of the dataset available on Kaggle for the year 2015. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

This dataset contains 253,680 survey responses from cleaned BRFSS 2015 to be used primarily for the binary classification of heart disease. Not that there is strong class imbalance in this dataset. 229,787 respondents do not have/have not had heart disease while 23,893 have had heart disease. 


## Other Data Sets from Other Years
https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system 

In [4]:
import pandas as pd
pd.set_option('display.max_columns', 35)

df = pd.read_csv("C:/Users/duvallar/OneDrive/1.RoseHulman/3.Junior/Winter/MA384/Data/heart_disease_health_indicators_BRFSS2015.csv")
df

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,0.0,1.0,1.0,1.0,18.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0


In [5]:
df.shape

(253680, 22)

In [7]:
df.dtypes

HeartDiseaseorAttack    float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
Diabetes                float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object

In [8]:
nan_count_per_column = df.isna().sum().sort_values(ascending=False)
print("PERCENT NaN:", '\n')
print(100* (nan_count_per_column / 160160))

PERCENT NaN: 

HeartDiseaseorAttack    0.0
HighBP                  0.0
Education               0.0
Age                     0.0
Sex                     0.0
DiffWalk                0.0
PhysHlth                0.0
MentHlth                0.0
GenHlth                 0.0
NoDocbcCost             0.0
AnyHealthcare           0.0
HvyAlcoholConsump       0.0
Veggies                 0.0
Fruits                  0.0
PhysActivity            0.0
Diabetes                0.0
Stroke                  0.0
Smoker                  0.0
BMI                     0.0
CholCheck               0.0
HighChol                0.0
Income                  0.0
dtype: float64


In [9]:
duplicate_count = df.duplicated().sum()
total_rows = len(df)
percentage_duplicates = (duplicate_count / total_rows) * 100
print("Percent Duplicates: ", percentage_duplicates.round(2), "  Duplicate Count", duplicate_count, "    Total Rows", total_rows )

Percent Duplicates:  9.42   Duplicate Count 23899     Total Rows 253680


In [11]:
df[df.duplicated(keep=False)].sort_values(by = list(df.columns))

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
4517,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,8.0,6.0,8.0
207307,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,8.0,6.0,8.0
42369,0.0,0.0,0.0,0.0,18.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,10.0,6.0,8.0
108949,0.0,0.0,0.0,0.0,18.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,10.0,6.0,8.0
17475,0.0,0.0,0.0,0.0,19.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,6.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124839,1.0,1.0,1.0,1.0,34.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,11.0,4.0,5.0
7123,1.0,1.0,1.0,1.0,34.0,1.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,10.0,5.0,6.0
231730,1.0,1.0,1.0,1.0,34.0,1.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,10.0,5.0,6.0
5545,1.0,1.0,1.0,1.0,34.0,1.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,11.0,5.0,7.0


In [12]:
for col in df.columns:
    print(df[col].value_counts().sort_values())
    print()

HeartDiseaseorAttack
1.0     23893
0.0    229787
Name: count, dtype: int64

HighBP
1.0    108829
0.0    144851
Name: count, dtype: int64

HighChol
1.0    107591
0.0    146089
Name: count, dtype: int64

CholCheck
0.0      9470
1.0    244210
Name: count, dtype: int64

BMI
78.0        1
96.0        1
85.0        1
90.0        1
86.0        1
        ...  
28.0    16545
25.0    17146
24.0    19550
26.0    20562
27.0    24606
Name: count, Length: 84, dtype: int64

Smoker
1.0    112423
0.0    141257
Name: count, dtype: int64

Stroke
1.0     10292
0.0    243388
Name: count, dtype: int64

Diabetes
1.0      4631
2.0     35346
0.0    213703
Name: count, dtype: int64

PhysActivity
0.0     61760
1.0    191920
Name: count, dtype: int64

Fruits
0.0     92782
1.0    160898
Name: count, dtype: int64

Veggies
0.0     47839
1.0    205841
Name: count, dtype: int64

HvyAlcoholConsump
1.0     14256
0.0    239424
Name: count, dtype: int64

AnyHealthcare
0.0     12417
1.0    241263
Name: count, dtype: int64


In [17]:
''' 
Binary Features:
        HeartDiseaseorAttack 
        HighBP 
        HighChol
        CholCheck (in past 5 years) 
        Smoker
        Stroke
        PhysActivity
        Fruits 
        Veggies
        HvyAlcoholConsump
        AnyHealthcare
        NoDocbcCost
        DiffWalk
        Sex (Male is 1) 

Many Differing Values:
        BMI
        MentHlth
        PhysHlth
        Age


Few Value Types:
        Diabetes
        GenHlth
        Education
        Income

'''

' \nBinary Features:\n        HeartDiseaseorAttack \n        HighBP \n        HighChol\n        CholCheck (in past 5 years) \n        Smoker\n        Stroke\n        PhysActivity\n        Fruits \n        Veggies\n        HvyAlcoholConsump\n        AnyHealthcare\n        NoDocbcCost\n        DiffWalk\n        Sex (Male is 1) \n\nMany Differing Values:\n        BMI\n        MentHlth\n        PhysHlth\n        Age\n\n\nFew Value Types:\n        Diabetes\n        GenHlth\n        Education\n        Income\n\n'

### Selected Subset of Features from BRFSS 2015
Given these risk factors, I tried to select features (columns/questions) in the BRFSS related to these risk factors. To help understand what the columns mean, I consult the BRFSS 2015 Codebook to see the questions and information about the questions. I try to match the variable names in the codebook to the variable names in the dataset I downloaded from Kaggle. I also reference some of the same features chosen for a research paper by Zidian Xie et al for *Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques* using the 2014 BRFSS. Diabetes and Heart Disease outcomes are strongly correlated, with the primary cause of death for diabetics being heart disease complications. Given this information, it is a useful starting point.

**BRFSS 2015 Codebook:** https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

**Relevant Research Paper using BRFSS for Diabetes ML:** https://www.cdc.gov/pcd/issues/2019/19_0109.htm


The **selected features** from the BRFSS 2015 dataset are:

**Response Variable / Dependent Variable:**
*   Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) --> _MICHD


**Independent Variables:**

**High Blood Pressure**
*   Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> _RFHYPE5

**High Cholesterol**
*   Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> TOLDHI2
*   Cholesterol check within past five years --> _CHOLCHK

**BMI**
*   Body Mass Index (BMI) --> _BMI5

**Smoking**
*   Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] --> SMOKE100

**Other Chronic Health Conditions**
*   (Ever told) you had a stroke. --> CVDSTRK3
*   (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> DIABETE3

**Physical Activity**
*   Adults who reported doing physical activity or exercise during the past 30 days other than their regular job --> _TOTINDA

**Diet**
*   Consume Fruit 1 or more times per day --> _FRTLT1
*   Consume Vegetables 1 or more times per day --> _VEGLT1

**Alcohol Consumption**
*   Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) --> _RFDRHV5

**Health Care**
*   Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?  --> HLTHPLN1
*   Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? --> MEDCOST

**Health General and Mental Health**
*   Would you say that in general your health is: --> GENHLTH
*   Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? --> MENTHLTH
*   Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? --> PHYSHLTH
*   Do you have serious difficulty walking or climbing stairs? --> DIFFWALK

**Demographics**
*   Indicate sex of respondent. --> SEX
*   Fourteen-level age category --> _AGEG5YR
*   What is the highest grade or year of school you completed? --> EDUCA
*   Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") --> INCOME2