# Heart Disease Health Indicators Dataset Notebook

## Purpose
The purpose of this code notebook is to clean BRFSS data into a useable format for machine learning alogrithms. 
The dataset originally has 330 features (columns), but based on heart disease research regarding factors influencing heart disease and other chronic health conditions, only select features are included in this analysis.

## Link to Dataset Output [Heart Disease Health Indicators Dataset](https://www.kaggle.com/alexteboul/heart-disease-health-indicators-dataset)
**253,680 survey responses from cleaned BRFSS 2015 - binary classification**

#### Important Risk Factors
Research in the field has identified the following as **important risk factors** for heart disease and other chronic illnesses like diabetes (not in strict order of importance):

*   blood pressure (high)
*   cholesterol (high)
*   smoking
*   diabetes
*   obesity
*   age
*   sex
*   race
*   diet
*   exercise
*   alcohol consumption
*   BMI
*   Household Income
*   Marital Status
*   Sleep
*   Time since last checkup
*   Education
*   Health care coverage
*   Mental Health

### Selected Subset of Features from BRFSS 2015
Given these risk factors, I tried to select features (columns/questions) in the BRFSS related to these risk factors. To help understand what the columns mean, I consult the BRFSS 2015 Codebook to see the questions and information about the questions. I try to match the variable names in the codebook to the variable names in the dataset I downloaded from Kaggle. I also reference some of the same features chosen for a research paper by Zidian Xie et al for *Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques* using the 2014 BRFSS. Diabetes and Heart Disease outcomes are strongly correlated, with the primary cause of death for diabetics being heart disease complications. Given this information, it is a useful starting point.

**BRFSS 2015 Codebook:** https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

**Relevant Research Paper using BRFSS for Diabetes ML:** https://www.cdc.gov/pcd/issues/2019/19_0109.htm


The **selected features** from the BRFSS 2015 dataset are:

**Response Variable / Dependent Variable:**
*   Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) --> _MICHD


**Independent Variables:**

**High Blood Pressure**
*   Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> _RFHYPE5

**High Cholesterol**
*   Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> TOLDHI2
*   Cholesterol check within past five years --> _CHOLCHK

**BMI**
*   Body Mass Index (BMI) --> _BMI5

**Smoking**
*   Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] --> SMOKE100

**Other Chronic Health Conditions**
*   (Ever told) you had a stroke. --> CVDSTRK3
*   (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> DIABETE3

**Physical Activity**
*   Adults who reported doing physical activity or exercise during the past 30 days other than their regular job --> _TOTINDA

**Diet**
*   Consume Fruit 1 or more times per day --> _FRTLT1
*   Consume Vegetables 1 or more times per day --> _VEGLT1

**Alcohol Consumption**
*   Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) --> _RFDRHV5

**Health Care**
*   Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?  --> HLTHPLN1
*   Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? --> MEDCOST

**Health General and Mental Health**
*   Would you say that in general your health is: --> GENHLTH
*   Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? --> MENTHLTH
*   Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? --> PHYSHLTH
*   Do you have serious difficulty walking or climbing stairs? --> DIFFWALK

**Demographics**
*   Indicate sex of respondent. --> SEX
*   Fourteen-level age category --> _AGEG5YR
*   What is the highest grade or year of school you completed? --> EDUCA
*   Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") --> INCOME2

## 1. Get the data

In [4]:
#imports
import os
import pandas as pd
import random


In [7]:
file_path = "C:/Users/duvallar/OneDrive/1.RoseHulman/3.Junior/Winter/MA384/Data/LLCP2016.XPT"
brfss_dataset_2016 = pd.read_sas(file_path)
brfss_dataset_2016

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_MAM5021,_RFPAP33,_RFPSA21,_RFBLDS3,_COL10YR,_HFOB3YR,_FS5YR,_FOBTFS,_CRCREC,_AIDTST3
0,1.0,1.0,b'01072016',b'01',b'07',b'2016',1100.0,b'2016000001',2.016000e+09,1.0,...,,,2.0,,,,,,,1.0
1,1.0,1.0,b'01112016',b'01',b'11',b'2016',1100.0,b'2016000002',2.016000e+09,1.0,...,1.0,,,1.0,1.0,1.0,,,1.0,2.0
2,1.0,1.0,b'01062016',b'01',b'06',b'2016',1100.0,b'2016000003',2.016000e+09,1.0,...,,,,,,,,,,2.0
3,1.0,1.0,b'01082016',b'01',b'08',b'2016',1100.0,b'2016000004',2.016000e+09,1.0,...,,,1.0,2.0,1.0,2.0,,2.0,1.0,9.0
4,1.0,1.0,b'01052016',b'01',b'05',b'2016',1100.0,b'2016000005',2.016000e+09,1.0,...,,,,,,,,,,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
486298,78.0,12.0,b'12312016',b'12',b'31',b'2016',1200.0,b'2016001262',2.016001e+09,,...,,,,,,,,,,
486299,78.0,12.0,b'12192016',b'12',b'19',b'2016',1100.0,b'2016001263',2.016001e+09,,...,,,2.0,,,,,,,2.0
486300,78.0,12.0,b'12092016',b'12',b'09',b'2016',1100.0,b'2016001264',2.016001e+09,,...,2.0,1.0,,2.0,2.0,2.0,2.0,2.0,2.0,1.0
486301,78.0,12.0,b'12312016',b'12',b'31',b'2016',1200.0,b'2016001265',2.016001e+09,,...,2.0,1.0,,2.0,,2.0,,2.0,,


In [8]:
#How many rows and columns
brfss_dataset_2016.shape

(486303, 275)

In [None]:
#check that the data loaded in is in the correct format
pd.set_option('display.max_columns', 500)
# brfss_dataset.head()
brfss_dataset.sort_index(axis=1)

In [9]:
desired_set = ['_MICHD',
               '_RFHYPE5',
               'TOLDHI2', '_CHOLCHK',
               '_BMI5',
               'SMOKE100',
               'CVDSTRK3', 'DIABETE3',
               '_TOTINDA',
               '_FRTLT1', '_VEGLT1',
               '_RFDRHV5',
               'HLTHPLN1', 'MEDCOST',
               'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK',
               'SEX', '_AGEG5YR', 'EDUCA', 'INCOME2']

In [14]:
columns_list = brfss_dataset.columns.tolist()
year = 2013
intersection_2013 = set(columns_list).intersection(desired_set)
print(year, "Missing: ", set(desired_set) - intersection_2013)
print(year, "Has:", intersection_2013)

2013 Missing:  {'_RFHYPE5', '_CHOLCHK', 'TOLDHI2', '_FRTLT1', '_VEGLT1'}
2013 Has: {'DIFFWALK', 'CVDSTRK3', 'GENHLTH', '_AGEG5YR', '_BMI5', 'INCOME2', '_RFDRHV5', 'SEX', '_TOTINDA', 'PHYSHLTH', 'EDUCA', 'MEDCOST', 'HLTHPLN1', 'DIABETE3', 'MENTHLTH', 'SMOKE100', '_MICHD'}


In [None]:
# brfss_dataset._RFHYPE5.value_counts().sort_index(ascending=True)

## 3. Make feature names more readable

In [None]:
#Rename the columns to make them more readable
brfss_df_selected = brfss_dataset.rename(columns = {'_MICHD':'HeartDiseaseorAttack', 
                                         '_RFHYPE5':'HighBP',  
                                         'TOLDHI2':'HighChol',
                                         '_CHOLCHK':'CholCheck', 
                                    -     '_BMI5':'BMI', 
                                    -     'SMOKE100':'Smoker', 
                                    -     'CVDSTRK3':'Stroke',
                                    -    'DIABETE3':'Diabetes', 
                                    -    '_TOTINDA':'PhysActivity', 
                                         '_FRTLT1':'Fruits', 
                                         '_VEGLT1':"Veggies", 
                                         '_RFDRHV5':'HvyAlcoholConsump', 
                                    -     'HLTHPLN1':'AnyHealthcare',
                                    -    'MEDCOST':'NoDocbcCost', 
                                    -    'GENHLTH':'GenHlth', 
                                    -    'MENTHLTH':'MentHlth', 
                                    -    'PHYSHLTH':'PhysHlth', 
                                    -     'DIFFWALK':'DiffWalk', 
                                    -     'SEX':'Sex', 
                                    -     '_AGEG5YR':'Age', 
                                    -    'EDUCA':'Education', 
                                    -    'INCOME2':'Income' })

In [None]:
brfss_df_selected.head()

In [None]:
brfss_df_selected.shape

In [None]:
# select specific columns
brfss_df_selected = brfss_2015_dataset[['_MICHD', 
                                         '_RFHYPE5',  
                                         'TOLDHI2', '_CHOLCHK', 
                                         '_BMI5', 
                                         'SMOKE100', 
                                         'CVDSTRK3', 'DIABETE3', 
                                         '_TOTINDA', 
                                         '_FRTLT1', '_VEGLT1', 
                                         '_RFDRHV5', 
                                         'HLTHPLN1', 'MEDCOST', 
                                         'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 
                                         'SEX', '_AGEG5YR', 'EDUCA', 'INCOME2' ]]

In [None]:
brfss_df_selected.shape

In [None]:
brfss_df_selected.head()

## 2. Clean the data

### 2.1 Drop missing values

In [None]:
#Drop Missing Values - knocks 100,000 rows out right away
original_row_count = brfss_df_selected.shape[0]
brfss_df_selected = brfss_df_selected.dropna()
droped_row_count = brfss_df_selected.shape[0]
print(brfss_df_selected.shape, "Percentage Droped:  ", ((original_row_count -  droped_row_count)/original_row_count) * 100)


### 2.2 Modify and clean the values to be more suitable to ML algorithms
In order to do this part, I referenced the codebook which says what each column/feature/question is: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

In [None]:
def percentDropped(df):
    percent = 100 * ((droped_row_count - df.shape[0]) / droped_row_count)
    return round(percent, 2)

In [None]:
# _MICHD
#Change 2 to 0 because this means did not have MI or CHD
brfss_df_selected['_MICHD'] = brfss_df_selected['_MICHD'].replace({2: 0})
print(brfss_df_selected._MICHD.unique())


In [None]:
#1 _RFHYPE5
#Change 1 to 0 so it represetnts No high blood pressure and 2 to 1 so it represents high blood pressure
brfss_df_selected['_RFHYPE5'] = brfss_df_selected['_RFHYPE5'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFHYPE5 != 9]
print(brfss_df_selected._RFHYPE5.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#2 TOLDHI2
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['TOLDHI2'] = brfss_df_selected['TOLDHI2'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != 9]
print(brfss_df_selected.TOLDHI2.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#3 _CHOLCHK
# Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years
# Remove 9
brfss_df_selected['_CHOLCHK'] = brfss_df_selected['_CHOLCHK'].replace({3:0,2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._CHOLCHK != 9]
print(brfss_df_selected._CHOLCHK.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
brfss_df_selected['_BMI5'] = brfss_df_selected['_BMI5'].div(100).round(0)
brfss_df_selected._BMI5.unique()

In [None]:
#5 SMOKE100
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['SMOKE100'] = brfss_df_selected['SMOKE100'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 9]
print(brfss_df_selected.SMOKE100.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#6 CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['CVDSTRK3'] = brfss_df_selected['CVDSTRK3'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 9]
print(brfss_df_selected.CVDSTRK3.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#7 DIABETE3
# going to make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['DIABETE3'] = brfss_df_selected['DIABETE3'].replace({2:0, 3:0, 1:2, 4:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE3 != 9]
print(brfss_df_selected.DIABETE3.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#8 _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)
brfss_df_selected['_TOTINDA'] = brfss_df_selected['_TOTINDA'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._TOTINDA != 9]
print(brfss_df_selected._TOTINDA.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#9 _FRTLT1
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day 
# remove all dont knows and missing 9
brfss_df_selected['_FRTLT1'] = brfss_df_selected['_FRTLT1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._FRTLT1 != 9]
print(brfss_df_selected._FRTLT1.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#10 _VEGLT1
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day 
# remove all dont knows and missing 9
brfss_df_selected['_VEGLT1'] = brfss_df_selected['_VEGLT1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._VEGLT1 != 9]
print(brfss_df_selected._VEGLT1.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#11 _RFDRHV5
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking)
# remove all dont knows and missing 9
brfss_df_selected['_RFDRHV5'] = brfss_df_selected['_RFDRHV5'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFDRHV5 != 9]
print(brfss_df_selected._RFDRHV5.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#12 HLTHPLN1
# 1 is yes, change 2 to 0 because it is No health care access
# remove 7 and 9 for don't know or refused
brfss_df_selected['HLTHPLN1'] = brfss_df_selected['HLTHPLN1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.HLTHPLN1 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.HLTHPLN1 != 9]
print(brfss_df_selected.HLTHPLN1.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#13 MEDCOST
# Change 2 to 0 for no, 1 is already yes
# remove 7 for don/t know and 9 for refused
brfss_df_selected['MEDCOST'] = brfss_df_selected['MEDCOST'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST != 9]
print(brfss_df_selected.MEDCOST.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#14 GENHLTH
# This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 9]
print(brfss_df_selected.GENHLTH.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#15 MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['MENTHLTH'] = brfss_df_selected['MENTHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 99]
print(brfss_df_selected.MENTHLTH.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#16 PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['PHYSHLTH'] = brfss_df_selected['PHYSHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 99]
print(brfss_df_selected.PHYSHLTH.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#17 DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused
brfss_df_selected['DIFFWALK'] = brfss_df_selected['DIFFWALK'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 9]
print(brfss_df_selected.DIFFWALK.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#18 SEX
# in other words - is respondent male (somewhat arbitrarily chose this change because men are at higher risk for heart disease)
# change 2 to 0 (female as 0). Male is 1
brfss_df_selected['SEX'] = brfss_df_selected['SEX'].replace({2:0})
brfss_df_selected.SEX.unique()

In [None]:
#19 _AGEG5YR
# already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing
brfss_df_selected = brfss_df_selected[brfss_df_selected._AGEG5YR != 14]
print(brfss_df_selected._AGEG5YR.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#20 EDUCA
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6
# Remove 9 for refused:
brfss_df_selected = brfss_df_selected[brfss_df_selected.EDUCA != 9]
print(brfss_df_selected.EDUCA.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#21 INCOME2
# Variable is already ordinal with 1 being less than $10,000 all the way up to 8 being $75,000 or more
# Remove 77 and 99 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME2 != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME2 != 99]
print(brfss_df_selected.INCOME2.unique())
print(brfss_df_selected.shape, "    Dropped percent: " , percentDropped(brfss_df_selected), "    Droped rows:   ", droped_row_count - brfss_df_selected.shape[0])


In [None]:
#Check the shape of the dataset now: We have 253,680 cleaned rows and 22 columns (1 of which is our dependent variable)
print(brfss_df_selected.shape,  "Percent Dropped From Start: ", round(((original_row_count - brfss_df_selected.shape[0]) / original_row_count) * 100, 2), " Original Row Count: ", original_row_count)

In [None]:
#Let's see what the data looks like after Modifying Values
brfss_df_selected.head()

In [None]:
#Check Class Sizes of the heart disease column  Note the class imbalance!
print(brfss_df_selected.groupby(['_MICHD']).size())
grp =brfss_df_selected.groupby(['_MICHD'])
percent_yes = 100*(grp.size()[1.0] / grp.size().sum())
print("Percent Yes (1.0): ",  round(percent_yes, 2))

## 4. Save to csv
First save version where heart disease is the target variable and in the first column, then save one where diabetes is the target variable and in the first column.

In [None]:
#************************************************************************************************
brfss.to_csv("C:/Users/duvallar/OneDrive/1.RoseHulman/3.Junior/Winter/MA384/Data/indicators_data_clean/indicators_BRFSS2015.csv" , sep=",", index=False)
#************************************************************************************************