In [2]:
#import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Data

---

### Key Indicators of Heart Disease

#### 2020 annual CDC survey data of 400k adults related to their health status

[Data Set Link](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)

[Original CDC Data Set Link](https://www.cdc.gov/brfss/annual_data/annual_data.htm)

#### Who Collects the Data

The dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. 

#### Data Collection Process

The BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. The most recent dataset (as of February 15, 2022) includes data from 2020 consisting of **401,958 rows and 279 columns**. The vast majority of columns are questions asked to respondents about their health status, such as:

- "Do you have serious difficulty walking or climbing stairs?"
- "Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]". 

#### Data Manipulation

In this dataset, the author noticed many different factors (questions) that directly or indirectly influence heart disease, so they decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects. **As described above, the original dataset of nearly 300 variables was reduced to just about 20 variables.**

## Data Assumptions

---

#### Author Manipulation

We described in the 'Data Manipulation' Section above, the author redacted 280 different variables based on the idea that it was not necessary or didn't provide helpful information. They gave no explanation for how this was determined, whether that was PCA, Random Forest exploration or other techniques we don't know how or why these features were chosen over the others.

#### Survey User Bias

##### Health Bias

According to the BRFSS [website](https://www.cdc.gov/brfss/about/index.htm) they don't disclose how the annual 400k applicants are chosen. Typically people who are more conscious about their health will go out of their way to fill out health based surveys as it can make an individual feel good about themselves to do so. On the other hand both people with heart disease and or people with physical health struggles and issues will be less inclined to fill out a survey discussing parts of themselves that they may not promote or want to focus on.

##### Communication Bias

Looking at their [documentation guide](https://www.cdc.gov/brfss/data_documentation/pdf/UserguideJune2013.pdf), we see that the interviews are done over landline phones. With the surgance of spam calling and decline of the general public to answer unknown numbers, this could also lead to a possible influx of older generations being more willing to pick up and answer as younger demographics veer away from this methodology with the inherent risks that it poses.


In [6]:
#import dataframe
df = pd.read_csv("heart_2020_cleaned.csv")
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
