**Analyzing the effects of different risk factors on the presence of heart disease in patients**

_Ethan Coates_

Heart disease is a term describing several heart conditions putting afflicted patients at risk for serious consequences such as heart attacks, heart failure, and arrhythmia. Heart disease is known to have risk factors, with the 3 major ones (high blood pressure, high cholesterol and smoking) affecting a whopping 47% of the American population. Through the use of data science techniques, it is possible to gain further insight into how heavily the many risk factors of heart disease weigh in on each other and on the occurrence of heart disease.

We will be analyzing a dataset derived from the CDC's 2020 survey of approximately 400,000 American adults on their health and lifestyle as part of the BRFSS, or Behavioral Risk Factor Surveillance System. The BRFSS surveys, conducted in the United States since 1984, question respondees via phone. The original dataset from the survey contained 279 columns with each representing a different question each row (respondee) was asked. 

This public domain dataset maintains all 401,958 rows of the original but strips the large number of questions down to 18. One of these is if the patient reported heart disease, and the rest are questions about other lifestyle choices and health conditions that may have an effect on whether the patient suffers from heart disease or not. The dataset is available as a CSV (comma-separated values) table, so we'll begin by importing it into a dataframe using PANDAS, a Python library highly optimized for efficiently working with datasets.

Source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease
https://www.cdc.gov/heartdisease/risk_factors.htm
https://www.cdc.gov/heartdisease/about.htm

In [124]:
import pandas as pd
import numpy as np

df = pd.read_csv("heart_2020_cleaned.csv")

df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


The dataframe has named columns with various types of entries; yes/no questions, categorical questions such as race, and quantitative metrics such as BMI.

The database's description on Kaggle formally defines what the responses to each of these questions mean:

| HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex 
| --- | --- | --- | --- | --- | --- | --- | --- | --- 
| Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) | Body Mass Index (BMI) | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week | (Ever told) (you had) a stroke? | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 was your physical health not good? | Thinking about your mental health, for how many days during the past 30 days was your mental health not good? | Do you have serious difficulty walking or climbing stairs? | Are you male or female? 



| AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Fourteen-level age category | Imputed race/ethnicity value | (Ever told) (you had) diabetes? | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job | Would you say that in general your health is... | On average, how many hours of sleep do you get in a 24-hour period? | (Ever told) (you had) asthma? | Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? | (Ever told) (you had) skin cancer? |



In [125]:
print(df.dtypes)

HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCancer           object
dtype: object


When importing the CSV table, PANDAS did not assign specific data types to the non-numeric columns; instead, it is representing them as generic object types. We can begin to form a properly typed dataframe by peeking into each column and deciding on the appropriate representation ourselves.

In [126]:
f1 = df["PhysicalHealth"].unique()
f2 = df["MentalHealth"].unique()
f3 = df["SleepTime"].unique()
floats = [(f1, df["PhysicalHealth"].name), (f2, df["MentalHealth"].name), (f3, df["SleepTime"].name)]

for (f,n) in floats:
    non_int = False
    for val in f:
        if not (float(val).is_integer()):
            non_int = True
    if non_int:
        print(str(n) + " contains at least one non-int float value.")
    else:
        print(str(n) + " contains only integer values.")

df["PhysicalHealth"] = pd.to_numeric(df["PhysicalHealth"], downcast="integer")
df["MentalHealth"] = pd.to_numeric(df["MentalHealth"], downcast="integer")
df["SleepTime"] = pd.to_numeric(df["SleepTime"], downcast="integer")

PhysicalHealth contains only integer values.
MentalHealth contains only integer values.
SleepTime contains only integer values.


When looking at the descriptors for PhysicalHealth and MentalHealth, we can see that the responses are in numbers of days. For SleepTime, the reponse is in hours. These columns were assigned as floating-point types, but performing a check on each of them reveals that they are all integers anyway (nothing past the right of the decimal point). So, we can convert them.

In [127]:
cn = [df["HeartDisease"].name, df["Smoking"].name, \
df["AlcoholDrinking"].name, df["Stroke"].name, df["DiffWalking"].name, df["Diabetic"].name, \
df["PhysicalActivity"].name, df["Asthma"].name, df["KidneyDisease"].name, df["SkinCancer"].name]

i = 0
for col_name in cn:
    temp_name = "temp" + str(i)
    df[temp_name] = False
    mask = df[col_name] == "Yes"
    df.loc[mask,temp_name] = True
    df.drop(columns=[col_name], inplace=True)
    df.rename(columns = {temp_name: col_name}, inplace=True)
    i += 1

Yes/no questions were logged as either "Yes" or "No", so we can convert those columns into boolean. While it is possible to use the apply() function in Pandas to convert "Yes" into True (and "No" into False), it is very inefficient on large datasets like this. So, a vectorized solution is preferred, where a new column of all False values is created and the "Yes" values found in the original column are "masked" over the new one. Vectorized code like this allows for under-the-hood optimizations by Pandas to take place, reducing the runtime dramatically (in this case, 0.4 vs. 18 seconds!).

In [128]:
cn = [df["GenHealth"].name, df["Sex"].name, df["AgeCategory"].name, df["Race"].name]

for col_name in cn:
    df[col_name] = df[col_name].astype("category")

print(df.dtypes)
df.head()

BMI                  float64
PhysicalHealth          int8
MentalHealth            int8
Sex                 category
AgeCategory         category
Race                category
GenHealth           category
SleepTime               int8
HeartDisease            bool
Smoking                 bool
AlcoholDrinking         bool
Stroke                  bool
DiffWalking             bool
Diabetic                bool
PhysicalActivity        bool
Asthma                  bool
KidneyDisease           bool
SkinCancer              bool
dtype: object


Unnamed: 0,BMI,PhysicalHealth,MentalHealth,Sex,AgeCategory,Race,GenHealth,SleepTime,HeartDisease,Smoking,AlcoholDrinking,Stroke,DiffWalking,Diabetic,PhysicalActivity,Asthma,KidneyDisease,SkinCancer
0,16.6,3,30,Female,55-59,White,Very good,5,False,True,False,False,False,True,True,True,False,True
1,20.34,0,0,Female,80 or older,White,Very good,7,False,False,False,True,False,False,True,False,False,False
2,26.58,20,30,Male,65-69,White,Fair,8,False,True,False,False,False,True,True,True,False,False
3,24.21,0,0,Female,75-79,White,Good,6,False,False,False,False,False,False,False,False,False,True
4,23.71,28,0,Female,40-44,White,Very good,8,False,False,False,False,True,False,True,False,False,False


Finally, all of the categorical columns can be converted to Pandas' built-in category type, which will be easier to work with later than simply storing strings. At this point, we have assigned proper types to all of the columns, which will facilitate our data analysis.

TO-DO: EDA, Machine learning models, visualizations