**Decoding Heart Disease** 

**Intro**

Heart disease can refer to a wide range of diseases, but in this case refers to coronary artery disease which involves reduction of blood flow to the heart due to atherosclerosis. Atherosclerotic plaque can build up in the arteries, and if near the heart can lead to heart attacks and other complications.

The data set contains data from 4 databases, including Cleveland, Hungary, Switzerland, and Long Beach (Virginia), the data from Cleveland was the only one processed.

It has 14 usable attributes, half of which are integer variables and the other half are categorical. Our project revolves around answering the question, **what factors contributes to heart disease?**



In [1]:
#imports basic tools and the dataset below

import pandas as pd
import altair as alt
!pip3 install -U ucimlrepo



In [2]:
from ucimlrepo import fetch_ucirepo

#imports dataset
heart_disease = fetch_ucirepo(name='Heart Disease')

In [1]:
#Sets up classifiers and diagnosis
classifiers = heart_disease.data.features 
diagnosis = heart_disease.data.targets 

#Creates dataframe
heart_df = classifiers.assign(diagnosis = diagnosis)

#Drops all columns except for predictors and diagnosis
heart_df_filter = heart_df[["age", "sex", "cp", "trestbps", "chol", "thalach", "diagnosis"]]

#Imports relevant sk-learn tools
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

train_heart, test_heart = train_test_split(heart_df_filter, train_size = 0.75)

#Grabs all the different diagnoses to get an initial table
train_heart_mean = train_heart.groupby("diagnosis").mean(numeric_only=True).reset_index()

train_heart_mean 

#Plot of age and heart disease

age = alt.Chart(train_heart).mark_bar().encode(
    x=alt.X("age", bin=alt.Bin(step=5)).title("Age"),
    y=alt.Y("count()").title("# of Patients"),
    color=alt.Color("diagnosis").title("Severity")
).properties(title="Age and Heart Disease")


#Plot of max heart rate and heart disease

max_hr = alt.Chart(train_heart).mark_bar().encode(
    x=alt.X("thalach", bin=alt.Bin(step=20)).title("Max HR (Exercise)"),
    y=alt.Y("count()").title("# of Patients"),
    color=alt.Color("diagnosis").title("Severity")
).properties(title="Max Heart Rate and Heart Disease")

display(age | max_hr)

NameError: name 'heart_disease' is not defined

Initially we can see a couple of trends just by grouping the data by diagnosis, most notably a null diagnosis being correlated with youth, a lower sex value (0 being female), lower chest pain (1 being asymptomatic), lower resting blood pressure and a higher maximum heart rate.

To further investigate a couple of links let's look at some graphs.

In [4]:
#Plot of age and heart disease

age = alt.Chart(train_heart).mark_bar().encode(
    x=alt.X("age", bin=alt.Bin(step=5)).title("Age"),
    y=alt.Y("count()").title("# of Patients"),
    color=alt.Color("diagnosis").title("Severity")
).properties(title="Age and Heart Disease")


#Plot of max heart rate and heart disease

max_hr = alt.Chart(train_heart).mark_bar().encode(
    x=alt.X("thalach", bin=alt.Bin(step=20)).title("Max HR (Exercise)"),
    y=alt.Y("count()").title("# of Patients"),
    color=alt.Color("diagnosis").title("Severity")
).properties(title="Max Heart Rate and Heart Disease")

display(age | max_hr)

These visualizations show a couple of expected trends, the proportion of younger people 25-55 with more severe diagnoses, i.e. 2 or 3 is much lower than those 55 and up.

As for max heart rate there is significant negative correlation with max heart rate and heart disease. The 160-180 group is dominated by a null diagnosis, 140-160 has at least a majority being heart disease free, but as we go lower the prevalence of heart disease becomes the norm, not the exception.

**Methods**

The variables we have decided to select are **(age)**, **(sex)**, chest pain (**cp**), resting blood pressure **(trestbps)**, serum cholesterol **(chol)** and maximum heart rate **(thalach)**.

Generally these variables were chosen as the most relevant predictors and others were excluded due to relevancy or overlap with some of the variables we already chose.

Using various chosen predictor variables we will attempt to generate a model that can accurately classify heart disease **(num)** as a target variable. We will fine-tune and re-adjust our model on unseen test data and re-iteratively do this with random seeds until it can reliably classify the target variable. 

Most of our visualizations are likely to be histograms, as we want to show quantification of numerical values, as it relates to the target variable. Since even the categorical values are stored as numbers (i.e. sex/chest pain) this should make our life easier.


**Expectations**

We expect that heart disease will be correlated with lower max heart rate, age, cholesterol levels, resting blood pressure, angina symptoms and males.


These findings will allow for early mitigation of certain factors through changes in lifestyle, medications, and surgeries, if needed. Diagnosing heart disease will also help in preventing or reducing the likelihood of heart attacks, strokes, and heart failure. 


Future questions that our findings could lead involve how to combat some of these variables that may contribute  to heart disease and whether fixing one factor could improve other factors involved (e.g. whether lowering resting blood pressure could increasemax heart rate).
