**Decoding Heart Disease**

**Introduction**


Heart disease can refer to a wide range of diseases, but in this case refers to coronary artery disease which involves reduction of blood flow to the heart due to atherosclerosis. Atherosclerotic plaque can build up in the arteries, and if near the heart can lead to heart attacks and other complications. The causes aren’t exactly clear, but plaque gradually builds up and is accompanied by inflammation of the arterial walls, reducing blood flow to vital organs and extremities (John Hopkins Medicine, 2021).


The data set contains data from 4 databases, including Cleveland, Hungary, Switzerland, and Long Beach (Virginia), the data from Cleveland was the only one processed. It has 14 usable attributes, half of which are integer variables and the other half are categorical. We have attempted to answer the question, **what factors contributes to heart disease?**


Heart disease is of growing concern, especially as humans continue to grow older and has become the leading factor of death for humans around the world accounting for 16% of total deaths globally (WHO’s Global Health Estimates, 2020). It also disproportionately affects wealthier nations and more importantly in this investigation has correlations with many health metrics and biological markers (WHO’s Global Health Estimates, 2020). 

We have built a model that will take certain predictive health markers and we will be attempting to ascribe to the severity (if present) of heart disease.


In [1]:
#imports basic tools and the dataset below

import pandas as pd
import altair as alt
import numpy as np
!pip3 install -U ucimlrepo



In [2]:
from ucimlrepo import fetch_ucirepo

#imports dataset
heart_disease = fetch_ucirepo(name='Heart Disease')

In [17]:
#Sets up classifiers and diagnosis
classifiers = heart_disease.data.features 
diagnosis = heart_disease.data.targets 

#Creates dataframe
heart_df = classifiers.assign(diagnosis = diagnosis)

#Drops all columns except for predictors and diagnosis
heart_df_filter = heart_df[["age", "cp", "trestbps", "chol", "thalach", "diagnosis"]]

#Imports sk-learn tools
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import recall_score

#sets a initial random seed to make data reproducable
np.random.seed(1000)

heart_train, heart_test = train_test_split(heart_df_filter, train_size = 0.75)

#Grabs all the different diagnoses to get an initial table
heart_train_mean = heart_train.groupby("diagnosis").mean(numeric_only=True).reset_index()

display("Figure 1: Heart Disease grouped by diagnoses", heart_train_mean)


'Figure 1: Heart Disease grouped by diagnoses'

Unnamed: 0,diagnosis,age,cp,trestbps,chol,thalach
0,0,52.128205,2.769231,129.247863,241.068376,158.410256
1,1,55.090909,3.295455,133.659091,252.590909,145.681818
2,2,58.555556,3.703704,135.185185,256.0,137.0
3,3,56.214286,3.857143,137.107143,238.892857,132.928571
4,4,59.363636,3.636364,138.545455,243.727273,140.181818


**Methods and Results**

Initially we can see a couple of trends just by grouping the data by diagnosis, most notably a null diagnosis being correlated with youth, lower chest pain (1 being asymptomatic), lower resting blood pressure and a higher maximum heart rate.

To further investigate a couple of links let's look at some graphs.

In [8]:
#Plot of age and heart disease

age = alt.Chart(heart_train).mark_bar().encode(
    x=alt.X("age", bin=alt.Bin(step=5)).title("Age"),
    y=alt.Y("count()").title("# of Patients"),
    color=alt.Color("diagnosis").title("Severity")
).properties(title="Figure 2: Age and Heart Disease")


#Plot of max heart rate and heart disease

max_hr = alt.Chart(heart_train).mark_bar().encode(
    x=alt.X("thalach", bin=alt.Bin(step=20)).title("Max HR (Exercise)"),
    y=alt.Y("count()").title("# of Patients"),
    color=alt.Color("diagnosis").title("Severity")
).properties(title="Figure 3: Max Heart Rate and Heart Disease")

display(age | max_hr)

These visualizations show a couple of expected trends, the proportion of younger people 25-55 with more severe diagnoses, i.e. 2 or 3 is much lower than those 55 and up.

As for max heart rate there is significant negative correlation with max heart rate and heart disease. The 160-180 group is dominated by a null diagnosis, 140-160 has at least a majority being heart disease free, but as we go lower the prevalence of heart disease becomes the norm, not the exception.

We will now setup a KNeighborsClassifier model below that will attempt to make predictions based off of the variables **age**, **cp**, **trestbps**, **chol**, and **thalach** and spit out a target variable for **diagnosis**.

Since we are using KNeighborsClassifier the variables will have to be standardized for further analysis.

In order to find the best K value we will perform use GridSearchCV and do it in the range from 1-50 (inclusive). We will use a cv of 10 to maintain a solid balance between efficiency and performance.


In [18]:
#Creates the preprocessor
heart_pre = make_column_transformer(
    (StandardScaler(), ["age","cp","trestbps","chol","thalach"]),
    remainder="passthrough",
    verbose_feature_names_out=False
)

#Creates the model
knn = KNeighborsClassifier()

#Creates pipeline
heart_pipe = make_pipeline(heart_pre, knn)

#Fits training data onto the pipeline with diagnosis as the target variable
heart_pipe.fit(
    heart_train[["age","cp","trestbps","chol","thalach"]],
    heart_train["diagnosis"])

#Creates a grid and performs a grid search to attempt to find the best K value
grid = {
    "kneighborsclassifier__n_neighbors": range(1,51)
}

heart_grid = GridSearchCV(
    estimator = heart_pipe,
    param_grid = grid,
    cv = 10
)

#Fits it onto the training data
heart_grid.fit(
    heart_train[["age","cp","trestbps","chol","thalach"]],
    heart_train["diagnosis"])

In [26]:
#Creates a dataframe out of the grid
heart_grid_df = pd.DataFrame(heart_grid.cv_results_)

#Creates a line plot out of the dataframe, using neighbors on the X and mean test score on the Y
grid_plot = alt.Chart(heart_grid_df).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Number of Neighbors"),
    y=alt.Y("mean_test_score").title("Mean Test Score")
)

display("Figure 4: How Does Number of Neighbors Impact Test Score?", grid_plot)

'Figure 4: How Does Number of Neighbors Impact Test Score?'

Now that we have a fitted grid, lets put it into a dataframe and visualize the results.

Looking at the line plot we can definitely see that increasing the number of neighbors significantly boosts the performance of the model, but it levels off very quickly. 

Lets sort the table by **rank_test_score** to find the best K value.

In [29]:
#Sorts the grid dataframe by rank_test_score
display("Figure 5: Best K Values",
        heart_grid_df.sort_values(by="rank_test_score").head(5)
        [["rank_test_score", "mean_test_score", "param_kneighborsclassifier__n_neighbors"]]
       )

'Figure 5: Best K Values'

Unnamed: 0,rank_test_score,mean_test_score,param_kneighborsclassifier__n_neighbors
39,1,0.524111,40
46,2,0.51996,47
45,2,0.51996,46
38,4,0.519565,39
44,5,0.515613,45


It looks like 40 is the best K value and we will go with this going forward. Now that we have a K value lets train the 

References

John Hopkins Medicine. (2021). Atherosclerosis. Hopkins Medicine. https://www.hopkinsmedicine.org/health/conditions-and-diseases/atherosclerosis

WHO’s Global Health Estimates. (2020, December 9). The top 10 causes of death. World Health Organization; WHO. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death