Aditi Surma: 84186501
\
Cathy Lei: 12532537
\
Lilian Wang: 35169481

# **Heart Disease Data Analysis**

**Introduction:**
\
\
&emsp; &emsp;Cardiovascular disease (CVD) is a leading cause of death globally, and deaths due to CVD have only increased (He et al. 80). To illustrate, in 1990 there were 12.1 million global CVD-caused deaths, which increased to 18.6 million deaths in 2019 (He et al. 80). CVD includes any condition which may impact the cardiovascular system, such as coronary heart disease or heart failure. Since CVD harms numerous patients globally, it is important to develop a system which can diagnose CVD as early as possible to ensure immediate treatment and possible recovery. This leads to our project’s central question: **Based on a patient’s resting systolic blood pressure, serum cholesterol levels and maximum achieved heart rate, do they have cardiovascular disease (CVD)?** 

&emsp; &emsp;The purpose of our project is to create a classifying system which uses three predictors, resting systolic blood pressure, serum cholesterol levels and maximum achieved heart rate, to determine whether a patient has CVD. The classifier will be made using a heart disease dataset from UC Irvine’s Machine Learning Repository with data collected by the Cleveland Clinic Foundation. The dataset has 303 data points and a total of 14 attributes: 4 continuous, 9 discrete, and 1 predicted attribute specifying whether an individual has heart disease. As mentioned, 3 continuous variables have been chosen as predictors for our classifier. 


**Methods**

&emsp; &emsp;  The variables chosen for the classifier’s predictors are systolic blood pressure, serum cholesterol levels and maximum achieved heart rate. These were chosen based on previous clinical studies investigating their correlation with the presence of CVD in a patient, and because they are continuous variables. To elaborate, a study conducted by a research group from Zhengzhou University found a strong, positive, linear correlation between systolic blood pressure, and the hazard ratio for CVD-caused mortality (He et al. 85). This conveys that systolic blood pressure is a significant indicator of CVD, and therefore an useful predictor for our classifier. In addition, the Framingham Study, a significant study in medicine which discovered the indicators of CVD, revealed a strong, positive, linear correlation between serum cholesterol and the instance of coronary heart disease, a branch of CVD (William B. Kannel, et al. 43). This justifies the importance of serum cholesterol levels as a predictor of CVD in our classifier. Finally, a study conducted by the Cardiovascular Institute and Fu Wai Hospital found that a heart rate above 90 beats per minute has the greatest hazard risk for CVD and a heart rate from above 75 had the greatest number of patients diagnosed with CVD (Qunxia Mao, 1644). Thus, heart rate is a vital indicator for CVD and is a predictor in our classification model. 

&emsp; &emsp; Our data analysis from beginning to end is as follows. First, we pulled our dataset from the UC Irvine Machine Learning Repository. We dropped missing values. We also renamed the columns  “num” and “thalach” to “heart_disease_presence” and “max_heart_rate” respectively to make our tables easier to interpret. 75 percent of the data was split off to use as our training set. Based on our background research, we decided to use systolic blood pressure (“trestbps(systolic)”), serum cholesterol levels (“chol”) and maximum achieved heart rate as our predictor variables. 

&emsp; &emsp; Next, we performed preliminary exploratory data analysis. To better understand the heart disease presence variable, we took the count and the mean for each level of heart disease presence, and then the mean value of each predictor we used. We then created three bar graphs to visualize the relationship between each predictor and the presence of heart disease. We chose to use bar graphs because we are comparing different amounts (the means of a given predictor) in each heart disease level category. 

&emsp; &emsp;From our bar graphs, we can see that higher levels of heart disease presence seem to be associated with lower maximum heart rates and higher systolic resting blood pressures. The relationship between serum cholesterol level and heart disease presence is less strong, but it does seem that having heart disease (1-4) is associated with higher cholesterol levels.

&emsp; &emsp; Our research question tries to classify whether or not someone has heart disease so a patient can know if they should seek additional testing and treatment. Regardless of their level of heart disease, a patient who has non-zero heart disease presence should seek additional care. Therefore we have replaced 1-4 with yes and 0 with no in the heart disease column.

&emsp; &emsp; Once again, we compared the mean value of each predictor variable with each category of heart disease presence and made bar graphs displaying each relationship. From these bar graphs, we can see that having heart disease is associated with lower maximum heart rates, higher cholesterol levels, and higher systolic blood pressures. 

&emsp; &emsp; We also created three scatterplots that each visualized the relationship between two of our quantitative variables. From these scatterplots, there does not seem to be a strong relationship between any pair of our predictors. 

&emsp; &emsp; After completing our preliminary data analysis, we built our classifier based on the K-nearest neighbours algorithm. To choose the best K for the classifier, we will standardize the data, create a pipeline, and perform cross-validation with GridSearchCV and our training set data. We want to standardize our predictor variables so that the K-nearest neighbours algorithm does not weigh the predictors differently based on their different scales and centers. 

&emsp; &emsp; From here, we can visualize the relationship between the number of neighbours (K) tested and the mean test score with a line graph. Having 24 neighbours produces a high mean test score, so we chose it as K. Additionally, the mean test score for nearby neighbours (K = 23 and 25) does not dramatically change the accuracy, so our choice of K seems reliable. 
Using K=24, we trained the classifier and predicted the labels for our testing set. 
Our accuracy from this model is 73%.

**Expected outcomes**

&emsp; &emsp; Based on previous studies, we expect our classifier to be more likely to predict patients as having CVD if they score highly on any predictor variable. Resting systolic blood pressures above 130 bpm (He et al. 85), heart rates over 90 bpm (Mao 1644), or scoring highly on multiple predictors is expected to be particularly strongly linked to CVD. 

**Preliminary Exploratory Data Analysis:**

In [164]:
import pandas as pd
import altair as alt
import numpy as np
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

**Figure 1: Heart disease original data.**

In [165]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
hd_original_data = pd.read_csv(url, header=None)
hd_original_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


To clean and wrangle this dataset into a tidy format, column names are added in the order given by the original website. The last column, initially referred to as "num", has been renamed to "heart disease presence", and refers to the heart disease status of the patient. This status ranges from 0-4, 0 indicating no presence of heart disease. The original column "thalach" has also been renamed to "max heart rate", and refers to the maximum heart rate achieved. All missing values have been dropped.

**Figure 2: Cleaned and wrangled heart disease data.**

In [166]:
hd_original_data.columns = ["age", "sex", "cp", "trestbps(systolic)", "chol", "fbs", "restecg", "max_heart_rate", "exang", "oldpeak", "slope", "ca", "thal", "heart_disease_presence"]
hd_original_data['heart_disease_presence'] = pd.Categorical(hd_original_data.heart_disease_presence)
hd_data = hd_original_data[(hd_original_data['age'] != "?")
                           & (hd_original_data['sex'] != "?")
                           & (hd_original_data['trestbps(systolic)'] != "?")
                           & (hd_original_data['chol'] != "?")
                           & (hd_original_data['fbs'] != "?")
                           & (hd_original_data['restecg'] != "?")
                           & (hd_original_data['max_heart_rate'] != "?")
                           & (hd_original_data['exang'] != "?")
                           & (hd_original_data['oldpeak'] != "?")
                           & (hd_original_data['slope'] != "?")
                           & (hd_original_data['ca'] != "?")
                           & (hd_original_data['thal'] != "?")
                           & (hd_original_data['heart_disease_presence'] != "?")]
hd_data

Unnamed: 0,age,sex,cp,trestbps(systolic),chol,fbs,restecg,max_heart_rate,exang,oldpeak,slope,ca,thal,heart_disease_presence
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3


For our project, we will split 75% of the data to use as our training data and the other 25% as our testing data.

**Figure 3: Split data**

In [167]:
hd_train, hd_test = train_test_split(hd_data, test_size=0.25, random_state=123, stratify=hd_data["heart_disease_presence"]) 
hd_train

Unnamed: 0,age,sex,cp,trestbps(systolic),chol,fbs,restecg,max_heart_rate,exang,oldpeak,slope,ca,thal,heart_disease_presence
295,41.0,1.0,2.0,120.0,157.0,0.0,0.0,182.0,0.0,0.0,1.0,0.0,3.0,0
100,45.0,1.0,4.0,115.0,260.0,0.0,2.0,185.0,0.0,0.0,1.0,0.0,3.0,0
279,58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,0
163,58.0,0.0,4.0,100.0,248.0,0.0,2.0,122.0,0.0,1.0,2.0,0.0,3.0,0
38,55.0,1.0,4.0,132.0,353.0,0.0,0.0,132.0,1.0,1.2,2.0,1.0,7.0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,58.0,1.0,3.0,140.0,211.0,1.0,2.0,165.0,0.0,0.0,1.0,0.0,3.0,0
34,44.0,1.0,3.0,130.0,233.0,0.0,0.0,179.0,1.0,0.4,1.0,0.0,3.0,0
98,52.0,1.0,2.0,134.0,201.0,0.0,0.0,158.0,0.0,0.8,1.0,1.0,3.0,0
19,49.0,1.0,2.0,130.0,266.0,0.0,0.0,171.0,0.0,0.6,1.0,0.0,3.0,0


To explore our dataset, we found the count and percentage of each level of heart disease presence, **Figure 4**:

In [168]:
explore_hd_grouped = (hd_train.groupby('heart_disease_presence').count())
explore_hd = explore_hd_grouped[["age"]].rename(columns={"age":"count"})
explore_hd = explore_hd.assign(
    percentage=100*explore_hd['count']/len(hd_train)
)
explore_hd

Unnamed: 0_level_0,count,percentage
heart_disease_presence,Unnamed: 1_level_1,Unnamed: 2_level_1
0,120,54.054054
1,40,18.018018
2,26,11.711712
3,26,11.711712
4,10,4.504505


The means of each predictor for the individual heart disease presence, **Figure 5**:

In [169]:
hd_predictors = hd_train[["trestbps(systolic)", "chol", "max_heart_rate", "heart_disease_presence"]]
hd_predictors.columns = ["trestbps(systolic) mean", "chol mean", "max_heart_rate mean", "heart_disease_presence"]
hd_0 = hd_predictors[hd_predictors["heart_disease_presence"] == 0]
hd_0 = pd.DataFrame(hd_0.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_0

hd_1 = hd_predictors[hd_predictors["heart_disease_presence"] == 1]
hd_1 = pd.DataFrame(hd_1.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_1.index=['1']

hd_2 = hd_predictors[hd_predictors["heart_disease_presence"] == 2]
hd_2 = pd.DataFrame(hd_2.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_2.index=['2']

hd_3 = hd_predictors[hd_predictors["heart_disease_presence"] == 3]
hd_3 = pd.DataFrame(hd_3.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_3.index=['3']

hd_4 = hd_predictors[hd_predictors["heart_disease_presence"] == 4]
hd_4 = pd.DataFrame(hd_4.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_4.index=['4']

hd_all = [hd_0, hd_1, hd_2, hd_3, hd_4]

hd_mean2 = pd.concat(hd_all)
hd_mean2.index.name = "heart_disease_presence"
hd_mean2

Unnamed: 0_level_0,trestbps(systolic) mean,chol mean,max_heart_rate mean
heart_disease_presence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,129.483333,244.075,158.875
1,132.75,249.325,145.9
2,131.653846,259.692308,135.423077
3,133.730769,238.807692,134.115385
4,136.9,253.1,139.4


Graphs representing these means:

**Figure 6A: Heart disease presence versus maximum heart rate:**

In [170]:
hd_mean2_ = hd_mean2.reset_index()
hd_mean2_["heart_disease_presence"] = pd.Categorical(hd_mean2_.heart_disease_presence)
hdp_vs_max_htrt = (
    alt.Chart(hd_mean2_)
    .mark_bar()
    .encode(
        x=alt.X("heart_disease_presence", title="Heart disease presence"),
        y=alt.Y("max_heart_rate mean", title="Maximum heart rate (BPM)"),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence", scale=alt.Scale(scheme='dark2'))
    )
).configure_axis(titleFontSize=12)
hdp_vs_max_htrt

  for col_name, dtype in df.dtypes.iteritems():


**Figure 6B: Heart disease presence versus cholesterol level:**

In [171]:
hdp_vs_chol = (
    alt.Chart(hd_mean2_)
    .mark_bar()
    .encode(
        x=alt.X("heart_disease_presence", title="Heart disease presence"),
        y=alt.Y("chol mean", title="Serum cholesterol level (mg/dl)"),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence", scale=alt.Scale(scheme='dark2'))
    )
).configure_axis(titleFontSize=12)
hdp_vs_chol

**Figure 6C: Heart disease presence versus systolic blood pressure:**

In [172]:
hdp_vs_restbps = (
    alt.Chart(hd_mean2_)
    .mark_bar()
    .encode(
        x=alt.X("heart_disease_presence", title="Heart disease presence"),
        y=alt.Y("trestbps(systolic) mean", title="Systolic resting blood pressure (mm Hg)"),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence", scale=alt.Scale(scheme='dark2'))
    )
).configure_axis(titleFontSize=12)
hdp_vs_restbps

We chose to use bar graphs because we are comparing different amounts (the means of a given predictor) in each heart disease level category.

From our bar graphs, we can see that higher levels of heart disease presence seem to be associated with lower maximum heart rates and higher systolic resting blood pressures. The relationship between serum cholesterol level and heart disease presence is less strong, but it does seem that having heart disease (1-4) is associated with higher cholesterol levels.

Our research question tries to classify whether or not someone has heart disease so a patient can know if they should seek additional testing and treatment. Regardless of their level of heart disease, a patient who has non-zero heart disease presence should seek additional care. Therefore we have replaced 1-4 with yes and 0 with no in the heart disease column.

In [173]:
hd_test["heart_disease_presence"] = (hd_test['heart_disease_presence']).astype(str)
hd_general_test = hd_test.replace({"heart_disease_presence": {"0":"no", "1":"yes", "2":"yes", "3":"yes", "4":"yes"}})

**Figure 7: Modified heart_disease_presence to yes or no:**

In [174]:
hd_train["heart_disease_presence"] = (hd_train['heart_disease_presence']).astype(str)
hd_general_train = hd_train.replace({"heart_disease_presence": {"0":"no", "1":"yes", "2":"yes", "3":"yes", "4":"yes"}})
hd_general_train

Unnamed: 0,age,sex,cp,trestbps(systolic),chol,fbs,restecg,max_heart_rate,exang,oldpeak,slope,ca,thal,heart_disease_presence
295,41.0,1.0,2.0,120.0,157.0,0.0,0.0,182.0,0.0,0.0,1.0,0.0,3.0,no
100,45.0,1.0,4.0,115.0,260.0,0.0,2.0,185.0,0.0,0.0,1.0,0.0,3.0,no
279,58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,no
163,58.0,0.0,4.0,100.0,248.0,0.0,2.0,122.0,0.0,1.0,2.0,0.0,3.0,no
38,55.0,1.0,4.0,132.0,353.0,0.0,0.0,132.0,1.0,1.2,2.0,1.0,7.0,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,58.0,1.0,3.0,140.0,211.0,1.0,2.0,165.0,0.0,0.0,1.0,0.0,3.0,no
34,44.0,1.0,3.0,130.0,233.0,0.0,0.0,179.0,1.0,0.4,1.0,0.0,3.0,no
98,52.0,1.0,2.0,134.0,201.0,0.0,0.0,158.0,0.0,0.8,1.0,1.0,3.0,no
19,49.0,1.0,2.0,130.0,266.0,0.0,0.0,171.0,0.0,0.6,1.0,0.0,3.0,no


**Figure 8: Means of each predictor for modified heart disease presence:**

In [175]:
hd_predictors_2 = hd_general_train[["trestbps(systolic)", "chol", "max_heart_rate", "heart_disease_presence"]]
hd_predictors_2.columns = ["trestbps(systolic) mean", "chol mean", "max_heart_rate mean", "heart_disease_presence"]

hd_no = hd_predictors_2[hd_predictors_2["heart_disease_presence"] == "no"]
hd_no = pd.DataFrame(hd_no.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_no.index=["no"]

hd_yes = hd_predictors_2[hd_predictors_2["heart_disease_presence"] == "yes"]
hd_yes = pd.DataFrame(hd_yes.drop(["heart_disease_presence"], axis=1).apply(np.mean)).transpose()
hd_yes.index=["yes"]

hd_general_all = [hd_no, hd_yes]
hd_general_mean2 = pd.concat(hd_general_all)
hd_general_mean2.index.name = "heart_disease_presence"
hd_general_mean2

Unnamed: 0_level_0,trestbps(systolic) mean,chol mean,max_heart_rate mean
heart_disease_presence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,129.483333,244.075,158.875
yes,133.127451,249.656863,139.588235


**Figure 9A: Heart disease presence versus maximum heart rate:**

In [176]:
hd_general_mean2_ = hd_general_mean2.reset_index()
hd_general_mean2_["heart_disease_presence"] = pd.Categorical(hd_general_mean2_.heart_disease_presence)
hdp_vs_max_htrt_general = (
    alt.Chart(hd_general_mean2_)
    .mark_bar()
    .encode(
        x=alt.X("heart_disease_presence", title="Heart disease presence"),
        y=alt.Y("max_heart_rate mean", title="Maximum heart rate (BPM)"),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence", scale=alt.Scale(scheme='dark2'))
    )
).configure_axis(titleFontSize=12)
hdp_vs_max_htrt_general

  for col_name, dtype in df.dtypes.iteritems():


**Figure 9B: Heart disease presence versus cholesterol level:**

In [177]:
hdp_vs_chol_general = (
    alt.Chart(hd_general_mean2_)
    .mark_bar()
    .encode(
        x=alt.X("heart_disease_presence", title="Heart disease presence"),
        y=alt.Y("chol mean", title="Serum cholesterol level (mg/dl)"),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence", scale=alt.Scale(scheme='dark2'))
    )
).configure_axis(titleFontSize=12)
hdp_vs_chol_general

**Figure 9C: Heart disease presence versus systolic blood pressure:**

In [178]:
hdp_vs_restbps_general = (
    alt.Chart(hd_general_mean2_)
    .mark_bar()
    .encode(
        x=alt.X("heart_disease_presence", title="Heart disease presence"),
        y=alt.Y("trestbps(systolic) mean", title="Systolic resting blood pressure (mm Hg)"),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence", scale=alt.Scale(scheme='dark2'))
    )
).configure_axis(titleFontSize=12)
hdp_vs_restbps_general

Once again, we compared the mean value of each predictor variable with each category of heart disease presence and made bar graphs displaying each relationship. From these bar graphs, we can see that having heart disease is associated with lower maximum heart rates, higher cholesterol levels, and higher systolic blood pressures.

We also created three scatterplots that each visualized the relationship between two of our quantitative variables. From these scatterplots, there does not seem to be a strong relationship between any pair of our predictors.

**Figure 10A: Serum cholesterol level versus systolic blood pressure**

In [179]:
chol_vs_restbps = (
    alt.Chart(hd_general_train)
    .mark_circle()
    .encode(
        x=alt.X("chol", title="Serum cholesterol level (mg/dl)", scale=alt.Scale(zero=False)),
        y=alt.Y("trestbps(systolic)", title="Systolic resting blood pressure (mm Hg)", scale=alt.Scale(zero=False)),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence")
    )
).configure_axis(titleFontSize=12)
chol_vs_restbps

**Figure 10B: Serum cholesterol level versus maximum heart rate**

In [180]:
chol_vs_max_htrt = (
    alt.Chart(hd_general_train)
    .mark_circle()
    .encode(
        x=alt.X("chol", title="Serum cholesterol level (mg/dl)", scale=alt.Scale(zero=False)),
        y=alt.Y("max_heart_rate", title="Maximum heart rate (BPM)", scale=alt.Scale(zero=False)),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence")
    )
).configure_axis(titleFontSize=12)
chol_vs_max_htrt

**Figure 10C: Systolic blood pressure versus maximum heart rate:**

In [181]:
restbps_vs_max_htrt = (
    alt.Chart(hd_general_train)
    .mark_circle()
    .encode(
        x=alt.X("trestbps(systolic)", title="Systolic resting blood pressure (mm Hg)", scale=alt.Scale(zero=False)),
        y=alt.Y("max_heart_rate", title="Maximum heart rate (BPM)", scale=alt.Scale(zero=False)),
        color=alt.Color("heart_disease_presence", title="Heart Disease Presence")
    )
).configure_axis(titleFontSize=12)
restbps_vs_max_htrt

This is the count for each patient's presence of heart disease, grouped by "yes" and "no" (whether the patient has heart disease or not):

**Figure 11: Counts of "yes" and "no":**

In [182]:
explore_hd_general_grouped = (hd_general_train.groupby('heart_disease_presence').count())
explore_hd_general = explore_hd_general_grouped[["age"]].rename(columns={"age":"count"})
explore_hd_general

Unnamed: 0_level_0,count
heart_disease_presence,Unnamed: 1_level_1
no,120
yes,102


**Data Analysis**

To choose our best K for the classifier, we will standardize the data, create a pipeline, and perform cross-validation with GridSearchCV:

In [183]:
hd_preprocessor = make_column_transformer(
    (StandardScaler(), ["trestbps(systolic)", "max_heart_rate", "chol"]),
)

param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 30, 1),
}

hd_tune_pipe = make_pipeline(hd_preprocessor, KNeighborsClassifier())

knn_tune_grid = GridSearchCV(
    hd_tune_pipe, param_grid, cv=4,
)

**Figure 12: Mean test scores:**

In [184]:
X_tune = hd_general_train[["trestbps(systolic)", "max_heart_rate", "chol"]]
y_tune = hd_general_train["heart_disease_presence"]

knn_model_grid = knn_tune_grid.fit(X_tune, y_tune)

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)
knn_model_grid
accuracies_grid

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.007621,0.001115,0.008789,0.00059,2,{'kneighborsclassifier__n_neighbors': 2},0.517857,0.535714,0.6,0.581818,0.558847,0.033304,28
1,0.006805,0.000106,0.008227,0.000168,3,{'kneighborsclassifier__n_neighbors': 3},0.607143,0.607143,0.654545,0.618182,0.621753,0.019462,18
2,0.006821,8.9e-05,0.008387,0.000376,4,{'kneighborsclassifier__n_neighbors': 4},0.589286,0.553571,0.654545,0.6,0.599351,0.036207,27
3,0.007442,0.000355,0.008448,0.000238,5,{'kneighborsclassifier__n_neighbors': 5},0.571429,0.642857,0.654545,0.654545,0.630844,0.034634,13
4,0.012263,0.005182,0.01561,0.006514,6,{'kneighborsclassifier__n_neighbors': 6},0.571429,0.607143,0.654545,0.636364,0.61737,0.031456,20
5,0.018262,0.000369,0.021675,0.000162,7,{'kneighborsclassifier__n_neighbors': 7},0.553571,0.625,0.618182,0.636364,0.608279,0.032246,23
6,0.018427,0.001417,0.040781,0.021582,8,{'kneighborsclassifier__n_neighbors': 8},0.553571,0.607143,0.636364,0.618182,0.603815,0.030827,25
7,0.033109,0.022026,0.035018,0.021099,9,{'kneighborsclassifier__n_neighbors': 9},0.607143,0.625,0.654545,0.672727,0.639854,0.025432,4
8,0.031965,0.022658,0.025407,0.009634,10,{'kneighborsclassifier__n_neighbors': 10},0.589286,0.660714,0.672727,0.636364,0.639773,0.031957,5
9,0.006266,0.000266,0.00893,0.003138,11,{'kneighborsclassifier__n_neighbors': 11},0.553571,0.642857,0.654545,0.672727,0.630925,0.045911,12


**Figure 13: Grid Search:**

In [185]:
accuracy_versus_k_grid = (
    alt.Chart(accuracies_grid, title="Grid Search")
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "param_kneighborsclassifier__n_neighbors",
            title="Neighbors",
            scale=alt.Scale(zero=False),
        ),
        y=alt.Y(
            "mean_test_score", 
            title="Mean Test Score", 
            scale=alt.Scale(zero=False)
        ),
    )
    .configure_axis(labelFontSize=10, titleFontSize=15)
    .properties(width=400, height=300)
)

accuracy_versus_k_grid

  for col_name, dtype in df.dtypes.iteritems():


From the graph above, we can see that K=24 gives us the highest accuracy whilst not having a large drop or increase on both sides, so we can be more confident that this value that gives a high accuracy is not a fluke. So, we will use K=24 for our classification.

Then, we will train the classifier and predict the labels, **Figure 14: Heart disease presence predictions:**:

In [186]:
knn = KNeighborsClassifier(n_neighbors=24) 

X = hd_general_train.loc[:, ["trestbps(systolic)", "max_heart_rate", "chol"]]
y = hd_general_train["heart_disease_presence"]

knn_fit = make_pipeline(hd_preprocessor, knn).fit(X, y)

hd_test_predictions = hd_general_test.assign(
    predicted = knn_fit.predict(hd_test.loc[:, ["trestbps(systolic)", "max_heart_rate", "chol"]])
)
hd_test_predictions[['heart_disease_presence', 'predicted']]

Unnamed: 0,heart_disease_presence,predicted
183,no,no
204,no,no
151,no,yes
218,no,yes
93,no,no
...,...,...
177,yes,yes
75,no,no
291,no,no
188,yes,no


Our accuracy from this model is:

In [187]:
hd_acc = knn_fit.score(
    hd_general_test.loc[:, ["trestbps(systolic)", "max_heart_rate", "chol"]],
    hd_general_test["heart_disease_presence"]
)
hd_acc

0.7333333333333333

**Figure 15: Confusion matrix**

In [188]:
pd.crosstab(
    hd_test_predictions["heart_disease_presence"],
    hd_test_predictions["predicted"]
)

predicted,no,yes
heart_disease_presence,Unnamed: 1_level_1,Unnamed: 2_level_1
no,33,7
yes,13,22


**Discussion**

&emsp; &emsp; Through the process of building our classifier, we found that its accuracy is 73%. This is certainly not what we expected to find as we expected an accuracy which was much greater because of our choice of predictors. Our prior research, which was based on previous clinical studies, suggested that systolic blood pressure, serum cholesterol levels and maximum achieved heart rate were effective indicators of a patient having cardiovascular disease. Thus, this outcome was unexpected. 

&emsp; &emsp; Looking at the Heart Disease Presence vs Maximum Heart Rate (BPM) bar chart made from the hd_mean2_ dataframe of the training data, we can see that there are distinguishable differences between the mean maximum heart rate of individuals without CVD and the mean maximum heart rates of those with CVD. In addition, looking at the scatter plots of Serum Cholesterol Level vs Maximum Heart Rate and Systolic Resting Blood Pressure vs Maximum Heart Rate, we can see that there are moderately distinct areas of orange (those with CVD) data points and blue (those without CVD) data points. The distribution of blue and orange data points in both plots show that those without CVD have higher heart rates and varying serum cholesterol levels as well as varying systolic resting blood pressure levels, and those with CVD have lover maximum heart rates and varying serum cholesterol levels as well as varying systolic resting blood pressure levels. This conveys that maximum heart rate is a useful predictor of whether or not a patient has CVD, because when used, it is able to distinguish between those with or without CVD, classifying the data points into two groups.

&emsp; &emsp; However, serum cholesterol level and systolic resting blood pressure were varying for cases with and without CVD. The two cases didn’t seem to have different levels for these two predictors; people with CVD could have the same serum cholesterol level and systolic resting blood pressure as people without CVD. This means that these predictors are not as useful for predicting cases of CVD because there aren’t large enough differences in these values for people with or without CVD. This observation can be seen through the scatter plot of serum cholesterol level vs systolic resting blood pressure. The blue points and orange points are not concentrated in two different areas, they are spread out in the same areas, therefore indicating that these two predictors may not be as useful for predicting if a patient has CVD. Looking at the plots of the serum cholesterol level vs maximum heart rate and the systolic resting blood pressure vs maximum heart rate, it can be observed that the maximum heart rate is the predictor which distinguishes the data points from each other the most. 

&emsp; &emsp; In addition, through the bar graphs of heart disease presences vs serum cholesterol level as well as heart disease presence vs systolic resting blood pressure, it is seen that there are not major differences in the mean serum cholesterol level and the systolic resting blood pressure of those with CVD and those without CVD, leading again to the inference that serum cholesterol level and systolic resting blood pressure may not be as strong of predictors of CVD as maximum heart rate. Therefore, with only one strong predictor for CVD classification, the unexpected low accuracy of the classifier can be rationalized. 
The dataset we used to build the classifier was from the Cleveland Clinic Foundation in the United States of America, whereas the research we used to choose our predictors was from China, with the majority of the patients being Chinese. This is of significance because studies suggest that people from different countries have different systolic blood pressures, cholesterol levels and heart rates (Bathula et al. 92) as well as differences in lifestyle. Therefore, factors contributing to CVD in the population of the USA may be different than those in China as shown in the prior research completed. Thus, predictors which may be useful for the Chinese population such as serum cholesterol level and resting systolic blood pressure may not be useful for data from the American population, again explaining the lower than expected accuracy of our classifier. 
Another explanation for the lower accuracy of our classifier is that the dataset only has 303 and 75% of it was used in the training dataset meaning 227 data points. This amount of data points may not be enough to train the classifier to classify a wide range of CVD or non CVD cases based on various combinations of values for maximum heart rate, systolic resting blood pressure and serum cholesterol level. In other words, this might not be enough data to exhibit strong correlations between a wide range of combinations of values for the predictor variables, and thus may only exhibit weak correlations which are not enough to correctly classify unseen data.

&emsp; &emsp; A minor contributing factor to the lower accuracy is that the data is slightly imbalanced. In the training dataset, there are 120 data points for those without CVD and 102 data points for those with CVD. Looking back at the scatter plots of serum cholesterol level vs maximum heart rate, systolic resting blood pressure vs maximum heart rate as well as serum cholesterol level vs systolic resting blood pressure we can see that especially in terms of systolic resting blood pressure and serum cholesterol levels, the data points for those with CVD and without CVD are mixed together and not concentrated in separate areas, indicating that these may not be useful predictors as mentioned above. Paired with the imbalance in data points, it is likely that the classifier’s accuracy is low because it has the potential to deliver false negatives due to the greater influence of data points of those without CVD.  

&emsp; &emsp; The 73% accuracy of the classifier makes it inadequate use in a professional health care setting. This is because it can result in both false positive and false negative diagnoses for patients. This is not a desirable situation for hospitals and patients to be in because for those who receive a false positive, they may be advised to undergo expensive medical treatment unnecessarily while wasting medical resources. For those who receive a false negative, their health conditions are left to worsen due to delayed medical treatment. Thus, the accuracy of this classifier has to be much higher in order for it to be trusted and for it to have a positive contribution in a medical setting. 

&emsp; &emsp; Although our classifier has a lower accuracy, it does lead to various useful findings. Firstly, it shows that the same classifier for CVD cannot be used in multiple countries. This is because each country may have different factors that are significant contributors to CVD due to differences in lifestyles and genetics. Therefore different predictors are needed to build the classifier in each country. In addition, it certainly shows that for the USA, serum cholesterol level and systolic resting blood pressure may not be useful indicators for diagnosing CVD, as these predictors were chosen from research conducted in China. Also, it can be seen that symptoms of heart disease may not have straightforward and concrete trends as seen through the serum cholesterol level and resting systolic blood pressure predictors. Therefore, it is useful to have a large dataset with many more data points than used in this project, so that the classifier can be exposed to various different correlations and trends in predictor values for those who have or do not have CVD.

&emsp; &emsp; Future questions which this project can lead to include determining which predictors are the most useful for building classifiers used to diagnose CVD in each country or continent. Research about this question can help increase the accuracy of CVD classifiers and potentially improve the process of diagnosing patients by making it more efficient. In addition, we use single measurements of each predictor, but future projects can investigate how changes in these predictors over time, links to the presence of CVD. For instance, if we find that systolic blood pressure is a strong predictor of CVD in a particular country either than the USA, it would be worth asking whether a person is more likely to have CVD if their systolic blood pressure shows to increase quickly over a few years, versus gradually over decades. 


**Citations**
\
\
He, L. et al. “Relationship of resting heart rate and blood pressure with all-cause and cardiovascular  disease mortality.” Public Health, vol. 208, June 2022, pages 80-88, Science Direct, https://www.sciencedirect.com/science/article/pii/S0033350622001032. Accessed March 11, 2023. 
\
\
Kannel, W. B. et al. “Factors of risk in the development of coronary heart disease--six year follow-up experience. The Framingham Study.” Annals of Internal Medicine, vol. 55, no. 1, July 1961, pages 33-50, National Library of Medicine. https://pubmed.ncbi.nlm.nih.gov/13751193/. Accessed March 11, 2023. 
\
\
Mao, Qunxia et al. “Heart rate influence on incidence of cardiovascular disease among adults in China.” International Journal of Epidemiology, vol. 39, no. 6, December 2010, pages 1638–1646, Oxford Academic. https://academic.oup.com/ije/article/39/6/1638/738860?login=false. Accessed March 11, 2023. 
\
\
Bathula, Rajaram et al. “Ethnic differences in heart rate: can these be explained by conventional cardiovascular risk factors?” Clinical autonomic research, vol. 18, no. 2, April 2008, pages 90-95, National Library of Medicine. https://pubmed.ncbi.nlm.nih.gov/18414771/. Accessed April 11, 2023.