# Group Project Proposal: Predicting Heart Disease

# Introduction

Coronary artery disease, a type of heart disease, is defined by >50% narrowing of the vessel's diameter. Blood pressure is defined by the force of blood against the walls of the arteries, measured in mmHg. Typically, a low maximum heart rate and high resting  blood pressure are associated with health risks, including a higher likelihood of heart disease (Sandvik et al., 1995; Wu et al., 2015). Thus, our group poses the following question: Can resting blood pressure (on admission to the hospital) and maximum heart rate achieved predict heart disease?

For our project, we are using the Heart Disease dataset from the UC Irvine Machine Learning Repository (Janosi et al., 1988). The database contains 76 attributes, but all published experiments use only a subset of 14. The data consists of both categorical and integer variable types. There is very little missing data, and the variables we will be looking at do not contain any mising data. In this project, we will be exploring the data to see how different variables interact. 

# 1. Import Libraries

In [91]:
#installing -U altair 

In [92]:
!pip install  -U altair



In [93]:
#pip installs and necessary libraries 
!pip install ucimlrepo



In this step, we'll import the following libraries: Altair, which provides statistical visualization for Python, Pandas for cleaning, manipulating, and analyzing data, and NumPy to access large datasets with efficiency and less storage capacity.  
We'll also utilize the Scikit-learn library to access its machine learning and statistical modelling capabilities.

In [94]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

In [95]:
from ucimlrepo import fetch_ucirepo 

# 2. Uploading the data

In [96]:
#uploading the data 

Here, we're trying to access the description of the dataset variables and other vital information for further manipulation and analysis. As an example, by running the metadata() function, our group fetched details about certain column specifications (demographics: "Age" and "Sex), the URL, and database locations. 

In [97]:
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
  
# metadata 
heart_disease.metadata

{'uci_id': 45,
 'name': 'Heart Disease',
 'repository_url': 'https://archive.ics.uci.edu/dataset/45/heart+disease',
 'data_url': 'https://archive.ics.uci.edu/static/public/45/data.csv',
 'abstract': '4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach',
 'area': 'Health and Medicine',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 303,
 'num_features': 13,
 'feature_types': ['Categorical', 'Integer', 'Real'],
 'demographics': ['Age', 'Sex'],
 'target_col': ['num'],
 'index_col': None,
 'has_missing_values': 'yes',
 'missing_values_symbol': 'NaN',
 'year_of_dataset_creation': 1989,
 'last_updated': 'Fri Nov 03 2023',
 'dataset_doi': '10.24432/C52P4X',
 'creators': ['Andras Janosi',
  'William Steinbrunn',
  'Matthias Pfisterer',
  'Robert Detrano'],
 'intro_paper': {'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.',
  'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M.

***Fig.1***: Exploring Heart Disease Dataset: Fetching Data, Extracting Features (X) and Targets (y), and Examining Metadata Information.

In [98]:
# variable information 
heart_disease.variables.head(5)

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,years,no
1,sex,Feature,Categorical,Sex,,,no
2,cp,Feature,Categorical,,,,no
3,trestbps,Feature,Integer,,resting blood pressure (on admission to the ho...,mm Hg,no
4,chol,Feature,Integer,,serum cholestoral,mg/dl,no


***Table 1***: First 5 rows of the original Heart Disease dataset used in this report 

In [99]:
#set dataset url to a variable and read in the dataset 

In [100]:
url = "https://archive.ics.uci.edu/static/public/45/data.csv"
heart_disease = pd.read_csv(url)
heart_disease = pd.DataFrame(heart_disease)
heart_disease.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


***Table 2:*** First 5 rows of the variables from the original Heart disease dataset

# 3. Cleaning data 

Selecting columns and cleaning/removing NAN

In [101]:
heart_disease.columns 

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

***Fig.2***: Heart Disease Dataset Columns: Displaying the Features and Targets Present in the Dataset.

Since we're mostly interested in the "trestbps", "chol", and "num" variables to answer our research question, we'll be filtering the data set to only include these columns.

In [102]:
heart_disease_filtered = heart_disease.loc[:,['trestbps', 'chol', 'num']]

In [103]:
heart_disease_filtered.head(5)

Unnamed: 0,trestbps,chol,num
0,145,233,0
1,160,286,2
2,120,229,1
3,130,250,0
4,130,204,0


***Table 3***: the loaded data set after it was filtered to only contain the Serum Cholestoral, resting blood pressure, and diagnosis values. 

# 4. Data splitting 

In this step, we split the data into training and testing sets randomly with state 123

In [104]:
heart_traindata, heart_testdata = train_test_split(heart_disease_filtered, test_size=0.25, random_state=123)
heart_traindata.head(5)

Unnamed: 0,trestbps,chol,num
36,120,177,3
148,128,308,0
21,150,283,0
187,160,246,2
161,125,304,4


***Table 4***: the training values from the filtered, splitted data set

In [105]:
heart_traindata, heart_testdata = train_test_split(heart_disease_filtered, test_size=0.25, random_state=123)
heart_testdata.head(5)

Unnamed: 0,trestbps,chol,num
11,140,294,0
292,120,169,2
269,130,180,0
268,152,223,1
94,135,252,0


***Table 5***: the testing values from the filtered, splitted data set

# 5. Exploring the Dataset 

In [106]:
#Playing with data and ploting to explore

In this section of our analysis, we're plotting each of the two independent variables (resting blood pressure and cholesterol) against the dependent variable (heart disease diagnosis) to observe a visual relationship.

In [107]:
trestbps_plot = alt.Chart(heart_traindata).mark_point().encode(
    x = alt.X('trestbps').title('Resting bp (mmHg)'),
    y = alt.Y('num:N').title('Heart Disease Symptoms'),
    
).properties(title='Resting Blood Pressure vs. Heart Disease Symptoms')
trestbps_plot
#looks at the resting blood pressure upon entering the hospital and Heart Disease Symptoms

***Fig.3***: Scatter plot of the Serum Resting blood pressure vs heart disease symptoms 

In [109]:
chol_plot = alt.Chart(heart_traindata).mark_point().encode(
    x = alt.X('chol').title('Serum Cholestoral (mg/dl)'),
    y = alt.Y('num:N').title('Heart Disease Symptoms'),
).properties(title='Serum Cholestoral (mg/dl) vs. Heart Disease Symptoms')
chol_plot

***Fig.4***: Scatter plot of the Serum cholestrol vs heart disease symptoms 

In [110]:
# also may be good to make 2 plots faceted with number of cases with 0  and the other as cases with num recorded between 1-4

In [111]:
summary_table=heart_traindata[["trestbps","chol","num"]].groupby("num").agg({
    "trestbps":"mean",
    "chol":"mean",
    "num":"count",
})
summary_table

Unnamed: 0_level_0,trestbps,chol,num
num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,128.639344,240.295082,122
1,132.512821,242.666667,39
2,136.785714,271.285714,28
3,137.148148,246.962963,27
4,139.0,261.454545,11


***Table 6***: Mean Values and Count of Resting Blood Pressure and Serum Cholestoral Grouped by num "the probability of a heart disease symptoms" in the Heart Disease Training Data.

# 5.1 Describing the Variables in the data set 

In [112]:
heart_disease.info()
#preview the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  num       303 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 33.3 KB


***Fig.5***: 

In [113]:
#looking unqiue values in target variable
heart_disease['num'].unique()

array([0, 2, 1, 3, 4])

***Fig.6***: the classes in the variable num "the presence of symptoms of Heart Disease"

In [114]:
#counts of how many observations are in each diagnosis class
heart_disease['num'].value_counts()

0    164
1     55
2     36
3     35
4     13
Name: num, dtype: int64

***Fig.7***: The percentage of observations in each diagnosis class

In [115]:
#exploring how balanced the data is by checking the percentage of observations in each class
100 * heart_disease.groupby('num').size() / heart_disease.shape[0]

num
0    54.125413
1    18.151815
2    11.881188
3    11.551155
4     4.290429
dtype: float64

***Fig.8***: The percentage of observations in each Heart Disease Symptoms class after grouping the classes 1-4 into one class (1)

From the counts we can see the data seems unbalanced and suggests we should either:
A) combine num (1-4) together and leave 0 so the data will be sorted as 0 and 1
with 0 being no symptoms and 1 being any sign of symptoms of heart disease
or B) create more data to balance the data set.

In [116]:
perim_concav = alt.Chart(heart_disease).mark_circle(opacity =0.7).encode(
    x=alt.X("trestbps").title("Resting bp (mmHg)"),
    y=alt.Y("chol").title("Serum Cholestoral (mg/dl)"),
    color=alt.Color("num:N").title("Symptoms of Heart Disease")
).properties(title='chol vs. trest bps')
perim_concav

***Fig.9***: Scatter Plot of the Relationship between Serum Cholestoral and Resting blood pressure Coloured based on the presence of Heart Disease symptoms. The higher the class number the more symptoms (0= no symptoms and 4= many symptoms)

From the plot we can see that trest_bp and chol have great overalap and the 2 variables together may not be good predictors

Therefore, we will go with option 1 to regroup the data in the num varibale to only two classes, 1. 0 for no heart disease symptoms and 2. 1 for symptoms present. 

# 5.2 Re-grouping of Data

In [117]:
heart_disease['num'] = heart_disease['num'].replace({
    2 : 1,
    3 : 1,
    4 : 1
})

In [118]:
#viewing the modified dataset 
heart_disease["num"]

0      0
1      1
2      1
3      0
4      0
      ..
298    1
299    1
300    1
301    1
302    0
Name: num, Length: 303, dtype: int64

***Fig.10***: Viewing the modified dataset of the vairable 

In [119]:
#ensuring all values are changed to 0 or 1 for column "num"
heart_disease['num'].unique()

array([0, 1])

***Fig.11***: Unique values in the 'num' column of the heart_disease dataset after ensuring they are either 0 or 1.

In [120]:
#checking percentage of observations in each class
100 * heart_disease.groupby('num').size() / heart_disease.shape[0]

num
0    54.125413
1    45.874587
dtype: float64

***Fig.12***: Class distribution percentages in the 'heart_disease' dataset.

In [121]:
#data looks balanced now!
heart_disease['num'].value_counts()

0    164
1    139
Name: num, dtype: int64

***Fig.13***: Balanced distribution of classes in the 'heart_disease' dataset.

# 5.3 Choosing the two variables to use

In [122]:
#exploring relationship between chol and trest bps for diagnosis with new groupings
regrouped_plot = alt.Chart(heart_disease).mark_circle(opacity =0.7).encode(
    x=alt.X("trestbps").title("Resting blood pressure (mmHg)"),
    y=alt.Y("chol").title("Serum Cholestoral (mg/dl)"),
    color=alt.Color("num:N").title("Symptoms of Heart Disease")
).properties(title='Relationship between re-grouped Serum Cholestoral and Resting blood pressure')
regrouped_plot


***Fig.14***: Scatter Plot of the Relationship between re-grouped Serum Cholestoral and Resting blood pressure for diagnosis

Since the two variables used so far still look very correlated, we will make correlation matrix to search for different variables

In [123]:
#correlation matirx on data frame 
corr = heart_disease.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
age,1.0,-0.097542,0.104139,0.284946,0.20895,0.11853,0.148868,-0.393806,0.091661,0.203805,0.16177,0.362605,0.127389,0.22312
sex,-0.097542,1.0,0.010084,-0.064456,-0.199915,0.047862,0.021647,-0.048663,0.146201,0.102173,0.037533,0.093185,0.380936,0.276816
cp,0.104139,0.010084,1.0,-0.036077,0.072319,-0.039975,0.067505,-0.334422,0.38406,0.202277,0.15205,0.233214,0.265246,0.414446
trestbps,0.284946,-0.064456,-0.036077,1.0,0.13012,0.17534,0.14656,-0.045351,0.064762,0.189171,0.117382,0.098773,0.133554,0.150825
chol,0.20895,-0.199915,0.072319,0.13012,1.0,0.009841,0.171043,-0.003432,0.06131,0.046564,-0.004062,0.119,0.014214,0.085164
fbs,0.11853,0.047862,-0.039975,0.17534,0.009841,1.0,0.069564,-0.007854,0.025665,0.005747,0.059894,0.145478,0.071358,0.025264
restecg,0.148868,0.021647,0.067505,0.14656,0.171043,0.069564,1.0,-0.083389,0.084867,0.114133,0.133946,0.128343,0.024531,0.169202
thalach,-0.393806,-0.048663,-0.334422,-0.045351,-0.003432,-0.007854,-0.083389,1.0,-0.378103,-0.343085,-0.385601,-0.264246,-0.279631,-0.417167
exang,0.091661,0.146201,0.38406,0.064762,0.06131,0.025665,0.084867,-0.378103,1.0,0.288223,0.257748,0.14557,0.32968,0.431894
oldpeak,0.203805,0.102173,0.202277,0.189171,0.046564,0.005747,0.114133,-0.343085,0.288223,1.0,0.577537,0.295832,0.341004,0.42451


***Fig.15***: Correlation Matrix: Visualizing the Relationship Between Features in the Heart Disease Dataset Using a Coolwarm Color Gradient.

In [124]:
clean_heart_disease= heart_disease 
clean_heart_disease = clean_heart_disease.drop(['age', "sex","slope", "oldpeak","exang", "cp", "fbs", "restecg", "ca"], axis = 1)
clean_heart_disease

Unnamed: 0,trestbps,chol,thalach,thal,num
0,145,233,150,6.0,0
1,160,286,108,3.0,1
2,120,229,129,7.0,1
3,130,250,187,3.0,0
4,130,204,172,3.0,0
...,...,...,...,...,...
298,110,264,132,7.0,1
299,144,193,141,7.0,1
300,130,131,115,7.0,1
301,130,236,174,3.0,1


***Table 6***: Cleaned Heart Disease Dataset: A Subset of the Original Dataset with Selected Columns Dropped for Simplified Analysis.

In [125]:
#looking at other combos starting with thalach
#exploring relationship between chol and Thalach bps for diagnosis

thal_chol_plot = alt.Chart(clean_heart_disease).mark_circle(opacity =0.7).encode(
    x=alt.X("thalach").title("Maximum heart rate achieved(bpm"),
    y=alt.Y("chol").title("Serum Cholestoral (mg/dl)"),
    color=alt.Color("num:N").title("Heart Disease Symptoms")
).properties(title='Relationship between Serum Cholestoral and Maximum heart rate achieved')
thal_chol_plot


***Fig. 16***: Scatter Plot of the Relationship between Serum Cholesterol and Maximum Heart Rate Achieved colour-coded by presence/absence Heart Disease symptoms.

In [127]:
#looking at other combos starting with thalach
#exploring relationship between trestbps and trest bps for diagnosis

thal_chol_plot = alt.Chart(clean_heart_disease).mark_circle(opacity =0.7).encode(
    x=alt.X("thalach").title("Maximum heart rate achieved(bpm)"),
    y=alt.Y("trestbps").title("Resting blood pressure(mmHg)"),
    color=alt.Color("num:N").title("Heart Disease Symptoms")
).properties(title='Relationship between the Maximum heart rate and Resting blood pressure')
thal_chol_plot


***Fig.17***: Scatter Plot of the Relationship between the Maximum heart rate and Resting blood pressure colour-coded by presence/absence Heart Disease symptoms.

We will go with thalach (Maximum Heart Rate achieved) and trestbps (Restign Blood Pressure) as they seemed the most distinct from the 3 combos tested (chol vs. trestbps, chol vs. thalach, and thalach vs. trestbps).

# 5.5 preparing the thalach and trestbps variables 

In [128]:
#splitting data into training data and test data 
#split randomly with state 123
heart_traindata, heart_testdata = train_test_split(heart_disease, test_size=0.25, random_state=123)

In [129]:
heart_traindata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 227 entries, 36 to 98
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       227 non-null    int64  
 1   sex       227 non-null    int64  
 2   cp        227 non-null    int64  
 3   trestbps  227 non-null    int64  
 4   chol      227 non-null    int64  
 5   fbs       227 non-null    int64  
 6   restecg   227 non-null    int64  
 7   thalach   227 non-null    int64  
 8   exang     227 non-null    int64  
 9   oldpeak   227 non-null    float64
 10  slope     227 non-null    int64  
 11  ca        223 non-null    float64
 12  thal      226 non-null    float64
 13  num       227 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 26.6 KB


***Fig.18***: Summary Information of the Heart Disease Training Data: Displaying Data Types and Non-null Counts for Each Attribute.

In [130]:
heart_testdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76 entries, 11 to 179
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       76 non-null     int64  
 1   sex       76 non-null     int64  
 2   cp        76 non-null     int64  
 3   trestbps  76 non-null     int64  
 4   chol      76 non-null     int64  
 5   fbs       76 non-null     int64  
 6   restecg   76 non-null     int64  
 7   thalach   76 non-null     int64  
 8   exang     76 non-null     int64  
 9   oldpeak   76 non-null     float64
 10  slope     76 non-null     int64  
 11  ca        76 non-null     float64
 12  thal      75 non-null     float64
 13  num       76 non-null     int64  
dtypes: float64(3), int64(11)
memory usage: 8.9 KB


***Fig.19***: Summary Information of the Heart Disease Testing Data: Displaying Data Types and Non-null Counts for Each Attribute.

# 6. Classification with K-nearest neighbors- Building/Training the classifier

# 6.1 Creating and fitting the model 

In this step, we're aiming to build and fit the initial model using k-nearest neighbours (KNN) classifier. 

In [131]:
#build and fit initial model with 80 neighbors
#train the classifer, 3 is chosen just for now for neighbours
hknn = KNeighborsClassifier(n_neighbors=80) 

X = heart_traindata[["trestbps", "thalach"]]
y = heart_traindata["num"]
hknn

***Fig.20***: Building and Fitting an Initial k-Nearest Neighbors Model: Utilizing 80 Neighbors for Classification Based on Resting Blood Pressure ('trestbps') and Maximum Heart Rate Achieved ('thalach').

# 6.2 Prepocessing the data 

Afterwards, our goal is to standardize the scale of our data so that we can transform the training and testing sets before feeding them into our machine learning and statistical modeling portion.

In [132]:
#preprocess the data
heart_preprocessor = make_column_transformer(
    (StandardScaler(), ["trestbps","thalach"]),
    remainder = "passthrough",
    verbose_feature_names_out=False
)
heart_preprocessor

***Fig.21***: Data Preprocessing: Standard Scaling Applied to 'trestbps' and 'thalach' Features, with Remaining Features Passed Through Unaltered.

# 6.3 Putting it together in a Pipeline 

As part of the Scikit library function, the pipeline is fitted  into our data. This is key to ensuring that our predictions are accurate. 

In [133]:
hknn_fit = make_pipeline(heart_preprocessor, hknn).fit(X, y)
hknn_fit

***Fig.22***: Model Training: Constructing a Pipeline with Preprocessing Steps and Fitting a k-Nearest Neighbors Classifier to the Heart Disease Training Data.

# 6.4 Choosing the most optimal k-value 

Cross-validation involves dividing the training data into subsets to assess the classifier's performance comprehensively. The objective is to compare various parameter values and identify the optimal one. This process includes splitting the training data into two parts, one for actual training and the other for evaluation, referred to as the validation set.

# 6.4.1 Cross-Validation

In [134]:
#cross validate
# create the 25/75 split of the *training data* into sub-training and validation
heart_subtrain, heart_validation = train_test_split(
    heart_traindata, test_size=0.25, random_state=123)

# fit the model on the sub-training data with 10 neighbors
hknn = KNeighborsClassifier(n_neighbors= 10) 
X = heart_subtrain[["trestbps", "thalach"]]
y = heart_subtrain["num"]
hknn_fit = make_pipeline(heart_preprocessor, hknn).fit(X, y)

# compute the score on validation data
acc = hknn_fit.score(
    heart_validation[["trestbps", "thalach"]],
    heart_validation["num"]
)
acc
#0.7368421052631579
#so accuracy is 73.7%

0.7368421052631579

In [135]:
#5 fold cross-validate 
hknn = KNeighborsClassifier(n_neighbors=3) 
heart_pipe = make_pipeline(heart_preprocessor, hknn)

cv_5_df = pd.DataFrame(
    cross_validate(
        estimator=heart_pipe,
        cv=5,
        X=X,
        y=y
    )
)

cv_5_df

Unnamed: 0,fit_time,score_time,test_score
0,0.014124,0.01758,0.5
1,0.006725,0.005512,0.705882
2,0.007205,0.005334,0.705882
3,0.006945,0.005383,0.529412
4,0.006171,0.005168,0.617647


In this step, we used cross-validation to create a 25/75 split of the training data into sub-training and validation. The resulting score of 73.7%reflects the accuracy of the model classifier. 

***Table 7***: "5-Fold Cross-Validation Results: Assessing the Performance of the Heart Disease Classifier Using a k-Nearest Neighbors Model with 3 Neighbors."

In [136]:
cv_5_metrics = cv_5_df.agg(['mean', 'sem'])
cv_5_metrics

Unnamed: 0,fit_time,score_time,test_score
mean,0.008234,0.007795,0.611765
sem,0.001482,0.002447,0.043026


***Table 8***: Aggregated Metrics from 5-Fold Cross-Validation: Calculating the Mean and Standard Error (SEM) Across Performance Metrics for the Heart Disease Classifier.

In [137]:
#testing if 10 is better for fold
cv_10 = pd.DataFrame(
    cross_validate(
        estimator=heart_pipe,
        cv=10,
        X=X,
        y=y
    )
)

cv_10_df = pd.DataFrame(cv_10)
cv_10_metrics = cv_10_df.agg(['mean', 'sem'])
cv_10_metrics
#did lower the std 
##remove need to record the score 

Unnamed: 0,fit_time,score_time,test_score
mean,0.006938,0.005414,0.6
sem,0.000351,0.000285,0.039992


***Table 9***: Cross-Validation Results with 10 Folds: Evaluating the Mean and Standard Error of Model Metrics, Considering the Performance of the Heart Disease Classifier.

# 6.4.2 Parameter value selection

In [138]:
#parameter value selection 
knn = KNeighborsClassifier()
heart_tune_pipe = make_pipeline(heart_preprocessor, knn)

heart_tune_pipe.get_params()

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 30, 1),
}

heart_tune_grid = GridSearchCV(
    estimator=heart_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)
heart_tune_grid

***Fig.23***: Parameter Value Selection: Performing Grid Search with 10-Fold Cross-Validation to Optimize the Number of Neighbors for the Heart Disease Classifier.

In [139]:
#Extract and visualize accuracy vs. neighbors for a smaller parameter grid
accuracies_grid = pd.DataFrame(
    heart_tune_grid.fit(
        heart_traindata[["trestbps", "thalach"]],
        heart_traindata["num"]
    ).cv_results_
)
accuracies_grid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 19 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   mean_fit_time                            29 non-null     float64
 1   std_fit_time                             29 non-null     float64
 2   mean_score_time                          29 non-null     float64
 3   std_score_time                           29 non-null     float64
 4   param_kneighborsclassifier__n_neighbors  29 non-null     object 
 5   params                                   29 non-null     object 
 6   split0_test_score                        29 non-null     float64
 7   split1_test_score                        29 non-null     float64
 8   split2_test_score                        29 non-null     float64
 9   split3_test_score                        29 non-null     float64
 10  split4_test_score                        29 non-null

***Fig.24***: Accuracy vs. Neighbors: Extracting and Visualizing Model Performance Across a Smaller Parameter Grid Using Grid Search and 10-Fold Cross-Validation.

In [140]:
#Extract and visualize accuracy vs. neighbors for a smaller parameter grid
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)
accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.545257,0.025015
1,2,0.540711,0.028451
2,3,0.620356,0.030874
3,4,0.593676,0.023255
4,5,0.643083,0.020684
5,6,0.638735,0.017791
6,7,0.669763,0.024999
7,8,0.66502,0.025942
8,9,0.669368,0.02677
9,10,0.674111,0.024608


***Table 10***: Refined Accuracy vs. Neighbors: Extracting and Preparing Data for Visualizing Model Performance Across a Smaller Parameter Grid Using Grid Search and 10-Fold Cross-Validation.

In [141]:
#Extract and visualize accuracy vs. neighbors for a smaller parameter grid
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(domain=(0.5, 0.8))
        .title("Accuracy estimate")
).properties(title='line plot of estimated accuracy versus the number of neighbors')

accuracy_vs_k

***Fig.25***: Visualizing Accuracy vs. Neighbors: Line Plot Illustrating the Estimated Accuracy Across a Smaller Parameter Grid Using Grid Search and 10-Fold Cross-Validation.

***most optimal K-value is 25 neighbors***

# 7. Classification with K-nearest neighbors- Evaluating the model

Here, we evaluated the model accuracy on the test set by predicating the outcomes against the recorded observations. we then used two methods to find out the accuracy (checking performance from test set and using the score method).  

In [None]:
# 7

In [142]:
#Evaluate model accuracy on the test set
#predict the labels in the test set 
heart_test_predictions = heart_testdata.assign(
    predicted = hknn_fit.predict(heart_testdata[["trestbps", "thalach"]])
)
heart_test_predictions[['num', 'predicted']].head(5)

Unnamed: 0,num,predicted
11,0,0
292,1,1
269,0,1
268,1,0
94,0,0


***Table 11***: Model Evaluation on Test Set: Predicted Labels vs. Actual Labels for the First 5 Samples in the Heart Disease Test Data.

In [143]:
#Evaluate model accuracy on the test set
#check performance
correct_preds = heart_test_predictions[
    heart_test_predictions['num'] == heart_test_predictions['predicted']
]

correct_preds.shape[0] / heart_test_predictions.shape[0]

0.7368421052631579

In [144]:
#Evaluate model accuracy using score method
heart_acc_1 = hknn_fit.score(
    heart_testdata[["trestbps", "thalach"]],
    heart_testdata["num"]
)
heart_acc_1

0.7368421052631579

Accuracy is 73.4%

In [61]:
pd.crosstab(
    heart_test_predictions["num"],
    heart_test_predictions["predicted"]
)

predicted,0,1
num,Unnamed: 1_level_1,Unnamed: 2_level_1
0,32,10
1,10,24


***Table 12***: Confusion Matrix: Cross-tabulation of Predicted vs. Actual Diagnosis Categories in the Heart Disease Test Set.

the confusion matrix shows that 32 observations where predicted correctly as 0 (no diagnosis) and 24 as 1. it also shows that the classifier wrongly classified 10 as 1 when they were truly 0 and 10 as 0 when they were truly 1


Accuracy = 73.7%, 
Precision = 76.2%, 
Recall = 76.2%

In [62]:
results_plot = alt.Chart(heart_test_predictions).mark_circle(opacity =0.7).encode(
    x=alt.X("thalach").title("Maximum heart rate achieved (bpm)"),
    y=alt.Y("trestbps").title("Resting blood pressure (mmHg)"),
    color=alt.Color("num:N").title("Diagnosis")
).properties(title='Relationship between the Maximum heart rate and Resting blood pressure')
results_plot

***Fig 26***: Result Plot: Visualizing the Relationship Between Maximum Heart Rate and Resting Blood Pressure, Color-Coded by Heart Disease Symptoms.

# Discussion

Our analysis focused on predicting heart disease using two key variables: resting blood pressure (trestbps) and maximum heart rate achieved (thalach). Although we began our exploratory data analysis by looking at resting blood pressure and serum cholesterol (chol), we found that these variables overlapped quite a bit with each other (see Fig. 9), and thus switched to look at  resting blood pressure (trestbps) and maximum heart rate achieved (thalach) instead. Furthermore, in the original dataset, heart disease (num) was grouped into 5 different categories (on a scale of 0-4), with a higher number indicating higher severity. However, because of the uneven distribution between categories, we decided to group 1-4 together. Therefore, we classified observations into one of two categories: no heart disease symptoms (0) and heart disease symptoms of any severity (1) (see Fig. 17).

Our results indicated a tangible correlation between the chosen features and the likelihood of heart disease. We found that resting blood pressure is positively correlated with heart disease, and that maximum heart rate achieved is negatively correlated with heart disease. These results are consistent with our initial hypothesis. Overall, our analysis demonstrates the effectiveness of these features in the context of heart disease prediction.

Furthermore, we discovered that the optimal k-value is 25 neighbours, which gave us an accuracy score of 73.7%, meaning that our classifier predicted 73.7% of observations correctly. To improve our classifier, we could add more observations 

These findings could influence future medical practice, patient education, preventative medicine, and research funding. Some future questions include: 
- What are the predictors of high blood pressure and high maximum heart rate? 
- How can high blood pressure and low maximum heart rate be prevented? 
- What biological mechanism underlies the relationship between these factors? 
- Does the duration of high blood pressure and low maximum heart rate impact heart disease? 
- Are medications aimed at treating high blood pressure and low maximum heart rate effective in preventing heart disease? 
- What other factors play a role in this relationship, such as age and sex?

# Conclusion

In summary, our research aimed to uncover whether resting blood pressure and maximum heart rate can predict the presence of heart disease. Completing this exploratory data analysis, we conclude that there is a positive relationship between resting blood pressure and heart disease. On the contrary,  a negative correlation was observed between maximum heart rate and heart disease, which was congruent with our hypothesis. With the machine learning and statistical modeling portion of our analysis, our classifier achieved an accuracy score of 73.7%. While our analysis is limited to a few observations, it sheds light on the influential, ever-changing landscape of clinical research, which can shape the lives of those affected by cardiovascular diseases. With our project, we're eager to see the future of data science in medicine. 

# References

1. Janosi, A., Steinbrunn,W., Pfisterer, M., & Detrano,R. (1988). Heart Disease. *UCI Machine Learning Repository*. doi: 10.24432/C52P4X.
2. Sandvik L., Erikssen J., Ellestad M., Erikssen G., Thaulow E., Mundal R., & Rodahl K. (1995). Heart rate increase and maximal heart rate during exercise as predictors of cardiovascular mortality: a 16-year follow-up study of 1960 healthy men. *Coronary Artery Disease*. doi: 10.1097/00019501-199508000-00012.
3. Wu, CY., Hu, HY., Chou YJ., Huang N., Chou YC., & Li CP. (2015). High Blood Pressure and All-Cause and Cardiovascular Disease Mortalities in Community-Dwelling Older Adults. *Medicine (Baltimore)*. doi: 10.1097/MD.0000000000002160.