# Group Project Report

## Introduction

Heart disease is increasingly common and one of the leading causes of death around the world. As algorithms and modeling become more accessible and sophisticated, they can be used to predict and diagnose heart disease. Here, we will create a simple model using the k-nearest neighbors algorithm to predict the type of chest pain in individuals testing for heart disease. We will investigate how accurately we can predict the type of chest pain with a K-nearest Neighbor model? 

The dataset we will use is called “processed.cleveland.data” from the Machine Learning Repository. Data was collected between May 1981 and September 1984 and stems from the angiography results of 303 patients at the Cleveland Clinic in Cleveland, Ohio. The average age of study participants was 54 and 206 of the 303 participants were men (“International application of a new probability algorithm”). Similar algorithms have been created like the one described in Detrano et al.’s International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease where the same data set, along with similar ones, trained an algorithm to predict heart disease. 


## Methods and Results

As mentioned previously, we will develop a model using this dataset to predict the type of chest pain that an individual is experiencing. The potential types of chest pain are typical angina, atypical angina, non-anginal pain, and asymptomatic pain. Our model will use the k-nearest neighbors algorithm and consider age, sex, electrocardiogram rate at rest (restecg), systolic blood pressure (trestbps), and cholesterol (chol) as variables. Age and sex were chosen because women report a greater risk of developing CVD than men of the same age group; however, the risk of developing CVD increases with age in both sexes (Rodgers et al., 2019). The American Heart Association reported that in the US, the likelihood of developing CVD for men and women between the ages of 40-59 is ~40%, between 60-79 years is ~75%, and above the age of 80, the likelihood increased to ~86%. By incorporating age and sex, we can better estimate an individual’s likelihood of developing CVD (Rodgers et al., 2019). The electrocardiogram at rest provides information about heart rate, rhythm, and potential heart enlargement, making it a useful tool for investigating symptoms (Electrocardiogram, 2023). The heart's activity is measured by electrical activity and the electrical impulse that travels through your heart (Electrocardiogram, 2023). The electrical passage can be tracked and used to determine whether the activity is regular or irregular. ECG is often used when experiencing chest pains, and the result can be correlated with numerous heart conditions (Electrocardiogram, 2023). Another factor we selected is blood pressure (BP). BP is one of the most important risk factors for CVD and the leading cause of mortality (Wu et al., 2015). Heart disease is associated with elevated systolic blood pressure, which damages arteries and impedes blood flow to the heart muscle. It is estimated that BP will affect 65% of the population for individuals over the age of 60 (Wu et al., 2015). Additionally, abnormal levels of serum cholesterol in the blood increase the risk of heart disease by promoting the development of fatty deposits in blood vessels, which can hinder flow through arteries (Carson, 2023). Although studies conducted do not indicate a strong association with CVD risk, it is still included as a variable due to the relationship of dietary cholesterol with heart health (Carson, 2023).

Let's load the important --- and set a seed so that the project is reproducible.

In [1]:


library(tidyverse)
library(tidymodels)


set.seed(777)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

We need to first read the dataset from the internet. After reading the data, we can rename the columns to make the data more informative. This is done below:

In [2]:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
cleveland_data <- read_csv(url, col_names = FALSE)
cleveland_data <- rename(cleveland_data,
       age = X1, 
       sex = X2, 
       cp = X3,
       trestbps = X4, 
       chol = X5, 
       fbs = X6, 
       restecg = X7, 
       thalach = X8, 
       exang = X9, 
       oldpeak = X10, 
       slope = X11, 
       ca = X12, 
       thal = X13, 
       num = X14)


[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


The data is then mutated to convert the columns into the right variable type. For example, since cp(chest pain) is a factor, we can use the as_factor function to convert the column.

In [3]:
cd_mutate <- cleveland_data |>
mutate(cp = as_factor(cp)) 



The dataset for heart disease contains "?" instead of NA values which might be difficult to deal with later on. We can replace all the "?" to NA as shown below.

In [4]:
cd_mutate[cd_mutate == "?"] <- NA
cd_mutate

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


For this model, we will be using the variables age, sex, chol, restecg, trestbps to predict the cp type. Hence, we should select these columns from the dataset. 

In [5]:
cd_select<-select(cd_mutate, age, sex, chol, restecg,  trestbps, cp)

We have cleaned and wrangled the data to give a tidy data frame. We can now proceed to split the data into training and testing sets to conduct the analysis.

In [6]:

cd_split<-initial_split(cd_select, prop=0.75, strata=cp)
cd_training<-training(cd_split)
cd_test<-testing(cd_split)


Let's begin with choosing a K value

In [7]:
#install.packages("themis")
#library(themis)
#test this out later


options(repr.plot.height = 5, repr.plot.width = 6)
#upsampled_recipe <- recipe(cd ~ ., data = cd_training) |>
  #step_upsample(cd, over_ratio = 1, skip = FALSE) |>
  #prep()

upsampled_cd_training <- bake(upsampled_recipe, cd_training)



cd_recipe <- recipe(cp ~ ., data = upsampled_cd_training)|>
            step_scale(all_predictors())|>
            step_center(all_predictors()) 
           

cd_vfold <- vfold_cv(cd_training, v = 10, strata = cp)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

k = tibble(neighbors = seq(2,50,5))

knn_results <- workflow() |>
      add_recipe(cd_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = cd_vfold, grid = k) |>
      collect_metrics()

accuracies <- knn_results |> 
      filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(1, 50, by = 5)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.4, 1.0)) # adjusting the y-axis


cross_val_plot

ERROR: Error in bake(upsampled_recipe, cd_training): object 'upsampled_recipe' not found


## References