# Group Project Report

## Introduction

Heart disease is increasingly common and one of the leading causes of death around the world. As algorithms and modeling become more accessible and sophisticated, they can be used to predict and diagnose heart disease. Here, we will create a simple model using the k-nearest neighbors algorithm to predict the type of chest pain in individuals testing for heart disease. We will investigate how accurately we can predict the type of chest pain with a K-nearest Neighbor model? 

The dataset we will use is called “processed.cleveland.data” from the Machine Learning Repository. Data was collected between May 1981 and September 1984 and stems from the angiography results of 303 patients at the Cleveland Clinic in Cleveland, Ohio. The average age of study participants was 54 and 206 of the 303 participants were men (“International application of a new probability algorithm”). Similar algorithms have been created like the one described in Detrano et al.’s International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease where the same data set, along with similar ones, trained an algorithm to predict heart disease. 


## Methods and Results

As mentioned previously, we will develop a model using this dataset to predict the type of chest pain that an individual is experiencing. The potential types of chest pain are typical angina, atypical angina, non-anginal pain, and asymptomatic pain. Our model will use the k-nearest neighbors algorithm and consider age, sex, electrocardiogram rate at rest (restecg), systolic blood pressure (trestbps), and cholesterol (chol) as variables. Age and sex were chosen because women with CHD tend to be older than men with CHD, and test of CHD increases with age in both sexes. By incorporating age and sex, we can better estimate an individual’s likelihood of developing CHD. The electrocardiogram at rest provides information about heart rate, rhythm and potential heart enlargement, making it a useful tool for investigating symptoms. A higher risk of heart disease associated with elevated systolic blood pressure, which damages arteries and impedes blood flow to the heart muscle. Additionally, a high level of serum cholesterol in the blood increases the risk of heart disease by promoting the development of fatty deposits in blood vessels, which can hinder flow through arteries. 



Let's load the important --- and set a seed so that the project is reproducible.

In [2]:


library(tidyverse)
library(tidymodels)


set.seed(777)

“installation of package ‘themis’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



ERROR: Error in library(themis): there is no package called ‘themis’


We need to first read the dataset from the internet. After reading the data, we can rename the columns to make the data more informative. This is done below:

In [None]:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
cleveland_data <- read_csv(url, col_names = FALSE)
cleveland_data <- rename(cleveland_data,
       age = X1, 
       sex = X2, 
       cp = X3,
       trestbps = X4, 
       chol = X5, 
       fbs = X6, 
       restecg = X7, 
       thalach = X8, 
       exang = X9, 
       oldpeak = X10, 
       slope = X11, 
       ca = X12, 
       thal = X13, 
       num = X14)


The data is then mutated to convert the columns into the right variable type. For example, since cp(chest pain) is a factor, we can use the as_factor function to convert the column.

In [None]:
cd_mutate <- cleveland_data |>
mutate(cp = as_factor(cp)) 



The dataset for heart disease contains "?" instead of NA values which might be difficult to deal with later on. We can replace all the "?" to NA as shown below.

In [None]:
cd_mutate[cd_mutate == "?"] <- NA
cd_mutate

For this model, we will be using the variables age, sex, chol, restecg, trestbps to predict the cp type. Hence, we should select these columns from the dataset. 

In [None]:
cd_select<-select(cd_mutate, age, sex, chol, restecg,  trestbps, cp)

We have cleaned and wrangled the data to give a tidy data frame. We can now proceed to split the data into training and testing sets to conduct the analysis.

In [None]:

cd_split<-initial_split(cd_select, prop=0.75, strata=cp)
cd_training<-training(cd_split)
cd_test<-testing(cd_split)


Let's begin with choosing a K value

In [3]:
#install.packages("themis")
#library(themis)

options(repr.plot.height = 5, repr.plot.width = 6)
#upsampled_recipe <- recipe(cd ~ ., data = cd_training) |>
  #step_upsample(cd, over_ratio = 1, skip = FALSE) |>
  #prep()

upsampled_cd_training <- bake(upsampled_recipe, cd_training)


cd_recipe <- recipe(cp ~ ., data = upsampled_cd_training)|>
            step_scale(all_predictors())|>
            step_center(all_predictors()) 
           

cd_vfold <- vfold_cv(cd_training, v = 10, strata = cp)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

k = tibble(neighbors = seq(2,50,5))

knn_results <- workflow() |>
      add_recipe(cd_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = cd_vfold, grid = k) |>
      collect_metrics()

accuracies <- knn_results |> 
      filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(1, 50, by = 5)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.4, 1.0)) # adjusting the y-axis


cross_val_plot

“installation of package ‘themis’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



ERROR: Error in library(themis): there is no package called ‘themis’


## References