# Predicting Heart Disease in Patients Using Classification

### Introduction:

Heart disease, also known as cardiovascular disease, refers to conditions which affect the heart, the most common of these conditions being coronary artery disease. Heart disease can lead to heart attacks, heart failure, arrhythmias and heart valve diseases. Some risk factors include high blood pressure, high cholesterol, obesity, and a sedentary lifestyle.

The question we attempt to answer is, "Can we predict the severity of heart disease, if any, in a patient based on their age, pain type, cholesterol, max heart rate, exercise induced angina, and gender?".

To answer this, the chosen data set is the “Heart Disease” dataset. Each observation contains details about an individual. It contains 14 variables with different attributes relating to heart disease, including an assigned value of 1-4 regarding the presence of heart disease in the individual.


### Methods

We will be conducting a classification analysis on the heart disease dataset with the goal of predicting the severity of heart disease (represented by the "severity" variable)

__Data Processing__
* Clean and wrangle the dataset.
* Filter the dataset to include only data from Cleveland in the  "region" column.
* Rename the "num" column to "severity."
* Select the following predictor columns: "severity," "age," "pain_type," "chol," "max_hr," "exang," "resting_ecg," and "sex."

__Data Splitting__
* Split the dataset into training and testing datasets. The training data will be used to train the model, while the testing data will be used to evaluate its performance.

__Model Building__
* Create a classification model based on the chosen specifications.
* Fit the model using the training data, allowing it to learn from the selected input features and class labels.

__Model Evaluation__
* Evaluate the model's performance using various evaluation metrics, such as accuracy precision and recall

__Model Tuning__
* We finally then fine tune the model to optimize its performance

__Describe at least one way that you will visualize the results__

* After evaluating and tuning the model we will often want to know which number of neighbors gives the best accuracy, we can create a visualization of that by plotting the accuracy against K values using a geom_point + geom_line graph. 


### Cleaning, Wrangling, Summary:

In [2]:
#Load all libraries and set plot dimensions
options(repr.plot.height = 8, repr.plot.width = 10)
library(tidyverse)
library(repr)
library(dplyr)
library(tidymodels)
install.packages("kknn")

#Load data frame and rename columns
urlfile= "https://raw.githubusercontent.com/cocom250/DSCI-100-Group-17/main/heart_disease_uci.csv"
heart_disease_data <- read_csv(url(urlfile))
head(heart_disease_data)

heart_disease_data <- rename(heart_disease_data, 
                             region = dataset,
                             pain_type = cp,
                             resting_bps = trestbps,
                             fasting_bs = fbs,
                             resting_ecg = restecg,
                             max_hr = thalch,
                             n_major_vessel = ca,
                             severity = num) 

#Cleaning up data, filtering for Cleveland and selecting predictor columns
heart_disease_data <- filter(heart_disease_data, region == "Cleveland") |>
    mutate(severity = as.factor(severity)) |>
    select(severity, age, pain_type, chol, max_hr, exang, resting_ecg, sex)

#Split the data into training set and testing set
heart_disease_split <- initial_split(heart_disease_data, prop = 0.75, strata = severity)
heart_disease_train <- training(heart_disease_split)
heart_disease_test <- testing(heart_disease_split)

#Plot indicating the number and proportion of observations with each pain type for different severity level
pain_type_distribution <- heart_disease_train|>
    group_by(severity, pain_type) |>
    summarize(n = n()) |>
    mutate(percent = 100*n/nrow(heart_disease_train))

pain_type_distribution

#Visualization of Pain Type distribution across severity levels
pain_type_distribution_plot <- ggplot(pain_type_distribution, 
                                 aes(x= pain_type, y =n)) +
    geom_bar(stat ="identity") +
    labs(x = "Types of Pain", y = "Count")+
    theme(text = element_text(size =12)) +
    facet_grid(rows = vars(severity)) +
    ggtitle("Distribution of Pain Type across Severity Levels")

pain_type_distribution_plot


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

[1mRows: [22m[34m920[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): sex, dataset, cp, restecg, slope, thal
[32mdbl[39m (8): id, age, trestbps, chol, thalch, oldpeak, ca, num
[33mlgl[39m (2): fbs, exang

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<lgl>,<chr>,<dbl>,<lgl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>
1,63,Male,Cleveland,typical angina,145,233,True,lv hypertrophy,150,False,2.3,downsloping,0,fixed defect,0
2,67,Male,Cleveland,asymptomatic,160,286,False,lv hypertrophy,108,True,1.5,flat,3,normal,2
3,67,Male,Cleveland,asymptomatic,120,229,False,lv hypertrophy,129,True,2.6,flat,2,reversable defect,1
4,37,Male,Cleveland,non-anginal,130,250,False,normal,187,False,3.5,downsloping,0,normal,0
5,41,Female,Cleveland,atypical angina,130,204,False,lv hypertrophy,172,False,1.4,upsloping,0,normal,0
6,56,Male,Cleveland,atypical angina,120,236,False,normal,178,False,0.8,upsloping,0,normal,0


ERROR: Error in group_by(heart_disease_train, severity, pain_type): object 'heart_disease_train' not found


### Performance Evaluation

In [None]:
# set the seed so our evaluation is reproducable 
set.seed(2024)

#Split the data into training set and testing set
heart_disease_split <- initial_split(heart_disease_data, prop = 0.75, strata = severity)
heart_disease_train <- training(heart_disease_split)
heart_disease_test <- testing(heart_disease_split)

#Plot indicating the number and proportion of observations in each severity level & average age and cholestrol measures
severity_stats <- heart_disease_train |>
    group_by(severity) |>
    summarize(n = n(), age_avg = mean(age, na.rm = TRUE), chol_avg = mean(chol, na.rm = TRUE)) |>
    mutate(percent = 100*n/nrow(heart_disease_train))

severity_stats

In [None]:
# set the seed so our evaluation is reproducable
set.seed(2024)

heart_recipe <- recipe(severity ~ age + chol + max_hr, data = heart_disease_data) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# training the classifier
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(heart_recipe) |>
  add_model(knn_spec) |>
  fit(data = heart_disease_train)

knn_fit

In [None]:
# set the seed so our evaluation is reproducable
set.seed(2024)

#predicting labels in the test set 
heart_test_predictions <- predict(knn_fit, heart_disease_test) |>
    bind_cols(heart_disease_test)

heart_test_predictions

# evaluating performance of the classifier on the test set
heart_test_predictions |>
    metrics(truth = severity, estimate = .pred_class) |>
    filter(.metric == "accuracy")

# looking at the confusion matrix
confusion <- heart_test_predictions |>
    conf_mat(truth = severity, estimate = .pred_class)

confusion

In [None]:
#calculating accuracy, precision and recall for each category
calculations <- tibble(accuracy = (33+1+2+2+0)/(33+9+5+3+2+5+1+1+1+0+4+1+2+2+1+0+1+2+2+1+0+1+0+1+0),
                       precision_0 = (33)/(33+9+5+3+2),
                       precision_1 = (1)/(5+1+1+1+0),
                       precision_2 = 2/(4+1+2+2+1),
                       precision_3 = 2/(0+1+2+2+1),
                       precision_4 = 0/(0+1+0+1+0),
                       recall_0 = (33)/(33+5+4),
                       recall_1 = 1/(9+1+1+1+1),
                       recall_2 = 2/(5+1+2+2),
                       recall_3 = 2/(3+1+2+2+1),
                       recall_4 = 0/(2+0+1+1))
calculations
                       

### Tuning

In [None]:
# set the seed so our evaluation is reproducable
set.seed(2024)

# create the 25/75 split of the training data into training and validation
heart_split <- initial_split(heart_disease_train, prop = 0.75, strata = severity)
heart_subtrain <- training(heart_split)
heart_validation <- testing(heart_split)

# recreate the standardization recipe from before 
# (since it must be based on the training data)
heart_train_recipe <- recipe(severity ~age + chol + max_hr, data = heart_subtrain) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# fit the knn model (we can reuse the old knn_spec model from before)
knn_train_fit <- workflow() |>
  add_recipe(heart_train_recipe) |>
  add_model(knn_spec) |>
  fit(data = heart_subtrain)

# get predictions on the validation data
validation_predicted <- predict(knn_train_fit, heart_validation) |>
  bind_cols(heart_validation)

# compute the accuracy
acc <- validation_predicted |>
  metrics(truth = severity, estimate = .pred_class) |>
  filter(.metric == "accuracy") |>
  select(.estimate) |>
  pull()

acc

# perform v fold, v=5
heart_vfold <- vfold_cv(heart_disease_train, v = 5, strata = severity)


#recreate standardization recipe using training data
heart_train_recipe <- recipe(severity ~age + chol + max_hr,
                        data = heart_disease_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

# resampling
knn_fit_1 <- workflow() |>
    add_recipe(heart_train_recipe) |>
    add_model(knn_spec) |>
    fit_resamples(resamples = heart_vfold)


knn_fit_1 |>
    collect_metrics()

## !!! Look at accuracy and if we need to perform fold with different v


In [None]:
# set the seed so our evaluation is reproducable
set.seed(2024)
options(repr.plot.height = 8, repr.plot.width = 10)

# parameter value selection
knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

knn_results <- workflow() |>
  add_recipe(heart_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = heart_vfold, grid = k_vals) |>
  collect_metrics() 

accuracies <- knn_results |>
  filter(.metric == "accuracy")


# visualizing most accurate k value
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") + 
  ggtitle("Estimated Accuracy versus the Number of Neighbours")+
  theme(plot.title = element_text(hjust =0.5), text = element_text(size =12))
  

accuracy_vs_k
#based on the graph, k=21 seems to provide the highest accuracy

In [None]:
#using optimal k value = 21
# set the seed so our evaluation is reproducable
set.seed(2024)

heart_recipe <- recipe(severity ~ age + chol + max_hr, data = heart_disease_data) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# training the classifier
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 21) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(heart_recipe) |>
  add_model(knn_spec) |>
  fit(data = heart_disease_train)

knn_fit

#predicting labels in the test set 
heart_test_predictions <- predict(knn_fit, heart_disease_test) |>
    bind_cols(heart_disease_test)

heart_test_predictions

# evaluating performance of the classifier on the test set
heart_test_predictions |>
    metrics(truth = severity, estimate = .pred_class) |>
    filter(.metric == "accuracy")

# looking at the confusion matrix
confusion <- heart_test_predictions |>
    conf_mat(truth = severity, estimate = .pred_class)

confusion


In [None]:
#calculating accuracy, precision and recall for each category
calculations <- tibble(accuracy = (39+1)/(39+12+5+8+4+3+1+4+1+1),
                       precision_0 = (39)/(39+12+5+8+4),
                       precision_1 = (1)/(3+1+4+1+0),
                       precision_2 = 0/(0),
                       precision_3 = 0/(1),
                       precision_4 = 0/(0),
                       recall_0 = (39)/(39+3),
                       recall_1 = 1/(12+1),
                       recall_2 = 0/(5+4+1),
                       recall_3 = 0/(8+1),
                       recall_4 = 0/(4))
calculations