# Project on Heart Disease (Group 73)

## Introduction

Heart disease is a broad term that refers to different heart conditions. Something in common is their effect on blood flow and dilation of veins to the heart. Other factors such as age, cholestrol, and blood pressure are also closely linked with heart disease.  
Our project aims to research whether these factors have a correlation with heart disease and will provide a classification model which will predict whether a person has heart disease based on four factors:
1. Age [age]
2. Resting blood pressure (in mm Hg on admission to the hospital) [trestbps]
3. Cholesterol [chol]
4. Maximum heart rate achieved [thalach]

We will use data taken from the UCI Machine Learning Repository's Heart Disease Dataset to train our model to predict whether someone has heart disease based on the 4 risk factors listed above. Our model will use diagnoses of heart disease based on the percentage of narrowing in major arteries (<50% narrowing meaning no  heart disease and >50% narrowing 
In our directory (retrieved from https://archive.ics.uci.edu/ml/datasets/Heart+Disease), there are four databases containing data from: Cleveland, Hungary, Long Beach (California), and Switzerland. We decided to choose the Hungarian data due to it having a binary column indicating whether or not a person is likely to have heart disease. This dataset was collected from the Hungarian Institute of Cardiology in Budapest.

######## CAN THIS CELL BE DELETED?? #############

Cleveland Clinic Foundation (cleveland.data)
Hungarian Institute of Cardiology, Budapest (hungarian.data)
V.A. Medical Center, Long Beach, CA (long-beach-va.data)
University Hospital, Zurich, Switzerland (switzerland.data)


This project aims to build a classification model to predict whether a patient presenting to the hospital would have a risk of heart disease by taking into account results from multiple medical tests. 
The dataset (https://archive.ics.uci.edu/ml/datasets/Heart+Disease) provides us with multiple attributes such as age and sex as well as a categorical attribute of whether a patient is in a risk of having a heart attack

The term “heart disease” refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which affects the blood flow to the heart. Decreased blood flow can cause a heart attack.

## Data Preparation

In [48]:
# Attatch the librarys
library(tidyverse)
library(tidymodels)

In [44]:
# load hungarian heart disease dataset from database

# url <- "https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/download?datasetVersionNumber=2"
# longbeach_data <- read_csv(url, col_names = TRUE)
# longbeach_data
hungarian_data <- read_csv("data/processed.hungarian.data", col_names = FALSE)
head(hungarian_data)

[1mRows: [22m[34m294[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X4, X5, X6, X7, X8, X9, X11, X12, X13
[32mdbl[39m (5): X1, X2, X3, X10, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>
28,1,2,130,132,0,2,185,0,0,?,?,?,0
29,1,2,120,243,0,0,160,0,0,?,?,?,0
29,1,2,140,?,0,0,170,0,0,?,?,?,0
30,0,1,170,237,0,1,170,0,0,?,?,6,0
31,0,2,100,219,0,1,150,0,0,?,?,?,0
32,0,2,105,198,0,0,165,0,0,?,?,?,0


In [50]:
# Rename the data frame's column names accordingly
hungarian_set <- rename(hungarian_data,
    age = X1, 
    sex = X2, 
    cp = X3,
    trestbps = X4, 
    chol = X5, 
    fbs = X6,
    restecg = X7,
    thalach = X8, 
    exang = X9, 
    oldpeak = X10,                        
    slope = X11, 
    ca = X12,
    thal = X13, 
    diagnosis = X14)
head(hungarian_set)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,diagnosis
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>
28,1,2,130,132,0,2,185,0,0,?,?,?,0
29,1,2,120,243,0,0,160,0,0,?,?,?,0
29,1,2,140,?,0,0,170,0,0,?,?,?,0
30,0,1,170,237,0,1,170,0,0,?,?,6,0
31,0,2,100,219,0,1,150,0,0,?,?,?,0
32,0,2,105,198,0,0,165,0,0,?,?,?,0


In [47]:
# Select applicable columns to use as predictors.
# C
hungarian_clean <- hungarian_set |>
    select(age, trestbps, chol, thalach, num) |>
    mutate(num = as.factor(num)) |>
    filter(age != "?") |>
    filter(chol != "?") |>
    filter(thalach != "?") |>
    filter(trestbps != "?") |>
    filter(num != "?") |>
    mutate(chol = as.double(chol)) |>
    mutate(thalach = as.double(thalach)) |>
    mutate(trestbps = as.double(trestbps))

head(hungarian_clean)

age,trestbps,chol,thalach,num
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
28,130,132,185,0
29,120,243,160,0
30,170,237,170,0
31,100,219,150,0
32,105,198,165,0
32,110,225,184,0


To clean up the data, we first renamed all the columns accordingly to all the given attributes of the dataset. Then we took out all the unused attributes and changed variable num to factor because it is a categorical statical variable.

In [5]:
# split data into training set and testing set
set.seed(1000)

hungarian_split <- initial_split(hungarian_clean, prop=0.75, strata=num)
hungarian_train <- training(hungarian_split)
hungarian_test <- testing(hungarian_split)

We then split the cleaned data into training set and testing set, where 75% of the data are used to training while the remaining 25% will be used for testing. The proportion are chosen so that there is sufficient data to train the model with sufficient data to test the model and obtain the accuracy of the model. 

## Method

We will build the classification model using k-nearest neighbors. We will build it using the training data type longbeach_train and test it for accuracy using the testing data type longbeach_test. The number of k we use will be tuned to have the highest accuracy through cross-validation. The predictor variables we use will only be variables that have double as their data type and not factor data type. This is because factor data types can not be used for k-nearest neighbors as distance can not be calculated for factors. 

We can visualize the tuning using cross-validation by plotting a graph with number of k vs accuracy.


## Expected Outcomes and Significance

From this research project, we expect to find a correlation between risk factors (specifically, a person's age, cholesterol, maximum heart rate achieved, and resting blood pressure) and the overall likelyhood that they have heart disease. With this research, we hope to create a model that can predict (to a high level of accuracy) whether a person has heart disease. If our model is successful, it will be a great aid to the medical community by creating fast and easy diagnosis of heart disease in patients, and will allow doctors to quickly assist those in need. A doctor could simply input the measurements taken for each of the 5 risk factors and the model will determine whether the patient is developing heart disease or not. 

After our model is complete, future questions may be raised on this topic such as:
- Are there other diseases that can be accurately predicted based on a chosen set of measurable risk factors?
- Will it ever be ethical to nearly remove a doctor from the situation and trust a model instead?
- Is it possible to create a model that is 100% accurate?
- How might a model, such as ours, affect the speed at which heart disease is detected and treated, and could this lead to more lives saved in the future?

In [6]:
#YAY we got our final data
#now we need to split into training and testing

#longbeach_train
#longbeach_test

long_beach_recipe <- recipe(num ~ age + trestbps + chol + thalach, data = longbeach_select) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())

long_beach_recipe

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Operations:

Scaling for all_predictors()
Centering for all_predictors()

In [34]:
# Now we need to make KNN spec and workflow

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 27) |>
      set_engine("kknn") |>
      set_mode("classification")
#knn_spec
long_beach_fit <- workflow() |>
      add_recipe(long_beach_recipe) |>
      add_model(knn_spec) |>
      fit(data = longbeach_train)
#long_beach_fit

In [35]:
long_beach_predictions <- predict(long_beach_fit , longbeach_test) |>
      bind_cols(longbeach_test)
long_beach_predictions |>
print(n=10)

[90m# A tibble: 69 × 6[39m
   .pred_class   age trestbps  chol thalach num  
   [3m[90m<fct>[39m[23m       [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m
[90m 1[39m 0              29      120   243     160 0    
[90m 2[39m 0              31      100   219     150 0    
[90m 3[39m 0              32      110   225     184 0    
[90m 4[39m 0              34      130   161     190 0    
[90m 5[39m 0              35      140   167     150 0    
[90m 6[39m 0              35      120   308     180 0    
[90m 7[39m 0              37      120   260     130 0    
[90m 8[39m 0              37      130   315     158 0    
[90m 9[39m 0              39      190   241     106 0    
[90m10[39m 0              40      140   289     172 0    
[90m# … with 59 more rows[39m


In [36]:
#Now we need to find accuracy

long_beach__prediction_accuracy <- long_beach_predictions |> #df
        metrics(truth = num, estimate = .pred_class)   #real vs estimations

long_beach__prediction_accuracy



.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.7101449
kap,binary,0.3320426
