# Project on Heart Disease (Group 73)

## Introduction

Heart disease is a broad term that refers to different heart conditions. Something in common is their effect on blood flow and dilation of veins to the heart. Other factors such as age, sex, and exercise are factors that are closely linked with heart disease. In our directory there are four data bases: Cleveland, Hungarian, Long Beach (California), and Switzerland. We decided to choose Long Beach due to similar culture (such as food and lifestyle) between Vancouver and Long beach. 


#Prav Cells Below (helps out with reducing errors)

The relationship between these factors and their effect on predicting heart disease can be further examined through our research question: 
This project aims to provide a classification model which will predict whether a person has heart disease based on the factors of 

 1. Cleveland Clinic Foundation (cleveland.data)
     2. Hungarian Institute of Cardiology, Budapest (hungarian.data)
     3. V.A. Medical Center, Long Beach, CA (long-beach-va.data)
     4. University Hospital, Zurich, Switzerland (switzerland.data)


Cleveland Clinic Foundation (cleveland.data)
     2. Hungarian Institute of Cardiology, Budapest (hungarian.data)
     3. V.A. Medical Center, Long Beach, CA (long-beach-va.data)
     4. University Hospital, Zurich, Switzerland (switzerland.data)


This project aims to build a classification model to predict whether a patient presenting to the hospital would have a risk of heart disease by taking into account results from multiple medical tests. 
The dataset (https://archive.ics.uci.edu/ml/datasets/Heart+Disease) provides us with multiple attributes such as age and sex as well as a categorical attribute of whether a patient is in a risk of having a heart attack

The term “heart disease” refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which affects the blood flow to the heart. Decreased blood flow can cause a heart attack.


## Data Preparation

In [2]:
#Adding in librarys
library(tidyverse)
library(tidymodels)


-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.6      [32mv[39m [34mpurrr  [39m 0.3.4 
[32mv[39m [34mtibble [39m 3.1.8      [32mv[39m [34mdplyr  [39m 1.0.10
[32mv[39m [34mtidyr  [39m 1.2.0      [32mv[39m [34mstringr[39m 1.4.1 
[32mv[39m [34mreadr  [39m 2.1.2      [32mv[39m [34mforcats[39m 0.5.2 

"package 'ggplot2' was built under R version 4.1.3"
"package 'tibble' was built under R version 4.1.3"
"package 'tidyr' was built under R version 4.1.2"
"package 'readr' was built under R version 4.1.2"
"package 'dplyr' was built under R version 4.1.3"
"package 'stringr' was built under R version 4.1.3"
"package 'forcats' was built under R version 4.1.3"
-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m

In [3]:
#loading dataset from database

original <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/long-beach-va.data"
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data"
longbeach_data <- read_csv(url, col_names = FALSE)

[1mRows: [22m[34m200[39m [1mColumns: [22m[34m14[39m
[36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X4, X5, X6, X8, X9, X10, X11, X12, X13
[32mdbl[39m (5): X1, X2, X3, X7, X14

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
#renaming data frame's column names accordingly
longbeach_set <- rename(longbeach_data,
    age = X1, 
    sex = X2, 
    cp = X3,
    trestbps = X4, 
    chol = X5, 
    fbs = X6,
    restecg = X7,
    thalach = X8, 
    exang = X9, 
    oldpeak = X10,                        
    slope = X11, 
    ca = X12,
    thal = X13, 
    num = X14)

In [5]:
#taking our most important columns
#then changing num to factor
longbeach_select <- longbeach_set |>
    select(age, chol, thalach, trestbps, oldpeak, num) |>
    mutate(num = as.factor(num))
longbeach_selec

age,chol,thalach,trestbps,oldpeak,num
<dbl>,<chr>,<chr>,<chr>,<chr>,<fct>
63,260,112,140,3,2
44,209,127,130,0,0
60,218,140,132,1.5,2
55,228,149,142,2.5,1
66,213,99,110,1.3,0
66,0,120,120,-0.5,0
65,236,105,150,0,3
60,0,140,180,1.5,0
60,0,141,120,2,3
60,267,157,160,0.5,1


To clean up the data, we first renamed all the columns accordingly to all the given attributes of the dataset. Then we took out all the unused attributes and changed variable num to factor because it is a categorical statical variable.

In [7]:
# split data into training set and testing set
longbeach_split <- initial_split(longbeach_select, prop=0.75, strata=num)
longbeach_train <- training(longbeach_split)
longbeach_test <- testing(longbeach_split)

We then split the cleaned data into training set and testing set, where 75% of the data are used to training while the remaining 25% will be used for testing. The proportion are chosen so that there is sufficient data to train the model with sufficient data to test the model and obtain the accuracy of the model. 

## Method

We will build the classification model using k-nearest neighbors. We will build it using the training data type longbeach_train and test it for accuracy using the testing data type longbeach_test. The number of k we use will be tuned to have the highest accuracy through cross-validation. The predictor variables we use will only be variables that have double as their data type and not factor data type. This is because factor data types can not be used for k-nearest neighbors as distance can not be calculated for factors. 

We can have multiple preliminary vizualization of each predictor class vs the prediction class.  We can also visualize the tuning done by using cross-validation by plotting a graph with number of k vs accuracy.

## Expected Outcomes and Significance

From this research project, we expect to find a correlation between risk factors (specifically, a person's age, cholesterol, resting heart rate, resting blood pressure, and ST depression) and the overall likelyhood that they have heart disease. With this research, we hope to create a model that can predict (to a high level of accuracy) whether a person has heart disease. We will use data taken from the UCI Machine Learning Repository's Heart Disease Dataset to train our model to predict whether someone has heart disease based on the 5 risk factors listed earlier. If our model is successful, it will be a great aid to the medical community by creating fast and easy diagnosis of heart disease in patients, and will allow doctors to quickly assist those in need. A doctor could simply input the measurements taken for each of the 5 risk factors and the model will determine whether the patient is developing heart disease or not. 

After our model is complete, future questions may be raised on this topic such as:
- Are there other diseases that can be accurately predicted based on a chosen set of measurable risk factors?
- Will it ever be ethical to nearly remove a doctor from the situation and trust a model instead?
- Is it possible to create a model that is 100% accurate?
- How might a model, such as ours, affect the speed at which heart disease is detected and treated, and could this lead to more lives saved in the future?