# DSCI 100 Project Proposal

## **I. Title**
### Categorizing heart disease patients in Cleveland based on Age, Resting Blood Pressure and Heart Rate.
By: Davina Deng, Kevin Peng

--- ---

## **II. Introduction** 
Heart disease, also known as cardiovascular disease, encompasses a range of conditions that affect the heart and blood vessels. Common types include coronary artery disease, arrhythmias and heart failure. Causes usually involve lifestyle factors such as poor diet, physical inactivity and smoking, as well as genetic predisposition. Symptoms can vary widely but usually include chest pain, shortness of breath and fatigue. Preventive measures include maintaining a healthy lifestyle and controlling risk factors such as high blood pressure and high cholesterol.

Elevated resting blood pressure puts extra pressure on the artery walls, which in the long term can lead to damage to the artery walls and an increased risk of heart disease, stroke and other cardiovascular complications. High blood pressure forces the heart to work harder to pump blood, which can lead to heart failure if left unchecked. 
Similarly, maximum heart ra , can give insight into cardiovascular health. A lower maximum heart rate indicates poorer heart function, while a high heart rate during exercise may signal underlying cardiovascular problems. Resting heart rate and maximum heart rate are important for predicting cardiovascular events and managing long-term heart healt (Christofaro et al., 2017)

Based on the connections, my question is: **Are new patients likely to have heart disease based on their resting blood pressure and maximum heart rate?** To answer this, I will use a knn classification algorithm.

I will use processed.cleveland.data from the Heart Disease Database (originally collected by the Cleveland Clinic Foundation) to predict whether or not a patient in Cleveland will develop heart disease. 

The variables (data columns) are as follows:

1. **age**: age
2. **sex**: sex (1 = male, 0 = female)
3. **cp**: chest pain type
4. **trestbps**: resting blood pressure in mmHg
5. **chol**: serum cholestoral in mg/dl
6. **fbs**: fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)
7. **restecg**: resting electrocardiographic results
8. **thalach**: maximum heart rate achieved
9. **exang**: whether exercise induced angina (1 = True, 0 = False)
10. **oldpeak**: ST depression induced by exercise, relative to rest
11. **slope**: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)
12. **ca**: number of major vessels (0-3) colored by flourosopy
13. **thal**: (3 = normal, 6 = fixed defect, 7 = reversable defect)
14. **num**: diagnosis of heart disease (1,2,3,4 = presence, 0 = no presence)

There are 303 rows in total. Each column is numeric-valued, with missing data represented as the string "?".

Based on the list above, I will use `trestbps` and `thalach` as predictors, to classify patients on whether or not they develop heart disease.


n)n)

## **III. Preliminary exploratory data analysis**

In [10]:
# loading all the packages we need for our proposal
library(tidyverse)
library(utils)
library(tidymodels)
library(rvest)
library(GGally)

# Formatting graphs
options(repr.matrix.max.rows = 6)

Use `read_csv()` to load the processed.cleveland.data dataset from the online directory.

In [20]:
# Source: https://raw.githubusercontent.com/UBC-DSCI/dsci-100-project_template/main/data/heart_disease/processed.cleveland.data

clev_data <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/dsci-100-project_template/main/data/heart_disease/processed.cleveland.data", 
                     col_names = FALSE) 
clev_data

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1
38,1,3,138,175,0,0,173,0,0.0,1,?,3.0,0


#### Adding column names
The dataframe does not come with column names, so those must be added. Some factor columns are also being read as <dbl> or <chr>, so those need to be changed as well.

In [26]:
#Adding column names
colnames(clev_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                               "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

clev_data

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,NA
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0,FALSE
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2,TRUE
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1,TRUE
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1,7,3,TRUE
57,0,2,130,236,0,2,174,0,0.0,2,1,3,1,TRUE
38,1,3,138,175,0,0,173,0,0.0,1,,3,0,FALSE


#### Removing NA values
In this section, we removing the missing values from our raw data set. This is because missing values degrade the performance of the knn model, and removing them improves the overall quality of the dataset. Here, we first determine the number of missing values, then remove them using na.omit() and perform a final check to confirm that all missing values in the dataset have been removed. Using this new clean dataset, it's time to move on to building the classification model. 

In [27]:
#Determining how many missing values there are
#Select the variables that will be used for analysis
clev_na <- clev_data|> 
    select(age, trestbps, thalach) |>
    is.na()|>
    nrow()

clev_na

#removing missing values 
clev_data[clev_data == "?"] <- NA

clev_no_NA <- clev_data |>
select(age, trestbps, thalach) |> 
        na.omit()
clev_no_NA


#Checking if any NA values remain after removal
check <- clev_no_NA |>
            map_df(~sum(is.na(.x)))
check

age,trestbps,thalach
<dbl>,<dbl>,<dbl>
63,145,150
67,160,108
67,120,129
⋮,⋮,⋮
57,130,115
57,130,174
38,138,173


age,trestbps,thalach
<int>,<int>,<int>
0,0,0


Now the data is tidy, that each observation has its own row, with each column referencing a single variable, and each cell containing one value. 

## Bibliography

Professional, Cleveland Clinic Medical. “Heart Disease.” Cleveland Clinic, https://my.clevelandclinic.org/health/diseases/24129-heart-disease.

Christofaro, D. G. D., Casonatto, J., Vanderlei, L. C. M., Cucato, G. G., & Dias, R. M. R. (2017, May). Relationship between resting heart rate, blood pressure and pulse pressure in adolescents. Arquivos brasileiros de cardiologia. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5444886/ 

Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.

The creators of the Heart Disease Database are:

    1. Andras Janosi
    2. William Steinbrunn
    3. Matthias Pfisterer
    4. Robert Detrano