# Title: Analysis of Resting Blood Pressure (mm Hg) and Maximum Heart Rate (bpm) for Heart Disease Severity Classification

# Introduction

For my final project, I will be using a modified version of the UCI Heart Disease dataset, specifically focusing on the Cleveland database. The UCI Heart Disease dataset includes data from four sources: Cleveland, Hungary, Switzerland, and the VA Long Beach. The Cleveland database used to predict heart disease is chosen due to its sufficient number of observations. Heart disease is a major global health concern and responsible for a significant portion of deaths worldwide. Early detection and accurate classification of heart disease are vital for effective treatment. Additionally, the ability to predict the severity of heart disease can help healthcare professionals prioritize interventions and provide the ideal treatments to individual patients' needs. The Heart Disease Cleveland dataframe includes the variables: "trestbps" and "thalach", which I will be using to predict our variable of interest: "num". In this dataframe, "trestbps" refers to the "resting blood pressure (mm Hg)" on admission to the hospital, while "thalach" refers to "maximum heart rate achieved (beats per minute, bpm)", and "num" means the diagnosis of heart disease with a value of: 0 means no heart disease, 1-4 means the levels of heart disease[1]. I would classify levels of heart disease as 1 = Mild, 2 = Moderate, 3 = Severe 4 = Life threatening[2]. With these resources, I will be answering the question: Can we classify the extent of heart disease using resting blood pressure and maximum heart rate achieved?

[1]: Source from Piazza question @555

[2]: For the levels of heart disease, I refer to the document https://www.rigshospitalet.dk/afdelinger-og-klinikker/kraeft-og-organsygdomme/blodsygdomme/forskning/forsoegsbehandling/Documents/Lymfomer/Triangle/Triangle-SAE.pdf

# Methods and Results:

The dataset would be loaded into my jupyter notebook using the function of read.table() in R so that I can easily manipulate and tidy the data. Also, the variables would be change from character to factor or integer or numeric according to the orginal dataset. The data analysis will utilize the "trestbps," "thalach," and "num" columns, which represent resting blood pressure (mm Hg), maximum heart rate achieved (bpm), and the diagnosis of heart disease, respectively. This analysis aims to understand how different maximum heart rates and resting blood pressure levels correlate with various diagnoses of heart disease. 

The prediction would be measured based on a classification model. Both predictor variables are quantitative for predicting the categorial class. The classification would be done with the application of k nearest neighbor algorithm as it does not require any specific shape assumption for the algorithm to perform well. I would split the dataset into training and testing sets. The optimal number of neighbors would be found by using the training dataset to train the k-NN model on the training data using different values of k to find the optimal number of neighbors. Additionally, I would evaluate the model performance using cross-validation to ensure it generalizes well to unseen data. Lastly, I would use the classifier, trained by the training set, to classify the heart disease diagnoses in the testing data. Then, I would perform a metrics from the predicted testing set to obtain the information of accuracy, precision, and recall of that classifier by analyzing the confusion matrix to understand the model's performance in predicting each class of heart disease.

To visualize the result, a scatter plot would be used with the standardized variables to make sure the classifier would not treat one of the variables more important than the other. Each type of heart disease diagnosis would be labeled with different colors for clear identification.

# This step is to load the packages I need to use to tidy the dataset:

In [2]:
library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 6) #limits output of dataframes to 6 rows

# Loading Data from the original source

In [7]:
url <- "https://raw.githubusercontent.com/UBC-DSCI/dsci-100-project_template/main/data/heart_disease/processed.cleveland.data"
cleveland_data <- read_csv(url, col_names = FALSE)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


# Data Wrangling:

In [8]:
# First, assign the column names of each column based on the UCI Website:
# https://archive.ics.uci.edu/dataset/45/heart+disease
colnames(cleveland_data) <- c('age','sex','cp','trestbps','chol','fbs', 'restcg', 
                              'thalach','exang','oldpeak','slope','ca','thal','num')

# This second step is to mutate the variables so that all of the columns/variables have the correct type according to the wedsite that the original dataset comes from
cleveland_clean <- cleveland_data |> mutate(age = as.numeric(age), sex = as_factor(sex), cp = as_factor(cp),
                                            trestbps = as.numeric(trestbps), chol = as.numeric(chol), fbs = as_factor(fbs),
                                            restcg = as_factor(restcg), thalach = as.numeric(thalach), exang = as_factor(exang),
                                            oldpeak = as.numeric(oldpeak), slope = as.numeric(slope), ca = as.numeric(ca),
                                            thal = as_factor(thal), num = as.factor(num)) |>
# This third step is to change gender/sex into words (male and female) instead of numbers (1 and 0) to make the data more readable to readers using the functions of mutate
# and fct_recode
                                     mutate(sex = fct_recode(sex, "male" = "1", "female" = "0"),
# This fourth step is to change the level of heart disease representing by numbers to a more readable description of the heart disease diagnose using 
#the functions of mutate and fct_recode
                                            num = fct_recode(num, "No heart disease" = "0", "Mild"="1", "Moderate" = "2",
                                                             "Serious" = "3", "Life threatening" = "4")) |>
# This fifth step is to use the function drop_na() to remove rows containing missing (NA) values from a data frame.
                                                                 drop_na()
cleveland_clean

[1m[22m[36mℹ[39m In argument: `ca = as.numeric(ca)`.
[33m![39m NAs introduced by coercion”


age,sex,cp,trestbps,chol,fbs,restcg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<fct>,<fct>
63,male,1,145,233,1,2,150,0,2.3,3,0,6.0,No heart disease
67,male,4,160,286,0,2,108,1,1.5,2,3,3.0,Moderate
67,male,4,120,229,0,2,129,1,2.6,2,2,7.0,Mild
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
68,male,4,144,193,1,0,141,0,3.4,2,2,7.0,Moderate
57,male,4,130,131,0,0,115,1,1.2,2,1,7.0,Serious
57,female,2,130,236,0,2,174,0,0.0,2,1,3.0,Mild


# Discussion:

1. summarize what you found2. 
discuss whether this is what you expected to find
3. 
discuss what impact could such findings hav
4. discuss what future questions could this lead to?eto?

# References: