# Heart Disease Prediction

## 1. Introduction

**Research Question:**  *How accurately can the health factors ___ predict if a person has heart disease?*

The phrase "heart disease" can be referring to various different conditions relating to the heart. Mild symptoms can include chest pain and shortness of breath and extreme symptoms can include heart attacks and failures. A wide range of health attributes can affect the likelihood of a person having heart disease. It would be useful for doctors to be able to use these factors to predict if an individual has heart disease.

This project will examine the Heart Disease Data Set from the UCI Machine Learning repository. While there are datasets provided by four different hospitals, the page description says that the Cleveland database is the only one used by ML researchers. The Cleveland dataset is the one that we will analyze. The goal of this project is to determine the accuracy of predicting heart disease based on certain factors. The dataset provides numerous factors such as age, maximum heart rate, etc. (will be displayed below). 

## 2. Preliminary Data Analysis

### Loading Libraries

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

### Importing Data

In [2]:
link <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
untidy_cleveland <- read_csv(link, col_names = FALSE)
untidy_cleveland

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1
38,1,3,138,175,0,0,173,0,0.0,1,?,3.0,0


Figure 1. Untidy tibble of Cleveland heart disease data

### Data Cleaning 

For this data to be tidy, column names must be changed, rows with "?" (NA) values must be removed, some columns must be changed to double data type, and the rightmost column must be changed to factor data type. 

In [15]:
#renaming the column names
colnames(untidy_cleveland) = c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

tidy_cleveland = untidy_cleveland |> 
    filter(ca != "?", thal != "?") |>   #in ca and thal there were "?" values
    mutate(ca = as.numeric(ca), thal = as.numeric(thal)) |>  #can change to <dbl> because no "?" 
    mutate(num = as.factor(num))   #num categorical data not <dbl>

tidy_cleveland
    

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
68,1,4,144,193,1,0,141,0,3.4,2,2,7,2
57,1,4,130,131,0,0,115,1,1.2,2,1,7,3
57,0,2,130,236,0,2,174,0,0.0,2,1,3,1


Figure 2. Tidy tibble of Cleveland heart disease data

#### Attribute Information

The column names are given by the data set description on the UCI Machine Learning Repository page

**age:** age in years

**sex:** sex (1 = male, 0 = female)

**cp:** chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)

**trestbps:** resting blood pressure (in mmHg)

**chol:** serum cholestoral (in mg/dl)

**fbs:** fasting blood sugar > 120 mg/dl (1 = true,  0 = false)

**restecg:** resting electrocardiographic results (0 = normal, 1 = stt abnormality, 2 = lv hypertrophy)

**thalach:** maximum heart rate achieved

**excang:** exercise induced angina (1 = yes, 0 = no)

**oldpeak:** ST depression induced by exercise relative to rest

**slope:** slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping)

**ca:** number of major vessels (0-3) colored by flourosopy

**thal:** 3 = normal, 6 = fixed defect, 7 = reversable defect

**num:** diagnosis of heart disease (1 >= heart disease exists, 0 = no risk of heart disease)

### Splitting into Training and Testing Sets

### Data Analysis

## 3. Methods

## 4. Expected outcomes and significance

## 5. Sources

Detrano, R. (1990). *Cleveland* [Data set]. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation.      
     https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data