# DSCI 100 Group 11 Project Proposal 

Columns to use: 3,4,9,10,12,13,17,18,58

# 1. Introduction

Heart disease is a common cause of death between all groups of people in the United States. Both genetic and environmental circumstances can contribute to the possibility of developing the condition. 

Multiple risk factors affect the likelihood of heart disease such as age, blood pressure, cholesterol. Older people are more likely to be diagnosed with heart disease (Rogers et al., 2019). According to the CDC, high cholesterol levels lead to plaque formation in blood vessels which makes them less flexible and narrower. Leading to the inability of blood to circulate around the body, irregular heart beats and ultimately, death. Furthermore, high blood pressure causes arteries to lose their elasticity which in turn leads to lower blood circulation to the heart.

Using the Heart Disease dataset provided to us from the UCI Machine Learning Repository, we will use 3 different factors to see whether or not each data point can be classified as <50% diameter narrowing, which means no risk of heart diseases, or >50% diameter narrowing, which would indicate a severe risk of heart disease. 

Through comparisons, these data points would assist in anticipating a prognosis of different risk levels for a patient who could provide the data of the 3 aspects we observed. 


# 2. Preliminary Data Analysis

In [1]:
## Loading in Libraries to use for data analysis
library(repr)
library(tidyverse)
library(tidymodels)
library(reactable)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

ERROR: Error in library(reactable): there is no package called ‘reactable’


In [4]:
## Raw Data URL
raw_data_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
## Reading data from url and Assigning names to columns
raw_data <- read_csv(raw_data_url,col_names = c("age", "sex", "cp", "trestbps", "chol", "fbs","restecg", "thalach", "exang" ,"oldpeak" ,"slope" ,"ca" ,"thal" ,"num"))
##Tidying the data
tidy_data <- raw_data |>
                select(age,trestbps,chol,num) |>
                    rename("age" = "age",
                           "resting.blood.pressure" = "trestbps" ,
                           "cholestrol" = "chol",
                           "likelihood" = "num") |>
                    mutate(likelihood = case_when(likelihood == 0 ~ "Absent",
                                                  likelihood != 0 ~ "Present")) |>
                    mutate(likelihood = as.factor(likelihood))
tidy_data
                

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): ca, thal
[32mdbl[39m (12): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,resting.blood.pressure,cholestrol,likelihood
<dbl>,<dbl>,<dbl>,<fct>
63,145,233,Absent
67,160,286,Present
67,120,229,Present
37,130,250,Absent
41,130,204,Absent
56,120,236,Absent
62,140,268,Present
57,120,354,Absent
63,130,254,Present
53,140,203,Present


In [5]:
set.seed(1) ## DO NOT CHANGE

## splitting into training and testing data
data_split <- initial_split(tidy_data, prop = 0.75, strata = likelihood)
data_train <- training(data_split)
data_test <- testing(data_split)

In [6]:
data_count <- data_train |> 
        group_by(likelihood) |>
        summarize(count = n())
data_count

likelihood,count
<fct>,<int>
Absent,123
Present,104


In [7]:
##Average Values of all predictors for observations with "absent"

data_absent_mean <- data_train |>
        filter(likelihood == "Absent") |>
        summarize(across(age:cholestrol, mean)) |>
         rename("mean.age" = "age",
                           "mean.resting.blood.pressure" = "resting.blood.pressure" ,
                           "mean.cholestrol" = "cholestrol") 

data_absent_mean

mean.age,mean.resting.blood.pressure,mean.cholestrol
<dbl>,<dbl>,<dbl>
52.96748,130.4553,243.9106


# 3. Methods

Within the data set, we will use 4 of the variables provided. They are as follows: age, blood pressure, cholesterol,  and diagnosis of heart disease. We will then plot these on graphs and colour them based on the two classes (<50% diameter narrowing or >50% diameter narrowing). The two classes will be categorized within the data by having the number 0, which would mean the absence of heart disease and any other number, which would signify the presence of the disease. Then, when we receive the new data point, we will plot it against the data already present and evaluate the risk of heart disease according to the closest data point. 

To help illustrate our observations, we will use a scatterplot. In order to ensure that what we are visualizing is clear, we will use markdown cells to describe and clarify.


# 4. Expected Outcomes and Significance 
We expect to find that those with higher cholestrol, high blood pressure and higher age will be at a much higher risk of heart disease than those with the opposite. 

Using these findings, we can encourage people to maintain a healthy lifestyle to preemptively prevent cardiovascular disease. 

In the future, the question that could arise is the proportion to which the genetics and environmental factors have an effect on the probability of getting the disease. 


# 5. Citations

- Centers for Disease Control and Prevention. (2022, September 8). Heart disease and stroke. Centers for Disease Control and Prevention. Retrieved March 2, 2023, from https://www.cdc.gov/chronicdisease/resources/publications/factsheets/heart-disease-stroke.htm 
- Centers for Disease Control and Prevention. (2021, May 18). High blood pressure symptoms and causes. Centers for Disease Control and Prevention. Retrieved March 2, 2023, from https://www.cdc.gov/bloodpressure/about.htm#:~:text=High%20blood%20pressure%20can%20damage%20your%20arteries%20by%20making%20them,Chest%20pain%2C%20also%20called%20angina 