In [10]:
library(tidyverse)
library(tidymodels)
set.seed(999)

Heart (cardiovascular) disease is a term displaying the wide range of heart conditions directly affecting the heart, associated blood vessels, and muscles surrounding the heart (Heart and Stroke Foundation Canada, n.d.). Such effects can result in long-term or short-term effects on the function of the heart itself spreading to other internal organs. Amongst the range of “heart” diseases coronary artery disease, commonly found in the United States is where the patients' blood vessels are narrowed and constricts the amount of blood supplying the heart. There is a multitude of prospective factors that may influence the likelihood of developing coronary artery disease or any variant of cardiovascular disease, including but not limited to, fasting blood sugar, cholesterol, and resting blood pressure.   

High levels of resting blood pressure are amongst one of the leading causes of cardiovascular disease resulting in stroke. This is due to the damaging of the lining of the arteries which can increase the probability of plaque buildup which narrows the arteries leading to the heart. Additionally, increased intake of cholesterol can build up inside of the blood vessels and restrict the flow to the heart, brain, lungs and kidneys (Centers for Disease Prevention and Control, 2022). Similarly, studies have observed and indicated fasting blood sugar as an underlying predictor in mortality of heart disease and the effects on the heart (National Library of Medicine, 2013). 

The objective of this project is to classify and categorize patients on their potential risk in developing heart disease.  

The question we will be addressing is: What is the likelihood of a patient at risk for heart disease based on their cholesterol, fasting blood sugar, and resting blood pressure? 

**THIS CELL WILL BE DELETED ONCE WE ARE SURE WE HAVE THE DATA WE NEED**  
Columns: 
      1. #3  (age)       
      2. #4  (sex)       
      3. #9  (cp)        
      4. #10 (trestbps)  
      5. #12 (chol)      
      6. #16 (fbs)       
      7. #19 (restecg)   
      8. #32 (thalach)   
      9. #38 (exang)     
      10. #40 (oldpeak)   
      11. #41 (slope)     
      12. #44 (ca)        
      13. #51 (thal)      
      14. #58 (num)       (the predicted attribute)
      
1. Age in years.
2. Sex of Patient: 1 = Male 0 = Female
3. CP:  chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of >              0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11.  slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14.  num: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing
        (in any major vessel: attributes 59 through 68 are vessels)


The columns in the data frame represent the following:   
trestbps: Resting Blood Pressue (in mmHg on admission to the hospital.
                        
fasting_bp: Fasting blood sugar value in milligrams per deciliter (mg/dL) of blood.  
                        0 = Below 120 mg/dL  
                        1 = Above 120 mg/dL
                            
cholesterol: Serum Cholesterol in milligrams per deciliter (mg/dL) of blood. High cholesterol is considering to be over 240 mg/dL. 

heart_disease: Presence of heart disease in general.  
                        0 = No  
                        1 = Yes


Below we have read in the heart_disease data file and tidied the data into a table of only the columns we want to use. The column names and values in the heart_disease colunm have been renmaed for ease of understanding.

In [39]:
heart_data <- as_tibble(read.csv("heart.csv")) |>
    select(trestbps, fbs, chol, target) |>

    rename(resting_bp = trestbps,
           fast_bp = fbs,
           cholesterol = chol, 
           heart_disease = target) |>

    mutate(heart_disease = as_factor(heart_disease)) |>
    mutate(heart_disease = fct_recode(heart_disease, "Yes" = "1", "No" = "0"))
    
head(heart_data)

resting_bp,fast_bp,cholesterol,heart_disease
<int>,<int>,<int>,<fct>
125,0,212,No
140,1,203,No
145,0,174,No
148,0,203,No
138,1,294,No
100,0,248,Yes


Then the tidy data is split into training and testing data sets. The percentage of total rows from heart_data has also been calculated using the training_heart set to confirm a correct splitting ratio.

In [37]:
heart_data_split <- initial_split(heart_data, props = 0.75, strata = heart_disease)
training_heart <- training(heart_data_split)
testing_heart <- testing(heart_data_split)

split_percent <- round(nrow(training_heart)/ (nrow(training_heart) + nrow(testing_heart)) * 100)

split_percent

All three data sets, heart_data, training_heart and testing_heart are summarized and the ratio of positive and negative values for the heart_disease column are calculated and returned to ensure the split data sets are consistent with the original tidy data. 

In [57]:
# if anyone remembers how to make them actually return a percent please add that
# in here. I will do it tomorrow if nobody does it - Doug
heart_data_percent <- heart_data |>
    group_by(heart_disease) |>
    summarize(count = n()) |>
    mutate(heart_data_percent = count) |>
    select(heart_data_percent)
    slice(heart_data_percent, 1) / slice(heart_data_percent, 2)

training_heart_percent <- training_heart |>
    group_by(heart_disease) |>
    summarize(count = n()) |>
    mutate(training_heart_percent = count) |>
    select(training_heart_percent)
    slice(training_heart_percent, 1) / slice(training_heart_percent, 2)

testing_heart_percent <- testing_heart |>
    group_by(heart_disease) |>
    summarize(count = n()) |>
    mutate(testing_heart_percent = count) |>
    select(testing_heart_percent)
    slice(testing_heart_percent, 1) / slice(testing_heart_percent, 2)   

heart_data_percent
<dbl>
0.9486692


training_heart_percent
<dbl>
0.9492386


testing_heart_percent
<dbl>
0.9469697


In [46]:
?slice

0,1
slice {dplyr},R Documentation

0,1
.data,"A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details."
...,"For slice(): <data-masking> Integer row values. Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For ⁠slice_*()⁠, these arguments are passed on to methods."
".by, by","<tidy-select> Optionally, a selection of columns to group by for just this operation, functioning as an alternative to group_by(). For details and examples, see ?dplyr_by."
.preserve,"Relevant when the .data input is grouped. If .preserve = FALSE (the default), the grouping structure is recalculated based on the resulting data, otherwise the grouping is kept as is."
"n, prop","Provide either n, the number of rows, or prop, the proportion of rows to select. If neither are supplied, n = 1 will be used. If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. prop will be rounded towards zero to generate an integer number of rows. A negative value of n or prop will be subtracted from the group size. For example, n = -2 with a group of 5 rows will select 5 - 2 = 3 rows; prop = -0.25 with 8 rows will select 8 * (1 - 0.25) = 6 rows."
order_by,"<data-masking> Variable or function of variables to order by. To order by multiple variables, wrap them in a data frame or tibble."
with_ties,"Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows."
na_rm,"Should missing values in order_by be removed from the result? If FALSE, NA values are sorted to the end (like in arrange()), so they will only be included if there are insufficient non-missing values to reach n/prop."
weight_by,<data-masking> Sampling weights. This must evaluate to a vector of non-negative numbers the same length as the input. Weights are automatically standardised to sum to 1.
replace,"Should sampling be performed with (TRUE) or without (FALSE, the default) replacement."
