# DSCI 100 Project Report: Heart Disease and Age, Sex, and Cholesterol Levels

#### Beth Koschel, 

## Introduction

**Background:** Heart disease is a term that encompasses several different types of heart conditions(1). The most common condition is know as coronary artery disease (CAD) which can lead to decreased blood flow to the heart resulting in a heart attack (1). As a prominent cause of mortality in Canada, heart disease underscores the significance of investigating factors that may contribute to its development (2).

**Question:** We want to know if age, sex, and cholesterol levels might play a role in the presence or absence of heart disease.

**ID and Describe the dataset used:** The dataset we are using to answer this question is heart disease dataset from the Cleveland provided to us through the UC Irvine Machine Learning Repository (3). It contains 14 attributes including age, sex, a chest pain scale, resting blood pressure, serum cholesterol, fasting blood sugar levels, resting electrocardiograph results, maximum heart rate achieved, exercise induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by fluoroscopy, and diagnosis of heart disease.

## Methods

#### Preprocessing

EDIT THIS:
Imported libraries and processed.cleveland.data dataset from the internet.

Cleaned and tidied data to make it usable, by assigning column types and adding a new column, diag.

Split the data into training and testing sets, working only with the training set until the very end.

Summarized the training set to make predictions about how we want our classifier to work.

Visualized the relationship between thalac and chol to get a deeper understanding of how the data is distributed.

##### Importing Libraries

In [52]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)


##### Importing the Data

The read_csv function was used to import the data (processed.cleveland.data)

In [53]:
# reading the data from data/process.cleveland.data
heart_data <- read_delim("data/processed.cleveland.data", delim=",", col_names = FALSE)

head(heart_data)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


##### Cleaning and Tidying the Heart Disease Data

##### Setting the Seed

In [54]:
# setting the seed to 1
set.seed(1)

##### Renaming the Columns

In [55]:
heart_data <- rename(heart_data,
                    age = X1,
                    sex = X2,
                    cp = X3,
                    trestbps = X4,
                    chol = X5,
                    fbs = X6,
                    restecg = X7,
                    thalach = X8,
                    exang = X9,
                    oldpeak = X10,
                    slope = X11,
                    ca = X12,
                    thal = X13,
                    num = X14)
head(heart_data)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


EXPLAIN FIGURE

##### Removing "?" from the data and replacing it with "NA"

In [47]:
# removing "?" from data and replacing it with NA

heart_data[heart_data == "?"] <- NA

##### Adding a New Column 

Num needs to be true or false (presence of disease or no presence of disease). Renaming to Diagnosis where 0 means no disease and 1 means disease

In [76]:
# adding diagnosis column and setting the values to 'TRUE' if the cell value is > 0 or 'FALSE' if the cell value is 0

heart_data <- heart_data |>
    mutate(diagnosis = as.factor(ifelse(is.na(num), NA, (num > 0))))

##### Switching to Factors

In [75]:
# switching column types to factors

heart_clean <- heart_data |>
    mutate(sex = as.factor(sex)) |>
    mutate(sex = fct_recode(sex, "M" = "1", "F" = "0")) |>
    mutate(cp = as.factor(cp)) |>
    mutate(fbs = as.factor(fbs)) |>
    mutate(restecg = as.factor(restecg)) |>
    mutate(exang = as.factor(exang)) |>
    mutate(thal = as.factor(thal)) |>
    mutate(ca = as.factor(ca)) |>
    mutate(slope = as.factor(slope)) |>
    mutate(num = as.factor(num))
head(heart_clean)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,diagnosis
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>
63,M,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0,False
67,M,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2,True
67,M,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1,True
37,M,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0,False
41,F,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0,False
56,M,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0,False


EXPLAIN FIGURE

##### Creating Training and Testing Datasets

ADD TEXT EXPLAINING

In [79]:
# splitting dataframe into training and testing datasets

heart_split <- initial_split(heart_clean, prop = 0.75, strata = num)
heart_training <- training(heart_split)
heart_testing <- testing(heart_split)

head(heart_training)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,diagnosis
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>
63,M,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0,False
37,M,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0,False
41,F,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0,False
56,M,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0,False
57,F,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0,False
57,M,4,140,192,0,0,148,0,0.4,2,0.0,6.0,0,False


EXPLAIN FIGURE

#### Summarizing the Data

EXPLAIN WHY WE ARE DOING THIS

In [78]:

# number of male patients in training dataset
male_count <- heart_training |> filter(sex == "M") |> group_by(diagnosis) |> summarize(male = n()) 

#number of female patients in training dataset
female_count <- heart_training |> filter(sex == "F") |> group_by(diagnosis) |> summarize(female = n()) 

# joining the male and female tables
sex_join <- full_join(male_count, female_count)

# getting the patient, count, percentage, min, max, count of male and female patients and mean of each predictor
num_obs <- nrow(heart_training)
heart_summary <- heart_training |> 
    group_by(diagnosis) |>
    summarize(
        num_of_patients = n(),
        percentage = n()/num_obs * 100,
        min_age = min(age),
        max_age = max(age),
        mean_age = mean(age),
        min_chol = min(chol),
        max_chol = max(chol),
        mean_chol = mean(chol)) 

heart_summary <- full_join(heart_summary, sex_join)
heart_summary

[1m[22mJoining with `by = join_by(diagnosis)`
[1m[22mJoining with `by = join_by(diagnosis)`


diagnosis,num_of_patients,percentage,min_age,max_age,mean_age,min_chol,max_chol,mean_chol,male,female
<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
False,124,54.86726,29,76,52.68548,126,564,244.2903,70,54
True,102,45.13274,35,77,56.48039,131,409,249.3627,87,15


In [None]:
FIGURE EXPLANATION

Finding Best K Value

Visualizing the results

Testing the classifier 