In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Loading in the data (replacing all "?" with NA since the predetermined NA value was "?", and NA is easier to deal with)

In [183]:
# Setting names of columns
names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num", show_col_types = FALSE)+

# Reading all of the data from https://archive.ics.uci.edu/ml/datasets/heart+Disease
cleveland_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = names, na = "?", show_col_types = FALSE)
switzerland_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data", col_names = names, na = "?", show_col_types = FALSE)
hungary_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", col_names = names, na = "?", show_col_types = FALSE)
va_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data", col_names = names, na = "?", show_col_types = FALSE)

Counts of NA values in each dataset, and within every column. This allows us to see which columns we should remove early on, and which would likely cause issues in terms of missing data.

In [69]:
NA_counts <- as.data.frame(cleveland_data[FALSE, ])
x <- substitute(list(cleveland_data, switzerland_data, hungary_data, va_data))

for (i in as.list(x)[-1]) {
    NA_counts[nrow(NA_counts) + 1,] <- map_df(get(i), ~sum(is.na(.x)))
    rownames(NA_counts)[nrow(NA_counts)] <- deparse(i)
}
NA_counts

Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
cleveland_data,0,0,0,0,0,0,0,0,0,0,0,4,2,0
switzerland_data,0,0,0,2,0,75,1,1,1,6,17,118,52,0
hungary_data,0,0,0,1,23,8,1,1,1,0,190,291,266,0
va_data,0,0,0,56,7,7,0,53,53,56,102,198,166,0


It looks like the columns `slope`, `ca`, and `thal` all have a large quantity of NA values within three of the four datasets that we are using. For this reason, we chose to drop these three columns since performing imputation with such a small quantity of usable data is quite inaccurate, and also because if we choose to omit NA values then that will delete the row entirely. Additionally, we are choosing to drop `fbs`, `exang`, and `oldpeak` for a similar reason; if we choose to omit all NA values within each row, then keeping these columns will cause us to lose ~150 rows of data (since all rows with NA will be deleted). As such, removing these columns will allow us to keep more observations for our analysis, as well as simplify our model later on by exchanging less features for more data.

In [192]:
data_tidy_multi <- rbind(cleveland_data, switzerland_data, hungary_data, va_data) %>%
    select(-slope, -ca, -thal, -fbs, -exang, -oldpeak) %>%
    na.omit()

num_counts_multi <- data_tidy_multi %>%
    group_by(num) %>%
    summarize(counts = n(), proportions = counts/nrow(data_tidy))
num_counts_multi

num,counts,proportions
<dbl>,<int>,<dbl>
0,373,0.44831731
1,244,0.29326923
2,100,0.12019231
3,91,0.109375
4,24,0.02884615


We can see that there is quite a large class imbalance within the dataset, particularly in the cases of more severe heart disease. Since this analysis is mainly focused on finding the impacts of a patient on whether they do or do not have heart disease, we are largely not interested in the severity of their condition, but instead on whether or not heart disease is present at all. Therefore, we can binarize the output variable by grouping categories 1-4 into a single case (has heart disease).

In [189]:
data_tidy <- data_tidy_multi %>%
    mutate(num = ifelse(num == 0, 0, 1))

num_counts <- data_tidy %>%
    group_by(num) %>%
    summarize(counts = n(), proportions = counts/nrow(data_tidy))

num_counts

num,counts,proportions
<dbl>,<int>,<dbl>
0,373,0.4483173
1,459,0.5516827


Evidently, the data is now much more balanced. Now that we have fully tidied our data (`data_tidy`), we can begin the variable selection portion of the analysis.

In [195]:
head(data_tidy)

age,sex,cp,trestbps,chol,restecg,thalach,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
63,1,1,145,233,2,150,0
67,1,4,160,286,2,108,1
67,1,4,120,229,2,129,1
37,1,3,130,250,0,187,0
41,0,2,130,204,2,172,0
56,1,2,120,236,0,178,0
