# Group Project
### Members:
- Minh Au
- Andrew Carriere
- Veronika Bumbulovic
- Kevin Yoon Jeong

## Preliminary exploratory data analysis
The following two cells will load necessary libraries and read in the data which is nescessary as the first step for the analysis.

In [101]:
library(tidyverse)
library(repr)
library(tidymodels)

In [299]:
# Reading the ata
hungarian_heart_data <- read_csv("data/hungarian-heart.csv", col_names = FALSE)
longbeach_heart_data <- read_csv("data/long-beach-va-heart.csv", col_names = FALSE)
switzerland_heart_data <- read_csv("data/switzerland-heart.csv", col_names = FALSE)
heart_data <- rbind(hungarian_heart_data, longbeach_heart_data) |>
    rbind(switzerland_heart_data)
head(heart_data, 10)

[1mRows: [22m[34m2940[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): X1

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m2000[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): X1

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1230[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): X1



X1
<chr>
1254 0 40 1 1 0 0
-9 2 140 0 289 -9 -9 -9
0 -9 -9 0 12 16 84 0
0 0 0 0 150 18 -9 7
172 86 200 110 140 86 0 0
0 -9 26 20 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 12
20 84 0 -9 -9 -9 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name


As seen in the small subset above, the data is not in tidy format and thus, is not ready for analysis. From the documentations, we know that there are 76 attributes, but currently, they are all grouped in one column. Furthermore, the 76 attributes are spread across 10 different lines with different number of attributes on each line (some with 7 while the others only have 8). As such, the data wrangling strategy is as follows:
1. Combine the 76 attributes into one row. They need to be separated by a specific delimiter such as ",".
2. Each rows are separated out into multiple columns.
3. Load the names of the attributes 
3. Select the following factors as decribed in the method section and convert them into correct type
    - age - age in years
    - sex - 0 for female, 1 for male
    - chol - serum cholestoral in mg/dl 
    - cigs - cigarettes per day
    - years - number of years as a smoker
    - thalach - maximum heart rate achieved 
    - thalrest - resting heart rate
    - trestbpd - resting blood pressure
    - num - diagnosis of heart disease (0 for absence, 1 - 4 for present)

In [300]:
separated_with_commas <- heart_data |>
    separate(col = X1, into = c("X1", "X2", "X3", "X4", "X6", "X7", "X8", "X9"), sep = " ") |>
    rowwise() |>
    mutate(combined_col = paste0(str_replace_na(c_across(X1:X9), replacement = ""), collapse = ",")) |>
    ungroup() |>
    select(combined_col)

“Expected 8 pieces. Missing pieces filled with `NA` in 1235 rows [1, 10, 11, 20, 21, 30, 31, 40, 41, 50, 51, 60, 61, 70, 71, 80, 81, 90, 91, 100, ...].”


In [307]:
clean_data <- separated_with_commas |>
    mutate(num = ceiling(row_number() / 10)) |>
    group_by(num) |>
    summarise(combined_col = gsub(',,', ',', gsub('NA', '', paste0(combined_col, collapse = ",")))) |>
    ungroup() |>
    separate(col = combined_col, into = c('id', 'ccf', 'age', 'sex', 'painloc', 'painexer', 'relrest', 'pncaden', 'cp', 'trestbps', 'htn', 'chol', 'smoke', 'cigs', 'years', 'fbs', 'dm', 'famhist', 'restecg', 'ekgmo', 'ekgday', 'ekgyr', 'dig', 'prop', 'nitr', 'pro', 'diuretic', 'proto', 'thaldur', 'thaltime', 'met', 'thalach', 'thalrest', 'tpeakbps', 'tpeakbpd', 'dummy', 'trestbpd', 'exang', 'xhypo', 'oldpeak', 'slope', 'rldv5', 'rldv5e', 'ca', 'restckm', 'exerckm', 'restef', 'restwm', 'exeref', 'exerwm', 'thal', 'thalsev', 'thalpul', 'earlobe', 'cmo', 'cday', 'cyr', 'num', 'lmt', 'ladprox', 'laddist', 'diag', 'cxmain', 'ramus', 'om1', 'om2', 'rcaprox', 'rcadist', 'lvx1', 'lvx2', 'lvx3', 'lvx4', 'lvf', 'cathef', 'junk', 'name'), sep = ",")
head(clean_data, 1)
processed_heart_data <- clean_data |>
    select(age, sex, chol, cigs, years, thalach, thalrest, trestbpd, num) |>
    mutate(age = as.numeric(age), sex = as_factor(sex), chol = as.numeric(chol), cigs = as.numeric(cigs), 
          years = as.numeric(years), thalach = as.numeric(thalach), thalrest = as.numeric(thalrest), 
           trestbpd = as.numeric(trestbpd), num = as.numeric(num))
head(processed_heart_data, 5)

“Expected 76 pieces. Additional pieces discarded in 616 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“Expected 76 pieces. Missing pieces filled with `NA` in 1 rows [290].”


id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,⋯,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1254,0,40,1,1,0,0,-9,2,140,⋯,-9,-9,1,1,1,1,1,-9.0,-9.0,name


age,sex,chol,cigs,years,thalach,thalrest,trestbpd,num
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
40,1,289,-9,-9,172,86,86,0
49,0,180,-9,-9,156,100,90,1
37,1,283,-9,-9,98,58,80,0
48,0,214,-9,-9,108,54,86,3
54,1,-9,-9,-9,122,74,90,0


According to the documentation of the dataset, fields with -9 value is simply missing data. And, as seen in the first five rows of the data, `cigs` and `years` contain only -9 as values. In fact, this persists for most of the dataset. If we remove all the rows with that have -9 for those two factors, the dataset would be reduced significantly. As such, we cannot use cigarettes as predictor for the model since we don't have enough information on them. <br>
Also, note that the diagnosis for a heart disease is either present or absent. So it's a binary category and it would be better to be represented as 0 and 1. Therefore, let 0 and 1 be the absence and the presence of the heart disease respesctively. Also, let the column name be **diagnosis** for more clarity. <br>
We should also replace the fields with -9 values with NA to be consistent with R's convention.

In [312]:
final_heart_data <- processed_heart_data |>
    mutate(diagnosis = 1 * (num >= 1)) |>
    mutate(diagnosis = as.factor(diagnosis)) |>
    select(-cigs, -years, -num)
final_heart_data[final_heart_data == -9] <- NA
head(final_heart_data, 5)

age,sex,chol,thalach,thalrest,trestbpd,diagnosis
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
40,1,289.0,172,86,86,0
49,0,180.0,156,100,90,1
37,1,283.0,98,58,80,0
48,0,214.0,108,54,86,1
54,1,,122,74,90,0


As a result, the table above shows a subset of the wrangled data. Now, we can split the data into training and testing sets to start giving more inisghts into the data.

In [23]:
# Splitting the data into training and testing
set.seed(1337)
heart_split <- initial_split(heart_data, prop = 0.8, strata = cp)
heart_train <- training(heart_split)
heart_test <- testing (heart_split)

In [4]:
# nrow(heart_train)
# heart_train_missing <- heart_train |>
#     mutate(cp = mean(cp))

# heart_train_missing