# Group Project
### Members:
- Minh Au
- Andrew Carriere
- Veronika Bumbulovic
- Kevin Yoon Jeong

## Preliminary exploratory data analysis
The following two cells will load necessary libraries and read in the data which is nescessary as the first step for the analysis.

In [101]:
library(tidyverse)
library(repr)
library(tidymodels)

In [149]:
# Reading the ata
heart_data <- read_csv("data/hungarian-heart.csv", col_names = FALSE)
head(heart_data, 10)

[1mRows: [22m[34m2940[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): X1

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1
<chr>
1254 0 40 1 1 0 0
-9 2 140 0 289 -9 -9 -9
0 -9 -9 0 12 16 84 0
0 0 0 0 150 18 -9 7
172 86 200 110 140 86 0 0
0 -9 26 20 -9 -9 -9 -9
-9 -9 -9 -9 -9 -9 -9 12
20 84 0 -9 -9 -9 -9 -9
-9 -9 -9 -9 -9 1 1 1
1 1 -9. -9. name


As seen in the small subset above, the data is not in tidy format and thus, is not ready for analysis. From the documentations, we know that there are 76 attributes, but currently, they are all grouped in one column. Furthermore, the 76 attributes are spread across 10 different lines with different number of attributes on each line (some with 7 while the others only have 8). As such, the data wrangling strategy is as follows:
1. Combine the 76 attributes into one row. They need to be separated by a specific delimiter such as ",".
2. Each rows are separated out into multiple columns.
3. Load the names of the attributes 
3. Select the 7 factors decribed in the method section and convert them into number or categorical.

In [190]:
separated_with_commas <- heart_data |>
    separate(col = X1, into = c("X1", "X2", "X3", "X4", "X6", "X7", "X8", "X9"), sep = " ") |>
    rowwise() |>
    mutate(combined_col = paste0(str_replace_na(c_across(X1:X9), replacement = ""), collapse = ",")) |>
    ungroup() |>
    select(combined_col)

“Expected 8 pieces. Missing pieces filled with `NA` in 589 rows [1, 10, 11, 20, 21, 30, 31, 40, 41, 50, 51, 60, 61, 70, 71, 80, 81, 90, 91, 100, ...].”


In [203]:
clean_data <- separated_with_commas |>
    mutate(num = ceiling(row_number() / 10)) |>
    group_by(num) |>
    summarise(combined_col = gsub(',,', ',', gsub('NA', '', paste0(combined_col, collapse = ",")))) |>
    ungroup() |>
    separate(col = combined_col, into = c('id', 'ccf', 'age', 'sex', 'painloc', 'painexer', 'relrest', 'pncaden', 'cp', 'trestbps', 'htn', 'chol', 'smoke', 'cigs', 'years', 'fbs', 'dm', 'famhist', 'restecg', 'ekgmo', 'ekgday', 'ekgyr', 'dig', 'prop', 'nitr', 'pro', 'diuretic', 'proto', 'thaldur', 'thaltime', 'met', 'thalach', 'thalrest', 'tpeakbps', 'tpeakbpd', 'dummy', 'trestbpd', 'exang', 'xhypo', 'oldpeak', 'slope', 'rldv5', 'rldv5e', 'ca', 'restckm', 'exerckm', 'restef', 'restwm', 'exeref', 'exerwm', 'thal', 'thalsev', 'thalpul', 'earlobe', 'cmo', 'cday', 'cyr', 'num', 'lmt', 'ladprox', 'laddist', 'diag', 'cxmain', 'ramus', 'om1', 'om2', 'rcaprox', 'rcadist', 'lvx1', 'lvx2', 'lvx3', 'lvx4', 'lvf', 'cathef', 'junk', 'name'), sep = ",")
head(clean_data, 1)
processed_heart_data <- clean_data |>
    select(age, sex, chol, cigs, years, thalach, thalrest, trestbpd) |>
    mutate(age = as.numeric(age), sex = as_factor(sex), chol = as.numeric(chol), cigs = as.numeric(cigs), 
          years = as.numeric(years), thalach = as.numeric(thalach), thalrest = as.numeric(thalrest), trestbpd = as.numeric(trestbpd))
head(processed_heart_data, 5)

“Expected 76 pieces. Additional pieces discarded in 293 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“Expected 76 pieces. Missing pieces filled with `NA` in 1 rows [290].”


id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,⋯,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1254,0,40,1,1,0,0,-9,2,140,⋯,-9,-9,1,1,1,1,1,-9.0,-9.0,name


age,sex,chol,cigs,years,thalach,thalrest,trestbpd
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
40,1,289,-9,-9,172,86,86
49,0,180,-9,-9,156,100,90
37,1,283,-9,-9,98,58,80
48,0,214,-9,-9,108,54,86
54,1,-9,-9,-9,122,74,90


As seen in the first five rows of the data, we could see that the dataset is already in the tidy format. This can be seen with how every column is a variable, every row is a single observation and every cell only has one value. Although, we still need modify the dataset slightly to suit the classification problem. More specifically, since we are trying to predict the type of chest pain, it needs to be represented as a category/factor instead of a number. Similarly, other categorical attributes include: sex, fasting blood sugar larger than $120$ mg/dl, resting electrocardiographic results, exercise induced agina, number of major vessels, and thalassemia.
# age, sex, chol, smoking (cigarettes per day or number of years as smoker), max heart rate, rest heart rate, resting blood pressure

In [1]:
# Convert some attributes to factors
heart_data <- heart_data |>
    mutate(sex = as_factor(sex), cp = as_factor(cp), fbs = as_factor(fbs), restecg = as_factor(restecg),
           exang = as_factor(exang), ca = as_factor(ca), thal = as_factor(thal))
head(heart_data, 5)

ERROR: Error in mutate(heart_data, sex = as_factor(sex), cp = as_factor(cp), : could not find function "mutate"


Before processing the data any further, we must extract the training data to prevent the test data from affecting any aspect of our data analysis process. As there are 1025 rows (or 1025 different observations), an appropriate proportion between training and testing dataset is $80\%$. Furthermore, to prevent any bias, the splitting will be random while preserving the good proportion of different chest pain type in both data subset. Note that for reproducibility, random seed of $1337$ will be used.

In [23]:
# Splitting the data into training and testing
set.seed(1337)
heart_split <- initial_split(heart_data, prop = 0.8, strata = cp)
heart_train <- training(heart_split)
heart_test <- testing (heart_split)

In [4]:
# nrow(heart_train)
# heart_train_missing <- heart_train |>
#     mutate(cp = mean(cp))

# heart_train_missing