# Problem with rare categorial data
If you look at `restecg` and `thal` you that there is one level with a small number of observed cases.

In [1]:
source("helpers.r")
df <- get_training_df()

"package 'tidyverse' was built under R version 3.6.1"Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
"package 'caret' was built under R version 3.6.1"Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

Parsed with column specification:
cols(
  age = col_double(),
  sex = col_double(),
  cp = col_double(),
  trestbps = col_double(),
  chol = col_double(),
  fbs = col_double(),
  restecg = col_double(),
  thalach = col_double(),
  e

In [2]:
df %>% count(restecg)
df %>% count(thal)

restecg,n
normal,115
ST-T_abnormalty,124
hypertrophy,4


thal,n
normal,2
fixed_defect,16
reversable_defect,131
?,94


This leads to problems when evaluating a models'performance with cross validation. Image fitting a model with a training set where the level `hypertrophy` from the variable `restecg` is not present. What should the model do when this level appears in an observation of the test set? R says: 

`Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor restecg has new levels hypertrophy`
  
  This makes sence if you look at how a logistic regression for example treats categorial variables when dummy encoding is used. I will solve this problem for now by just throwing these observations out of the data set. I treat those as outliers and ignore them. 

In [3]:
get_clean_df <- function(df) {
  
  df <- df %>%
    filter(restecg != 2) %>%
    filter(thal != 0)


  result <- df %>%
    transmute(age,
              thalach,
              trestbps,
              oldpeak,
              ca,
              chol,
              restecg = factor(df$restecg, levels = c(0,1), labels = c("normal", "ST-T_abnormalty")),
              fbs = factor(df$fbs, levels = c(0,1), labels = c("no", "yes")),
              sex = factor(df$sex, levels = c(0,1), labels = c("Female", "Male")),
              exang = factor(df$exang, levels = c(0,1), labels = c("no", "yes")),
              cp = factor(df$cp, levels = c(0,1,2,3), labels = c("typical_angina", "atypical_angina","non-anginal_pain", "asymptomatic")),
              slope = factor(df$slope, levels = c(0,1,2), labels = c("upsloping", "flat", "downsloping")),
              target = factor(df$target, levels = c(0,1), labels = c("no_disease", "disease")),
              thal = factor(df$thal,levels = c(1,2,3), labels = c("fixed_defect", "reversable_defect", "?"))
    )
}

get_training_df_clean <- function(p = 0.8) {
  set.seed(25)
  df_raw <- read_csv("data.csv")
  df <- get_clean_df(df_raw)
  inTraining <- createDataPartition(df$target, p = p, list = FALSE)
  training <- df[inTraining,]
  # testing  <- df[-inTraining,]
  return(training)
}

From now on i will use this function to get the training data set. 