# Assignment 3

The objective of assignment 3 is to assess the extent to which dataset size and hyperparameter tuning affects the performance of prognostic models.
 
The learning objectives are: (1) how performance is affected by sample size, (2) how this varies by modeling technique, and (3) when hyperparameter tuning is most useful.

You will be given a large dataset and are asked to sample datasets of various sizes from it. For each sample train a machine learning model. Then plot the relationship between the sample size and the model performance. You should expect to see a power curve with better performance with larger datasets.

## Overview

The workflow of this assignment is as following:

1. The code to load and data preprocessing, including predictor selection, will be given, please do not change the corresponding code.
2. We will generate a vector of integers as the sample sizes of each subset (code will be provided).
3. Then for each dataset:
    1. Train CART and LGBM models, with and without tuning, on __all of the records in the dataset__ with 5-fold nested CV to establish a baseline.
    2. Train CART and LGBM models, with and without tuning, on __each subset with different sample size__ with 5-fold nested CV.
    
    (These two steps don't have to be in this particular order. See below.)
    
4. You will complete the function at the top to complete these steps. 
5. Plot the the relationship between sample size and model performance (AUC and scaled Brier score) and answer the questions based on your plots. Elaborate your answer by providing the thought process.

## Some notes

The dataset used in this assignement is a public dataset, and there is some information about the dataset in the help and in the attached document. The dataset is: faers (FDA adverse drug events). The help file documents the binary prediction models that can be trained.

Based on the previous assignment, you should only be using nested CV (no need for repeated nested CV because as we saw the variation is very small with nested CV and therefore the repeated part does not really add much value).

## Changes of functions in the updated `sdgm` package

Before you start working on the main question, please note that we have updated the `sdgm` package and below are some important changes. Read carefully and you will need to use these affected functions in this assignment.

### Model building functions

In the updated `sdgm` packages, the functions to build models have been updated to add a new parameter called `tune` to control whether to tune the model (default = T), you can use `tune = F` to turn off hyperparameter tuning when needed.

In [None]:
# One example to show the changes. This applys to all the model building functions.
?sdgm::cart.bestmodel.bin()

### `nested.cv.bin` for nested CV

In the updated version of `sgbm` package, there is a new function called `nested.cv.bin` that implements nested cv, see the help page below to see more details. We will use this function in this assignment.

__Important note:__ With this `nested.cv.bin` function, you don't need to implement parallel computing yourself. There is a parameter called `par` to control if the outer loop should be parallelized, and its default value is `T`. So without changing it, the program will parallelize the outer loop automatically.

In [None]:
?sdgm::nested.cv.bin()

## Load useful packages

In [None]:
library(sdgm)
library(dplyr)
library(ggplot2)

## The dataset: `sdgm::faers` (FDA adverse events)

For more details about this dataset, see below and the attached file. 

In [None]:
# check the help file of this dataset. It documents the binary prediction model 
# that can be trained from this dataset.
?sdgm::faers

In [None]:
# In this assignment, we will provide the data pre-processing steps (same as in the help file above). Please don't change code here.

# get the dataset
data<-sdgm::getdata("faers")

# create binary outcome
data$status <- ifelse(data$outc_cod_0 == "DE", 1,
                     ifelse(data$outc_cod_0 %in% c("CA", "DS", "HO", "LT", "OT", "RI"), 0, NA))

# transform event_dt into days
data$date <- as.Date(as.character(data$event_dt), format = "%Y%m%d")
data$days <- as.numeric(as.Date("2020-01-01") - data$date)

data$weight                               <- data$wt
data$weight[which(data$wt_cod == "LBS")]  <- data$wt[which(data$wt_cod == "LBS")] * 0.45359237

data$age_yr                               <- data$age
data$age_yr[which(data$age_cod == "DEC")] <- data$age[which(data$age_cod == "DEC")] * 10
data$age_yr[which(data$age_cod == "DY")]  <- data$age[which(data$age_cod == "DY")] / 365
data$age_yr[which(data$age_cod == "HR")]  <- data$age[which(data$age_cod == "HR")] / (24*365)
data$age_yr[which(data$age_cod == "MON")] <- data$age[which(data$age_cod == "MON")] / 12
data$age_yr[which(data$age_cod == "WK")]  <- data$age[which(data$age_cod == "WK")] / 52

data$age_yr                               <- ifelse(data$age_yr >= 150, NA, data$age_yr)

# transform sex into factor
data$sex <- factor(data$sex) 

# select predictors (categorical)
data$drug            <- as.factor(data$drugname_0)

data$indi_pt         <- as.factor(data$indi_pt_0)

# subset dataset
cols_select <- c("status", "days", "sex", "age_yr", "weight", "drug", "indi_pt")

In [None]:
# note here we need to transform the dataset from tibble to dataframe to avoid some issues
full_data <- as.data.frame(data %>% select(all_of(cols_select)))

# have a look
dim(full_data)
summary(full_data)
str(full_data)

In [None]:
# define the outcome variable
voutcome <- "Your answer"

table(full_data[,voutcome])

## Question 1: Describe the missingness pattern

Using "Lecture 6 - Missingness Patterns on CCHS 2023 - v2.ipynb" as your reference, describe the missingness pattern in this dataset

In [None]:
# Describe the missingness pattern
"Your answer"

You may have noticed that the data has three potential issues:
1. It has a lot of NAs. The current implementation of CART and LGBM in the `sdgm` package can deal with missingness in predictors, so you don't need to worry about it. However, we need to filter our the `NA`s in the __outcome variable__.
2. Variables like `drug` and `indi_pt` have a lot of categories, i.e., very high cardinality. In this assignment, you will see that LGBM and CART are able to handle high cardinality variables. They both implement a mean target encoding scheme for a binary outcome.
3. The outcome variable in this dataset is highly imbalanced. However, since the learning objectives in this assignment is to assess the influence of dataset size and hyperparameter tuning, we will ignore the imbalance issue.

In [None]:
# Filter out NAs in the outcome variable
full_data <- "Your answer"

# have a look
dim(full_data)
summary(full_data)
str(full_data)

## Define a vector of integers as sample sizes of the subsets

In [None]:
set.seed(10) # NEED TO BE ADDED
b<-rnorm(1, 1.5,0.005) 
set_sizes<-round(b^(10:32))

# make sure the subset size doesn't exceed the whole data size
size_whole <- nrow(full_data)
set_sizes <- set_sizes[set_sizes < size_whole]

# Add the full sample size at the end
set_sizes <- c(set_sizes, size_whole)
set_sizes

# total number of subsets including the whole dataset
n_set <- length(set_sizes)
n_set

ind_sizes <- 1: length(n_set)

## Question 2 (optional - bonus): More subsets to smooth out the final plot
In the current set up in the above cell, you'll have about 22 subsets with different sample sizes, including the whole dataset. You'll get bonus points if you can increase the number of subsets in a reasonable way to make the final plot smoother.

If you choose to work on this bonus question, please just make changes in the above cell. The change will be reflected by `set_sizes`, `n_set`, and the final plots.

## Question 3: Building and evaluating 4 models (`cart` and `lgbm`, with and without tuning) on all the subsets

For reference, please see "Lecture 4 - Nested K-cross validation of CART on CCHS.ipynb" and your code for assignment 2.

Use the code below to control `n_iter` throughout the assignment. You can set it as `n_iter <- 1` when you develop and debug the code. Then set it as `n_iter <- 20` for the final submission.

In [None]:
n_iter <- 1

In [None]:
# ============= Building 4 models for each subset ==================

# loop through all subsets including the whole dataset
res <- sapply(set_sizes, function(size) { 
    
    # slice the data using the current sample size: size
    sub_data <- full_data %>% slice_sample(n=size)

    # build and evaluate 4 models
    res.model <- sapply(1:4, function(i_model) {
        # name of file
        file_name <- paste0("result.",nrow(sub_data), ".", i_model, ".rds")
        
        # check if file already exists
        if (file.exists(name_file)) { # exist, read the file
           res_vec <- readRDS(file_name) 
        } else { # not exist, run the code
            # build and evaluate 
            "Your answer"

            # collect results
            res_vec <- "Your answer"

            # save it to a file
            saveRDS(res_vec, file = file_name)
        }
        # return results
        res_vec
    })

    # return 
    res.model
})

# have a look
res

In order to generate the plots using the code provided below, you'll need to organize your result `res` into a dataframe with five variables: 

1. `size`: sample size of each subset
2. `model`: a categorical variable of `cart` and `lgbm`
3. `tune`: a categorical variable of `with tuning` and `without tuning`
4. `auc`: the AUC value
5. `brier`: the brier score

In [None]:
# -------- re-organize `res` and save it as a data.frame -------- 
df.res <- "Your answer"

# have a look at the results dataframe
df.res

In [None]:
# change plot size to 18 * 6
options(repr.plot.width=18, repr.plot.height=6) 

In [None]:
# Plot of auc 
ggplot(df.res, aes(size, auc, color = tune)) +
    geom_point(size = 3) +
    geom_path()+
    geom_hline(data = df.res %>% filter(size == max(df.res$size)), 
               aes(yintercept = auc, color = tune), linetype = "dashed") +
    facet_wrap(~model)+
    labs(x = "Sample size",
         y = "AUC value",
         title = paste0("faers: FDA adverse events, n_iter = ", n_iter)) +
    theme(text = element_text(size = 20),
          legend.title = element_blank())

# Plot of brier 
ggplot(df.res, aes(size, brier, color = tune)) +
    geom_point(size = 3) +
    geom_path()+
    geom_hline(data = df.res %>% filter(size == max(df.res$size)), 
               aes(yintercept = brier, color = tune), linetype = "dashed") +
    facet_wrap(~model) +
    labs(x = "Sample size",
         y = "Brier score") +
    theme(text = element_text(size = 20),
          legend.title = element_blank())

## Question 4: How large does the dataset need to be to get reasonable discrimination and calibration ?

## Question 5: Does tuning affect the discrimination and calibration results ?

## Question 6: How do tuning and sample size interact in influencing the discrimination and calibration results ?

# Congratulation! You have completed the Assignment 3!