# Assignment 2

In assignment 2, you will build five classification models on six different datasets. You can then compare them to see which modeling technique works best on each dataset using AUC and the scaled brier score as the metrics.
 
Learning objective: (a) how to train and evaluate models, (b) comparing modeling techniques

## Some notes

For reference, please see "Lecture 3 - Tree Examples 2023.ipynb", "Lecture 4 - Nested K-cross validation of CART on CCHS.ipynb", "Lecture 4 - Parallel nested CV on CCHS.ipynb", and "sdgm-getting started.pdf"

Five classification models available in `sdgm` are:
1. logistic regression: `lr.bestmodel.bin`
2. CART: `cart.bestmodel.bin`
3. random forest: `rf.bestmodel.bin`
4. LightGBM (lgbm): `lgbm.bestmodel.bin`
5. support vector machine (svm): `svm.bestmodel.bin`

## Load useful packages

In [3]:
library(sdgm)
library(dplyr)
library(ggplot2)

## Some preparation work

In [4]:
# Change plot size to 18 * 6
options(repr.plot.width=18, repr.plot.height=6) 

In [5]:
n_iter <- 20 
train_data_split <- 0.8
model_vec <- c("lr", "cart", "rf", "lgbm", "svm") # names of models

In [6]:
# -------- define dataframes to save the intermediate results ---------
# AUC values of the repeated train/test split, 15 times
auc_split <- data.frame(type = rep("train/test split", 15),
                      metric = rep("auc", 15),
                      lr = rep(0, 15),
                      cart = rep(0, 15),
                      rf = rep(0, 15),
                      lgbm = rep(0, 15),
                      svm = rep(0, 15))

# AUC values of the repeated nested cv, 15 times
auc_cv <- data.frame(type = rep("nested cv", 15),
                      metric = rep("auc", 15),
                      lr = rep(0, 15),
                      cart = rep(0, 15),
                      rf = rep(0, 15),
                      lgbm = rep(0, 15),
                      svm = rep(0, 15))

# brier scores of the repeated train/test split, 15 times
brier_split <- data.frame(type = rep("train/test split", 15),
                        metric = rep("brier", 15),
                        lr = rep(0, 15),
                        cart = rep(0, 15),
                        rf = rep(0, 15),
                        lgbm = rep(0, 15),
                        svm = rep(0, 15))

# brier scores of the repeated nested cv, 15 times
brier_cv <- data.frame(type = rep("nested cv", 15),
                        metric = rep("brier", 15),
                        lr = rep(0, 15),
                        cart = rep(0, 15),
                        rf = rep(0, 15),
                        lgbm = rep(0, 15),
                        svm = rep(0, 15))

## Dataset 1: `sdgm::C2` BankNote

For more details about this dataset, see [here](https://archive.ics.uci.edu/dataset/267/banknote+authentication)

In [None]:
# show the description of this dataset
?sdgm::C2

In [None]:
# first glance of the dataset
full_data <- sdgm::C2
head(full_data)
dim(full_data)

In [None]:
# Convert characters to numbers. Note because of the special format in the data, `as.numeric` doesn't work here.
# The code below is to extract numbers from strings
full_data$class <- as.numeric(gsub("\\D", "", full_data$class))

In [None]:
# Check the data now
summary(full_data)
head(full_data)

# Check the outcome variable
table(full_data$class)

### Question 1.1: Based on what you have learnt about the dataset, is there anything should be done to prepare the data?

**Hint 1:** Are all the variables in the correct variable type?

**Hint 2 (Important!):** As per some models' requirement, the outcome variable has to be a numeric variable between 0 and 1.

***Note:*** Some datasets need a bit of data preparation and some don't. If you think this one needs to be prepared, choose "yes" and add your code in the below cell. Otherwise, choose "no" and leave the below cell unchanged.

Your answer: yes or no? (choose one by deleting the other)

In [None]:
# The data preparation step if needed

# Check the data now
summary(full_data)

In [None]:
# remove one variable to add more variance to the the outputs across the repeated evaluations
full_data <- full_data %>% select(-var)

### Question 1.2: Build five different models and evaluate them

In [None]:
# define the outcome variable
voutcome <- Your answer

Note: You will use parallel computing here, see "Lecture 4 - Parallel nested CV on CCHS.ipynb".

Hint: remember to include "model" in the vector when you call `parallel::clusterExport`. For more details about `parallel::clusterExport`, see [here](https://cran.r-project.org/web/packages/SimDesign/vignettes/Fixed_obj_fun.html).

In [None]:
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
    
    # parallel computing
    Your answer
    
    res <- parallel::parSapply(cl, 1:15, function(x)
    {
        # partition data into train and test portions
        Your answer
        
        # retrieve train and test data
        Your answer
        
        # build the model
        if (model == "lr") {
            Your answer
        } else if (model == "cart") {
            Your answer
        } else if (model == "rf") {
            Your answer
        } else if (model == "lgbm") {
            Your answer
        } else if (model == "svm") {
            Your answer
        }
        
        # predict
        Your answer
  
        # calculate and return AUC and brier score
        Your answer
        
        c(test_auc, test_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_split$lr <- res[1,]
        brier_split$lr <- res[2,]
    } else if (model == "cart") {
        auc_split$cart <- res[1,]
        brier_split$cart <- res[2,]
    } else if (model == "rf") {
        auc_split$rf <- res[1,]
        brier_split$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_split$lgbm <- res[1,]
        brier_split$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_split$svm <- res[1,]
        brier_split$svm <- res[2,]
    }
}

# save results
res.split.df1 <- rbind(auc_split, brier_split)

# print results
print(res.split.df1)

Mimic the structure in the last cell, complete the below cell for nested CV. Remember to use parallel computing.

In [None]:
# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    # parallel computing
    Your answer
    
    # this is the repeated loop
    res <- parallel::parSapply(cl, seq(15), function(i) 
    {
        # this is the nested CV outer loop
        nested_res <- sapply(caret::createFolds(full_data[, voutcome], k=5), function(x) 
        {
            Your answer
        })
        nested_cv_auc <- mean(nested_res[1,], na.rm=T)
        nested_cv_brier <- mean(nested_res[2,], na.rm=T)
                               
        c(nested_cv_auc, nested_cv_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_cv$lr <- res[1,]
        brier_cv$lr <- res[2,]
    } else if (model == "cart") {
        auc_cv$cart <- res[1,]
        brier_cv$cart <- res[2,]
    } else if (model == "rf") {
        auc_cv$rf <- res[1,]
        brier_cv$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_cv$lgbm <- res[1,]
        brier_cv$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_cv$svm <- res[1,]
        brier_cv$svm <- res[2,]
    }
}

# save results
res.cv.df1 <- rbind(auc_cv, brier_cv)

print(res.cv.df1)

In [None]:
auc.df1 <- rbind(res.split.df1 %>% filter(metric == "auc"), 
                 res.cv.df1 %>% filter(metric == "auc"))

# pivot the dataframe from wide to long for plotting
auc.df1.long <- auc.df1 %>% 
    tidyr::pivot_longer(-c(type, metric), 
                        names_to = "model",
                        values_to = "auc")
auc.df1.long$model <- factor(auc.df1.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
auc.df1.long$type <- factor(auc.df1.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(auc.df1.long, aes(model, auc, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

In [None]:
brier.df1 <- rbind(res.split.df1 %>% filter(metric == "brier"), 
                   res.cv.df1 %>% filter(metric == "brier"))

# pivot the dataframe from wide to long for plotting
brier.df1.long <- brier.df1 %>% 
    tidyr::pivot_longer(-c(type,metric), 
                        names_to = "model",
                        values_to = "brier")
brier.df1.long$model <- factor(brier.df1.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
brier.df1.long$type <- factor(brier.df1.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(brier.df1.long, aes(model, brier, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

### Question 1.3: Based on the above violin plot, which model is the best one? Why?

## Dataset 2: `sdgm::C3` Breast Cancer Wisconsin

For more details about this dataset, see [here](https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original)

In [None]:
# show the description of this dataset
?sdgm::C3

In [None]:
# first glance of the dataset
full_data <- sdgm::C3
head(full_data)
dim(full_data)

In [None]:
# Convert characters to numbers. Note because of the special format in the data, `as.numeric` doesn't work here.
# The code below is to extract numbers from strings
full_data$Class <- as.numeric(gsub("\\D", "", full_data$Class))

In [None]:
# Check the data now
summary(full_data)

# Check the outcome variable
table(full_data$Class)

### Question 2.1: Based on what you have learnt about the dataset, is there anything should be done to prepare the data?

**Hint 1:** Are all the variables in the correct variable type?

**Hint 2 (Important!):** As per some models' requirement, the outcome variable has to be a numeric variable between 0 and 1.

***Note:*** Some datasets need a bit of data preparation and some don't. If you think this one needs to be prepared, choose "yes" and add your code in the below cell. Otherwise, choose "no" and leave the below cell unchanged.

Your answer: yes or no? (choose one by deleting the other)

In [None]:
# The data preparation step if needed

# Check the data now
summary(full_data)

### Question 2.2: Build five different models and evaluate them

In [None]:
# define the outcome variable
voutcome <- Your answer

Based on what you have done for Dataset 1, complete the below cells of this dataset. Remember to use parallel computing.

In [None]:
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
    Your answer
}

# save results
res.split.df2 <- rbind(auc_split, brier_split)

# print results
print(res.split.df2)  

In [None]:
# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    Your answer
}

# save results
res.cv.df2 <- rbind(auc_cv, brier_cv)

print(res.cv.df2)

In [None]:
auc.df2 <- rbind(res.split.df2 %>% filter(metric == "auc"), 
                 res.cv.df2 %>% filter(metric == "auc"))

# pivot the dataframe from wide to long for plotting
auc.df2.long <- auc.df2 %>% 
    tidyr::pivot_longer(-c(type, metric), 
                        names_to = "model",
                        values_to = "auc")
auc.df2.long$model <- factor(auc.df2.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
auc.df2.long$type <- factor(auc.df2.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(auc.df2.long, aes(model, auc, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

In [None]:
brier.df2 <- rbind(res.split.df2 %>% filter(metric == "brier"), 
                   res.cv.df2 %>% filter(metric == "brier"))

# pivot the dataframe from wide to long for plotting
brier.df2.long <- brier.df2 %>% 
    tidyr::pivot_longer(-c(type,metric), 
                        names_to = "model",
                        values_to = "brier")
brier.df2.long$model <- factor(brier.df2.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
brier.df2.long$type <- factor(brier.df2.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(brier.df2.long, aes(model, brier, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

### Question 2.3: Based on the above violin plot, which model is the best one? Why?

## Dataset 3: `sdgm::C16` Diabetic Retinopathy

For more details about this dataset, see [here](https://archive.ics.uci.edu/dataset/329/diabetic+retinopathy+debrecen)

In [None]:
# show the description of this dataset
?sdgm::C16

In [None]:
# first glance of the dataset
full_data <- sdgm::C16
head(full_data)
dim(full_data)

In [None]:
# Convert characters to numbers. Note because of the special format in the data, `as.numeric` doesn't work here.
# The code below is to extract numbers from strings
full_data$Quality <- as.numeric(gsub("\\D", "", full_data$Quality))
full_data$PreScreen <- as.numeric(gsub("\\D", "", full_data$PreScreen))
full_data$AM_FM <- as.numeric(gsub("\\D", "", full_data$AM_FM))
full_data$Class <- as.numeric(gsub("\\D", "", full_data$Class))

In [None]:
# Check the data now
summary(full_data)

# Check the outcome variable
table(full_data$Class)

### Question 3.1: Based on what you have learnt about the dataset, is there anything should be done to prepare the data?

**Hint 1:** Are all the variables in the correct variable type?

**Hint 2 (Important!):** As per some models' requirement, the outcome variable has to be a numeric variable between 0 and 1.

***Note:*** Some datasets need a bit of data preparation and some don't. If you think this one needs to be prepared, choose "yes" and add your code in the below cell. Otherwise, choose "no" and leave the below cell unchanged.

Your answer: yes or no? (choose one by deleting the other)

In [None]:
# The data preparation step if needed
           
# Check the data now
summary(full_data)

### Question 3.2: Build five different models and evaluate them

In [None]:
# define the outcome variable
voutcome <- Your answer

Based on what you have done for Dataset 1, complete the below cells of this dataset. Remember to use parallel computing.

In [None]:
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
    Your answer
}

# save results
res.split.df3 <- rbind(auc_split, brier_split)

# print results
print(res.split.df3)  

In [None]:
# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    Your answer
}

# save results
res.cv.df3 <- rbind(auc_cv, brier_cv)

print(res.cv.df3)

In [None]:
auc.df3 <- rbind(res.split.df3 %>% filter(metric == "auc"), 
                 res.cv.df3 %>% filter(metric == "auc"))

# pivot the dataframe from wide to long for plotting
auc.df3.long <- auc.df3 %>% 
    tidyr::pivot_longer(-c(type, metric), 
                        names_to = "model",
                        values_to = "auc")
auc.df3.long$model <- factor(auc.df3.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
auc.df3.long$type <- factor(auc.df3.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(auc.df3.long, aes(model, auc, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

In [None]:
brier.df3 <- rbind(res.split.df3 %>% filter(metric == "brier"), 
                   res.cv.df3 %>% filter(metric == "brier"))

# pivot the dataframe from wide to long for plotting
brier.df3.long <- brier.df3 %>% 
    tidyr::pivot_longer(-c(type,metric), 
                        names_to = "model",
                        values_to = "brier")
brier.df3.long$model <- factor(brier.df3.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
brier.df3.long$type <- factor(brier.df3.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(brier.df3.long, aes(model, brier, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

### Question 3.3: Based on the above violin plot, which model is the best one? Why?

## Dataset 4: `sdgm::C18` EEGb Eye State

For more details about this dataset, see [here](https://archive.ics.uci.edu/dataset/264/eeg+eye+state). 

In [32]:
# show the description of this dataset
?sdgm::C18

0,1
C18 {sdgm},R Documentation


In [33]:
# first glance of the dataset
# Considering the sample size, we will sample 10% of it for this assignment.
full_data <- sdgm::C18 %>% slice_sample(prop=0.1)
head(full_data)
dim(full_data)


Unnamed: 0_level_0,AF3,F7,F3,FC5,T7,P7,O1,O2,P8,T8,FC6,F4,F8,AF4,eyeDetection
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,4278.46,3974.36,4261.03,4106.67,4326.67,4616.41,4085.13,4624.1,4209.74,4225.13,4192.82,4281.03,4584.1,4339.49,'0'
2,4285.64,3987.18,4256.92,4109.23,4350.77,4628.72,4076.92,4618.97,4204.62,4231.28,4209.23,4266.15,4611.79,4341.54,'0'
3,4292.31,3987.69,4261.54,4104.1,4312.82,4611.79,4054.87,4610.77,4201.54,4234.87,4213.85,4285.13,4614.36,4362.05,'0'
4,4435.38,4088.21,4298.97,4169.74,4317.95,4592.31,4058.97,4589.23,4173.85,4219.49,4174.36,4312.31,4655.9,4473.85,'0'
5,4265.13,3977.44,4236.41,4110.77,4329.23,4600.0,4052.31,4632.31,4221.03,4235.9,4212.31,4270.77,4602.05,4350.26,'1'
6,4311.79,4010.77,4279.49,4151.79,4360.0,4640.51,4070.77,4637.95,4227.18,4262.05,4226.67,4299.49,4623.08,4368.72,'0'


In [34]:
# Convert characters to numbers. Note because of the special format in the data, `as.numeric` doesn't work here.
# The code below is to extract numbers from strings
full_data$eyeDetection <- as.numeric(gsub("\\D", "", full_data$eyeDetection))

In [35]:
# Check the data now
missing_data <- sapply(full_data, function(x) sum(is.na(x)))
variables_with_missing_data <- names(full_data)[missing_data > 0]
print(variables_with_missing_data)
summary(full_data)
head(full_data)

# Check the outcome variable
table(full_data$eyeDetection)

character(0)


      AF3             F7             F3            FC5             T7      
 Min.   :4199   Min.   :3915   Min.   :4208   Min.   :4061   Min.   :4305  
 1st Qu.:4279   1st Qu.:3991   1st Qu.:4250   1st Qu.:4108   1st Qu.:4332  
 Median :4294   Median :4005   Median :4262   Median :4120   Median :4339  
 Mean   :4301   Mean   :4010   Mean   :4264   Mean   :4122   Mean   :4341  
 3rd Qu.:4312   3rd Qu.:4023   3rd Qu.:4271   3rd Qu.:4133   3rd Qu.:4347  
 Max.   :4504   Max.   :4155   Max.   :4383   Max.   :4236   Max.   :4464  
       P7             O1             O2             P8             T8      
 Min.   :4566   Min.   :4027   Min.   :4574   Min.   :4158   Min.   :4167  
 1st Qu.:4611   1st Qu.:4058   1st Qu.:4604   1st Qu.:4190   1st Qu.:4219  
 Median :4617   Median :4070   Median :4613   Median :4199   Median :4228  
 Mean   :4620   Mean   :4073   Mean   :4615   Mean   :4201   Mean   :4230  
 3rd Qu.:4626   3rd Qu.:4084   3rd Qu.:4624   3rd Qu.:4209   3rd Qu.:4239  
 Max.   :475

Unnamed: 0_level_0,AF3,F7,F3,FC5,T7,P7,O1,O2,P8,T8,FC6,F4,F8,AF4,eyeDetection
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,4278.46,3974.36,4261.03,4106.67,4326.67,4616.41,4085.13,4624.1,4209.74,4225.13,4192.82,4281.03,4584.1,4339.49,0
2,4285.64,3987.18,4256.92,4109.23,4350.77,4628.72,4076.92,4618.97,4204.62,4231.28,4209.23,4266.15,4611.79,4341.54,0
3,4292.31,3987.69,4261.54,4104.1,4312.82,4611.79,4054.87,4610.77,4201.54,4234.87,4213.85,4285.13,4614.36,4362.05,0
4,4435.38,4088.21,4298.97,4169.74,4317.95,4592.31,4058.97,4589.23,4173.85,4219.49,4174.36,4312.31,4655.9,4473.85,0
5,4265.13,3977.44,4236.41,4110.77,4329.23,4600.0,4052.31,4632.31,4221.03,4235.9,4212.31,4270.77,4602.05,4350.26,1
6,4311.79,4010.77,4279.49,4151.79,4360.0,4640.51,4070.77,4637.95,4227.18,4262.05,4226.67,4299.49,4623.08,4368.72,0



  0   1 
834 664 

### Question 4.1: Based on what you have learnt about the dataset, is there anything should be done to prepare the data?

**Hint 1:** Are all the variables in the correct variable type?

**Hint 2 (Important!):** As per some models' requirement, the outcome variable has to be a numeric variable between 0 and 1.

***Note:*** Some datasets need a bit of data preparation and some don't. If you think this one needs to be prepared, choose "yes" and add your code in the below cell. Otherwise, choose "no" and leave the below cell unchanged.

Your answer: yes or no? (choose one by deleting the other)

In [None]:
# The data preparation step if needed

# Check the data now
summary(full_data)

### Question 4.2: Build five different models and evaluate them

In [None]:
# define the outcome variable
voutcome <- Your answer

Based on what you have done for Dataset 1, complete the below cells of this dataset. Remember to use parallel computing.

In [None]:
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
    Your answer
}

# save results
res.split.df4 <- rbind(auc_split, brier_split)

# print results
print(res.split.df4)  

In [None]:
# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    Your answer
}

# save results
res.cv.df4 <- rbind(auc_cv, brier_cv)

print(res.cv.df4)

In [None]:
auc.df4 <- rbind(res.split.df4 %>% filter(metric == "auc"), 
                 res.cv.df4 %>% filter(metric == "auc"))

# pivot the dataframe from wide to long for plotting
auc.df4.long <- auc.df4 %>% 
    tidyr::pivot_longer(-c(type, metric), 
                        names_to = "model",
                        values_to = "auc")
auc.df4.long$model <- factor(auc.df4.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
auc.df4.long$type <- factor(auc.df4.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(auc.df4.long, aes(model, auc, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

In [None]:
brier.df4 <- rbind(res.split.df4 %>% filter(metric == "brier"), 
                   res.cv.df4 %>% filter(metric == "brier"))

# pivot the dataframe from wide to long for plotting
brier.df4.long <- brier.df4 %>% 
    tidyr::pivot_longer(-c(type,metric), 
                        names_to = "model",
                        values_to = "brier")
brier.df4.long$model <- factor(brier.df4.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
brier.df4.long$type <- factor(brier.df4.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(brier.df4.long, aes(model, brier, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

### Question 4.3: Based on the above violin plot, which model is the best one? Why?

## Dataset 5: `sdgm::C23` Stroke

For more details about this dataset, see [here](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data)

In [7]:
# show the description of this dataset
?sdgm::C23

0,1
C23 {sdgm},R Documentation


In [8]:
# first glance of the dataset
# Considering the sample size, we will sample 10% of it for this assignment.
full_data <- sdgm::C23 %>% slice_sample(prop=0.1)
head(full_data)
dim(full_data)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
Unnamed: 0_level_1,<fct>,<int>,<chr>,<chr>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<chr>
1,Female,60,'0','0',Yes,Self-employed,Rural,72.77,30.7,never smoked,'0'
2,Female,30,'0','0',Yes,Private,Rural,114.68,25.2,never smoked,'0'
3,Male,11,'0','0',No,children,Urban,92.16,20.3,never smoked,'0'
4,Male,20,'0','0',No,Private,Urban,85.57,25.3,never smoked,'0'
5,Male,22,'0','0',No,Private,Urban,89.43,27.3,never smoked,'0'
6,Male,58,'0','0',Yes,Private,Rural,90.51,29.2,smokes,'0'


In [9]:
# Convert characters to numbers. Note because of the special format in the data, `as.numeric` doesn't work here.
# The code below is to extract numbers from strings
full_data$hypertension <- as.numeric(gsub("\\D", "", full_data$hypertension))
full_data$heart_disease <- as.numeric(gsub("\\D", "", full_data$heart_disease))
full_data$stroke <- as.numeric(gsub("\\D", "", full_data$stroke))

In [10]:
# Check the data now
summary(full_data)


# Check the outcome variable
table(full_data$stroke)

    gender          age         hypertension    heart_disease     ever_married
 Female:1830   Min.   :10.00   Min.   :0.0000   Min.   :0.00000   No : 704    
 Male  :1077   1st Qu.:33.00   1st Qu.:0.0000   1st Qu.:0.00000   Yes:2203    
 Other :   0   Median :48.00   Median :0.0000   Median :0.00000               
               Mean   :47.62   Mean   :0.1087   Mean   :0.04919               
               3rd Qu.:62.00   3rd Qu.:0.0000   3rd Qu.:0.00000               
               Max.   :82.00   Max.   :1.0000   Max.   :1.00000               
         work_type    Residence_type avg_glucose_level      bmi       
 children     :  47   Rural:1481     Min.   : 55.12    Min.   :10.10  
 Govt_job     : 448   Urban:1426     1st Qu.: 77.72    1st Qu.:25.10  
 Never_worked :   7                  Median : 92.02    Median :29.20  
 Private      :1913                  Mean   :106.66    Mean   :30.11  
 Self-employed: 492                  3rd Qu.:114.07    3rd Qu.:33.90  
                     


   0    1 
2853   54 

### Question 5.1: Based on what you have learnt about the dataset, is there anything should be done to prepare the data?

**Hint 1:** Are all the variables in the correct variable type?

**Hint 2 (Important!):** As per some models' requirement, the outcome variable has to be a numeric variable between 0 and 1.

***Note:*** Some datasets need a bit of data preparation and some don't. If you think this one needs to be prepared, choose "yes" and add your code in the below cell. Otherwise, choose "no" and leave the below cell unchanged.

Your answer: yes or no? (choose one by deleting the other)

In [11]:
# The data preparation step if needed

# Check the data now


### Question 5.2: Build five different models and evaluate them

In [12]:
# define the outcome variable
voutcome <- "stroke"

Based on what you have done for Dataset 1, complete the below cells of this dataset. Remember to use parallel computing.

In [None]:
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
      
    # parallel computing
    cl<-parallel::makeCluster(5)
    parallel::clusterExport(cl, c("full_data", "voutcome", "model"), envir = environment() )
    
    res <- parallel::parSapply(cl, 1:15, function(x)
    {
        # partition data into train and test portions
        idx <- splitTools::partition(rep(0,nrow(full_data)), p=c(train=0.7, test=0.3), type="stratified")
        
        # retrieve train and test data
        train_data <- full_data[idx$train,]
        test_data <- full_data[idx$test,]
        
        # build the model
        if (model == "lr") {
            best_model<-sdgm::lr.bestmodel.bin(train_data, voutcome)
        } else if (model == "cart") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "rf") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "lgbm") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "svm") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        }
        
        # predict
        preds<-predict(best_model, test_data)
  
        # calculate and return AUC and brier score
         if (!is.null(preds))
        {
          test_auc <- sdgm::auc(preds, test_data[,voutcome] ) 
        } else {
          test_auc <- NA
          print("AUC calculation failed because there are no predicted values")
        }
        print(paste0("AUC on C27 Data: ", test_auc))
        
         if (!is.null(preds))
        {
          test_brier <- sdgm::brier(preds, test_data[,voutcome] ) 
        } else {
          test_brier <- NA
          print("Brier calculation failed because there are no predicted values")
        }
          print(paste0("Brier Score on C27 Data: ", test_brier))

        c(test_auc, test_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_split$lr <- res[1,]
        brier_split$lr <- res[2,]
    } else if (model == "cart") {
        auc_split$cart <- res[1,]
        brier_split$cart <- res[2,]
    } else if (model == "rf") {
        auc_split$rf <- res[1,]
        brier_split$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_split$lgbm <- res[1,]
        brier_split$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_split$svm <- res[1,]
        brier_split$svm <- res[2,]
    }
}
}

# save results
res.split.df5 <- rbind(auc_split, brier_split)

# print results
print(res.split.df5)  

In [13]:
# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    
    # parallel computing
    cl <- parallel::makeCluster(5)
    parallel::clusterExport(cl, c("full_data", "voutcome", "model"), envir = environment())
    
    # this is the repeated loop
    res <- parallel::parSapply(cl, seq(15), function(i) 
    {
        # this is the nested CV outer loop
        nested_res <- sapply(caret::createFolds(full_data[, voutcome], k=5), function(x) 
        {
            testInds <- x
            trnInds <- setdiff(1:nrow(full_data), testInds)
            train_data <- full_data[trnInds,]
            test_data <- full_data[testInds,]

            if (model == "lr") {
                best_model <- sdgm::lr.bestmodel.bin(train_data, voutcome)
            } else if (model == "cart") {
                best_model <- sdgm::cart.bestmodel.bin(train_data, voutcome)
            } else if (model == "rf") {
                best_model <- sdgm::rf.bestmodel.bin(train_data, voutcome)
            } else if (model == "lgbm") {
                best_model <- sdgm::lgbm.bestmodel.bin(train_data, voutcome)
            } else if (model == "svm") {
                best_model <- sdgm::svm.bestmodel.bin(train_data, voutcome)
            }
            
            #predict
            preds <- predict(best_model, test_data)
            
            # calculate and return AUC and brier score
             if (!is.null(preds)) {
                nested_auc <- sdgm::auc(preds, test_data[, voutcome])
                nested_brier <- sdgm::brier(preds, test_data[, voutcome])
            } else {
                nested_auc <- NA
                nested_brier <- NA
            }

            c(nested_auc, nested_brier)
        })
        nested_cv_auc <- mean(nested_res[1,], na.rm=T)
        nested_cv_brier <- mean(nested_res[2,], na.rm=T)
                               
        c(nested_cv_auc, nested_cv_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_cv$lr <- res[1,]
        brier_cv$lr <- res[2,]
    } else if (model == "cart") {
        auc_cv$cart <- res[1,]
        brier_cv$cart <- res[2,]
    } else if (model == "rf") {
        auc_cv$rf <- res[1,]
        brier_cv$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_cv$lgbm <- res[1,]
        brier_cv$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_cv$svm <- res[1,]
        brier_cv$svm <- res[2,]
    }
}

# save results
res.cv.df5 <- rbind(auc_cv, brier_cv)

print(res.cv.df5)

In [None]:
auc.df5 <- rbind(res.split.df5 %>% filter(metric == "auc"), 
                 res.cv.df5 %>% filter(metric == "auc"))

# pivot the dataframe from wide to long for plotting
auc.df5.long <- auc.df5 %>% 
    tidyr::pivot_longer(-c(type, metric), 
                        names_to = "model",
                        values_to = "auc")
auc.df5.long$model <- factor(auc.df5.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
auc.df5.long$type <- factor(auc.df5.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(auc.df5.long, aes(model, auc, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

In [None]:
brier.df5 <- rbind(res.split.df5 %>% filter(metric == "brier"), 
                   res.cv.df5 %>% filter(metric == "brier"))

# pivot the dataframe from wide to long for plotting
brier.df5.long <- brier.df5 %>% 
    tidyr::pivot_longer(-c(type,metric), 
                        names_to = "model",
                        values_to = "brier")
brier.df5.long$model <- factor(brier.df5.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
brier.df5.long$type <- factor(brier.df5.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(brier.df5.long, aes(model, brier, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

### Question 5.3: Based on the above violin plot, which model is the best one? Why?

## Dataset 6: `sdgm::C27` Titanic

For more details about this dataset, see [here](https://www.kaggle.com/competitions/titanic/data)

In [14]:
# show the description of this dataset
?sdgm::C27

0,1
C27 {sdgm},R Documentation


In [15]:
# first glance of the dataset
full_data <- sdgm::C27
head(full_data)
dim(full_data)

Unnamed: 0_level_0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<chr>,<chr>,<dbl>,<fct>
1,'0','3',male,'1','0',7.25,S
2,'1','1',female,'1','0',71.2833,C
3,'1','3',female,'0','0',7.925,S
4,'1','1',female,'1','0',53.1,S
5,'0','3',male,'0','0',8.05,S
6,'0','3',male,'0','0',8.4583,Q


In [16]:
# Convert characters to numbers. Note because of the special format in the data, `as.numeric` doesn't work here.
# The code below is to extract numbers from strings
full_data$Survived <- as.numeric(gsub("\\D", "", full_data$Survived))
full_data$Pclass <- as.numeric(gsub("\\D", "", full_data$Pclass))
full_data$SibSp <- as.numeric(gsub("\\D", "", full_data$SibSp))
full_data$Parch <- as.numeric(gsub("\\D", "", full_data$Parch))

In [17]:
# Check the data now
missing_data <- sapply(full_data, function(x) sum(is.na(x)))
variables_with_missing_data <- names(full_data)[missing_data > 0]
print(variables_with_missing_data)
                       
summary(full_data)
head(full_data)
str(full_data)

character(0)


    Survived          Pclass          Sex          SibSp           Parch       
 Min.   :0.0000   Min.   :1.000   female:314   Min.   :0.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:2.000   male  :577   1st Qu.:0.000   1st Qu.:0.0000  
 Median :0.0000   Median :3.000                Median :0.000   Median :0.0000  
 Mean   :0.3838   Mean   :2.309                Mean   :0.523   Mean   :0.3816  
 3rd Qu.:1.0000   3rd Qu.:3.000                3rd Qu.:1.000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :3.000                Max.   :8.000   Max.   :6.0000  
      Fare        Embarked
 Min.   :  0.00    :  2   
 1st Qu.:  7.91   C:168   
 Median : 14.45   Q: 77   
 Mean   : 32.20   S:644   
 3rd Qu.: 31.00           
 Max.   :512.33           

Unnamed: 0_level_0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked
Unnamed: 0_level_1,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<fct>
1,0,3,male,1,0,7.25,S
2,1,1,female,1,0,71.2833,C
3,1,3,female,0,0,7.925,S
4,1,1,female,1,0,53.1,S
5,0,3,male,0,0,8.05,S
6,0,3,male,0,0,8.4583,Q


'data.frame':	891 obs. of  7 variables:
 $ Survived: num  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass  : num  3 1 3 1 3 3 1 3 3 2 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ SibSp   : num  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch   : num  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...


### Question 6.1: Based on what you have learnt about the dataset, is there anything should be done to prepare the data?

**Hint 1:** Are all the variables in the correct variable type?

**Hint 2 (Important!):** As per some models' requirement, the outcome variable has to be a numeric variable between 0 and 1.

***Note:*** Some datasets need a bit of data preparation and some don't. If you think this one needs to be prepared, choose "yes" and add your code in the below cell. Otherwise, choose "no" and leave the below cell unchanged.

Your answer: yes or no? (choose one by deleting the other)

In [18]:
# The data preparation step if needed
str(full_data)

#encoded_data <- full_data
#categorical_columns <- c("Sex", "Embarked")

#for (col in categorical_columns) {
  #encoded_data <- cbind(encoded_data, model.matrix(~0 + as.factor(full_data[[col]])))
  #colnames(encoded_data) <- make.names(colnames(encoded_data), unique = TRUE)
#}

# Remove the original categorical columns from encoded_data
#encoded_data <- encoded_data[, !colnames(encoded_data) %in% categorical_columns]

# View the result
#head(encoded_data)
# Check the data now
#summary(encoded_data)
#str(encoded_data)

'data.frame':	891 obs. of  7 variables:
 $ Survived: num  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass  : num  3 1 3 1 3 3 1 3 3 2 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ SibSp   : num  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch   : num  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...


### Question 6.2: Build five different models and evaluate them

In [19]:
# define the outcome variable
voutcome <- "Survived"

Based on what you have done for Dataset 1, complete the below cells of this dataset. Remember to use parallel computing.

In [20]:
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
      
    # parallel computing
    cl<-parallel::makeCluster(5)
    parallel::clusterExport(cl, c("full_data", "voutcome", "model"), envir = environment() )
    
    res <- parallel::parSapply(cl, 1:15, function(x)
    {
        # partition data into train and test portions
        idx <- splitTools::partition(rep(0,nrow(full_data)), p=c(train=0.7, test=0.3), type="stratified")
        
        # retrieve train and test data
        train_data <- full_data[idx$train,]
        test_data <- full_data[idx$test,]
        
        # build the model
        if (model == "lr") {
            best_model<-sdgm::lr.bestmodel.bin(train_data, voutcome)
        } else if (model == "cart") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "rf") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "lgbm") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "svm") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        }
        
        # predict
        preds<-predict(best_model, test_data)
  
        # calculate and return AUC and brier score
         if (!is.null(preds))
        {
          test_auc <- sdgm::auc(preds, test_data[,voutcome] ) 
        } else {
          test_auc <- NA
          print("AUC calculation failed because there are no predicted values")
        }
        print(paste0("AUC on C27 Data: ", test_auc))
        
         if (!is.null(preds))
        {
          test_brier <- sdgm::brier(preds, test_data[,voutcome] ) 
        } else {
          test_brier <- NA
          print("Brier calculation failed because there are no predicted values")
        }
          print(paste0("Brier Score on C27 Data: ", test_brier))

        c(test_auc, test_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_split$lr <- res[1,]
        brier_split$lr <- res[2,]
    } else if (model == "cart") {
        auc_split$cart <- res[1,]
        brier_split$cart <- res[2,]
    } else if (model == "rf") {
        auc_split$rf <- res[1,]
        brier_split$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_split$lgbm <- res[1,]
        brier_split$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_split$svm <- res[1,]
        brier_split$svm <- res[2,]
    }
}

# save results
res.split.df6 <- rbind(auc_split, brier_split)

# print results
print(res.split.df6)  

In [None]:
# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    
    # parallel computing
    cl <- parallel::makeCluster(5)
    parallel::clusterExport(cl, c("encoded_data", "voutcome", "model"), envir = environment())
    
    # this is the repeated loop
    res <- parallel::parSapply(cl, seq(15), function(i) 
    {
        # this is the nested CV outer loop
        nested_res <- sapply(caret::createFolds(encoded_data[, voutcome], k=5), function(x) 
        {
            testInds <- x
            trnInds <- setdiff(1:nrow(encoded_data), testInds)
            train_data <- encoded_data[trnInds,]
            test_data <- encoded_data[testInds,]

            if (model == "lr") {
                best_model <- sdgm::lr.bestmodel.bin(train_data, voutcome)
            } else if (model == "cart") {
                best_model <- sdgm::cart.bestmodel.bin(train_data, voutcome)
            } else if (model == "rf") {
                best_model <- sdgm::rf.bestmodel.bin(train_data, voutcome)
            } else if (model == "lgbm") {
                best_model <- sdgm::lgbm.bestmodel.bin(train_data, voutcome)
            } else if (model == "svm") {
                best_model <- sdgm::svm.bestmodel.bin(train_data, voutcome)
            }
            
            #predict
            preds <- predict(best_model, test_data)
            
            # calculate and return AUC and brier score
             if (!is.null(preds)) {
                nested_auc <- sdgm::auc(preds, test_data[, voutcome])
                nested_brier <- sdgm::brier(preds, test_data[, voutcome])
            } else {
                nested_auc <- NA
                nested_brier <- NA
            }

            c(nested_auc, nested_brier)
        })
        nested_cv_auc <- mean(nested_res[1,], na.rm=T)
        nested_cv_brier <- mean(nested_res[2,], na.rm=T)
                               
        c(nested_cv_auc, nested_cv_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_cv$lr <- res[1,]
        brier_cv$lr <- res[2,]
    } else if (model == "cart") {
        auc_cv$cart <- res[1,]
        brier_cv$cart <- res[2,]
    } else if (model == "rf") {
        auc_cv$rf <- res[1,]
        brier_cv$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_cv$lgbm <- res[1,]
        brier_cv$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_cv$svm <- res[1,]
        brier_cv$svm <- res[2,]
    }
}

# save results
res.cv.df6 <- rbind(auc_cv, brier_cv)

print(res.cv.df6)

In [None]:
## Question 8 (bonus): Are you able to make the code more compact to reduce the repetition?

Hint: Can you convert the repetitive parts into functions with the appropriate parameters passed to them?
# ============= 15 repeated train/test split ============= 
for (model in model_vec) {
      
    # parallel computing
    cl<-parallel::makeCluster(5)
    parallel::clusterExport(cl, c("full_data", "voutcome", "model"), envir = environment() )
    
    res <- parallel::parSapply(cl, 1:15, function(x)
    {
        # partition data into train and test portions
        idx <- splitTools::partition(rep(0,nrow(full_data)), p=c(train=0.7, test=0.3), type="stratified")
        
        # retrieve train and test data
        train_data <- full_data[idx$train,]
        test_data <- full_data[idx$test,]
        
        # build the model
        if (model == "lr") {
            best_model<-sdgm::lr.bestmodel.bin(train_data, voutcome)
        } else if (model == "cart") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "rf") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "lgbm") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        } else if (model == "svm") {
            best_model<-sdgm::cart.bestmodel.bin(train_data, voutcome)
        }
        
        # predict
        preds<-predict(best_model, test_data)
  
        # calculate and return AUC and brier score
         if (!is.null(preds))
        {
          test_auc <- sdgm::auc(preds, test_data[,voutcome] ) 
        } else {
          test_auc <- NA
          print("AUC calculation failed because there are no predicted values")
        }
        print(paste0("AUC on C27 Data: ", test_auc))
        
         if (!is.null(preds))
        {
          test_brier <- sdgm::brier(preds, test_data[,voutcome] ) 
        } else {
          test_brier <- NA
          print("Brier calculation failed because there are no predicted values")
        }
          print(paste0("Brier Score on C27 Data: ", test_brier))

        c(test_auc, test_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_split$lr <- res[1,]
        brier_split$lr <- res[2,]
    } else if (model == "cart") {
        auc_split$cart <- res[1,]
        brier_split$cart <- res[2,]
    } else if (model == "rf") {
        auc_split$rf <- res[1,]
        brier_split$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_split$lgbm <- res[1,]
        brier_split$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_split$svm <- res[1,]
        brier_split$svm <- res[2,]
    }
}

# save results
res.split.df6 <- rbind(auc_split, brier_split)

# print results
print(res.split.df6)

# ============= repeated nested 5-fold CV ============= 
for (model in model_vec) {
    
    # parallel computing
    cl <- parallel::makeCluster(5)
    parallel::clusterExport(cl, c("encoded_data", "voutcome", "model"), envir = environment())
    
    # this is the repeated loop
    res <- parallel::parSapply(cl, seq(15), function(i) 
    {
        # this is the nested CV outer loop
        nested_res <- sapply(caret::createFolds(encoded_data[, voutcome], k=5), function(x) 
        {
            testInds <- x
            trnInds <- setdiff(1:nrow(encoded_data), testInds)
            train_data <- encoded_data[trnInds,]
            test_data <- encoded_data[testInds,]

            if (model == "lr") {
                best_model <- sdgm::lr.bestmodel.bin(train_data, voutcome)
            } else if (model == "cart") {
                best_model <- sdgm::cart.bestmodel.bin(train_data, voutcome)
            } else if (model == "rf") {
                best_model <- sdgm::rf.bestmodel.bin(train_data, voutcome)
            } else if (model == "lgbm") {
                best_model <- sdgm::lgbm.bestmodel.bin(train_data, voutcome)
            } else if (model == "svm") {
                best_model <- sdgm::svm.bestmodel.bin(train_data, voutcome)
            }
            
            #predict
            preds <- predict(best_model, test_data)
            
            # calculate and return AUC and brier score
             if (!is.null(preds)) {
                nested_auc <- sdgm::auc(preds, test_data[, voutcome])
                nested_brier <- sdgm::brier(preds, test_data[, voutcome])
            } else {
                nested_auc <- NA
                nested_brier <- NA
            }

            c(nested_auc, nested_brier)
        })
        nested_cv_auc <- mean(nested_res[1,], na.rm=T)
        nested_cv_brier <- mean(nested_res[2,], na.rm=T)
                               
        c(nested_cv_auc, nested_cv_brier)
    })
    parallel::stopCluster(cl)
    
    # save results of the model
    if (model == "lr") {
        auc_cv$lr <- res[1,]
        brier_cv$lr <- res[2,]
    } else if (model == "cart") {
        auc_cv$cart <- res[1,]
        brier_cv$cart <- res[2,]
    } else if (model == "rf") {
        auc_cv$rf <- res[1,]
        brier_cv$rf <- res[2,]
    } else if (model == "lgbm") {
        auc_cv$lgbm <- res[1,]
        brier_cv$lgbm <- res[2,]
    } else if (model == "svm") {
        auc_cv$svm <- res[1,]
        brier_cv$svm <- res[2,]
    }
}

# save results
res.cv.df6 <- rbind(auc_cv, brier_cv)

print(res.cv.df6)

In [None]:
auc.df6 <- rbind(res.split.df6 %>% filter(metric == "auc"), 
                 res.cv.df6 %>% filter(metric == "auc"))

# pivot the dataframe from wide to long for plotting
auc.df6.long <- auc.df6 %>% 
    tidyr::pivot_longer(-c(type, metric), 
                        names_to = "model",
                        values_to = "auc")
auc.df6.long$model <- factor(auc.df6.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
auc.df6.long$type <- factor(auc.df6.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(auc.df6.long, aes(model, auc, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

In [None]:
brier.df6 <- rbind(res.split.df6 %>% filter(metric == "brier"), 
                   res.cv.df6 %>% filter(metric == "brier"))

# pivot the dataframe from wide to long for plotting
brier.df6.long <- brier.df6 %>% 
    tidyr::pivot_longer(-c(type,metric), 
                        names_to = "model",
                        values_to = "brier")
brier.df6.long$model <- factor(brier.df6.long$model, 
                         levels = c("lr", "cart", "rf", "lgbm", "svm"))
brier.df6.long$type <- factor(brier.df6.long$type, 
                         levels = c("train/test split", "nested cv"))

# Plot the violin plot
ggplot(brier.df6.long, aes(model, brier, fill = model)) + 
    geom_violin(alpha = 0.5, draw_quantiles = c(0.25, 0.5, 0.75))+ 
    facet_wrap(~type)

### Question 6.3: Based on the above violin plot, which model is the best one? Why?

## Question 7: After working with all the datasets, which models are the best ones across the board? Why?

## Question 8 (bonus): Are you able to make the code more compact to reduce the repetition?

Hint: Can you convert the repetitive parts into functions with the appropriate parameters passed to them?

# Congratulation! You have completed the Assignment 2!