# Tutorial 10: Classifiers as an Important Class of Predictive Models

#### Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. Give an example of a research question that requires a predictive model to predict classes on new observations.
2. Write a computer script to perform model selection using ridge and LASSO regressions to fit a logistic regression useful for predictive modeling.
3. List model metrics that are suitable to evaluate predicted classes given by a predictive model with binary responses (e.g., Accuracy, Precision, Sensitivity, Specificity, Cohen's kappa).
4. Write a computer script to compute these model metrics. Interpret and communicate the results from that computer script.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(caret)
library(pROC)
library(boot)
library(glmnet)
source("tests_tutorial_10.R")

For this tutorial, we will keep working with the `breast_cancer` data set. 

In [None]:
# Run this cell before continuing

set.seed(20211130)

breast_cancer <- read_csv("data/breast_cancer.csv") %>%
    select(-c(
        mean_area, area_error, concavity_error, concave_points_error, worst_radius, worst_texture, worst_perimeter,
        worst_area, worst_smoothness, worst_compactness, worst_concavity, worst_concave_points, worst_symmetry,
        worst_fractal_dimension)) %>% 
    mutate(target = if_else(target == "malignant", 1, 0))

breast_cancer_train <- 
    breast_cancer %>% 
    slice_sample(prop = 0.70)

breast_cancer_test <- 
    breast_cancer %>% 
    anti_join(breast_cancer_train, by = "ID")

breast_cancer_train <- 
    breast_cancer_train %>% 
    select(-ID)

breast_cancer_test <- 
    breast_cancer_test %>% 
    select(-ID)

breast_cancer_logistic_model <- 
    glm(
        formula = target ~ .,
        data = breast_cancer_train,
        family = binomial)

ROC_full_log <- 
    roc(
        response = breast_cancer_train$target, 
        predictor = predict(breast_cancer_logistic_model, type = "response"))

#### Regularization in GLMs

In the worksheet, you fitted the regular logistic regression to this data set. But, we can also use *shrinkage methods* for logistic and Poisson regression. We'll omit the mathematical details and focus only on the implementation and interpretation of these methods.

Regularization aims to improve predictive models by introducing some bias in exchange for a reduction in the model's variance. In previous weeks, we introduced two penalty functions to shrink the size of the coefficients:

- $L_2$-penalty used by *Ridge* methods: $\sum_{j=1}^p\beta_j^2$
       
- $L_1$-penalty used by *LASSO* methods: $\sum_{j=1}^p|\beta_j|$

These penalties can be used for logistic and Poisson regressions as well using the `glmnet()` function we used before but changing the `family` argument and the `type.measure`, accordingly.

Let's put these concepts in practice to **build a classifier using logistic regression**!

The package `glmnet` takes variables only as matrices. Therefore, we need to prepare our data before fitting the regularized models using `glmnet`.

**Question 1.0**
<br>{points: 1}

To prepare the model matrix for `glmnet`, we will use the `model.matrix` function, which receives two arguments:

- `object`: which is the formula of your model.
- `data`: which is the data you want to use.

The `model.matrix` function adds a column named `(Intercept)` filled with ones. We do not need this column, so let's remove it. We will also need the response variable to be in a matrix format, so let's create this now. 

Save the model matrix in an object named `model_matrix_X_train` and the response matrix in an object named `matrix_Y_train`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# model_matrix_X_train <- 
#     ...

# matrix_Y_train <- 
#     as.matrix(..., ncol = 1)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

## Ridge Logistic Regression

We will first build a classifier using a Ridge logistic regression.

#### Selecting the penalty level (lambda)

An important step in regularized methods is to find an "optimal" level of penalty. In the case of MLR, we selected this level so that it minimizes the prediction MSE and used **cross-validation** (CV) to estimated.

However, for logistic regression, we learned alternative metrics to evaluate classification performance. In this tutorial, we'll use AUC and CV to tune the classifier (i.e., estimate an "optimal" level of penalty).

Thus, we'll select the value of $\lambda$ that **maximizes** the cross-validation AUC.

> alternatively, a validation set can be used to tune the model. However, this will require 3 independent sets: training, validation, test.

#### Cross-validation AUC

The function `cv.glmnet()` runs a cross-validation for any estimator in the `glmnet` family. 

For each $\lambda$ in the grid:

1. the data is divided into $k$ folds
2. $k-1$ folds are combined to serve as a *training* set and one fold is left out as a *test* set
3. the model is trained using the data from the $k-1$ combined folds
4. an AUC is computed in the fold left out using the trained model
5. steps 3 and 4 are repeated so that all folds are left out once and you get $k$ AUC values.
6. get an average AUC as an estimate of the test-AUC

Select the values of $\lambda$ in the grid that maximized the estimated test-AUC 

**Question 1.1**
<br>{points: 1}

In this first question, we'll estimate the AUC of a Ridge logistic model on a lambda-grid using the function `cv.glmnet()` to compute the cross-validation AUC as explained before.

Use `auc` as the `type.measure` to measure prediction performance, and set the number of folds `nfolds` to 10. 

> Recall that ridge regression is defined when `alpha` is equal to zero and `family = binomial` for logistic regression.

Other arguments are the same as we used before to fit a Ridge linear regression.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(1234) # do not change this!

# breast_cancer_cv_lambda_ridge <- 
#   cv.glmnet(
#        x = ..., 
#        y = ...,
#        alpha = ...,
#        family = ...,
#        type.measure = ...,
#        nfolds = ...)

# breast_cancer_cv_lambda_ridge 

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_cv_lambda_ridge

In [None]:
test_1.1()

**Question 1.2**

The object `breast_cancer_cv_lambda_ridge` from `cv.glmnet()` is a list of different elements, including the estimated CV-AUCs for each value of lambda in the grid.

We can use the function `plot()` and `breast_cancer_cv_lambda_ridge` to visualize these values. 

> Recall that there are $k$ AUC values for each $\lambda$.

The resulting plot will indicate the average AUC (red dot) and error bars (in grey) on the $y$-axis along with the $\lambda$ sequence on the $x$-axis in log-scale. 

The top $x$-axis will indicate the number of inputs whose estimated coefficients are different from zero by each value of $\lambda$. Note that for Ridge we will always see the total number of covariates on this top $x$-axis since the Ridge penalty never shrinks estimates to zero. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your computer.
options(repr.plot.width = 16, repr.plot.height = 8) 

# plot(..., 
#      main = "Cross-Validation with Ridge Regression\n\n")

# your code here
fail() # No Answer - remove if you provide an answer

#### Selected lambda

The plot in **Question 1.2** shows two vertical dotted lines. *Given an `object` coming from `cv.glmnet()`*, these lines correspond to two values of $\lambda$:

- $\lambda_{\text{min}}$ which provides the **maximum average AUC** out of the whole sequence for $\lambda$. We can obtain it with `object$lambda.min`.


- $\lambda_{\text{1SE}}$ the highest $\lambda$ for which average AUC **within one standard error** of the maximum. We can obtain it with `object$lambda.1se`.

> note that the name of the variable `.min` comes from MLR and does not reflect the fact that in this case the AUC is *maximized*.

In some cases, $\lambda_{\text{1SE}}$ is preferable because we can select a considerably simpler model without having a significant reduction of the AUC. 


**Question 1.3**
<br>{points: 1}

Using `breast_cancer_cv_lambda_ridge`, obtain the level of penalty that maximizes the CV-AUC, i.e., $\lambda_{\text{min}}$ and assign it to the variable `lambda_max_AUC_ridge`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# lambda_max_AUC_ridge <- round(..., 4)


# your code here
fail() # No Answer - remove if you provide an answer

lambda_max_AUC_ridge

In [None]:
test_1.3()

**Question 1.4**

As before, we can visualize the estimated regression coefficients for different values of $\lambda$ in the grid. 

Use `breast_cancer_cv_lambda_ridge$glmnet.fit` along with a second argument called `"lambda"` within the function `plot()`. 

You will see that the estimated coefficients shrink towards zero as the value of $\lambda$ increases. Moreover, use the `abline()` function to indicate `lambda_max_AUC_ridge` as a vertical dashed line in red **on the natural logarithm scale**.


*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# ...(..., "lambda")
# ...(v = ..., col = "red", lwd = 3, lty = 2)

# your code here
fail() # No Answer - remove if you provide an answer

**Question 1.5**
<br>{points: 1}

Once we have the optimum value for $\lambda$, let us fit the ridge regression model we will compare versus `breast_cancer_logistic_model` (from the worksheet). We will use the function `glmnet()` along with `model_matrix_X_train` and `matrix_Y_train`. Extract the fit for a `lambda` value equal to `lambda_max_AUC_ridge`.

Call the resulting estimated model `breast_cancer_ridge_max_AUC`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(1234) # DO NOT CHANGE!

# breast_cancer_ridge_max_AUC <- 
#   glmnet(
#   x = ..., y = ...,
#   alpha = ...,
#   family = ...,
#   lambda = ...
# )

#coef(breast_cancer_ridge_max_AUC)

# your code here
fail() # No Answer - remove if you provide an answer

coef(breast_cancer_ridge_max_AUC)

In [None]:
test_1.5()

**CV-AUC for a Logistic Regression (without penalization)**

In `worksheet_13`, we've computed the CV missclassification error for a classical (non-penalized) logistic regression. Let's compute here the CV AUC to compare it with that of  penalized models. Read the given code if you want to learn more about CV!

*Run the following cell.*

In [None]:
set.seed(1234)
num.folds <- 10

folds <- createFolds(breast_cancer_train$target, k=num.folds)

regr.cv <- NULL
for (fold in 1:num.folds) {
train.idx <- setdiff(1:nrow(breast_cancer_train), folds[[fold]])
regr.cv[[fold]] <- glm(target ~ ., data=breast_cancer_train, subset=train.idx,
                       family="binomial")
    }

pred.cv <- NULL
auc.cv <- numeric(num.folds) 

for (fold in 1:num.folds) {
test.idx <- folds[[fold]]
pred.cv[[fold]] <- data.frame(obs=breast_cancer_train$target[test.idx],
pred=predict(regr.cv[[fold]], newdata=breast_cancer_train, type="response")[test.idx])
auc.cv[fold] <- roc(obs ~ pred, data=pred.cv[[fold]])$auc
    }

breast_cancer_cv_ordinary <- round(mean(auc.cv),7)

cat("Cross-validation AUC for the ordinary logistic model:", 
breast_cancer_cv_ordinary)

**Question 1.6**
<br>{points: 1}

To help us keep track of the AUC for different models, let's create a data frame with the AUC computed by CV for each of our models: (1) ridge logistic regression and (2) ordinary logistic regression (from the worksheet). 

Note that all the average AUC values from the CV are stored in an object called `cvm` from  `cv.glmnet`. 

Store the ridge and ordinary models' cross-validation AUCs in a tibble called `breast_cancer_AUC_models` with two columns:

- `model`: The regression model from which we will obtain the prediction accuracy. This will be a string vector with elements: `"ordinary"` and `"ridge"`.
- `auc`: A numerical vector with AUC corresponding to each model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_AUC_models <- 
#     tibble(
#         model = ...,
#         auc = ...)

# breast_cancer_AUC_models

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_AUC_models

In [None]:
test_1.6()

## LASSO Logistic Regression

**Question 1.7**
<br>{points: 1}

We already prepared our training data with `model_matrix_X_train` and `matrix_Y_train`. Now we need to find the value of $\lambda$ that provides the largest average AUC when a LASSO penalty is used.

Use the function `cv.glmnet()`. 

> remember to use a LASSO penalty we need to set `alpha = 1`.

Specify the proper accuracy `type.measure` and number of folds `nfolds` (use $k = 5$) along with the correct argument for `family`.

*Assign the function's output as `breast_cancer_cv_lambda_LASSO`.*

In [None]:
set.seed(1234) # do not change this!

# breast_cancer_cv_lambda_LASSO <- 
#   ...(
#   x = ..., y = ...,
#   alpha = ...,
#   family = ...,
#   type.measure = ...,
#   nfolds = ...)

# breast_cancer_cv_lambda_LASSO

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_cv_lambda_LASSO

In [None]:
test_1.7()

#### Selecting lambda

As before, we can use the function `plot()` to visualize the CV-AUC values for each value of $\lambda$ in the grid.

This time, for LASSO logistic regression, we will see different values on this top $x$-axis since the model will shrink some coefficients to exactly zero. 

The following plot compares the Ridge and the LASSO path to select lambda values. You can see that for LASSO, but not for Ridge, all estimates will become zero for large $\lambda$ values.

*Run the cell below.*

In [None]:
options(repr.plot.width = 16, repr.plot.height = 8) # Adjust these numbers so the plot looks good in your desktop.

plot(breast_cancer_cv_lambda_ridge, main = "Cross-Validation with Ridge Regression\n\n")

plot(breast_cancer_cv_lambda_LASSO, main = "Cross-Validation with LASSO\n\n")

**Question 1.8**
<br>{points: 1}

As before, the plot of the output coming from `cv.glmnet()` shows two vertical dotted lines: $\lambda_{\text{min}}$ and $\lambda_{\text{1SE}}$ 

Using `breast_cancer_cv_lambda_LASSO`, obtain $\lambda_{\text{1se}}$ and assign it to the variable `lambda_1se_AUC_LASSO`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# lambda_1se_AUC_LASSO <- round(..., 4)

# your code here
fail() # No Answer - remove if you provide an answer

lambda_1se_AUC_LASSO

In [None]:
test_1.8()

**Question 1.9**
<br>{points: 1}

Let's compare the LASSO logistic model fit at the `lambda.1se` with `breast_cancer_log_model` and `breast_cancer_ridge_max_AUC`. 

We will use the function `glmnet()` along with `breast_cancer_X_train` and `breast_cancer_Y_train`. Extract the estimated model for `lambda` equal to `lambda_1se_AUC_LASSO`. Call the output `breast_cancer_LASSO_1se_AUC`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(1234) # do not change this!

# breast_cancer_LASSO_1se_AUC <- ...(
#   x = ..., y = ...,
#   alpha = ...,
#   family = ...,
#   lambda = ...
# )

# breast_cancer_LASSO_1se_AUC

# your code here
fail() # No Answer - remove if you provide an answer

coef(breast_cancer_LASSO_1se_AUC)

In [None]:
test_1.9()

**Question 1.10**
<br>{points: 1}


Based on the results above, where those estimated regression coefficients equal to zero are shown as `.`, what input variables are selected in `breast_cancer_LASSO_1se_AUC`?

**A.** `mean_radius`.

**B.** `mean_texture`.

**C.** `mean_perimeter`.

**D.** `mean_smoothness`.

**E.** `mean_compactness`.

**F.** `mean_concavity`.

**G.** `mean_concave_points`.

**H.** `mean_symmetry`.

**I.** `mean_fractal_dimension`.

**J.** `radius_error`.

**K.** `texture_error`.

**L.** `perimeter_error`.

**M.** `smoothness_error`.

**N.** `compactness_error`.

**O.** `symmetry_error`.

**P.** `fractal_dimension_error`.

*Assign your answers to the object `answer1.12`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes.*

In [None]:
# answer1.10 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.10()

**Question 1.11**
<br>{points: 1}

Let's add the Lasso Logistic Regression row to our `breast_cancer_AUC_models` tibble. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# breast_cancer_AUC_models <- 
#     breast_cancer_AUC_models %>% 
#     add_row(model = ..., 
#             auc = ...)

# your code here
fail() # No Answer - remove if you provide an answer

breast_cancer_AUC_models

In [None]:
test_1.11()

### Model Selection 

Great job! You can now choose a model that you expect will have a good prediction performance based on the CV results, without looking at the test set!! 

We can see that the ridge model is slightly better, although we used the $\lambda_{\min}$ for ridge and $\lambda_{1se}$ for lasso. On the other hand, the model selected by LASSO is considerably simpler since it uses only three of the variables while keeping similar performance. 

After choosing the model, you can apply the chosen model to the test set to estimate the model's performance. 

**Question 1.12**
<br>{points: 1}

Suppose you chose the LASSO model. Use the model to predict the `target` variable on the **test** set (`breast_cancer_test`). Then, use the `roc` function to obtain the ROC curve in the test set. Save the result in an object named ROC_LASSO. 


*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# model_matrix_X_test <- 
#     ...(object = ...,
#                  data = ....)[, -1]

# ROC_lasso <- 
#     roc(
#         response = ...,
#         predictor = predict(...,
#                      newx = ...)[,"s0"] ) 

# your code here
fail() # No Answer - remove if you provide an answer

ROC_lasso

In [None]:
test_1.12()

We can use the `plot` function to plot the `ROC_lasso` curve from the Lasso model in the test set. 

*Run the cell below.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

**Out of curiosity**, let's check how the other two models perform in the test set. 

In [None]:
# Run this cell before continuing

ROC_ridge <- roc(
  response = breast_cancer_test$target,
  predictor = predict(breast_cancer_ridge_max_AUC,
                      newx = model_matrix_X_test )[,"s0"] )

ROC_ordinary <- roc(
  response = breast_cancer_test$target,
  predictor = predict(breast_cancer_logistic_model,
                      newdata = breast_cancer_test) )

In [None]:
plot(ROC_lasso,
  print.auc = TRUE, col = "blue", lwd = 3, lty = 2,
  main = "ROC Curves for Breast Cancer Dataset"
)

lines.roc(ROC_ridge, col = "green", lwd = 3, lty = 2, print.auc=TRUE)
lines.roc(ROC_ordinary, col = "red", lwd = 3, lty = 2)

**CAUTION**

From the ROC curve in the test set, the LASSO model performs worse (although reasonably close) to the other two models. So you might be tempted to switch models at this point. But changing models at this stage will bring optimization bias again, making the estimates of AUC obtained here to overestimate the AUC in (new) unseen data. 