# Tutorial 08 - Selection methods for generative and predictive models

By the end of this section, students will be able to:

- Give examples of questions that can be answered by generative models and others that can be answered by predictive models.
- Discuss how the research question being asked impacts the statistical modelling procedures.
- Explain how regularized methods, such as lasso and ridge, can be used to estimate a predictive or a generative model.
- Distinguish the selection properties of lasso and ridge penalties.
- Discuss why the model obtained directly from lasso is not the most suitable model for generative modelling and how post-lasso is one way to address this problem.
- Write a computer script to perform lasso/ridge and use it to predict new outcomes.
- Write a computer script to perform post-lasso and use it to estimate a generative model.
- Discuss post-inference problems (e.g., double dipping into the data set) and current practical solutions available to address these (e.g., data-splitting techniques).
- Write a computer script to apply currently available practical solutions to post inference problems.
- Discuss how the research question being asked impacts the communication of the results.

In [None]:
# Loading packages
library(car)
library(tidyverse)
library(tidymodels)
library(broom)
library(glmnet)
library(leaps)
library(faraway)
library(mltools)
source("tests_tutorial_08.R")

## Model Selection: Ridge

We have learned that *shrinkage methods* can be used to build predictive models. Some of these methods can also be used to select variables. 

In particular, *Ridge method* will not select variables since the estimated coefficients won't be shrunk to zero. However, it can be used when

- there are more predictors than observations
  
- to address multicollinearity (in fact, that was the primary goal)

In this tutorial, you will use the dataset `fat` from the library `faraway` to build a Ridge regression and use it to predict the `brozek` value for men in a test set. 

**Recall**: This dataset contains the percentage of body fat and a whole variety of body measurements (continuous variables) of 252 men. You will use the variable `brozek` as the response variable and a subset 14 variables to build different models. Additional information about the data can be found in [Johnson (1996)](https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910505). 

Run the code below to create the working data frame called `fat_sample` and build the objects needed in next problems:

The response variable `brozek` is the percent of body fat using Brozek's equation:

$$\texttt{brozek} = \frac{457}{\texttt{density}} - 414.2,$$

where body `density` is measured in $\text{g}/\text{cm}^3$.

The 14 input variables are:

- `age`: Age in $\text{years}$.
- `weight`: Weight in $\text{lb}$.
- `height`: Height in $\text{in}$.
- `adipos`: Adiposity index in $\text{kg}/\text{m}^2$.

$$\texttt{adipos} = \frac{\texttt{weight}}{\texttt{height}^2}$$

- `neck`: Neck circumference in $\text{cm}$.
- `chest`: Chest circumference in $\text{cm}$.
- `abdom`: Abdomen circumference at the umbilicus and level with the iliac crest in $\text{cm}$.
- `hip`: Hip circumference in $\text{cm}$.
- `thigh`: Thigh circumference in $\text{cm}$.
- `knee`: Knee circumference in $\text{cm}$.
- `ankle`: Ankle circumference in $\text{cm}$.
- `biceps`: Extended biceps circumference in $\text{cm}$.
- `forearm`: Forearm circumference in $\text{cm}$.
- `wrist`: Wrist circumference distal to the styloid processes in $\text{cm}$.

In [None]:
# Get a sample
fat_sample <- fat %>%
  select(
    brozek, age, weight, height, adipos, neck, chest, abdom,
    hip, thigh, knee, ankle, biceps, forearm, wrist
  )

# Split data into training and test sets
set.seed(123)


#Alternative code:
training_fat  = fat_sample %>%
  sample_frac(0.6)

testing_fat = fat_sample %>%
  setdiff(training_fat)


# Build matrix and vector required by `glmnet`

# Using `model.matrix()` to get the X matrix, the first column 
# corresponds to the intercept and needs to be deleted


fat_X_train <- model.matrix(object = brozek ~ .,
  data = training_fat)[, -1] 
fat_Y_train <- training_fat[, "brozek"]

fat_X_test <- model.matrix(object = brozek ~ .,
  data = testing_fat)[, -1]

fat_Y_test <- testing_fat[, "brozek"]

**Question 1.0**
<br>{points: 1}

Now that we have our training data prepared in `fat_X_train` and `fat_Y_train`, we will select the value of $\lambda$ that provides the smallest $\text{MSE}_{\text{test}}$ using cross-validation. 

We can do this automatically with function `cv.glmnet()` where `x` is the matrix of input variables and `y` is vector of training responses that we prepared. 

> **Heads up**: the method of Ridge regression is defined when `alpha = 0`. 

To select `lambda` we will use a **sequence** of values that goes from $\lambda = \exp(-5) = 0.0067$ to $\lambda = \exp(10) = 22026.5$. Internally, `glmnet` will use cross-validation to compare the test MSE at each of these values. Assign the function's output as `fat_cv_lambda_ridge`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(1234) # DO NOT CHANGE!

# fat_cv_lambda_ridge <- ...(
#   x = ..., y = ...,
#   alpha = ...,
#   lambda = exp(seq(-5, 10, 0.1))
# )

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

We can visualize the estimated test MSE at each value of lambda in the sequence by using `fat_cv_lambda_ridge` and `plot()`. 

The resulting plot will indicate the $\text{MSE}_{\text{test}}$ on the $y$-axis (error bars show the variation of the test error in the different folds) along with the range of $\lambda$ on the bottom $x$-axis on the natural log-scale. 

> **Heads up**: Ridge regression never shrinks estimators to zero, thus we see a value of `14` on the top $x$-axis. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# plot_data <- ...

# your code here
fail() # No Answer - remove if you provide an answer

plot(plot_data, main = "MSE of Ridge estimated by CV for different lambdas\n\n")

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

The plot in **Question 1.1** also shows two vertical dotted lines. *Given an `object` coming from `cv.glmnet()`*, these lines correspond to two values of $\lambda$:

- $\hat{\lambda}_{\text{min}}$ which minimizes MSE. It can be obtained with `object$lambda.min`.


- $\hat{\lambda}_{\text{1SE}}$ for which the MSE is within one standard error of the minimum. It can be obtained with `object$lambda.1se`.


Using `fat_cv_lambda_ridge`, obtain the $\hat{\lambda}_{\text{min}}$ and save it as `fat_lambda_min_MSE_ridge`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_lambda_min_MSE_ridge <- round(..., 4)

# your code here
fail() # No Answer - remove if you provide an answer

fat_lambda_min_MSE_ridge

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

Once we have selected a value of $\lambda$, we can extract the estimated Ridge regression at that level of penalization. Store the estimated models in `fat_ridge_min_coef`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(1234) # DO NOT CHANGE!

# fat_ridge_min_coef <- ...(..., s = ...)

# your code here
fail() # No Answer - remove if you provide an answer

fat_ridge_min_coef

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Let's compare the estimated regression coefficients of  `fat_ridge_min` with those of `fat_full_OLS`.

Create a data frame called `fat_coef` with three columns:

- `Full_OLS:` The estimated coefficients from `fat_full_OLS` obtained via function `coef()`.
- `Ridge_min`: The estimated coefficients in `fat_ridge_min_coef`. Recall this is the estimated ridge regression with $\hat{\lambda}_{\text{min}}$.
    
*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_full_OLS <- ...

# fat_reg_coef <- cbind(
#   Full_OLS = ...(...),
#   Ridge_min = as.vector(...)) %>%
#       round(4) %>% as.data.frame()


# your code here
fail() # No Answer - remove if you provide an answer

fat_coefs

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 1}

Write code following the steps outlined below. 

- Using `predict()` and `fat_full_OLS`, obtain the (out-of-sample) predicted `brozek` values for men in `testing_fat`. Store them in a variable called `fat_test_pred_full_OLS`. 

Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.

In [None]:
# fat_test_pred_full_OLS <- ...

# your code here
fail() # No Answer - remove if you provide an answer

head(fat_test_pred_full_OLS)

In [None]:
test_1.5()

**Question 1.6**
<br>{points: 1}

We will now compute the **Mean Squared Error (MSE)** on the test set to evaluate the predictive model (the smaller the metric, the more better the model):

> **Heads up:** a related measure commonly used is the **Root Mean Squared Error (RMSE) = $\sqrt{\text{MSE}}$**, which is the standard deviation of the prediction errors $y_i - \hat{y}_i$. This metric has the same units as the response.

Use the function `rmse()` from the `mltools` package to compute the $\text{RMSE}_{\text{test}}$ of the *predicted* brozed values in `fat_test_pred_full_OLS`.

Store the computed RMSE metric in a tibble called `fat_RMSE_models` with two columns:

- `Model`: The regression model from which we will obtain the prediction accuracy.
- `RMSE`: The $\text{RMSE}_{\text{test}}$ corresponding to the model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_test_RMSEs <- tibble(
#   Model = "OLS Full Regression",
#   RMSE = ...(
#     preds = ...,
#     actuals = ...
#   )
# )

# your code here
fail() # No Answer - remove if you provide an answer

fat_test_RMSEs

In [None]:
test_1.6()

**Question 1.7**
<br>{points: 1}

Use `fat_cv_lambda_ridge` and the level of penalization that minimizes the CV-MSE to predict the brozek index of men in the test set `testing_fat`, and call the resulting object `fat_test_pred_ridge_min`.

> **Hint:** Use function `predict()` with the argument `newx` to specifiy the test set.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_test_pred_ridge_min <- predict(...,
#   newx = ..., ....)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.7()

**Question 1.8**
<br>{points: 1}

Use the function `rmse()` to compute the $\text{RMSE}_{\text{test}}$ using the predicted values stored in `fat_test_pred_ridge_min`. 

Add this metric as an additional row in the tibble `fat_test_RMSEs`. Use `"Ridge Regression with minimum MSE"` in column `Model` and the corresponding values for $\text{RMSE}_{\text{test}}$ in column `RMSE`.

**NOTE**: note that the code below binds rows into `fat_test_RMSEs`. Do not re-run this cell or restart the kernel if needed. Otherwise, this object will have extra (repeated) rows.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_test_RMSEs <- rbind(
#   fat_test_RMSEs,
#   tibble(
#     Model = "Ridge Regression with minimum MSE",
#     RMSE = ...(
#        ...,
#        ...
#   )
# )


# your code here
fail() # No Answer - remove if you provide an answer

fat_test_RMSEs

In [None]:
test_1.8()

**Question 1.9**
<br>{points: 1}

Based on your results in `fat_test_RMSEs`, which model shows the best prediction? Was this an expected result? Justify your answer

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## Model Selection For Inference

In the second part of this tutorial, you will select a generative model using a real data set and use it for inference. 

**Recall** the [Ames `Housing` dataset](https://www.kaggle.com/c/home-data-for-ml-course/) you worked on Worksheet 08. Let's refresh our memory: it was compiled by Dean De Cock, it has 79 input variables on different characteristics of residential houses in Ames, Iowa, USA that can be used to predict the property's final price, `SalePrice`. As in worksheet_08, we will focus our attention on 21 numerical input variables:

- `LotFrontage`: Linear $\text{ft}$ of street connected to the house.
- `LotArea`: Lot size in $\text{ft}^2$.
- `MasVnrArea`: Masonry veneer area in $\text{ft}^2$.
- `TotalBsmtSF`: Total $\text{ft}^2$ of basement area.
- `GrLivArea`: Above grade (ground) living area in $\text{ft}^2$.
- `BsmtFullBath`: Number of full bathrooms in basement.
- `BsmtHalfBath`: Number of half bathrooms in basement.
- `FullBath`: Number of full bathrooms above grade.
- `HalfBath`: Number of half bathroom above grade.
- `BedroomAbvGr`: Number of bedrooms above grade (it does not include basement bedrooms).
- `KitchenAbvGr`: Number of kitchens above grade.
- `Fireplaces`: Number of fireplaces.
- `GarageArea`: Garage's area in $\text{ft}^2$.
- `WoodDeckSF`: Wood deck area in $\text{ft}^2$.
- `OpenPorchSF`: Open porch area in $\text{ft}^2$.
- `EnclosedPorch`: Enclosed porch area in $\text{ft}^2$.
- `ScreenPorch`: Screen porch area in $\text{ft}^2$.

Let's start by loading the data set. 

In [None]:
## Load the housing data set
housing_raw <- read_csv("data/Housing.csv", col_types = cols())

# Use `YearBuilt` and `YrSold` to create a variable `ageSold`
housing_raw$ageSold <- housing_raw$YrSold - housing_raw$YearBuilt


# Select subset of input variables
housing_raw <- 
  housing_raw %>%
  select(LotFrontage, LotArea, MasVnrArea, TotalBsmtSF, 
    GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, Fireplaces,
    GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, ScreenPorch, PoolArea, ageSold, SalePrice
  )

# Remove those rows containing `NA`s and some outliers
housing_raw <- 
    drop_na(housing_raw)  %>% 
    filter(LotArea < 20000)

str(housing_raw)

Our objective in this tutorial is to obtain a model for inference. We want to study how the properties' values are affected by the different properties' attributes. We want to be able to:

1. Interpret the parameters of the model;
2. Identify relevant attributes (covariates); 
3. Have a measure of uncertainty of our estimates.

**Question 2.0** 
<br> {points: 1}

Since we do not know which variables are important/relevant, we will need to conduct a variable selection technique. Let's start by splitting the data set into two sets: (1) the first part, with 60% of the rows, will be used to select a model; and (2) the second part, will be for inference. 

> Note that the partition is similar to the training-test one used in prediction problems.

Your job is to randomly select 60% of the rows and store them in an object called `housing_selection`. Store the remaining rows in an object called `housing_inference`.

The `housing_inference` object is golden! It should not be touched before we select the variables. No peeking!!  

In [None]:
set.seed(20211118) # Do not change this

# Housing_split <- ...(..., prop = ..., strata = SalePrice)
# housing_selection <- training(Housing_split)
# housing_inference <- testing(Housing_split)

# your code here
fail() # No Answer - remove if you provide an answer

head(housing_selection)

In [None]:
test_2.0()

**Question 2.1** 
<br> {points: 1}

As we discussed in the worksheet, there are many possible approaches for model selection. Let's focus on Lasso. Run Lasso on the `housing_selection` tibble and find the value `lambda` that provides the lowest Cross-validation MSE. (See `cv.glmnet` function.)

_Save the result in an object named `lasso_model`._

In [None]:
set.seed(20211118) # do not change this

# lasso_model <-
#     cv.glmnet(x = ... %>% ...(-...)%>% as.matrix(), 
#               y = ..., 
#               alpha = ...)

# your code here
fail() # No Answer - remove if you provide an answer

lasso_model

In [None]:
test_2.1()

**Question 2.2** 
<br> {points: 1}

Write a code to extract the coefficients of the best lasso model found in the `lasso_model`. By best, we mean the one with the smallest MSE. 

_Save the result in an object named `beta_lasso`._

In [None]:
set.seed(20211118) # do not change this


# your code here
fail() # No Answer - remove if you provide an answer

beta_lasso

In [None]:
test_2.2()

**Question 2.3** 
<br> {points: 1}

Extract the name of the covariates selected by Lasso in an object named `lasso_selected_covariates`.  

In [None]:
#lasso_selected_covariates <- as_tibble(
        # as.matrix(...),
        # rownames='covariate') %>%
        # filter(covariate != '(Intercept)' & abs(s1) !=0) %>% 
        # pull(...)

# your code here
fail() # No Answer - remove if you provide an answer

lasso_selected_covariates

In [None]:
test_2.3()

**Question 2.4** 
<br> {points: 1}

In **Question 1.4** of this tutorial you extracted the estimated coefficients obtained using Ridge. In **Question 2.3**, you identified the variables selected by LASSO. 

**What is a main difference between these two methods in terms of variable selection?** Comment on the numbers of variables selected in relation to the variables available in the dataset.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.5** 
<br> {points: 1}

We expect that LASSO would remove highly correlated variables. However, LASSO can still fit a linear model on data sets with high levels of multicollinearity. Unfortunately, ordinary least squares cannot. To be on the safe side, let's check the variance inflator factor of the variables selected by LASSO. 

_Save the output in an object named `lasso_variables_vif`._

In [None]:
#lasso_variables_vif <- 
#    vif(...)

# your code here
fail() # No Answer - remove if you provide an answer

lasso_variables_vif

In [None]:
test_2.5()

**Question 2.6**
<br>{points: 1}

True or false?

The `lasso_variables_vif` does not indicate a very concerning presence of multicollinearity. 

_Assign your answer to an object called `answer2.6`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer2.6 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.6()

**Question 2.7** 
<br> {points: 1}

Finally, let's use the covariates selected by lasso and stored in `lasso_selected_covariates` to fit a linear model using ordinary least squares.

_Save the output in an object named `inference_model`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

summary(inference_model)

In [None]:
test_2.7()

**Question 2.8** 
<br> {points: 1}

The model stored in `inference_model` has shown 5 non-significant variables at 5% significance level. Should we remove these variables and re-fit the model with them? Briefly explain why or why not. 

> Your answer goes here

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.