# Worksheet 08: Selection methods for generative and predictive models

### Learning objectives
By the end of this section, students will be able to:

- Give examples of questions that can be answered by generative models and others that can be answered by predictive models.
- Discuss how the research question being asked impacts the statistical modelling procedures.
- Explain how regularized methods, such as lasso and ridge, can be used to estimate a predictive or a generative model.
- Distinguish the selection properties of lasso and ridge penalties.
- Discuss why the model obtained directly from lasso is not the most suitable model for generative modelling and how post-lasso is one way to address this problem.
- Write a computer script to perform lasso/ridge and use it to predict new outcomes.
- Write a computer script to perform post-lasso and use it to estimate a generative model.
- Discuss post-inference problems (e.g., double dipping into the data set) and current practical solutions available to address these (e.g., data-splitting techniques).
- Write a computer script to apply currently available practical solutions to post inference problems.
- Discuss how the research question being asked impacts the communication of the results.


In [None]:
library(tidyverse)
library(glmnet)
library(broom)
library(leaps)
library(repr)
library(faraway)
library(mltools)

options(repr.plot.width=10, repr.plot.height=8)
source("tests_worksheet_08.R")

# PART I: Regularized Methods

## In worksheet_07: 

### Select a model using *stepwise* algorithms

- these are *greedy* algorithms 

- results depend on the order in which variables are selected 

- variables are either *in* (i.e., estimated coefficient different from zero) or *out* (i.e., estimated coefficient equal to zero)

## In this worksheet:

- can we use regularized methods to select a predictive model? and for a generative model?

- can we use the selected model to make inference about the population?

## PART I. Lasso 

In this first section you will use LASSO to select variables of a generative model. A problem of LASSO, and other regularized methods, is that the resulting estimators are **biased**. 

<font color='darkred'>**Bias estimators** are estimators whose sampling distributions are not centred on the true value of the parameter. </font> 

To study this problem, we are going to use a simulation to generate variables and a **known** model that relates them.

**Simulation Design**: here's what we are going to do: 

1. We are going to consider a response variable $Y$ and $p=3$ covariates. However, only 2 of the generated covariates will have an effect on $Y$. Hopefully, LASSO will select the 2 relevant ones.  

2. Generate 100 observations (`n`) of each variable (the response and the 3 covariates) from a Normal distribution, with means equal to 0 for the covariates and the following mean for the response:

$$
E[Y|X_1, X_2, X_3] = 75X_1 - 5 X_2 + 0 X_3
$$ 

> Therefore, the **true coefficients** are $\beta_1=75$ and $\beta_2=-5$. Note that $\beta_3 = 0$, thus $X_3$ is not a relevant variable

3. Use LASSO to select a model and store the coefficients of the selected model.

4. Replicate (`rep`) this study 1,000 times. 

> Note that we are generating 1,000 datasets, all at once, and storing it in a tibble called `lasso_sim`. We will use the `map` function to apply a function to each dataset.

In [None]:
# Run this cell before continuing 

set.seed(20211113) # Do not change this.

n <- 1000    # sample size
rep <- 1000 # number of replications

lasso_sim <- 
    tibble(
        X1 = round(
                rnorm(n * rep, 0, 10), 
                2),
        X2 = round(
                rnorm(n * rep, 0, 10), 
                2),
        X3 = round(
                rnorm(n * rep, 0, 20), 
                2),
        Y = round(75 * X1 - 5*X2 + rnorm(n * rep, 0, 400),2)) %>% 
    mutate(replicate = rep(1:rep, n)) %>% 
    arrange(replicate) 


head(lasso_sim)

**Question 1.0**<br>
{points: 1}

Using the `lasso_sim` tibble, fit a LASSO model for each replicate.

> Note: In the `glmnet()` function you can choose `family` like in `glm()`. The default option is `family = Gaussian`, which is used in this worksheet since the response is continuous. 

For simplicity, we'll use $\lambda=30$ in all models. 

> However, in practice, it is not recommended to fit LASSO at a given lambda value. 

Store the models in a column named `lasso_models`. 

_Assign your data frame to an object called `lasso_study`. Your data frame should have four columns: `replicate`, `data`, and `lasso_model`._

In [None]:
# lasso_study <- 
#     ... %>% 
#     ... %>% 
#     ... %>% 
#     mutate(
#         lasso_model = ...(...,
#                           ~...(.x %>% select(-Y) %>% as.matrix(), 
#                                   .x %>% select(Y) %>% as.matrix(), 
#                                   alpha = ..., 
#                                   lambda = ...)))

# your code here
fail() # No Answer - remove if you provide an answer

head(as.data.frame(lasso_study), 2)

In [None]:
test_1.0()

**Question 1.1**<br>
{points: 1}

Extract the coefficient for `beta_1` from each `lasso_model` in the `lasso_study` tibble. Store the coefficients in a column name `lasso_beta1` in the same `lasso_study` tibble. 

In [None]:
# lasso_study <- 
#     ... %>% 
#     ...

# your code here
fail() # No Answer - remove if you provide an answer

head(as.data.frame(lasso_study %>% select(-data)))

In [None]:
test_1.1()

**Question 1.2**
<br> {points: 1}

Plot the sampling distribution of $\hat{\beta}_1$ obtained by Lasso.


_Assign your plot to an object called `lasso_beta1_sampling_dist`._

In [None]:
# lasso_beta1_sampling_dist <- 
#     lasso_study %>% 
#     ggplot() + 
#     geom_...(aes(...), color='white') +
#     geom_vline(xintercept = 75, color = 'red') + 
#     geom_text(aes(75, 110), label = "True Value of the Parameter", color = 'red') 
#     geom_text(aes(75, 80), label = "True Value of\n the Parameter", color = 'red', size = 7) +
#     theme(text = element_text(size = 18))

# your code here
fail() # No Answer - remove if you provide an answer

lasso_beta1_sampling_dist

In [None]:
test_1.2()

**Question 1.3**
<br>{points: 1}

True or false?

The sampling distribution of the lasso estimator of $\beta_1$ is centered around the true $\beta_1$.

_Assign your answer to an object called `answer1.3`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer1.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**<br>
{points: 1}

One way to correct the bias of the Lasso's estimated coefficients is to re-fit the model using LS and the variables selected by LASSO. In other words, we use LASSO only to select variables and ignore the estimated coefficients. We then re-estimate the coefficients of the selected variables using regular least squares. 

In the cell below we have done it for you. Here's what we did:

1. Add a new column to `lasso_study` tibble, named `lasso_selected_covariates` with the covariates selected by LASSO (i.e., with coefficients different from 0).


2. We fitted a linear model using `lm` (regular least square) and only the `lasso_selected_covariates`.


3. We extracted $\beta_1$ from the `lm` model, and saved it on a column called `ls_beta1`.

Your job is to plot the sampling distribution of $\hat{\beta}_1$ obtained by the regular least square, using the variables selected by LASSO.

_Assign your plot to an object called `post_lasso_lm_beta1_sampling_dist`._

In [None]:
# Run this cell before continuing

lasso_study <- 
    lasso_study %>% 
    mutate(
        lasso_selected_covariates = map(.x = lasso_model, 
                                        ~as_tibble(
                                                as.matrix(coef(.x)),
                                                rownames='covariate') %>%
                                                filter(covariate != '(Intercept)' & abs(s0) > 10e-6) %>% 
                                                pull(covariate)),
        ls_fit = map2(.x = data, .y = lasso_selected_covariates,
                     ~lm(Y ~ ., data = .x[,c(.y, 'Y')])),
        ls_beta1 = map_dbl(.x = ls_fit, ~tidy(.x) %>% filter(term == 'X1') %>% pull(estimate)))


lasso_study %>% 
    select(-data, -lasso_model, -ls_fit) %>% 
    head()

In [None]:
# post_lasso_lm_beta1_sampling_dist <- 
#     lasso_study %>% 
#     ggplot() + 
#     geom_...(aes(...), color='white') +
#     geom_vline(xintercept = 75, color = 'red') + 
#     geom_text(aes(75, 80), label = "True Value of\n the Parameter", color = 'red', size = 7) +
#     theme(text = element_text(size = 18))

# your code here
fail() # No Answer - remove if you provide an answer

post_lasso_lm_beta1_sampling_dist

In [None]:
test_1.4()

## PART II: Inference after model selection

### 1. Can we make inference using the selected models??

An important topic learned in this course is how to make inference (e.g, calculate confidence interval and hypotheses tests) for a fixed model. However, when we apply a model selection algorithm, we are searching for the combination of variables that will give us the best model (according to a given metric). So the variables in our final models are not fixed; instead, they are selected adaptively based on **the data at hand**. 

Two questions arise then: 

1. Do these model selection algorithms affect the inference about the parameters of the model? 

2. Is the way we interpret the models still the same? 

In this section we'll investigate the first question when the forward selection method is used to select a model.

**Simulation Design**<br>

Again, we are going to use a simulation study in order to be in full control of the true model that generates the data. Here's what we are going to do: 

1. We are going to consider a response variable $Y$ and $p=10$ independent variables. However, none of the generated variables will have an effect on $Y$. In other words, in this simulation study, we expect the intercept-only model to be better than any LR that includes any of the independent variables.


2. Generate 100 observations (`n`) of each variable (the response and the 10 independent variables) from a Normal distribution.


3. Apply the forward selection algorithm to select at most 3 variables among the 10 available. We restrict the size to 3 to shorten the computation time. Use the adjusted $R^2$ to compare models of different sizes.

4. Use the selected model to make inference about the population. Recall that none of the independent variables is related to the response Y. However, due to randomness in the sample used, we may still (incorrectly) find a model that it's statistically better than the intercept only model! 

5. Replicate this study 1,000 times (`rep`) and compute the type I error rate. 

> Note that we are generating 1,000 datasets, all at once!! We will use the `map` function to apply a function to each dataset.

*Run the cells below to generate the datasets.* 

In [None]:
# Run this cell before continuing 
set.seed(20211113)

n <- 100    # sample size
p <- 10     # number of variables
rep <- 1000 # number of replications

means <- runif((p+1), 3, 10) # means for the Normal distribution 
                             # that will be used to generate 
                             # p covariates and a response Y   

dataset <- as_tibble(
  data.frame(
    matrix(
      round(rnorm((p + 1) * n * rep, 
            means, 10), 2), 
      ncol = p+1, 
      byrow = TRUE
    )
  ) %>% 
  rename(Y = X11) %>% 
  mutate(replicate = rep(1:rep, n)) %>% 
  arrange(replicate) 
)

head(dataset)

In [None]:
dim(dataset)

**Question 2.1 - Warm up**<br>
{points: 1}

To help you visualize the code abstraction, let's do a more intuitive exercise. 
Using the `dataset` tibble, fit one `lm` for each replicate using all 10 covariates to explain $Y$. Store the `lm` models in a column named `models`.

_Assign your data frame to an object called `full_models`. Your data frame should have three columns: `replicate`, `data`, and `models`._

In [None]:
# full_models <- 
#     ... %>% 
#     group_by(...) %>% 
#     nest() %>% 
#     mutate(models = map(...))


# your code here
fail() # No Answer - remove if you provide an answer

# Try exploring the columns of your data frame. 
# Check full_models$data[[1]] and full_models$models[[1]]

In [None]:
full_models$models[[1]]

In [None]:
test_2.1()

To help speed things up, we created a function for you that receives a data frame, runs the forward selection algorithm to select at most 3 variables, and fit LS on the selected variables. 

*Read and run the cell below to create such a function.*

In [None]:
forward_selection_function <- function(dataset){
    sel_model <- regsubsets(x = Y ~ ., 
                    nvmax = 3,
                    data = dataset,
                    method = "forward",
)

sel_model_summary <- summary(sel_model)
    
adj_r2_min = which.max(sel_model_summary$adjr2) 
selected_var <- names(coef(sel_model, adj_r2_min))[-1]
data_subset <- dataset %>% select(all_of(selected_var),Y)

selected_model <- lm(Y ~ .,
  data = data_subset
)
    return(selected_model)
    }

**Question 2.2**<br>
{points: 1}

The function `forward_selection_function` will be used to select and fit a model on a given data set using LS. We will then compare the selected model to the intercept only model using an $F$-test. Which null hypothesis will be tested:

**A**. The coefficient of the first variable selected equals zero.

**B**. The coefficient(s) of all selected variables equal zero.

**C**. The selected variables equal zero.

**D**. The intercept equals zero.

*Assign your answer to an object called answer2.2. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"`, surrounded by quotes.*

In [None]:
# answer2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

**Question 2.3**<br>
{points: 1}

Using the function `map`, apply the `forward_selection_function` to each generated dataset in the tibble `dataset`, identified by the variable `replicate`. Store each fitted model in a column named `fs_model`. 

> Note that there are 1000 fitted models, one for each dataset generated!! 

Then, use the function `map_dbl` to compare each selected model to the intercept only model and extract the $p$-value of each $F$-test. Store the 1000 $p$-values in a column named `F_pvalue`.

_Assign your data frame to an object called `forward_selection_F`. Your data frame should have four columns: `replicate`, `data`, `fs_model`, and `F_pvalue`._

In [None]:
# forward_selection_F <- 
#     ... %>% 
#     group_by(...) %>% 
#     nest() %>% 
#     mutate(
#         ... = map(...), 
#         ... = ..._dbl(...)
#     )

# your code here
fail() # No Answer - remove if you provide an answer

head(as.data.frame(forward_selection_F), 2)

In [None]:
test_2.3()

**Question 2.4** 
<br> {points: 1}

Knowing that none of generated independent variables are relevant to model $Y$, what proportion of tests would you expect to wrongly reject the null hypothesis that the coefficients of all the selected variables are zero? Consider the significance level of 5%.

_Assign your answer to an object called `nominal_type_I_error`. Your answer should be a single number._

In [None]:
# nominal_type_I_error <- ...

# your code here
fail() # No Answer - remove if you provide an answer

nominal_type_I_error

In [None]:
test_2.4()

**Question 2.5** 
<br> {points: 1}

Check the proportions of $p$-values computed from the $F$-tests stored in the `forward_selection_F` tibble that are below the 5% significance level. 

_Assign your answer to an object called `forward_selection_type_I_error`. Your answer should be a single number._

In [None]:
# forward_selection_type_I_error <- 
#     ... %>% 
#     ungroup() %>% 
#     summarise(...(... < 0.05)) %>% 
#     pull()

# your code here
fail() # No Answer - remove if you provide an answer

forward_selection_type_I_error

In [None]:
test_2.5()

### 2. The double use of data

The Type I Error rate from the tests run using the fitted selected models was significantly higher than the nominal level of 5%. Why? 

Well, if we are looking for the most relevant covariates in a dataset, it is not surprising that we frequently find these covariates significant. In our case above, the *F-statistic* compares the reduction of SSR after adding the selected variables to an intercept-only model. But the variables selected were those that reduced the SSR (or increased the adjusted $R^2$) **in the sample at hand**. Hence, we have a much higher chance of wrongly rejecting $H_0$.  

**The problem**: we are using the **same sample** to find the variables and fit the model. 

**A possible solution**: what if we split the dataset into two parts, one used for model selection and the other one used for inference? Would that solve the problem? 

Let's investigate! 

**Question 3.0** 
<br> {points: 1}

In this exercise we are again going to use the tibble `dataset`. But this time we are going to split our dataset into two parts. We are going to use one part to select a model, and the other part for inference. Since the sample sizes has changed, to compare strategies, we'll also fit the selected model and test it on the first set.

> For this exercise, let's split the dataset in half. Note that other options are possible.

Here's what you need to do: 

1. Using the first 50 observations, run the forward selection algorithm to select at most 3 variables and fit a LS on the selected variables. Store the selected model in `fs_model` column. 


2. Run an $F$-test to compare the selected model to an intercept only model and extract the $p$-value. Store the 1000 values in a column called `F_fs`.


3. Fit the model selected in Step 1 using the 50 remaining observations and save it in a column named `inference_model`. Also, extracts the $p$-value of the $F$-test for the `inference_model` and stores it in a column called `F_pvalue`. 

> Note that using the `map` functions you can perform these steps at once for the 1000 datasets.

_Assign your data frame to an object called `fs_error_split`. Your data frame should have 6 columns: `replicate`, `data`, `fs_model`, `F_fs`, `inference_model`, `F_pvalue`._

In [None]:
set.seed(20211113) # Do not change this.

# fs_error_split <- 
#     ... %>% 
#     ... %>% 
#     ... %>% 
#     mutate(
#         fs_model = ...(..., .f = function(d) forward_selection_function(d %>% head(50))), 
#         F_fs = ...,
#         inference_model = map2(.x = ..., .y = ..., ~ update(.y, .~., data = .x %>% tail(50))), 
#         F_pvalue =  ...)
#     )

# your code here
fail() # No Answer - remove if you provide an answer
        
head(fs_error_split) %>% 
    select(replicate, F_fs, F_pvalue)

In [None]:
test_3.0()

**Question 3.1** 
<br> {points: 1}

Check the proportions of $p$-value of the $F$-test in the `F_fs` column that are below the 5% significance level. Note that these tests were run using the the first split of the data used to select the model.

> Hint: you've done something similar in **Question 2.5**

_Assign your answer to an object called `fs_split1_type_I_error`. Your answer should be a single number._

In [None]:
# fs_split1_type_I_error <- ... 

# your code here
fail() # No Answer - remove if you provide an answer

fs_split1_type_I_error

In [None]:
test_3.1()

**Question 3.2** 
<br> {points: 1}

Check the proportions of $p$-value of the $F$-test in the `F_pvalue` column that are below the 5% significance level. Note that these tests were run using the the second split of the data that was *not used* to select the model.


_Assign your answer to an object called `fs_split2_type_I_error`. Your answer should be a single number._

In [None]:
# fs_split2_type_I_error <- ... 

# your code here
fail() # No Answer - remove if you provide an answer

fs_split2_type_I_error

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

True or false?

If we split the data and use different part for model selection and inference, the type I error of the F-test after the forward selection is close to the significance level. 

_Assign your answer to an object called `answer3.3`. Your answer should be either "true" or "false", surrounded by quotes._

In [None]:
# answer3.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.3()

### Class discussion 
Contrast the `forward_selection_type_I_error` and `nominal_type_I_error`. Are they similar? Why do you think this is happening. 

<font color='darkred'>The post-inference problem in models selected through forward selection, as examined in this simulation study, *also occurs with LASSO*. </font>

In the tutorial, you will address the problem by splitting the dataset -- one part for selecting variables using LASSO and the other for fitting the model for inference. 

## Conclusions

#### Lasso has two problems:

- **Biased estimators**: we can take care of this by fitting regular least squares on the variables selected by Lasso. This approach is called **post-lasso**.

- **Post-inference**: fitting a LS regression after LASSO, we are using the data to select the variables as well as to conduct inference. We cannot rely on the inference given by the `lm`, unless we split the data to take care of this problem.

#### Post-inference problem:

- we can not use the same data to select variables of the model and to conduct inference ("double dipping"). 

- the inference results given by the `lm` are not valid (as seen in the first part of the worksheet). 

- if we split the data, we can use one part to select and the other part to estimate and build tests.

- more sophisticated methods have been proposed to address this problem (beyond the scope of this course).
