# Lecture - Data Splitting
#### STAT 450

## Learning Goals

By the end of this lecture, students will be able to:

- Discuss how the research question being asked impacts the statistical modelling procedures.
- Write a computer script to perform post-lasso and use it to estimate a generative model.
- Discuss post-selection problems (e.g., double dipping into the data set) and current practical solutions available to address these (e.g., data-splitting techniques).
- Write a computer script to apply currently available practical solutions to post-selection problems.

In [None]:
library(tidyverse)
library(broom)

# Part I - Introduction

## 1. Statistical Modelling: Inference vs Prediction

In statistical modelling, the objective is to capture how a response variable, $Y$, is associated with a set of input variables, $\mathbf{X}=\left(X_1, X_2, ..., X_p\right)$. Two main reasons motivate us to model the relationship between $\mathbf{X}$ and $Y$: (1) prediction and (2) inference. 

### 1.1 Prediction

- Statistical modelling allows us to model the relationship between the response variable and the covariates associated with it based on data. 

- Statistical modelling also gives you a measure of uncertainty, which is crucial. 

- A few examples of questions questions that can be answered with prediction: 

  - How much will be the selling price of a house with three bedrooms, 1300 sq. feet in Kitsilano? 
  - Given that my blood albumin level is 44g/L, do I have a liver problem?
  - What will be my final grade in STAT 450, given that I'm studying 5 hours per week and got 85% on the homework? 
  - Is there a difference in final grades between two sessions of a course: one online and the other in-person? 


#### Linear Regression as a Predictor

- The conditional expectation is the best predictor of $Y$ given a vector of variables $\boldsymbol{X}$: $E[Y|\boldsymbol{X}]$ -- best in the sense of Mean Square Error. 


- An LR assumes a linear form (linear in the parameters) for the conditional expectation
    - In general, this is only an *assumed* model, an *approximation* to the real form of $E[Y|\boldsymbol{X}]$


- If the conditional expectation is not linear, the LR can still be used as a predictor of $Y$, but it may not be the *best* one!
    - More flexible models (e.g., kNN) could perform better to estimate the conditional expectation

### 1.2 Inference

- In some cases, we are more interested in understanding the association between $Y$ and $\mathbf{X}$ than in predicting $Y$ based on $\mathbf{X}$. 
  - We want to know how $Y$ varies when $\mathbf{X}$ changes. 


- For example:
  - How does price affect the sales of iPhones?
  - What affects the price of a house more: the number of bedrooms or the number of bathrooms? 
  - Is $\text{CO}_2$ levels associated with temperature elevation? 
  - Does the sex of the applicants influence the chance of admission at the University of British Columbia?
  - Has the social distancing measures influenced the spread of COVID-19 in Canada?


- In all the questions above, we are not primarily interested in making highly accurate predictions. Instead, we want to understand the relation between different variables. 

---------------------------------------

- In many cases, we are interested in both inference and prediction. 
  - For example, one could be interested in answering, "How do social distancing measures influence the spread of COVID-19?" but also in answering, "How many cases can we expect to have if we implement some social distancing policies?".   
  - Linear Models are an excellent initial approach in these cases, as they are highly interpretable and perform reasonably well in many cases. 


- As we move to more complex models non-linear models, such as LOESS or even Neural Networks, we **might** obtain much higher prediction performance, but their interpretation is quite tricky, if possible at all.  
    - These models can be advantageous if we are only interested in accurate predictions. 
    - For example, a bakery manager wants to know how many apple pies will be sold the next day in order to know how many to prepare today. In this case, it doesn't matter how they get the prediction as long as it is close. Otherwise, they miss sales if the forecast is too small; they lose money by throwing out too many pies if the forecast is too high. No matter how complex these models are, we can always estimate their prediction performance. 


- When our primary interest is in inference, i.e., in understanding the relationship between the response and the covariates, we are willing to sacrifice some prediction performance for a more interpretable model that correctly depicts the variables' relationship. 
  - We are concerned about obtaining good estimates for the parameters of the models.

- No matter the objective, prediction or inference, model assessment is of fundamental importance. 
    - Key strategy for model assessment: data splitting. 
    - Data splitting is widespread for evaluating a model's prediction performance. But as it turns out, it can also be very useful for inference! 

# Part II - Double Dipping

## 2. Model Selection and Inference

- There are many aspects involved in selecting a model that goes beyond variable selection; for example,
  - Do we want a parametric or non-parametric approach? 
  - Do we want to assume a functional form for the relationship between $Y$ and $\mathbf{X}$ (e.g., linear, quadratic, exponential, logarithmic)?
  - Prediction performance. 
  - Is model interpretability important?


- Let's focus on selecting Multiple Linear Regression models, which comes down to variable selection.


- There are many ways of comparing models: (1) $C_p$; (2) AIC; (3) BIC; (4) F-test; and (5) cross-validation MSE.


- and different techniques to select a desired model:
  - Stepwise Algorithms (e.g., Forward Selection, Backward Selection)
  - Lasso

### 2.1 Can we still make inferences for the selected models??

- You have learned how to make inferences (e.g., calculate the confidence interval and hypotheses tests) for a **fixed model**.


- When we apply any of these model selection methods, we are searching for the combination of variables that will give us the best model (according to a given metric). 
  - So, the variables in our final models are not fixed; instead, they are selected adaptively based on **the data at hand**. 

<br>

- Two questions arise then: 
  1. Do these model selection algorithms affect the inference about the parameters of the model? 
  2. Is the way we interpret the models still the same? 


#### 2.1.1 Forward Selection Review
Suppose we have $p$ covariates $X_1, X_2,\ldots,X_p$ to explain our response $Y$. The full model is given by:

$$
Y_i = \beta_0 + \beta_1 X_{i1} + \ldots + \beta_p X_{ip} + \varepsilon
$$

We want to find the best subset of variables to explain $Y$. Searching the best subset using brute force would require us to fit a prohibitive number of models to compare (see table below).

| number of covariates (*p*) | Number of possible models |
| ---------------------------|-------------------:|
| 10 | 1,024 |
| 20 | 1,048,576 |
| 30 | 1,073,741,824 |


The forward selection strategy helps us find good models among the insane number of models shown in the table above. But, unfortunately, it is not guaranteed to find the best model (or even good models). 

It starts with the **null model** (i.e., a model with no covariates, only the intercept $\beta_0$):

$$
\mathbf{Y} = \beta_0 + \epsilon
$$

Then, among the remaining variables, it searches for the one that improves the model the most and incorporates the variable into the model. It keeps incorporating one variable at a time until there's no variable left that would improve the model. 

But what do we mean by "improves the model the most?" 

- One can use different criteria to "measure" this. Common choices are $C_p$, *AIC*, *BIC*, *F-statistic*.

### 2.1.2 Simulation Scenario

We will explore if the forward selection strategy affects the model inference. For this purpose, we are going to use simulation for us to know the truth. Here's what we are going to do: 

1. We will consider a response variable $Y$ and $p=10$ covariates. However, none of the covariates will affect $Y$; they are all independent. (we already know the truth)


2. Generate 100 observations of each variable from a normal distribution.


3. Apply only the first step of forward selection. In other words, we want to add the first variable only among ten potential candidates.
    - The metric we are going to use is the F statistic.


4. `replicate` this study 1,000 times and measure the errors.

We have already simulated the data for you (Steps 1 and 2 above) in the cell below. 

In [None]:
# Run this cell before continuing 
set.seed(20240214)

n <- 100    # sample size
p <- 10     # number of variables
rep <- 500 # number of replications

means <- runif((p+1), 3, 10) # the mean that will be used in the 
                             # Normal distribution for simulation.
                             # The +1 refers to Y.  

dataset <- as_tibble(
  data.frame(
    matrix(
      round(rnorm((p + 1) * n * rep, 
            means, 10), 2), 
      ncol = p+1, 
      byrow = TRUE
    )
  ) %>%
  rename(Y = X11) %>%
  mutate(replicate = rep(1:rep, n)) %>%
  arrange(replicate) 
)

head(dataset)

To help speed things up, we created a function for you that receives a data frame, performs the first forward selection step, and returns the F-statistic.

In [None]:
forward_selection_step1 <- function(dataset){
    #' Returns the F-statistic of the first
    #' step of forward selection.
    #'
    #' @param dataset the dataset to be used

    selected_model <- lm(Y ~ ., data = dataset[,c(paste("X", 1, sep = ""), "Y")])
    F_selected <- glance(selected_model) %>% pull(statistic)
    
    for( j in 2:(ncol(dataset)-1) ){ # fits one lm for each covariate and calculate the F statistic 
        model <- lm(Y ~ ., data = dataset[,c(paste("X",j, sep = ""), "Y")])
        F <- glance(model) %>% pull(statistic)
        
        
        if (F > F_selected){
            F_selected <- F
            selected_model <- model
        }
    }
    return(selected_model)
}

**Exercise: Obtaining the forward selection model**<br>

Using the `dataset` tibble, obtain the forward selection model and store the model in a column named `fs_model`. Then, extract the F-statistic from the model and store it in a column named `F`.

In [None]:
# forward_selection_F <- 
#     dataset %>% 
#     group_by(...) %>%
#     nest() %>%
#     mutate(
#         ... = map(...), 
#         ... = ..._dbl(...)
#     )

head(forward_selection_F, 2)

**Question** 

Suppose we want to test, at 5% significance, whether the decrease in the RSS was significant by adding the variable chosen by the forward selection strategy. In this case, $\text{F-statistic}\sim F_{1,98}$. 


What value should we compare the F-statistic against? 

In [None]:
# F_critical <- ...

F_critical

**Question** 

Knowing that none of our covariates are relevant to model $Y$, if we use the `F_critical` you calculated in the previous question, what proportions of replications would you expect to wrongly reject the null hypothesis that the variable is not significant?

In [None]:
# nominal_type_I_error <- ...

nominal_type_I_error

**Exercise** 

Check the proportions of F-statistics in the `forward_selection_F` tibble that are above the `F_critical` you calculated. 


In [None]:
# forward_selection_type_I_error <-
#   forward_selection_F %>%
#   ungroup() %>%
#   ...

forward_selection_type_I_error

### Class discussion 
Contrast the `forward_selection_type_I_error` and `nominal_type_I_error`. Are they similar? Why do you think this is happening. 

## 2.1 The double use of data

- The Type I Error after the forward selection was significantly higher than the nominal level of 5%.
  - Well, if we are looking for the most relevant covariates **in the sample at hand**, it is not surprising that we frequently find these covariates significant.
  - Hence, we have a much higher chance of wrongly rejecting $H_0$.  


- Ok, we identified the problem: 
  - We use the same sample to find the variable that yields the largest test statistic and then test if the variable is relevant. 
  - You could think like this: "We're looking at our sample to find the most relevant variable. After we find it, we will ask the same sample if the variable is relevant." 


- But what if we split the dataset into two parts, one for model selection and the other for inference? Would that solve the problem? Let's investigate! 



- We are going to use the tibble `dataset`. But this time, we are going to split our dataset into two parts:
  1. one for model selection 
  2. one for inference

Here's what you need to do: 

1. Shuffle the dataset so we know that the observations are in random order. 

2. Using the first 50 observations, apply the first step of forward selection; store the selected model in `fs_model` column. Also, extract the F-statistic of the `fs_model` and store it in a column called `F_fs`.

3. Fit the model selected in Step 2 using the 50 remaining observations and save it in a column named `inference_model`. Also, extract the F-statistic of the `inference_model` and store it in a column called `F_inference`. 

In [None]:
set.seed(20240214) # Do not change this.

# fs_error_split <- 
#     dataset %>% 
#     slice_sample(...) %>%
#     ... %>% 
#     ... %>% 
#     mutate(
#         fs_model = ...(..., .f = function(d) forward_selection_step1(d %>% head(50))), 
#         F_fs = ...,
#         inference_model = map2(.x = ..., .y = ..., ~ update(.y, .~., data = .x %>% tail(50))), 
#         F_inference =  ...)
#     )
        
head(fs_error_split) %>% 
    select(F_fs, F_inference)

**Question** 

Check the proportions of F-statistics in the `F_inference` column that are above the `F_critical` you calculated. (Hint: in this case $\text{F-statistic}\sim F_{1,48}$. 

In [None]:
# fs_split_type_I_error <- 
#   fs_error_split %>%
#   ungroup() %>%
#   ...

fs_split_type_I_error

**Question**

True or false?

If split the data into model selection and inference split, the type I error of the F-test after the forward selection is close to the significance level. 

In [None]:
# answer2.4 <- ...

### 3. Model Selection and Prediction

- A similar problem occurs when we are focused on prediction. 


- We learn to use the sample to estimate quantities of the population. 
  - It's only natural for us to think of using the in-sample MSE to estimate the out-of-sample MSE. 
  


- The problem is that we fit our model to minimize the in-sample MSE.
  - For this reason, the in-sample MSE tends to underestimate the out-of-sample MSE (sometimes by a considerable amount). 
  - This is called "overfitting" -- when your in-sample cost function is much lower than the out-of-sample cost function. 
 


  
- To obtain a reliable estimate of the out-of-sample error, we need to predict observations the model didn't have access to during the fitting process.   
  
 


- To solve this problem, we resort again to data split. 
  - We split our sample into two sets, one for model fitting and one for testing the model.
  
 

------------------


- With the split of the sample into a training set and test set, we can test our model's prediction accuracy. 


- But what about model selection? 


- Imagine you have ten competing models. You want to select the one with the best prediction accuracy. How would you do that?

**Approach 1**

1. We use the training set to fit the models.
2. We select the model with the best prediction accuracy in the training set.
3. We test the model in the test set to obtain an estimate of the out-of-sample error of the selected model. 

<p style="color: red;">What do you think of this approach? </p>

**Approach 2**

1. We use the training set to fit the models;
2. We assess each model prediction accuracy in the test set; 
3. The most accurate model in the test set wins. 

<p style="color: red;">What do you think of this approach? </p>

#### 3.1 The validation set

- The validation set is a second split of the training data.


- We end up with three data sets: (1) training set; (2) validation set; and (3) test set;
  1. We use the training set to fit as many models as we want;
  2. We compare the out-of-sample models' performance using the validation set;
  3. Once we have a winner, we estimate the out-of-sample performance using the test set.
  
<p style="color: red;">Why can't we use the out-of-sample performance from the validation set as our estimate?</p>
