# Worksheet 8: Prediction and Model Selection

#### Lecture and Tutorial Learning Goals:

By the end of this section, students will be able to:

- Explain the difference between confidence intervals for prediction and prediction confidence intervals and what elements need to be estimated to construct these intervals.

- Write a computer script to calculate these intervals. Interpret and communicate the results from that computer script.

- Give an example of a question that can be answered by predictive modelling.

- Explain the algorithms for the following variable selection methods: • Forward selection • Backward selection

- Explain when a linear regression is an appropriate model to predict new outcomes based on new values of the input variables.

- List model metrics that are suitable for evaluation of a statistical model developed for the purpose of predictive modelling (e.g., RMSE), as well as how they are calculated.

- Discuss how different estimation methods can result in different predictions.

# Part I: Uncertainty of prediction

## 1. Prediction Intervals *vs* Confidence Intervals for prediction

Last week we have seen that the <font color=blue> estimated LR </font> can be used to predict values of the response variable

> we can predict observations from the training or test sets! 

We have also learned different metrics to evaluate the estimated model. Many of these metrics compared the observed response $y$ with its predicted value using the estimated LR $\hat{y}$. For example:


- **Mean Squared Error**: MSE = $\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2$

or 

- $R^2 = cor(y, \hat{y})^2$ (for a model with an intercept estimated by LS)

**Today, we will learn how to measure the uncertainty of $\hat{y}$**

#### Predictions are random variables

- Since the predictions are functions of the estimated LR, they also depend on the sample used!! 

    
- A different sample would have resulted in a different estimated LR and thus different predictions!! 

    - recall the sample to sample variation in the estimated coefficients?? it translates into variation in the predictions!

    
<font color="blue">**As dicussed for the estimation of the regression parameters, we can obtain *confidence intervals* that take into account the sample-to-sample variation of the predictions as well!**</font>
    
There are 2 type of intervals we can construct depending on the quantity we want to predict: *confidence intervals for prediction (CIP)* and *prediction confidence intervals (PI)*
    
Let's use a SLR to present concepts 
    
> **NOTE**: these intervals can be constructed for *any* LR  

## Dataset: [2015 Property Tax Assessment from Strathcona County](https://data.strathcona.ca/Housing-Buildings/2015-Property-Tax-Assessment/uexh-8sx8)

In this first part of the worksheet, we'll work with a new dataset with data on property tax assessed values of properties in the Strathcona County. A valuation date of July 1, 2014 and a property condition date as of December 31, 2014 are provided. 

![](https://github.com/UBC-STAT/stat-301/blob/master/materials/worksheet_08/img/popul_AB.png?raw=true)

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
library(broom)
install.packages("latex2exp")
library(latex2exp)
library(tidyverse)
library(repr)
library(digest)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
source("tests_worksheet_08.R")

dat <- read.csv("data/Assessment_2015.csv")
dat <- dat %>% filter(ASSESSCLAS=="Residential")  %>% 
        mutate(assess_val = ASSESSMENT / 1000)

#### NOTE 1: re-scaling ASSESSMENT

The variable `ASSESSMENT` was divided by 1000 to work with smaller numbers. The transformed values are stored in `assess_val` and are used as the response variable. 

#### NOTE 2: population parameters

Unless we work with a simulated dataset, the true population parameters are *unknown*. Instead of simulating data and *just* to illustrate concepts, we'll use all the residencies in the dataset to obtain a LR and pretend that this is the *population* line.

> **NOTE**: this is not done in a real data analysis!

We'll use a random sample to estimate the true LR and use it to predict

In [None]:
# A smaller sample
set.seed(561)
dat_s <- sample_n(dat, 100, replace = FALSE)

In [None]:
lm_p <- lm(assess_val ~ BLDG_METRE, dat)
lm_s <- lm(assess_val ~ BLDG_METRE, dat_s)

tidy(lm_s)  %>% mutate_if(is.numeric, round, 3)

In [None]:
cols <- c("Population"="#f04546","LS Estimate"="#3591d1")

plot_sample <- ggplot(data=dat_s,aes(BLDG_METRE, assess_val)) + 
  xlab("building size (mts)")+ 
  ylab("assessed value ($/1000)") +
  xlim(50,450)+
  geom_point(aes(BLDG_METRE, assess_val), color="grey")

### Prediction of the assessed value of a house in Strathcona

The assessed value of a random house in Strathcona can be modelled as the average assessed value of a house with similar characteristics plus some random error

Mathematically:

$$Y_i = E[Y_i|X_{i}] + \varepsilon_i$$

> a random residence won't have a value exactly equal to the average population value of residencies of the same size, some has higher values and others have lower values 

In addition, we have assumed that the conditional expectation is linear. Then,

<font color=red> $$ E[Y_i|X_{i}] = \beta_0 + \beta_1 X_{i}$$ </font>

Since we are pretending that we know this population line, we can plot it

> **NOTE**: recall that in practice this line is *unknown*

In [None]:
plot_expect <- plot_sample +
    geom_segment(x = 251, y = predict(lm_p,data.frame(BLDG_METRE = 251)),xend = 251,yend = 534, linetype = "dashed")+
    geom_point(aes(x = 251,y = 534), color = "black", size = 3)+
    geom_text(aes(x = 251,y = 490,label = TeX(r"($y_i$)", output = "character")), size = 5,parse = TRUE)+
    geom_point(aes(x = 251,y = predict(lm_p,data.frame(BLDG_METRE=251))),color = "red", size = 3)+  
    geom_text(aes(x = 280,y = 730,label = TeX(r"($E(Y_i|X_i)$)", output = "character")), color = "red", size = 5,parse = TRUE)+
    geom_text(aes(x = 265,y = 600,label = TeX(r"($e_i$)", output = "character")), size = 5,parse = TRUE)+
    geom_smooth(data = dat,aes(BLDG_METRE, assess_val, color = "Population"),method = lm, 
                linetype = 2, se = FALSE, fullrange=TRUE)+
    scale_colour_manual(name="SLR",values=cols)

In [None]:
plot_expect

#### Estimated Linear Regression:

In practice, we use the random sample to estimate the regression line !

> Based on a random sample of houses from Strathcona, we estimate the relation between the assessed value  of a house and its size. We use the estimated relation to predict the value of any house in the county

 Then we can use the <font color=blue> estimated LR (blue line) </font> to predict.

The prediction of the $i$-th observation is given by:

<font color=blue> $$ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{i} $$

In [None]:
plot_ls <- plot_expect +    
    geom_smooth(data = dat_s,aes(BLDG_METRE, assess_val, color="LS Estimate"),method=lm, se = FALSE,fullrange=TRUE)+
    geom_point(aes(x=251,y=predict(lm_s,data.frame(BLDG_METRE=251))), color="blue", size=3)+
    geom_text(aes(x=251,y=predict(lm_s,data.frame(BLDG_METRE=251))+50,label=TeX(r"($\hat{y}_i$)", output = "character")), color="blue", size=5,parse = TRUE)

In [None]:
plot_ls

### What do we want to predict with <font color=blue> $\hat{Y}_i$ </font>?

- **(A)** the *average* assessed value of a house of *this size*: <font color=red> $E[Y_i|X_i]$ </font>

- **(B)** the *actual* value of a house of *this size*: $Y_i$ (knowing its size $X_i$)

#### Note that we predict both with *uncertainty*!! 

> Which one do you think is more difficult to predict??

### Intervals to describe uncertainty

#### (A) Confidence Intervals for Prediction (CIP)

The uncertainty comes from the estimation 

> only 1 sources of variation!

The predicted value <font color=blue> $ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{i}$ </font> approximates, with uncertainty, the <font color=red> population $ E[Y_i| X_{i}] = \beta_0 + \beta_1 X_{i}$ </font>

> because the estimated coefficients <font color=blue> $\hat{\beta}_0$ and $\hat{\beta}_1$ </font> are estimates, *approximations*, of the true population coefficients <font color=red> $\beta_0$ and $\beta_1$ </font>, respectively

> if we take a different sample, we get: different estimates, different blue lines, and different predictions!

#### A **95% confidence interval for prediction** is a range that with 95% probability contains the *average* value of a house of *this size* 

> note that once we have estimated values and a numerical range based on the sample we use the word "confidence" (instead of "probability") since nothing else is random

**A quick look at data**

Using the sample `dat_s`, let's compute 95% confidence intervals for prediction using the function `predict`. 

- Create a dataframe, called `dat_cip`, that contains the response, the input, the predictions using `lm_s`, and the lower and upper bounds of the intervals for *each* observation

> each row corresponds to one (in-sample) prediction and its confidence interval

In [None]:
dat_cip <- dat_s  %>% 
    select(assess_val,BLDG_METRE) %>% 
    cbind(predict(lm_s,interval="confidence",se.fit=TRUE)$fit)

In [None]:
head(dat_cip,3)

#### Interpretation:

*Row 1*: with 95% confidence, the *expected* value of a house of size 220 mts is between \\$671944 and \\$748198 (rounded)

#### Visualization

In [None]:
ggplot(data=dat_s,aes(BLDG_METRE, assess_val)) + 
        xlab("building size (mts)")+ 
        ylab("assessed value ($/1000)") +
        xlim(50,450)+ 
        geom_smooth(data = dat,aes(BLDG_METRE, assess_val, color="Population"),method=lm, se = FALSE,fullrange=TRUE)+
        geom_smooth(data = dat_s,aes(BLDG_METRE, assess_val, color="LS Estimate"),method=lm, se = TRUE,fullrange=TRUE)+
        scale_colour_manual(name="SLR",values=cols)

**Question 1.0** 
{points: 1}

Using the sample `dat_s`, compute 90% confidence intervals for prediction. Create a dataframe, called `dat_cip_90`, that contains the response, the input, the predictions using `lm_s`, and the lower and upper bounds of the intervals for each observation. Columns in your dataframe should be in this order.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# dat_cip_90 <- ...  %>% 
#    select(...,...) %>% 
#    cbind(...(..., interval = "...", level = ..., se.fit=TRUE)$fit)

# head(dat_cip_90)
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1** 
{points: 1}

Based on the output `dat_cip_90`, which of the following claims is correct?

**A.** with 90% confidence, the *expected* value of a house of size 97 mts is between \\$301274 and \\$363407 (rounded) 

**B.** with 90% confidence, the value of a house of size 97 mts is between \\$301274 and \\$363407 (rounded) 


**C.** with 90% confidence, the *expected* value of a house of size 97 mts is between \\$678167 and \\$741974 (rounded) 


**D.** with 90% confidence, the *expected* value of any house is between \\$678167 and \\$741974 (rounded) 


*Assign your answer to an object called `answer1.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes. *

In [None]:
# answer1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2** 
{points: 1}

True or false?

Based on the outputs `dat_cip_90` and `dat_cip`, CIP are centered at the fitted value  <font color=blue>$\hat{Y}_i$ </font>

*Assign your answer to an object called `answer1.2`. Your answer should be either `"true"` or `"false"`, surrounded by quotes.*

In [None]:
# answer1.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3** 
{points: 1}

True or false?

The 90% confidence intervals for prediction are wider than the 95% confidence intervals for prediction

*Assign your answer to an object called `answer1.3`. Your answer should be either `"true"` or `"false"`, surrounded by quotes.*

In [None]:
# answer1.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

#### (B) Prediction Intervals (PI)

The predicted value <font color=blue> $ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{i}$ </font> also approximates, with uncertainty, an actual observation $ Y_i = \beta_0 + \beta_1 X_{i} + \varepsilon_i$

The uncertainty comes from the estimation and from the error term that generates the data!

> 2 sources of variation! more uncertainty!!

> **uncertainty 1**: because the estimated value <font color=blue> $\hat{\beta}_0 + \hat{\beta}_1 X_i$ </font> *approximates* the average (population) value $\beta_0 + \beta_1 X_i$ 

> **uncertainty 2**: because the actual observation $Y_i$ differs from the average (population) value by an error $\varepsilon_i$

- PI are centered at the fitted value  <font color=blue>$\hat{Y}_i$</font>, but they are wider than the CIP (more uncertainty)!


> classical intervals are based on the $t$-distribution (we omit details)

#### **A 95% prediction interval** is a range that with  95% probability contains the *actual value* of a house of *this size* 

> as mentioned before, for a particular interval based on an *observed* random sample, we replace "probability" by "confidence"

**A quick look at data**

> each row corresponds to one (in-sample) prediction and its confidence interval

In [None]:
dat_pi <- dat_s  %>% 
    select(assess_val,BLDG_METRE) %>% 
    cbind(predict(lm_s,interval="prediction"))

In [None]:
head(dat_pi,3)

#### Interpretation:

*Row 1*: with 95% confidence, the value of a house of size 220 mts is between \\$454519 and \\$965622 (rounded)

**Question 1.4** 
{points: 1}

Let's use the results in `dat_cip` and `dat_pi` to corroborate that the prediction intervals are wider than the confidence intervals for prediction.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# cipWider <- sum((...$lwr - ...$lwr) > 0) == nrow(...)
# piWider <- sum((...$upr - ...$upr) > 0) == nrow(...)


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()

## Conclusions Part I: Prediction uncertainty

- Confidence intervals for prediction account for the uncertainty given by the estimated LR to predict the conditional expectation of the response


- Prediction intervals account for the uncertainty given by the estimated LR to predict the conditional expectation of the response, *plus* the error that generates the data! 


- PI are wider than CIP, both are centered at the fitted value!

![](https://github.com/UBC-STAT/stat-301/blob/master/materials/worksheet_08/img/pred_error.png?raw=true)

# PART II: Model Selection

#### Do we need all the available input variables in the model?

<font color="blue">**This week we will focus on selecting a subset of variable to be included in the model**</font>

Some datasets contain *many* variables but not all are relevant

> you may want to identify the *most relevant* variables to build a model

But again: 
    
#### <font color=red>  What is your goal??
    
> inference vs prediction
    
To decide if a variable (or set of variables) is relevant or not we need to choose an evaluation metric
    
As we discussed last week, the evaluation metric used depends on the goal of the analysis!!

## 2. Variable selection for generative models

In this section we will focus on different selection and estimation methods when the goal is to *estimate and make inference about* the model that generated the data. We'll refer to these models as ***generative models***.

<font color="blue">**Last week we learned different ways to evaluate our models when the main goal is estimation and inference**</font>

For a LR with an intercept and estimated by LS:

- The <font color="blue">$R^2$</font>, coefficient of determination, can be used to measure the part of the variation in the response explained by the estimated model


- The <font color="blue">adjusted $R^2$</font> can be used to compare the fit of estimated models of different sizes


- The <font color="blue">MSE</font> (based on in-sample data) can be used to compare the observed values with those predicted by the estimated model  


- These <font color="blue">$F$</font> tests can be used to select variables by comparing nested models

#### How can we use these measures to select a model when the main goal is estimation and inference?

- ###  The $F$-test

The <font color="blue">$F$</font> test can be used to tests the simultaneous significance of *additional* terms in the full model (that are not in the reduced model)

#### $$H_0: \beta_{q+1} = \beta_{q+2} = \ldots = \beta_s=0$$

versus the alternative

#### $$H_1: \text{ at least one of the coefficients in the questionable subset is different from 0}$$

#### Thus, we can use the results of these tests to establish a *selection rule* to evaluate sets of variables

- ### The $t$- tests:

In previous lectures we evaluated the contribution of individual variables to explain a response using $t$-tests calculated by `lm` and given in the `tidy` table:

$$H_0: \beta_j = 0,  \text{ versus }  H_1: \beta_j \neq 0$$ 

> $H_0$ contains only *one* coefficient

The results of the $t$-tests evaluate the contribution of *each* variable (separately) to explain the variation observed in the response ***with all other variables included and held constant*** in the model!!

> sometimes refer as: "after controlling for other explanatory variables"

#### Thus, we can use the results of these tests to establish a *selection rule* to evaluate variables one at a time:

> for example: discard variables with *p-values* above a threshold

> *caution note 1*: if there are many variables in the model (i.e., $p$ is large) using individual $t$-tests may result in many false discoveries (i.e., erroneously reject a true $H_0$)

> *caution note2*: the *training* set is used (over an over) to select so it can't be used again to assess the final significance of the model. This problem is known as the *post-inference* problem

- ### The $R^2$ (or the RSS) and the adjusted $R^2$:

Can we compare the coefficient of determinations of two models to select the best one?

The RSS decreases as more variables are included in the model!! 

> so the $R^2$ of bigger models are *always* larger than the $R^2$ of nested models! <font color=red> *regardless of the relevance of the variables added*
    
To overcome this problem the **adjusted $R^2$** has been proposed:

$$ \text{adjusted } R^2 = 1- \frac{RSS/(n-p)}{TSS/(n-1)} $$ 

    
Thus,

#### You can use the $R^2$ to compare models of equal size (not necessarily nested) 

or 

#### You can use the adjusted $R^2$ to compare models of different sizes (not necessarily nested)   
    
> But we can't have tests based on these quantities because their sampling distribution is unknown.    

## 3. Variable selection for predictive models

How do we evaluate the predictive performance of a model?

- **Mean Squared Error**: MSE$_{\text{test}} = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2$

> *Test MSE*: where $y_i$ are new responses from the test set and $\hat{y}_i$ are predicted values using the LR estimated with training data

- **Root Mean Squared Error**:

$$\text{RMSE}_{\text{test}} = \sqrt{\frac{1}{n}\sum_{i = 1}^n(y_i - \hat{y}_i)^2}$$

 

- **$R^2$**: $R^2 = 1 - \frac{RSS}{TSS}$

> Can we compute the $R^2$ on the test set??
    
Yes, as we mentioned for the MSE, the $R^2$ can be computed for new responses in a test set $y_{new}$ compared to the predicted values obtained using the trained LR, $\hat{y}_{new}$   
    
> some functions compute the $R^2$ from a validation set or using cross validation (perhaps seen in other courses)   
    
> however, note that it is *no longer the coefficient of determination*. It measures the correlation between the true and the predicted responses *in a test set*   

Some metrics, such as $C_p$, AIC and BIC, have been proposed to approximate the *test MSE* but are computed with the training set. 

#### You can use these measures to select variables of predictive models, even without using a test set.

## Dataset

In this section we will work with a real estate dataset, the [Ames `Housing` dataset](https://www.kaggle.com/c/home-data-for-ml-course/), compiled by Dean De Cock. It has 79 input variables on different characteristics of residential houses in Ames, Iowa, USA that can be used to predict the property's final price, `SalePrice`. We will use a subset of 21 continuous input variables:

- `LotFrontage`: Linear $\text{ft}$ of street connected to the house.
- `LotArea`: Lot size in $\text{ft}^2$.
- `MasVnrArea`: Masonry veneer area in $\text{ft}^2$.
- `TotalBsmtSF`: Total $\text{ft}^2$ of basement area.
- `GrLivArea`: Above grade (ground) living area in $\text{ft}^2$.
- `BsmtFullBath`: Number of full bathrooms in basement.
- `BsmtHalfBath`: Number of half bathrooms in basement.
- `FullBath`: Number of full bathrooms above grade.
- `HalfBath`: Number of half bathroom above grade.
- `BedroomAbvGr`: Number of bedrooms above grade (it does not include basement bedrooms).
- `KitchenAbvGr`: Number of kitchens above grade.
- `Fireplaces`: Number of fireplaces.
- `GarageArea`: Garage's area in $\text{ft}^2$.
- `WoodDeckSF`: Wood deck area in $\text{ft}^2$.
- `OpenPorchSF`: Open porch area in $\text{ft}^2$.
- `EnclosedPorch`: Enclosed porch area in $\text{ft}^2$.
- `ScreenPorch`: Screen porch area in $\text{ft}^2$.
- `PoolArea`: Pool area in $\text{ft}^2$.

The following variables will be used to construct a variable `ageSold`
- `YearBuilt`: Original construction date.
- `YrSold`: Year sold.

Run this code to prepare a working dataset

In [None]:
Housing <- read_csv("data/Housing.csv")

# Use `YearBuilt` and `YrSold` to create a variable `ageSold`
Housing$ageSold <- Housing$YrSold - Housing$YearBuilt


# Select subset of input variables
Housing <- Housing %>%
  select(
    LotFrontage, LotArea, MasVnrArea, TotalBsmtSF, 
    GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, Fireplaces,
    GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, ScreenPorch, PoolArea, ageSold, SalePrice
  )

# Remove those rows containing `NA`s and some outliers
Housing <- drop_na(Housing)  %>% 
            filter(LotArea < 20000)

str(Housing)

We'll first split this dataset into a *training* and a *test* set following these steps:

1. Create an `ID` column in `Housing` (i.e., `Housing$ID`) with the row number corresponding to each house in the sample.

2. Use the function `sample_n()` to create `training_Housing` (**sampling WITHOUT replacement**) with 60\% of the observations from `Housing`.

3. Use `anti_join()` with `Housing` and `training_Housing` to create `testing_Housing` by column `ID`.

Run this code to split the dataset `Housing`

In [None]:
set.seed(1234)

Housing$ID <- 1:nrow(Housing)
training_Housing <- sample_n(Housing, size = nrow(Housing) * 0.60,
  replace = FALSE
)

testing_Housing <- anti_join(Housing,
  training_Housing,
  by = "ID"
)

head(training_Housing, 3)
nrow(training_Housing)

head(testing_Housing, 3)
nrow(testing_Housing)

### Estimating a MLR

- Estimate **an additive** MLR using *all* **input** variables in `Housing` using `training_Housing`.

> **Note**: you don't need to enter the input variables manually in `lm`! and remember to drop the ID variable 

Run this code to fit a MLR using `lm` and call it `Housing_full_OLS`.

In [None]:
Housing_full_OLS <- lm(SalePrice ~ ., data = training_Housing[,-21])

tidy(Housing_full_OLS) 

- Using `predict()` and `Housing_full_OLS`, obtain the **out-of-sample predictions** for `testing_Housing`. Store them in a variable called `Housing_test_pred_full_OLS`.

> **Note**: if you enter the input variables manually in `lm`, follow the order in the dataset

Run this code to fit a MLR using `lm` and call it `Housing_full_OLS`.

In [None]:
Housing_test_pred_full_OLS <- predict(Housing_full_OLS, newdata = testing_Housing[, -21])
head(Housing_test_pred_full_OLS)

**Question 3.0**
<br>{points: 1}

Use the function `rmse()` to compute the $\text{RMSE}_{\text{test}}$ for the predictions in `Housing_test_pred_full_OLS` with respect to the observed `SalePrice` in the test set (`testing_Housing$SalePrice`). Put this metric in a tibble called `Housing_R_MSE_models` with two columns:

- `Model`: The regression model from which we will obtain the prediction accuracy.
- `R_MSE`: The $\text{RMSE}_{\text{test}}$ corresponding to the model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Housing_R_MSE_models <- tibble(
#   Model = "OLS Full Regression",
#   R_MSE = ...(
#        preds = ...,
#        actuals = ...)
# )
# Housing_R_MSE_models

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

## 4. An automated proceedure for model selection

When we don't have any idea about which variables should be included in the model, ideally, you want to select the best model out of *all possible models* of all possible sizes. 

> For example: if the dataset has 2 explanatory variables $X_1$ and $X_2$, there are 4 models to compare: (1) an intercept-only model, (2) a model with only $X_1$, (3) a model with only $X_2$, and (4) a model with both $X_1$ and $X_2$. 

However, the number of *all possible* models become too large rapidely, even for small subset of variables

> there are a total of $2^p$ models from a set of $p$ variables

> if $p = 20$ (20 available explanatory variables) we need to evaluate more than a million models! 

There are methods to search more efficiently for a good model (although it may not find the "best" one out of all possible):

- **Forward selection**: 
Image from [ISLR](https://www.statlearning.com)

![](https://github.com/UBC-STAT/stat-301/blob/master/materials/worksheet_07/img/forward.png?raw=true)

1. Start with the intercept only model: $y_i = \beta_0 + \varepsilon_i$

> remember that $\hat{\beta}_0 = \bar{y}$ from the training samples, so $\hat{y}_{0} = \bar{y}$ for any observation (from the training or the test set)

2. Select and add variables sequentially 

**Size 1** Evaluate all models of size 1, choose the best model of size 1 (based on RSS, equal size models), call it $\mathcal{M}_1$. 

**Size 2** *Starting with the best size 1 model*, add 1 variable to create a (expanded) model of size 2. Repeat for all remaining variables and evaluate all expanded models of size 2. Choose the best model of size 2 (based on RSS), call it $\mathcal{M}_2$.

> note that there are more models of size 2 that we are not evaluating since 1 variable has been already chosen in the previous step


$\ldots$ continue until you reach the full model


**Size p** there's only one full model, call it $\mathcal{M}_p$.

> note that we can stop this iteration earlier if we want a model of a predetermined size

3. Now we have to select the best out of the $p$ selected models: $\mathcal{M}_1$ (the best model of size 1), $\mathcal{M}_2$ (the best expanded model of size 2), $\ldots, \mathcal{M}_p$ (the full model of size $p$)

> you can't use the RSS to compare models of different sizes

**Depending on the goal of the study you can use**:
    
 - the adjusted $R^2$ to select *generative models*
    
or

 - the test MSE, $C_p$, AIC or BIC to select *predictive models*
 
You can learn more about these measures in [ISLR](https://www.statlearning.com)



Other selection procedures include:

- **Backward selection**: start with the full model and remove variables, one at a time


- **Hybrid selection**: after adding a variable, the method may also remove variables 


### 4.1 Selecting a smaller model

The OLS model uses all input variables to estimate a generative model. However, as we see from the results table, not all the terms in this regression are statistically significant and this may not be the best predictive model either. 

You may want to **select a smaller subset of variables** that better explain the variation in `SalePrice` or to predict.

In the following questions you will use the forward selection algorithm to select a smaller model. We will compute different metrics so that we can examine different types of models.

#### Algorithms in R

Both the **forward** and **backward** selection algorithms are implemented in R by the function `regsubsets()` from library `leaps`. 

- The argument `x` of `regsubsets()` is analogous to `formula` in `lm()`. 

- The argument `nvmax` indicates the maximum size of the input set to be used in the variable selection.

This function identifies subsets of input variables that provide the best model for different model sizes and the selects the best among those.

**Forward selection**

Let's select some of the input variables in `Housing` using the **forward selection** algorithm aiming for a strong generative model. 

Create one object using `regsubsets()`with `training_Housing`: `Housing_forward_sel`. This object has to indicate  selected models for each model size, from **1 to 19 input variables** (check argument `nvmax`).

> **Note**: if you enter the input variables manually, follow the order in the dataset. Easier to you use `.`, even if the dataset contains the response variable

*Run the code below to select the best nested models of each size*

In [None]:
Housing_forward_sel <- regsubsets(
  x = SalePrice ~ ., nvmax = 19,
  data = training_Housing[,-21],
  method = "forward",
)

housing_forward_summary <- summary(Housing_forward_sel)
housing_forward_summary

You can see that: 

- variables are selected one at a time.

- once the variable is in the model, it stays and another variable is selected

- the algorithm continues until it builds a model of size `nvmax`

**Final selection**

Out of the 19 possible models obtained with forward selection and stored in `Housing_forward_sel`, we can select the best one in terms of its *goodness of fit*. 

Let's store and examine different evaluation metrics contained in `housing_forward_summary`. Construct a tibble called `housing_forward_eval`. This object should contain the following columns:

- `n_input_variables`: the number of input variables in each selected model (from 1 to 19).

- `RSQ`: the $R^2$ of each model

- `RSS`: the RSS of each model

- `ADJ.R2`: the adjusted $R^2$ of each model

- `Cp`: the $C_p$ of each model

- `BIC`: the Bayesian Information Criterion of each model

*Run the following code to evaluate the best models of each size*

In [None]:
housing_forward_summary_df <- tibble(
    n_input_variables = 1:19,
    RSQ = housing_forward_summary$rsq,
    RSS = housing_forward_summary$rss,
    ADJ.R2 = housing_forward_summary$adjr2,
    Cp = housing_forward_summary$cp,
    BIC = housing_forward_summary$bic,
)
housing_forward_summary_df

You can see how the $R^2$ increases with more variables in the model. However, its adjusted version will start decreasing after 15 variables are selected. 

**The forward algorithm would select a generative model with 15 variables using the adjusted $R^2$**

**The forward algorithm would select a predictive model with 13 variables using the adjusted $R^2$**

We can **visualize** how these measures change as variables are added to the selected model with the function `plot()`. 

Run this code to plot the $C_p$ of the models selected by the forward selection algorithm 

In [None]:
plot(summary(Housing_forward_sel)$cp,
  main = "Cp for forward selection",
  xlab = "Number of Input Variables", ylab = "Rsq", type = "b", pch = 19,
  col = "red"
)

#### Prediction performance of the selected predictive model

Once we have a selected model we can train it using `lm()` with the training dataset and use it predict values of the residences in the test set. 

Run this code to train the selected models and use it to predict in the test set

In [None]:
# Estimation

Housing_red_OLS <- lm(SalePrice ~ LotArea + MasVnrArea + TotalBsmtSF + GrLivArea + BsmtFullBath +
                        BedroomAbvGr + KitchenAbvGr + Fireplaces + GarageArea + WoodDeckSF +
                        EnclosedPorch + PoolArea + ageSold,
  data = training_Housing
)

# Prediction
Housing_test_pred_red_OLS <- predict(Housing_red_OLS, newdata = testing_Housing[, 1:19])
head(Housing_test_pred_red_OLS)

**Question 4.0**
<br>{points: 1}

Use the function `rmse()` to compute the $\text{RMSE}_{\text{test}}$ of the new predictive model `Housing_test_pred_red_OLS` using the test set `testing_Housing`. Add this metric as another row in the tibble `Housing_R_MSE_models` with `"OLS Reduced Regression"` in the column `Model` and the corresponding $\text{RMSE}_{\text{test}}$ in column `R_MSE`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Housing_R_MSE_models <- rbind(
#   Housing_R_MSE_models,
#   tibble(
#     Model = ...
#     R_MSE = ...(
#          preds = ...,
#          actuals = ...
#       )
#     )
#   )
# Housing_R_MSE_models

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_4.0()