# Tutorial 9: Prediction and Model Selection

#### Lecture and Tutorial Learning Goals:

By the end of this section, students will be able to:

- Explain the difference between confidence intervals for prediction and prediction confidence intervals and what elements need to be estimated to construct these intervals.

- Write a computer script to calculate these intervals. Interpret and communicate the results from that computer script.

- Give an example of a question that can be answered by predictive modelling.

- Explain the algorithms for the following variable selection methods: • Forward selection • Backward selection

- Explain when a linear regression is an appropriate model to predict new outcomes based on new values of the input variables.

- List model metrics that are suitable for evaluation of a statistical model developed for the purpose of predictive modelling (e.g., RMSE), as well as how they are calculated.

- Discuss how different estimation methods can result in different predictions.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(broom)
library(repr)
library(infer)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
source("tests_tutorial_09.R")

## 1. Prediction CI *versus* CI for Prediction

In previous lectures we have learned how to estimate LR models and used them to make inference about the population parameters. In this lecture we will learn different concepts related to *prediction*.

> **Heads up**: It is important to distinguished between *in-sample* prediction from *out-of-sample* prediction

We have seen different measures to compare the *in-sample* values of the response with their corresponding predicted values using a LR to evaluate the goodness of the model.

In this first section we are going to recognize and measure the *uncertainty* of these predictions.

Let us start by loading the dataset to be used throughout this tutorial. We will use the dataset `fat` from the library `faraway`. You can find detailed information about it in [Johnson (1996)](https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910505). This dataset contains the percentage of body fat and a whole variety of body measurements (continuous variables) of 252 men. We will use the variable `brozek` as the response variable and a subset 14 variables to build different models. 

Run the code below to create the working data frame called `fat_sample`.

In [None]:
fat_sample <- fat %>%
  select(
    brozek, age, weight, height, adipos, neck, chest, abdom,
    hip, thigh, knee, ankle, biceps, forearm, wrist
  )

head(fat_sample,3)

The response variable `brozek` is the percent of body fat using Brozek's equation:

$$\texttt{brozek} = \frac{457}{\texttt{density}} - 414.2,$$

where body `density` is measured in $\text{g}/\text{cm}^3$.

The 14 input variables are:

- `age`: Age in $\text{years}$.
- `weight`: Weight in $\text{lb}$.
- `height`: Height in $\text{in}$.
- `adipos`: Adiposity index in $\text{kg}/\text{m}^2$.

$$\texttt{adipos} = \frac{\texttt{weight}}{\texttt{height}^2}$$

- `neck`: Neck circumference in $\text{cm}$.
- `chest`: Chest circumference in $\text{cm}$.
- `abdom`: Abdomen circumference at the umbilicus and level with the iliac crest in $\text{cm}$.
- `hip`: Hip circumference in $\text{cm}$.
- `thigh`: Thigh circumference in $\text{cm}$.
- `knee`: Knee circumference in $\text{cm}$.
- `ankle`: Ankle circumference in $\text{cm}$.
- `biceps`: Extended biceps circumference in $\text{cm}$.
- `forearm`: Forearm circumference in $\text{cm}$.
- `wrist`: Wrist circumference distal to the styloid processes in $\text{cm}$.

**Question 1.0**
<br>{points: 1}

Let's start by building a SLR using only `weight` to predict `brozek`.

Use the `lm()` function to estimate the SLR. Store this estimated model in the variable `SLR_fat`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_fat <- ...(..., ...)
# SLR_fat

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

In previous lectures, we have learned how to obtain and interpret confidence intervals for the regression parameters. 

Since the predictions are functions of the estimated LR, they also depend on the sample used! A different sample would have resulted in a different estimated LR and different predictions! As dicussed for the estimation of the regression parameters, we can obtain confidence intervals that take into account the sample-to-sample variation of the predictions as well! 

There are 2 type of intervals we can construct depending on the quantity we want to predict: *confidence intervals for prediction* and *prediction confidence intervals*

> **Heads up**: Isn't this confusing?? 

Let's start by computing *confidence intervals for prediction*. These are intervals to predict the *average* brozek index for men of different weights. 

Using `SLR_fat` and `predict`, obtain the asymptotic 95% CIP (confidence intervals for prediction). Create a dataframe, called `fat_cip`, that contains the response, the input, the predictions, and the lower and upper bounds of the intervals for each observation **in that order from left-to-right**. 

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_cip <- fat_sample  %>% 
#    select(..., ...) %>% 
#    cbind(predict(...,interval="confidence",se.fit=TRUE)$fit)  %>% 
#    mutate_if(is.numeric, round, 3)
# head(fat_cip)


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

We have just calculated the 95% confidence interval for the mean brozek index for men of different weights in our sample. 

Provide a brief interpretation for the 95% confidence interval for prediction you have calculated in row 1.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.3**
<br>{points: 1}

Let's now compute and interpret *prediction confidence intervals*. These are intervals to predict the (actual) brozek index for men of different weights.  

You can use `SLR_fat` and `predict` again to obtain the asymptotic 95% PI (prediction intervals) changing the argument `interval`. Create a dataframe, called `fat_pi`, that contains the response, the input, the predictions, and the lower and upper bounds of the intervals for each observation, **in that order from left to right**.

> **Heads up**: read the warning message! since your goal is to predict an actual value, it is important to note that this is not coming from a test set.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_pi <- fat_sample  %>% 
#    select(..., ...) %>% 
#    cbind(predict(...,interval="prediction",se.fit=TRUE)$fit)  %>% 
#    mutate_if(is.numeric, round, 3)
# head(fat_pi)


# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

We have just calculated the 95% prediction interval for the brozek index of men of different weights in our sample. 

Provide a brief interpretation for the 95% prediction interval you have calculated in row 1.
Your interpretation goes here.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.5**
<br>{points: 1}

Compare the confidence intervals computed in **Question 1.1** with those computed in **Question 1.3** (by row). Which confidence intervals are wider?? Respond and explain why in one or two sentences.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 2. Predictive Modelling using Linear Regression

In this section you will use the LR as a *predictive model*. Predictive models are built and trained to predict *new* observations. Thus, we need two types of datasets: a *training* set and a *test* set. 

If two independent datasets are not available to build a predictive model, we can:

- approximate the *test* MSE

or 

- use the data in hand and split it to create these datasets.

In this section, you will split the data to build a predictive model on one part using all available variables and test it on the second part of the data.

**Question 2.0**
<br>{points: 1}

Let's start by randomly splitting `fat_sample` in two sets on a 70-30% basis: `training_fat` (70% of the data) and `testing_fat` (the remaining 30%) and then train a full LR with all the available input variables on the training set.

You can do the following:

1. Create an `ID` column in `fat_sample` (i.e., `fat_sample$ID`) with the row number corresponding to each man in the sample.

2. Use the function `sample_n()` to create `training_fat` (sampling *without* replacement) with 70\% of the observations coming from `fat_sample`.

3. Use `anti_join()` with `fat_sample` and `training_fat` to create `testing_fat` by column `ID`. 

4. Remove the variable `ID` used to split the data

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(123) # DO NOT CHANGE!

# fat_sample$ID <- rownames(fat_sample)
# training_fat <- ...(..., size = nrow(fat_sample) * 0.70,
#   replace = ...
# )

# testing_fat <- anti_join(...,
#   ...,
#   by = ...
# )

# training_fat <- training_fat %>% select(-"ID")
# testing_fat <- testing_fat %>% select(-"ID")

# head(training_fat)
# nrow(training_fat)

# head(testing_fat)
# nrow(testing_fat)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

Let's start by building a predictive additive LR with *all* **14** inputs. Call this object `fat_full_OLS`. 

Estimate an additive LR with *all* **14** inputs against the response variable `brozek`  using `lm()` and data from `training_fat`. 

> **If you write down the input variables, the order should match the column order from `training_fat` to pass the autograding tests**.

This will be our baseline model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_full_OLS <- lm(...,
#   ...
# )
# fat_full_OLS

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

Using `predict()` and `fat_full_OLS`, obtain the (out-of-sample) predicted brozek values for men in `testing_fat`. 

> `second_set_fat` will be used as independent *test data*

Store them in a variable called `fat_test_pred_full_OLS`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_test_pred_full_OLS <- ...(..., newdata = ...)
# head(fat_test_pred_full_OLS)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

**Question 2.3**
<br>{points: 1}

We will now compute the **Root Mean Squared Error (RMSE)** using data from the test set to evaluate the predictive model. This metric has the same units as the response; and the smaller the value, the better the model.

Use the function `rmse()` from the `mltools` package to compute the $\text{RMSE}_{\text{test}}$ based on the *predicted* brozed values stored in `fat_test_pred_full_OLS` for men in the test set. Note that the observed brozek values for these men are in `testing_fat$brozek`. 

Store this metric in a tibble called `fat_RMSE_models` with two columns:

- `Model`: The regression model from which we will obtain the prediction accuracy.
- `RMSE`: The $\text{RMSE}_{\text{test}}$ corresponding to the model.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_RMSE_models <- tibble(
#   Model = "OLS Full Regression",
#   RMSE = ...(
#     ...,
#     ...
#   )
# )
# fat_RMSE_models

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3()

## 3. Selecting a predictive model

The previous model uses all input variables to predict. However, we may want to select a smaller model by using only a subset of the input variables. The *stepwise selection* algorithms presented in worksheet_09 can be used to build predictive models. 

A good predictive model would be one that minimizes the *test* MSE. However, we can not use the same set to select the model and evaluate its performance. 

Metrics such as $C_p$, AIC and BIC are computed with the *training* set and can be used to *approximate* the *test* MSE, without looking at the *test* data. 

The test set will then be used *only* to assess the predictive performance of the selected model.

**Question 3.0**
<br>{points: 1}

Using only the training data in `training_fat`, select a reduced LR using the **forward selection** algorithm. Recall that this method is implemented in the function `regsubsets()` from library `leaps`.

The function `regsubsets()` identifies various subsets of input variables selected for models of different sizes. The argument `x` of `regsubsets()` is analogous to `formula` in `lm()`. 

Create one object using `regsubsets()`with `training_fat` and call it `fat_forward_sel`. We will use `fat_fwd_summary` to check your results.

> **Maintain the order of columns seen in `training_fat`**

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_forward_sel <- ...(
#   ..., ...,
#   ...,
#   ...
# )
# fat_forward_sel

#fat_fwd_summary <- summary(fat_forward_sel)

#fat_fwd_summary <- tibble(
#    n_input_variables = 1:14,
#    RSS = fat_fwd_summary$rss,
#    BIC = fat_fwd_summary$bic,
#    Cp = fat_fwd_summary$cp
#)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

**Question 3.1**
<br>{points: 1}

Out of the fourteen best models selected for each size by the *forward* subset algorithm and stored in `fat_forward_sel`, we will select the best one in terms of the *out-of-sample* prediction accuracy, estimated by the Mallow's $C_p$. 

Use the $C_p$ computed for each model, stored in `fat_forward_summary`, to select the best predictive model and indicate which input variables are in the selected model.

> **Heads up:** The most accurate model will have the smallest $C_p$. 


**A.** `age`.

**B.** `weight`.

**C.** `height`.

**D.** `adipos`.

**E.**  `neck`.

**F.**  `chest`.

**G.**  `abdom`.

**H.**  `hip`.

**I.**  `thigh`.

**J.**  `knee`.

**K.**  `ankle`.

**L.**  `biceps`.

**M.**  `forearm`.

**N.**  `wrist`.

*Assign your answers to the object `answer3.1`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes.*

In [None]:
#Run this cell below before continuing.

fat_fwd_summary
summary(fat_forward_sel)

In [None]:
# answer3.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1()

**Question 3.2**
<br>{points: 1}

Use the variables selected by the forward subset algorithm to build a *predictive* model. 

1. Identify the size of the model that minimizes the $C_P$, call it `cp_min`

2. Find the name of the variables for the best model of size `cp_min`, selected by the forward algorithm. Store them in an object called `selected_var`. Do not include the intercept with the variable names. 

3. Select only those columns and the response `brozek` from `training_fat`. Called the reduced data frames `training_subset`. 

> The previous step allows you to conveniently fit `lm` on all variables in the data, except the response. Note that the test set can include additional variables that won't be used to predict if not included in the model.

4. Train the predictive model using `lm()` and the reduced `training_subset` data. Call it `fat_red_OLS`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# cp_min = which.min(...$Cp) 
# selected_var <- names(...(fat_forward_sel, ...))[-1]

# training_subset <- training_fat %>% select(all_of(selected_var),brozek)

# fat_red_OLS <- ...(...,
#   ...
# )

# summary(fat_red_OLS)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

Use the trained model `fat_red_OLS` to predict the responses of the test set `testing_fat`, and call the resulting object `fat_test_pred_red_OLS`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_test_pred_red_OLS <- ...(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.3()

**Question 3.4**
<br>{points: 1}

Use the function `rmse()` to compute the RMSE of predicted brozek values of men in the test set stored in `fat_test_pred_red_OLS`. Add this metric as another row in the tibble `fat_RMSE_models` with `"OLS Reduced Regression"` in the column `Model` and the corresponding $\text{RMSE}_{\text{test}}$ in column `RMSE`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_RMSE_models <- rbind(
#   fat_RMSE_models,
#   tibble(
#     Model = ...
#     RMSE = ...
#     )
#   )
# fat_RMSE_models

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.4()

**Question 3.5**
<br>{points: 1}

Based on your results in `fat_RMSE_models`, which model has the best *out-of-sample* prediction performance?

**A.** OLS Full Regression.

**B.** OLS Reduced Regression.

*Assign your answer to an object called `answer3.5`. Your answer should be one of `"A"` or `"B"` surrounded by quotes.*

In [None]:
# answer3.5 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.5()