# **Tutorial 09: Prediction Uncertainty**

## **Lecture and Tutorial Learning Goals:**

By the end of this section, students will be able to:

- Explain the difference between confidence intervals for prediction and prediction intervals and what elements need to be estimated to construct these intervals.

- Write a computer script to calculate these intervals. Interpret and communicate the results from that computer script.

- Give an example of a question that can be answered by predictive modelling.

In [None]:
# Run this cell before continuing.
library(broom)
library(faraway)
library(tidyverse)

source("tests_tutorial_09.R")

## **1. Prediction CI *versus* CI for Prediction**

We have learned how to estimate and use the estimated models to infer population parameters. You will now focus on **prediction** using the estimated model.

In predictive modelling and analysis, it is essential to recognize and understand the distinction between **in-sample** prediction and **out-of-sample** prediction.

In this tutorial, we are going to recognize and measure the *uncertainty* of these predictions.

Let us start by loading the dataset to be used throughout this tutorial. We will use the dataset `fat` from the library `faraway`. You can find detailed information about it in [Johnson (1996)](https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910505). This dataset contains the percentage of body fat and a variety of body measurements (continuous variables) of 252 men. We will use the variable `brozek` as the response variable and a subset of 14 variables to build different models. 

Run the code below to create the working data frame called `fat_sample`.

In [None]:
fat_sample <- 
    fat %>%
    select(brozek, age:adipos, neck:wrist)

head(fat_sample,3)

The response variable `brozek` is the percent of body fat using Brozek's equation:

$$\texttt{brozek} = \frac{457}{\texttt{density}} - 414.2,$$

where body `density` is measured in $\text{g}/\text{cm}^3$.

The 14 input variables are:

- `age`: Age in $\text{years}$.
- `weight`: Weight in $\text{lb}$.
- `height`: Height in $\text{in}$.
- `adipos`: Adiposity index in $\text{kg}/\text{m}^2$.

$$\texttt{adipos} = \frac{\texttt{weight}}{\texttt{height}^2}$$

- `neck`: Neck circumference in $\text{cm}$.
- `chest`: Chest circumference in $\text{cm}$.
- `abdom`: Abdomen circumference at the umbilicus and level with the iliac crest in $\text{cm}$.
- `hip`: Hip circumference in $\text{cm}$.
- `thigh`: Thigh circumference in $\text{cm}$.
- `knee`: Knee circumference in $\text{cm}$.
- `ankle`: Ankle circumference in $\text{cm}$.
- `biceps`: Extended biceps circumference in $\text{cm}$.
- `forearm`: Forearm circumference in $\text{cm}$.
- `wrist`: Wrist circumference distal to the styloid processes in $\text{cm}$.

**Question 1.0**
<br>{points: 1}

Let's start by building an SLR using only `weight` to predict `brozek.`

Use the `lm()` function to estimate the SLR. Store this estimated model in the variable `SLR_fat`.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# SLR_fat <- ...(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer

SLR_fat

In [None]:
test_1.0()

<hr>

In the first part on the course, we learned how to obtain and interpret confidence intervals for the regression parameters. Since the predictions are functions of the estimated MLR, they also depend on the sample used! We can obtain confidence intervals that take into account the uncertainty in predictions.

There are 2 type of intervals we can construct depending on the quantity we want to predict: *confidence intervals for prediction* and *prediction intervals*. 

**Recall**:

- **Confidence interval for prediction:** an interval for the conditional average value $E[Y_i |X_i]$.
  
- **Prediction interval:** an interval for the actual value of a response $Y_i$. 

**Question 1.1**
<br>{points: 1}

Let's start by computing confidence intervals for prediction. These intervals predict the average `brozek` index for men of a given  weight. 

Using `SLR_fat` and `predict`, obtain the asymptotic 95% CIP (confidence intervals for prediction). 

Create a data frame called `fat_cip` that contains the response, the input, the predictions, and the lower and upper bounds of the intervals for each observation **in that order from left-to-right**. 

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_cip <- 
#    fat_sample %>% 
#    select(..., ...) %>% 
#    cbind(predict(...,interval="confidence",se.fit=TRUE)$fit) %>% 
#    mutate_if(is.numeric, round, 3)

# your code here
fail() # No Answer - remove if you provide an answer

head(fat_cip)

In [None]:
test_1.1()

**Question 1.2**
<br>{points: 1}

We have just calculated the 95% confidence interval for the mean `brozek` index for men of different weights in our sample. 

Provide a brief interpretation of the 95% confidence interval for the prediction you have calculated in row 1.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.3**
<br>{points: 1}

Let's now compute and interpret prediction intervals. These intervals predict the actual brozek index for men of different weights.  

You can also use `SLR_fat` and `predict` to obtain the asymptotic 95% PI (prediction intervals), changing the argument `interval`. 

Create a tibble, called `fat_pi`, that contains the response, the input, the predictions, and the lower and upper bounds of the intervals for each observation, *in that order from left to right*.

> Read the warning message. Since your goal is to predict an actual value, it is important to note that this is not coming from a test set.

*Fill out those parts indicated with ..., uncomment the corresponding code in the cell below, and run it.*

In [None]:
# fat_pi <- fat_sample  %>% 
#    select(..., ...) %>% 
#    cbind(predict(...,interval="prediction",se.fit=TRUE)$fit)  %>% 
#    mutate_if(is.numeric, round, 3)

# your code here
fail() # No Answer - remove if you provide an answer

head(fat_pi)

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

We have just calculated the 95% prediction interval for the brozek index of men of different weights in our sample. 

Provide a brief interpretation for the 95% prediction interval you have calculated in row 1. 


> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.5**
<br>{points: 1}

Compare the confidence intervals computed in **Question 1.1** with those computed in **Question 1.3** (by row). Which confidence intervals are wider? Respond and explain why in one or two sentences.

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.