# **Worksheet 09: Prediction Uncertainty**

## **Lecture and Tutorial Learning Goals:**

By the end of this section, students will be able to:

- Explain the difference between confidence intervals for prediction and prediction intervals and what elements need to be estimated to construct these intervals.

- Write a computer script to calculate these intervals. Interpret and communicate the results from that computer script.

- Give an example of a question that can be answered by predictive modelling.


In [None]:
# Loading Libraries

library(broom)
library(latex2exp)
library(tidyverse)

source("tests_worksheet_09.R")

## 1. Prediction Intervals *vs* Confidence Intervals for prediction

In previous lectures you've learned how to estimate three regression models (linear, logistic and Poisson). The estimated models can be used to predict the expected values of the response given additional covariates. 

The estimated model depends on the random sample used. Thus, the prediction also depends on the random sample, which means that they are random variables (functions of the a random sample). 

Thus, we can compute the sampling distribution and the standard error of the prediction random variable!

> A different sample would have resulted in a different estimated model and, thus, different predictions. The sample-to-sample variation in the estimated coefficients translates into variation in the predictions. 
  
Today, we will measure the **uncertainty** of the predictions for MLR using two type of intervals: 

- confidence intervals for prediction (CIP)

- prediction intervals (PI)

> CIP can be constructed for any of the regression models we studied, using `se.fit = true` in the `predict()` function.

> PI are not defined for all GLM models. Although some bootstrapping methods have been proposed, we won't cover PI for logistic and Poisson regression   

## **1.1. <u>Dataset: </u>[<u>2015 Property Tax Assessment from Strathcona County</u>](https://data.strathcona.ca/Housing-Buildings/2015-Property-Tax-Assessment/uexh-8sx8)**

In this worksheet, we'll work with a dataset containing data on property tax-assessed values in Strathcona County. The dataset provides a valuation date of July 1, 2014, and a property condition date of December 31, 2014. 

![](https://github.com/UBC-STAT/stat-301/blob/master/supplementary-material/img/popul_AB.png?raw=true)

Let's start by loading the data, but to work with smaller numbers, we will divide the assessed value by 1000. 

In [None]:
# Loads the data and re-scale the assessed values.

properties_data <- 
    read.csv("data/Assessment_2015.csv") %>%
    filter(ASSESSCLAS=="Residential")  %>% 
    mutate(assess_val = ASSESSMENT / 1000)

Unfortunately, unless we work with a simulated dataset, the true population parameters are *unknown*. However, to illustrate concepts while working with a real dataset, and avoid simulating data,  we'll use all the residencies in the dataset to obtain a MLR and **pretend** that this is the population line.

> <font color='darkred'>**THIS IS NOT DONE IN REAL DATA ANALYSIS!** We are doing this here for teaching purposes only. In practice, your entire dataset is your sample. </font>

So, let's select a sample from our `properties_data`.

In [None]:
set.seed(561) # DO NOT CHANGE THIS.

# A sample out of our bigger sample properties_data, 
# which we are PRETENDING to be population (but IT ISN'T)

properties_sample <- 
    properties_data %>%
    slice_sample(n = 100, replace = FALSE)

Next, 

- **estimate** the population SLR using the sample `properties_sample`

- **compute** the population coefficients of the SLR using all properties in `properties_data`

> again, in practice you would not be able to *compute* the population coefficients

In [None]:
lm_pop <- lm(assess_val ~ BLDG_METRE, properties_data)
lm_sample <- lm(assess_val ~ BLDG_METRE, properties_sample)

tidy(lm_sample) %>% 
    mutate_if(is.numeric, round, 3)

In [None]:
cols <- c("Population"="#f04546","LS Estimate"="#3591d1")

plot_sample <- 
    properties_sample %>%
    ggplot(aes(BLDG_METRE, assess_val)) + 
    xlab("building size (mts)") + 
    ylab("assessed value ($/1000)") +
    xlim(50, 450) +
    geom_point(aes(BLDG_METRE, assess_val), color="grey")

### **1.2 Prediction of the assessed value of a house in Strathcona**

Using linear regression, the assessed value of a random house in Strathcona can be modelled as the average assessed value of a house with similar characteristics plus some random error.

Mathematically,

$$\text{value}_i = E[\text{value}_i|\text{size}_{i}] + \varepsilon_i$$

The $\varepsilon_i$ term is necessary because a random residence won't have a value exactly equal to the average population value of residencies of the same size; some have higher values, and others have lower values. 

In addition, if we assume that the conditional expectation is linear, then:

$$ E[\text{value}_i|\text{size}_{i}] = \beta_0 + \beta_1 \text{size}_{i}$$

which is the population regression line. 

We use the random sample to estimate the regression line. In this case, we estimate the relation between a house's assessed value and size based on a random sample of houses from Strathcona. 

We use the **estimated** SLR to **predict**:

$$\text{pred.value}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{size}_i$$

**It is important to distinguish the following quantities:**

- actual value: $\text{value}_i$

- population average (or conditional expected) value: <font color='red'>$E[\text{value}_i|\text{size}_{i}]$</font>

- predicted value <font color='blue'>$\text{pred.value}_i$</font>

Since we are **pretending** that we know this population line, we can plot it. But once again, recall that in practice this line is *unknown*.

In [None]:
# Run this before continuing.

# Don't worry about reading and understanding this code. 
# You can just run and skip it.

options(repr.plot.width=8, repr.plot.height=6)

plot_expect <- 
    plot_sample +
    geom_smooth(data = properties_data, 
                aes(BLDG_METRE, assess_val, color = "Population"),
                method = lm, 
                linetype = 2, 
                se = FALSE, 
                fullrange = TRUE) +
    geom_point(aes(x = 251, y = predict(lm_pop, tibble(BLDG_METRE = 251))), 
               color = "red", 
               size = 3) +  
    annotate('text',
             x = 300,
             y = 715,
             label = "paste('E[',Y[i],' | ', X[i],' ]')", 
             color = "red", 
             size = 7,
             parse = TRUE) +
    geom_point(aes(x = 251,y = 534), color = "black", size = 3) +
    annotate("text",
             x = 265, 
             y = 500, 
             label = 'paste(y[i])', 
             size = 7, 
             parse = TRUE) +
    geom_segment(x = 251, 
                 y = predict(lm_pop, tibble(BLDG_METRE = 251)),
                 xend = 251, 
                 yend = 534, 
                 linetype = "dashed") +
    annotate("text", 
             x = 240,
             y = 630,
             label = 'paste(epsilon[i])', , 
             size = 7,
             parse = TRUE) +
    scale_colour_manual(name = "SLR", values = cols) + 
    theme(text = element_text(size = 16))

plot_ls <- 
    plot_expect +    
    geom_smooth(data = properties_sample, 
                aes(BLDG_METRE, assess_val, color = "LS Estimate"),
                method = lm,
                se = FALSE,
                fullrange = TRUE) +
    geom_point(aes(x = 251, y = predict(lm_sample, tibble(BLDG_METRE = 251))),
               color = "blue",
               size = 3) +
    annotate('text',
             x = 240, 
             y = predict(lm_sample, tibble(BLDG_METRE = 251)) + 60, 
             label = 'paste(hat(y)[i])', 
             color = "blue", 
             size = 7,
             parse = TRUE)

plot_ls

#### **<u>Confidence Intervals for Prediction (CIP)</u>**

CIP are used when we want to predict $E[\text{value}_i|\text{size}_{i}]$ (conditional expectation)!

The predicted value <font color=blue> $\text{pred.value}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{size}_{i}$ </font> approximates, with uncertainty, the  population average value <font color=red>$E[\text{value}_i|\text{size}_{i}]$ </font> 

> if we take a different sample, we get different estimates (i.e., different blue lines) and, consequently, different predictions

<font color='darkred'>**The only source of variation here is the sample-to-sample variation**</font>

A **95% confidence interval for prediction** is a range that has a 95% probability of capturing the **population average** value of a house with a given size.

Once we have estimated and predicted values, the range is non-random so we use the word **confidence** (instead of "probability") since nothing else is random!

**A quick look at data**

Using the sample `properties_sample`, let's compute 95% confidence intervals for prediction using the function `predict`. 

- Create a dataframe, called `properties_cip`, that contains the response, the input, the predictions using `lm_sample`, and the lower and upper bounds of the intervals for *each* observation

> each row corresponds to one (in-sample) prediction and its confidence interval

In [None]:
properties_cip <- 
    properties_sample  %>% 
    select(assess_val, BLDG_METRE) %>% 
    cbind(predict(lm_sample, interval="confidence", se.fit=TRUE)$fit)

In [None]:
head(properties_cip,3)

**Interpretation** for row 1: 

With 95% confidence, the <font color=red>*average value*</font> of a house **of size 220 mts** is between $\$671,944$ and $\$748,198$ (rounded)

> note the *conditional statement*: "of size 220 mts", not any house!

**Visualization**

In [None]:
properties_sample %>%
    ggplot(aes(BLDG_METRE, assess_val)) + 
        xlab("building size (mts)") + 
        ylab("assessed value ($/1000)") +
        xlim(50,450) + 
        geom_smooth(aes(color="LS Estimate"), method = lm, se = TRUE, fullrange = TRUE) +
        geom_smooth(data = properties_data, 
                    aes(BLDG_METRE, assess_val, color="Population"),
                    method = lm,
                    se = FALSE,
                    fullrange = TRUE) +        
        scale_colour_manual(name="SLR",values=cols)

**Question 1.0** <br>
{points: 1}

Using the sample `properties_sample`, compute 90% confidence intervals for prediction. Create a data frame called `properties_cip_90` that contains the response, the input, the predictions using `lm_sample`, and the lower and upper bounds of the intervals for each observation. Columns in your data frame should be in this order.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
 # properties_cip_90 <- 
 #    ...  %>% 
 #    select(..., ...) %>% 
 #    cbind(...(..., 
 #              interval = "...", 
 #              level = ..., 
 #              se.fit=TRUE)$fit)

# your code here
fail() # No Answer - remove if you provide an answer

head(properties_cip_90)

In [None]:
test_1.0()

**Question 1.1** 
{points: 1}

Based on the output `properties_cip_90`, which of the following claims is correct?

**A.** with 90% confidence, the *expected* value of a house of size 97 mts is between \\$301274 and \\$363407 (rounded) 

**B.** with 90% confidence, the value of a house of size 97 mts is between \\$301274 and \\$363407 (rounded) 


**C.** with 90% confidence, the *expected* value of a house of size 97 mts is between \\$678167 and \\$741974 (rounded) 


**D.** with 90% confidence, the *expected* value of any house is between \\$678167 and \\$741974 (rounded) 


*Assign your answer to an object called `answer1.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes. *

In [None]:
# answer1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

**Question 1.2** 
{points: 1}

True or false?

Based on the outputs `properties_cip_90` and `properties_cip`, CIP are centered at the fitted value  <font color=blue>$\hat{Y}_i$ </font>

*Assign your answer to an object called `answer1.2`. Your answer should be either `"true"` or `"false"`, surrounded by quotes.*

In [None]:
# answer1.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

**Question 1.3** 
{points: 1}

True or false?

The 90% confidence intervals for prediction are wider than the 95% confidence intervals for prediction

*Assign your answer to an object called `answer1.3`. Your answer should be either `"true"` or `"false"`, surrounded by quotes.*

In [None]:
# answer1.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

#### **<u> Prediction Intervals (PI)</u>**

PI are used when we want to predict the actual response of a new observation $Y_i$!

The predicted value <font color=blue> $\text{pred.value}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{size}_{i}$  </font> also approximates, with uncertainty, the actual response $\text{value}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{size}_{i} + \varepsilon_i$. 


However, now the uncertainty comes from the estimation (sample-to-sample variability) and the error term that generates the data, <font color='darkred'>**two sources of uncertainty**</font>!!

A 95% prediction interval is a range within which a new value of a house of this size is expected to fall with a specified probability (e.g., 95%). 

Note that this time, the aim is to predict the actual value, which is a random variable, thus we interpret the interval in terms of "probability".

**A quick look at data**

In [None]:
properties_pi <- 
    properties_sample  %>% 
    select(assess_val, BLDG_METRE) %>% 
    cbind(predict(lm_sample, interval="prediction"))

In [None]:
head(properties_pi, 3)

Each row corresponds to one (in-sample) prediction and its confidence interval.

<br>

**Interpretation** for row 1: with 95% probability, the value of a house of size 220 mts is between $\$454,519$ and $\$965,622$ (rounded).

<br>

**Question 1.4** 
{points: 1}

Let's use the results in `properties_cip` and `properties_pi` to corroborate that the prediction intervals are wider than the confidence intervals for prediction. You will:

1. calculate the length of the intervals for `properties_pi` (name the column `len_pi`) and `properties_cpi` (name the column `len_cip`). 
2. `left_join` the data from both tibbles by `assess_val` and `BLDG_METRE` (you'll see a warning of many-to-many relationships that you can ignore for our purposes here). 
3. count how many of the `len_cip` are higher than `len_ci` using `summarise`. 
4. pull the value out of the tibble with `pull()` and store it in a variable called `n_cpi_wider`. 

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# properties_pi <-
#     properties_pi %>%
#     mutate(len_pi = ... - ...)

# properties_cip <-
#     ... %>%
#     ...(len_cip = ...)

# n_cpi_wider <-
#     properties_pi %>%
#     left_join(properties_cip, by = join_by(assess_val, BLDG_METRE)) %>%
#     summarise(sum(... <= ...)) %>%
#     pull()

# your code here
fail() # No Answer - remove if you provide an answer

n_cpi_wider

In [None]:
test_1.4()

## **1.3 Conclusions: Prediction uncertainty**

- Confidence intervals for prediction account for the uncertainty given by the estimated LR to predict the conditional expectation of the response


- Prediction intervals account for the uncertainty given by the estimated LR to predict the actual response, i.e, the conditional expectation of the response *plus* the error that generates the data! 


- PIs are wider than CIPs; both are centered at the fitted value!

![](https://github.com/UBC-STAT/stat-301/blob/master/supplementary-material/img/pred_error.png?raw=true)