# Quantifying Total Error Around A Model

## Chapter 6.1-6.3 Overview Notebook

In [None]:
# teacher enter a class_id here
class_id <- "teacher@myschool.com" # put any unique id here (e.g., teacher email address)

# run this to set up the notebook

# run this code to load the R packages we will use
#remotes::install_version("coursekata", "0.19.0", upgrade="never")
library(coursekata)
library(gridExtra)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# create two data frames: student_1 and student_2
time_sec <- c(7.4, 6, 7.5, 6.4, 6.7)
trial <- c(1:5)
student1 <- data.frame(trial, time_sec)

time_sec <- c(7.4, 6.7, 7.1, 6.2, 7.2, 6.4, 6.5, 7.2, 6.6, 7.4)
trial <- c(1:10)
student2 <- data.frame(trial, time_sec)

# set some variables for this notebook
condition <- "guess" # set as "timer" or "guess"
user_id <- Sys.getenv("JUPYTERHUB_USER") # this gets the users Jupyter ID

# pull in gf_resid and gf_squaresid for now
source("https://raw.githubusercontent.com/UCLATALL/stopwatch/refs/heads/main/gf_resid.R")
source("https://raw.githubusercontent.com/UCLATALL/stopwatch/refs/heads/main/gf_square_resid.R")


## 1 Why Do We Build Models? To Reduce Error

When we build models in statistics, one of our big goals is to reduce error. No model is ever going to be perfect. It's predictions will always be a little off from the data. But some models do a better job than others.

So far we have learned to make a very simple model: the empty model (on the left). Later we will learn to make more complex models (like the one on the right). Both of these models are based on the same data points. Just looking at them, you may think that one model has more error than the other. But how much more?

In order to compare the two models accurately, we need a **measure** of how much total error there is around a model. Developing such a measure is the focus of this lesson, starting with the empty model.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.1-6.3-overview-empty-complex.jpg" width="90%">


<div class="discussion-question">
    
### 1.1 Discussion Question: Which model has less error and how can you tell? Why can't we just use the sum of the residuals as a measure of total error? 
    
</div>

## 2 Estimating Elapsed Time

In order to start developing a measure of error, we need some data we can model.  

### 2.1 Let’s collect our own!

In the cell below we have a stopwatch. But this is not a normal stopwatch. On this stopwatch, after you click the **Start** button you won't be able to read the elapsed time until your click on **Stop**. Let's see how it works. Try using the stopwatch to guess when exactly 7 seconds are up. Try it a few times. 

In [None]:
# run this cell
IRdisplay::display_html(sprintf(
  '<iframe src="https://uclatall.github.io/stopwatch/stopwatch-v3.html?save_data=false&progress=hide" width="500" height="210" sandbox="allow-scripts allow-same-origin"></iframe>'
))

<div class="discussion-question">
    
### 2.2 Discussion Questions: Let's say someone tried the guessing game five times. Could you use their data to make a model to predict what their guess would be for a sixth time? 

- How would you make the model?
- What does it mean to "fit" a model? 
- What does it mean for a model to be the "best fit" for the data?
    
</div>

### 2.3 Let's collect 5 trials of data from a single student.

We need a volunteer to do 5 trials guessing an elapsed time of 7 seconds. The stopwatch below will collect the data from each trial and save it in a Google sheet.  

**Important: This time we've made the task a little harder.** You won’t get to see how well the volunteer did until all 5 trials are complete and we pull the data into R. This way we’ll have a small dataset we can use to fit the empty model and explore total error.

In [None]:
# then run this code to render the stopwatch
IRdisplay::display_html(sprintf(
  '<iframe src="https://uclatall.github.io/stopwatch/stopwatch-v3.html?condition=%s&class_id=%s&user_id=%s&trials=5&progress=hide&result=hide" width="500" height="210" sandbox="allow-scripts allow-same-origin"></iframe>',
  condition, class_id, user_id
))

### 2.4 How well did they do? Run the code below to import the data into R

The code below imports data for the five trials into an R data frame called `volunteer_student`

In [None]:
# run this code to import the most recent session from your class
sheet_url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQRi4VBEFYGz_HnJXBsxlawbqM4vg5vgullb-iIqc5-dOpqEBCck12IS7lPD7bEdoxe8VPKYKe86Db4/pub?output=csv"
volunteer_student <- utils::read.csv(sheet_url, header = TRUE, stringsAsFactors = FALSE) %>%
  mutate(date = as.POSIXct(date, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")) %>%
  filter(user_id == !!user_id) %>%
  filter(session_id == session_id[which.max(date)]) %>%
  select(user_id, trial, time_sec, date)

# prints out the recent_data
volunteer_student

<div class="guided-notes">
    
### 2.5 What does each row of the data frame represent? How about each column? (Put an asterisk next to the outcome variable.)
    
</div>

## 3 Fit the Empty Model of `time_sec`

### 3.1 Let's look at data from someone called `student1` and then create a scatter plot.

Let’s look at some data collected from a previous student who tried this task. Run the code below to take a look at the `student1` data frame. 

Then make a scatter plot with `time_sec` on the y-axis (the outcome variable) and `trial` on the x-axis. How well did this person do?


In [None]:
# run this cell
str(student1)

# graph time_sec by trial



<div class="guided-notes">
    
### 3.2 Write code to fit the empty model of `time_sec` and save it as `empty_model1`, and then print out the parameter estimate ($b_0$)
    
</div>

In [None]:
# code here



<div class="guided-notes">
    
### 3.3 Write the best-fitting model in GLM notation. 
</div>

<div class="guided-notes">
    
### 3.4 Draw the best-fitting empty model onto the scatter plot provided
    
</div>


In [None]:
# to see what the drawing should look like, run this 
gf_point(time_sec ~ trial, data = student1) %>%
  gf_model(empty_model1) 

<div class="discussion-question">
    
### 3.5 Discussion Question: How would we use this model to make a prediction about the next trial?
    
</div>

## 4 Measuring Total Error

We have fit the empty model for `student1`. We know there is error around the model prediction, evidenced by the fact that none of the estimates are exactly 6.8 seconds. But how much *total error*? How can we measure total error?  

We already know one method that won’t work: just adding up the residuals. That always comes out to zero for the empty model.


<div class="discussion-question">
    
### 4.1 Discussion Question: What are some other ideas for how to calculate the total error around a model?
    
</div>

In [None]:
# run this
gf_point(time_sec ~ trial, data = student1) %>%
  gf_model(empty_model1) %>%
  gf_resid(empty_model1, color="red")

### 4.2 One idea: Sum the absolute values of the residuals

A good idea is to add up the absolute values of the residuals: `sum(abs(resid(empty_model1)))` In effect, this sums up the distances of each data point from the model prediction.

- ✅ Absolute value solves the canceling problem (positive and negative errors both become positive).
- ❌ The problem with this solution is that the sum of absolute residuals is not minimized at the mean. The mean would not be the *best-fitting* empty model if this were our measure of error.

In [None]:
# If you want to explore sum of absolute residuals, 
# try uncommenting each of these lines of code one at a time

# resid(empty_model1)
# abs(resid(empty_model1))
# sum(abs(resid(empty_model1)))

### 4.3 The most common solution: Sum of Squares Total (SST)

Instead of adding residuals or absolute residuals, the most common measure of total error is the sum of the squared residuals. This is calculated by squaring each residual, then adding up the squares.

- ✅ Squaring solves the canceling problem (positive and negative errors both become positive).
- ✅ Squaring also ensures the mean is the best-fitting empty model.

This measure is called the Sum of Squares (SS). When we’re talking about the total error around the empty model, we call it the Sum of Squares Total (SST).

<div class="guided-notes">
    
### 4.4 We've learned how to represent residuals on a graph. How would we represent squared residuals on a graph? Draw how you think you could do that.

</div>


In [None]:
# run this after students predict
gf_point(time_sec ~ trial, data = student1) %>%
  gf_model(empty_model1) %>%
  gf_resid(empty_model1) %>%
  gf_squaresid(empty_model1, color = "blue")

<div class="discussion-question">
    
### 4.5 Discussion Question: In sum of squares, what exactly is being summed?
    
As a contrast, also consider, in sum of residuals, what exactly is being summed?
    
</div>

<div class="guided-notes">

### 4.6 Modify the R code below to calculate the sum of squares    

- NOTE: You need to square the residuals before you sum them
- Afterwards, write the R code to calculate the SS using the supernova() function
    
</div>


In [None]:
# modify this code; you should get 1.66
resid(empty_model1)



<div class="discussion-question">
    
### 4.7 Discussion Question: What units is this number in? Is 1.66 a lot or a little total error?
    
</div>

## 5 Mathematical Notation for SST

We don't introduce a lot of formulas in this course, but the formula for sum of squares is important enough that you should learn it. 

Here is the formula: <br><br>

<div style="font-size: 20pt;">$SST=\sum_{i=1}^n (Y_i-\bar{Y})^2$</div>

**This is often abbreviated like this:**<br><br>

<div style="font-size: 20pt;">$SST=\sum (Y_i-\bar{Y})^2$</div>

The $\sum$ symbol stands for sum (for every row i, which goes from 1 to n, the last row of your data).

**And you could also write this to mean the same thing:**<br><br>

<div style="font-size: 20pt;">$SST=\sum (Y_i-\hat{Y})^2$</div>

$\hat{Y}$ (y-hat) is the model prediction, which for the empty model is just the mean of $Y$, or $\bar{Y}$.

<div class="guided-notes">
    
### 5.1 Fill in the missing table cells to connect the R code to the formula
    
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <th style="border: 1px solid black; font-weight: bold; width: 30%; text-align: left">R Code</th>
      <th style="border: 1px solid black; font-weight: bold; width: 10%; text-align: left">Formula Symbol</th>
      <th style="border: 1px solid black; font-weight: bold; width: 60%; text-align: left">Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: #f2f2f2;">
      <td style="border: 1px solid black;text-align: left; font-size: 16px;"><code>time_sec</code> or <code>student1$time_sec</code></td>
      <td style="border: 1px solid black;text-align: left">Y<sub>i</sub></td>
      <td style="border: 1px solid black;text-align: left">Values of the outcome variable (each guess in seconds)</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;text-align: left; font-size: 16px;"><code>mean(student1&#36;time_sec)</code></td>
      <td style="border: 1px solid black;text-align: left;"><span style="text-decoration: overline;">Y</span><br>(Y-bar)</td>
      <td style="border: 1px solid black;text-align: left;"> </td>
    </tr>
    <tr style="background-color: #f2f2f2;">
      <td style="border: 1px solid black;text-align: left; font-size: 16px;"><code>_________(empty_model)</code></td>
      <td style="border: 1px solid black;text-align: left;">Y<sub>i</sub> − <span style="text-decoration: overline;">Y</span></td>
      <td style="border: 1px solid black;text-align: left;">Residuals from the empty model (differences between each value and the mean)</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;text-align: left; font-size: 16px;"><code>resid(empty_model)^2</code></td>
      <td style="border: 1px solid black;text-align: left;"> </td>
      <td style="border: 1px solid black;text-align: left;">Squared residuals</td>
    </tr>
    <tr style="background-color: #f2f2f2;">
      <td style="border: 1px solid black;text-align: left; font-size: 16px;"><code>sum(resid(empty_model)^2)</code></td>
      <td style="border: 1px solid black;text-align: left;"> </td>
      <td style="border: 1px solid black;text-align: left;"> </td>
    </tr>
  </tbody>
</table>

<div class="discussion-question">

### 5.2 Discussion Question: What happens when we run the code in the cell below? Will it add additional rows or additional columns to the `student1` dataset? How do you know?

</div>

In [None]:
# just run this; but let's not print student1 just yet
# we'll print it after our activity in 5.3
student1$empty_predict <- predict(empty_model1)
student1$empty_resid <- resid(empty_model1)
student1$empty_resid_sqrd <- resid(empty_model1)^2

<div class="guided-notes">

### 5.3 Below is the new version of the student1 data table that we created using the predict() and resid() functions. Some of the cells are empty. 
    
- Fill in the top of each column with the R code that was used to produce each of the new columns (e.g., `student1$empty_predict <- <your code here>`)
- Fill in the second row of each column with the appropriate mathematical notation: 
 $Y_i$, $\hat{Y}$, $(Y_i-\hat{Y})$, or $(Y_i-\hat{Y})^2$.
- Finally, fill in any missing values in the data table.

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%">
  <thead>
    <tr>
      <th colspan="2" style="border: 1px solid black; background-color: #7F7F7F; color: white; text-align: left; width: 20%;">
        R code &rarr;
      </th>
      <th style="border: 1px solid black; width: 26%;"></th>
      <th style="border: 1px solid black; width: 27%;"></th>
      <th style="border: 1px solid black; width: 27%;"></th>
    </tr>
      <tr>
      <th style="border: 1px solid black; background-color: #444; color: white; text-align: left; width: 20%;">
        math &rarr;
      </th>
      <th style="border: 1px solid black;text-align: right;"></th>
      <th style="border: 1px solid black; width: 26%;"></th>
      <th style="border: 1px solid black; width: 27%;"></th>
      <th style="border: 1px solid black; width: 27%;"></th>
    </tr>
    <tr>
      <th style="border: 1px solid black;text-align: right;">trial</th>
      <th style="border: 1px solid black;text-align: right;">time_sec</th>
      <th style="border: 1px solid black;text-align: right;">empty_predict</th>
      <th style="border: 1px solid black;text-align: right;">empty_resid</th>
      <th style="border: 1px solid black;text-align: right;">empty_resid_sqrd</th>
    </tr>
  </thead>
  <tbody>
    <tr style="border: 1px solid black;text-align: right;">
      <td style="border: 1px solid black;">1</td>
      <td style="border: 1px solid black;">7.4</td>
      <td style="border: 1px solid black;">6.8</td>
      <td style="border: 1px solid black;">0.6</td>
      <td style="border: 1px solid black;">0.36</td>
    </tr>
      <tr>
      <td style="border: 1px solid black;">2</td>
      <td style="border: 1px solid black;">6.0</td>
      <td style="border: 1px solid black;"> </td>
      <td style="border: 1px solid black;">-0.8</td>
      <td style="border: 1px solid black;">0.64</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">3</td>
      <td style="border: 1px solid black;">7.5</td>
      <td style="border: 1px solid black;">6.8</td>
      <td style="border: 1px solid black;"> </td>
      <td style="border: 1px solid black;">0.49</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">4</td>
      <td style="border: 1px solid black;">6.4</td>
      <td style="border: 1px solid black;">6.8</td>
      <td style="border: 1px solid black;">-0.4</td>
      <td style="border: 1px solid black;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">5</td>
      <td style="border: 1px solid black;">6.7</td>
      <td style="border: 1px solid black;"> </td>
      <td style="border: 1px solid black;">0.1</td>
      <td style="border: 1px solid black;"> </td>
    </tr>
  </tbody>
</table>

In [None]:
# just run this
student1$empty_predict <- predict(empty_model1)
student1$empty_resid <- resid(empty_model1)
student1$empty_resid_sqrd <- resid(empty_model1)^2

student1

<div class="discussion-question">
    
### 5.4 Now that we have this updated `student1` data frame, what are some different ways we could calculate Sum of Squares Total? 
    
- For each method, let's try to work out the R code to do implement.  
    
</div>

In [None]:
# code here


<div class="discussion-question">
    
### 5.5 The Sum of Squares Total for student 1's data is 1.66. Why does the first line of code work and the second doesn't?
    
</div>

In [None]:
# this code works
sum( student1$empty_resid_sqrd ) 

# this code does not
# hint: this number is very close to 0
sum( student1$empty_resid )^2 

## 6 How Sample Size Affects SST

We've been working with a small dataset from Student 1. It only has 5 data points. Student 2 generated 10 data points - twice as many as Student 1. Let's see how this difference in sample size affects Sum of Squares Total.

<div class="discussion-question">
    
### 6.1 Discussion: Student 2 happens to be a bit better at this task than Student 1. Student 1’s SST was 1.66. Let's make a prediction. How will Student 2's SST compare to Student 1?
    
</div>

### 6.2 In the code cell below, fit and save `empty_model2` for `time_sec` in the `student2` data frame. Print it out and interpret the parameter estimate.

In [None]:
# code here



<div class="discussion-question">
    
### 6.3 Discussion: Just looking at the two plots:
- Which one is Student 1 and which one is Student 2? 
- Which model, `empty_model1` and `empty_model2`, looks like it would have the larger SST? Why?
- Is Student 2 more accurate at the estimation task? How can you tell?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.1-6.3-overview-SS-Total-2-students.jpg" alt = "the squares that comprise SST for student 1 and student 2's guesses" width="100%">

### 6.4 Write R code to calculate Sum of Squares Total for `student1` and `student2`. Do the calculations confirm your intuitions?

In [None]:
# code here



## 7 Variance

We just saw that Student 2’s SST was larger than Student 1’s. But this doesn’t mean the empty model fits Student 2’s data worse. It just means we added up more squared residuals (10 squares instead of 5).  

So SST by itself isn’t a fair way to compare across datasets of different sizes.

To get a more comparable indicator of how well the model fits, statisticians use something called **variance**. Variance is simply the SST divided by n-1 (the n stands for the number of data points). In this way, variance corrects for the difference in sample size between the two datasets. (n-1 is called the *degrees of freedom*. We will discuss this more in later parts of the course.)

<div class="guided-notes">
    
### 7.1 Modify the code we use to calculate SST around `empty_model2` to calculate variance instead for `student2`
    
- Afterwards, write the R code to calculate the SS using the supernova() function
- Finally, write the R code to calculate the variance using the var() function

</div>


In [None]:
# modify this
sum(resid(empty_model2)^2)



<div class="discussion-question">

### 7.2 Discussion: In the `supernova()` table, the column labeled MS stands for "mean square".  That is just another way of saying "variance". Why does it make sense to think of variance as a "mean square"? 
</div>


<div class="guided-notes">

### 7.3 How would you modify the formula for SST to get the formula for variance?   

<div style="font-size: 20pt;text-align: center">$\sum (Y_i-\bar{Y})^2$</div>

</div>


### 7.4 Using variance to compare Student 1 and Student 2

In the code cell below, use the `supernova()` function to calculate the variance of Student 1 and Student 2.

In [None]:
# code here


<div class="discussion-question">

### 7.5 Discussion: Notice that the SS of Student 2 is larger than Student 1; but the MS (variance) of Student 2 is smaller than Student 1. Why do you think this is? How would you interpret these numbers? 
</div>


## 8 Practice What You've Learned
**NOTE: This might be better put in a separate notebook.**
Try collecting data on yourself, fitting the empty model, and then calculating the SST around your model. See how your level of error compares to your classmates, and think about what this comparison means.

### 8.1 Run the code in the cell below to set up your own stopwatch. How long is 16 seconds?

To make it harder, try estimating the duration of 16 seconds instead of 7. And try doing it 10 times. Just push **Start** to begin, and when you're done we'll show you how to download your data. **You won't be able to see your times until you have completed all 10 trials and imported the data into R.**

In [None]:
# run this code to render the stopwatch
IRdisplay::display_html(sprintf(
  '<iframe src="https://uclatall.github.io/stopwatch/stopwatch-v3.html?condition=%s&class_id=%s&user_id=%s&trials=10&progress=hide&result=hide" width="500" height="220" sandbox="allow-scripts allow-same-origin"></iframe>',
  condition, class_id, user_id
))

### 8.2 Run the cell below to get your data in a data frame called `my_data`

In [None]:
# run this code to import your data from your most recent session
sheet_url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQRi4VBEFYGz_HnJXBsxlawbqM4vg5vgullb-iIqc5-dOpqEBCck12IS7lPD7bEdoxe8VPKYKe86Db4/pub?output=csv"
my_data <- utils::read.csv(sheet_url, header = TRUE, stringsAsFactors = FALSE) %>%
  mutate(date = as.POSIXct(date, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")) %>%
  filter(user_id == !!user_id) %>%
  filter(session_id == session_id[which.max(date)]) %>%
  select(user_id, trial, time_sec)

# this prints out my_data
my_data

### 8.3 Create and save `my_empty_model` of `time_sec` in your data. Print out the parameter estimate.

In [None]:
# code here
my_empty_model <- lm(time_sec ~ NULL, data = my_data)

### 8.4 Write your best-fitting empty model in GLM notation and interpret the parameter estimate

Empty model in GLM notation:

Interpretation of parameter estimate:


### 8.5 Plot `time_sec` by `trial`. Then:
- Overlay the empty model. Run it to make sure the code works.
- Overlay the residuals. Run it to make sure the code works.
- Overlay the squared residuals. Run it to make sure the code works.

In [None]:
# code here


### 8.6 Calculate SST and variance for `my_empty_model`

In [None]:
# code here



<div class="discussion-question">
    
### 8.7 Discussion: Compare your performance with your classmates
- Who was most accurate on average?
- Who has the largest SST? Who has the smallest?
- Do people who are more accurate necessarily have smaller SSTs?
    
</div>