# Quantifying Error from Regression Models (COMPLETE)

## Chapter 9.5-9.7 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)
library(gridExtra)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# read in data
study_data <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQjgMnwBMvsmAj9rP2OccPLjGZZUON9ifFqDav0IDo-F1fUqBgpGXoBK0Lhmqf1IApfcjC4LGnW5iaZ/pub?output=csv") %>%
  mutate(active_cat = factor(ntile(active,3), levels = c(1:3), labels = c("low", "medium", "high")))

# add functions directly from github
source("https://raw.githubusercontent.com/coursekata/beta-functions/refs/heads/main/gf_resid.R")
source("https://raw.githubusercontent.com/coursekata/beta-functions/refs/heads/main/gf_square_resid.R")


<div class="teacher-note">
    
<b>Section Goals:</b> In this section, students extend their understanding of how to quantify error from a statistical model to include regression models.  As in group models, error is quantified based on residuals from the model predictions. Just as in group models, error from regression models can be quantified using sums of squares, PRE, and F, and total error can be partitioned into explained and unexplained parts. 

Students should understand that, just like in group models:

- Residuals are calculated as the value of the data point minus the predicted value based on the model.
- SSE is calculated by squaring and summing all residuals from a model. 
- SST represents the total variation in the outcome variable (based on the empty model) and can be partitioned into SSM + SSE.
- PRE quantifies how much better the regression model’s predictions are compared to the empty model. In regression contexts with one predictor variable, PRE is the same as R^2.
- The F ratio adjusts for model complexity by incorporating MS (which incorporated degrees of freedom).

A <a href="https://docs.google.com/document/d/1mzMcrHdNLWQ_JcL9ON9pyiKk5h8R3mgOXNr22S5vKww/edit?tab=t.5y2a0ykmi2fk" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
</div>

## 1 Quantifying Error From A Regression Model

In the previous notebook we compared a three-group model with a regression model. We learned how to fit a regression model, how to interpret the parameter estimates, and how to represent a regression model in GLM notation. In this notebook we will continue this comparison by looking at how we quantify error around a regression model.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/9.1-9.4-studying.png">

## 2 Review: The `study_data` Data Frame

We’ll continue working with the `study_data` dataset we used earlier. These data are similar to a study by <a href="https://pubmed.ncbi.nlm.nih.gov/33444109/">Walck-Shannon, Rowell, and Frey (2021)</a>, in which college students described how they studied for a biology exam.

Each student estimated what percentage of their study time they spent using different techniques (for example, quizzing themselves, re-reading notes, or explaining concepts to others). Researchers grouped these strategies into active and passive categories.

The `study_data` dataset includes data from 60 students and three variables:
- `active`: the percentage of study time spent using active strategies
- `active_cat`: a categorical variable indicating whether a student was low, medium, or high in active study time (20 students per group)
- `exam`: the percent correct on the biology exam

Use the code cell below to view the data set.

In [None]:
# code here

# complete
str(study_data)

## 3 Review: The Best-Fitting Group Model and the Regression Model


<div class="discussion-question">

### 3.1 Discussion Questions: These models are both the "best-fitting" models. What does this mean? How could they both be "best-fitting"?
</div>

In [None]:
# run this

# this creates and prints out the best-fitting three-group model
threegroup_model <- lm(exam ~ active_cat, data = study_data)
threegroup_model

# this creates the best-fitting regression model
regression_model <- lm(exam ~ active, data = study_data)
regression_model

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:20%"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Three-Group Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Regression Model</td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: center;"><b>Relationship</b></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"><code>exam ~ active_cat</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"><code>exam ~ active</code></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: center; height: 80px"><b>Visualization of Model</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: center;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/9.5-9.7-overview-group-model.jpg" alt="group model"></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: center;"><img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/9.5-9.7-overview-regression-model.jpg" alt="regression model"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>Model Specified in GLM Notation</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$Y_i = b_0 + b_1X_{i} + e_i$$</td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top; height: 80px"><b>Best-Fitting Model in GLM Notation</b></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$Y_i = 74.55 + 3.10X_{1i} + 4.85X_{2i} + e_i$$</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;">$$Y_i = 70.7 + 0.13X_{i} + e_i$$</td>
    </tr>
  </tbody>  
</table>

<div class="teacher-note">

**Teacher Note:** When we specify a model we decide what kind of a model we are going to fit: a group model, for example, or a regression model. 
    
Fitting the model means to find the parameter estimates (the values of the $b$s) that reduce error to the lowest possible level given the model that has been specified. It doesn't mean that either the group model or regression model is "best", but that *if we are going to fit the three-group model*, these are the parameter estimates that reduce error the most.
    
However, *if we are going to fit the regression model*, a different set of parameter estimates will reduce error the most.
    
</div>


## 4 DATA = MODEL + ERROR for Regression Models

Both the group model and the regression model can be expressed as:

<p style="font-size: 20px">$$Y_i = \hat{Y}_i + e_i$$</p>

We just have different models that give us the predicted exam scores, the $\hat{Y}_i$. This connects to our big organizing idea:

<p style="font-size: 20px">$$\text{DATA} = \text{MODEL} + \text{ERROR}$$</p>

In the plot below, the gray points represent data from all 60 students. The six black points are a small sample we’ll use to explore model predictions and error in this section.


<div class="guided-notes">

### 4.1 Draw the model predictions ($\hat{Y}_i$) for these 6 students as dots along the regression line
    
Next to each dot, write a numeric estimate of what the model predicts for that students' exam score.
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/9.5-9.7-overview-6-dots.jpg" alt="scatter plot and regression line with 6 particular dots highlighted" width = 60%>

<div class="guided-notes">
    
### 4.2 For these 6 students, draw the error ($e_i$) as a line representing the difference between their data and the model prediction on the regression line. 
    
Next to each line, mark whether you think the error is positive (+) or negative (–). 

    
</div>

### 4.3 These lines represent residual error

Error from the regression model is just like error from the group model. In both cases, the error (or residual) is the difference between each data point and its predicted value from the model.  

We can express the idea of "residual error" by rearranging our word equation:

<p style="font-size: 20px">$$\text{ERROR} = \text{DATA} - \text{MODEL}$$</p>


<div class="guided-notes">
    
### 4.4 We can also represent that idea with GLM notation. How would you rearrange $Y_i = \hat{Y}_i + e_i$ in a similar way to calculate $e_i$?
    
</div>

<div class="guided-notes">
    
### 4.5 Calculate model predictions and residuals for the 6 students in the table. 
Reminder: The best-fitting regression equation is $\hat{Y}_i = 70.7 + 0.13X_i$ 

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width:20%">active ($X_i$)</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width:20%">exam ($Y_i$)</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width:25%">predicted exam ($\hat{Y}_i$)</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: left; width:25%">residual error ($e_i$)</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">26</td>
      <td style="border: 1px solid black; text-align: left;">77</td>
      <td style="border: 1px solid black; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">44</td>
      <td style="border: 1px solid black; text-align: left;">72</td>
      <td style="border: 1px solid black; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">54</td>
      <td style="border: 1px solid black; text-align: left;">76</td>
      <td style="border: 1px solid black; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">57</td>
      <td style="border: 1px solid black; text-align: left;">80</td>
      <td style="border: 1px solid black; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">63</td>
      <td style="border: 1px solid black; text-align: left;">84</td>
      <td style="border: 1px solid black; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">71</td>
      <td style="border: 1px solid black; text-align: left;">76</td>
      <td style="border: 1px solid black; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"></td>
    </tr>
  </tbody>  
</table>


<div class="guided-notes">
    
### 4.6 Which column is DATA, MODEL, and ERROR? Label the columns in the table.

</div>

<div class="guided-notes">
    
### 4.7 Just like we calculated predictions and residuals by hand, R can calculate them for every row in the dataset using `predict()` and `resid()`. Fill in the missing parts of the table.

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:20%"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Predictions</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">Residuals</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: left;">To calculate (but not save)</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>predict(regression_model)</code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: left;">To save into <code>study_data</code> as a new variable</td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>study_data&dollar;regression_predict &lt;- </code></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"></td>
    </tr>
  </tbody>  
</table>


<div class="discussion-question">

### 4.8 Discussion Question: When we run the two lines of code below, will R print anything? Explain what these lines of code will do. (How can you check what the code did?)

</div>

In [None]:
# run this after you discuss 4.8
study_data$regression_predict <- predict(regression_model)
study_data$regression_resid <- resid(regression_model)

# how can you check what the code did?

# COMPLETE
head(study_data)

<div class="teacher-note">

**Sample Responses:**  
- No, won't print anything because R is just saving objects.
- This code should create two new variables in the data frame called `regression_predict` and `regression_resid`. We can check this just by running `head()`.
    
    
</div>


<div class="guided-notes">

### 4.9 In addition to using R to calculate residuals for every student (every row), we can also use R to visualize all of them. We used `gf_resid()` before to display residuals from a group model. Try using it to visualize the residuals from the regression model.

</div>


In [None]:
# add gf_resid to this plot
gf_point(exam ~ active, data = study_data) %>%
  gf_model(regression_model, color = "firebrick")

# COMPLETE
gf_point(exam ~ active, data = study_data) %>%
  gf_model(regression_model, color = "firebrick") %>%
  gf_resid(regression_model, color = "firebrick")

<div class="discussion-question">

### 4.10 Discussion Question: Do the residuals seem to “balance” around the line, just like the residuals from the empty and group models? Is there a way we could write R code to check that idea?

</div>

In [None]:
# code here

# COMPLETE
sum(resid(regression_model))

<div class="teacher-note">

**Sample Responses:**  
- There are residuals both above and below the regression line and overall it looks balanced. But I'm not sure I can know for sure just by looking at it.  
- We could check this by adding up all the residuals; it should be close to zero.

**Teacher Note:**  This question helps students generalize the balancing idea across all the best-fitting models we have explored so far: empty model, group model (made of group means), and the regression model. The regression model may not look like a mean, but it functions as one by being the “middle” of a bivariate distribution, balancing the residuals above and below it.
</div>


## 5 Partitioning Variation Around the Regression Model

In group models, we saw that the total variation in the data (SST) could be broken down into the part explained by the model (SSM) and the leftover error (SSE).  This same idea applies to regression models: the regression line uses the `active` variable to explain part of the variation in `exam`, and the rest is left as error.

We can apply the DATA = MODEL + ERROR framework to SS like this:

<p style="font-size: 20px">$$\text{SST} = \text{SSM} + \text{SSE}$$</p>

<div class="guided-notes">

### 5.1 Label this Venn diagram to show that SST (the total variation) can be partitioned into SSM (the model’s explained variation) and SSE (the unexplained error).

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.5-7.7-overview-venn2.jpg" width = 40%>

### 5.2 We can use `gf_squaresid()` (or `gf_square_resid()`) to visualize the error that is minimized by the regression model

In [None]:
# add gf_squaresid to this visualization
gf_point(exam ~ active, data = study_data) %>%
  gf_model(regression_model, color = "firebrick") %>%
  gf_resid(regression_model, color = "firebrick") 

# COMPLETE
gf_point(exam ~ active, data = study_data) %>%
  gf_model(regression_model, color = "firebrick") %>%
  gf_resid(regression_model, color = "firebrick") %>%
  gf_squaresid(regression_model)

<div class="guided-notes">

### 5.3 Use `supernova()` to look at the Sums of Squares from the regression model.  Label the Venn diagram with the values for SST, SSM, and SSE that you see in the ANOVA table.

How much error is left from the regression model compared to the error left from the empty model?


</div>

In [None]:
# code here

# COMPLETE
supernova(regression_model)

<div class="guided-notes">

### 5.4 Fill out the table to compare the three models we've seen in the last few chapters: empty, group, and regression. 

- Put a check mark to show which models *balance residuals* and *minimize sum of squares*.  
- Fill in the name of the sum of squares that each model minimizes.

</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Balances Residuals?</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Minimizes SS?</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Name of the SS that is minimized</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: center;">Empty Model</td>
      <td style="border: 1px solid black; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: center;">Group Model</td>
      <td style="border: 1px solid black; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; text-align: center;">Regression Model</td>
      <td style="border: 1px solid black; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;"> </td>
    </tr>
  </tbody>  
</table>


<div class="discussion-question">

### 5.5 Discussion Questions: Run the code below to look at the SSE from the three-group model and regression model. What do you notice about the different sums of squares for the two models? Which is a better model? How can you tell?
    
</div>

In [None]:
# run this
supernova(threegroup_model)
supernova(regression_model)

<div class="teacher-note">

**Sample Responses:**  
- SSE are similar (786 versus 772, respectively) and thus SSM (241 and 256, respectively) and PRE are similar too (0.23 and 0.25, respectively). 
- The models are actually fairly similar but the regression model is a little bit better.

**Teacher Note:**  
- Some students may have the insight to look at the F-statistic which penalizes the three group model for being more complicated. 
- You may reinforce that the SS Total for both model are the same because they have the same outcome variable (exam score).
- You may want to ask students why it's a good thing that the SSE is smaller for the regression model than the three group model. We want students to internalize that when it comes to error, having *less* error is a good thing.
 
</div>
