# Partitioning Variation
## Chapter 7.9-7.10 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)

# format notebook
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# temporarily add gf_resid and gf_squaresid
source("https://raw.githubusercontent.com/UCLATALL/stopwatch/refs/heads/main/gf_resid.R")
source("https://raw.githubusercontent.com/UCLATALL/stopwatch/refs/heads/main/gf_square_resid.R")

# get the data
class1 <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQmmwF6-kWKW6GotizxD08b0IMh2T0ai-TISV0DdVZyaTaZW2tt57nZ05VR5luKW5r-evnvbB7DOjtv/pub?gid=0&single=true&output=csv")
class2 <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQbLFb-SgUEpog4zN73kZT7YvimGNYnYwbcmEGNVgc9TZalvmyYzmo68a-KzePFUOBc26AmXspnEnK1/pub?gid=0&single=true&output=csv")

# fit models
empty_model1 <- lm(guess ~ NULL, data = class1)
empty_model2 <- lm(guess ~ NULL, data = class2)
condition_model1 <- lm(guess ~ condition, data = class1)
condition_model2 <- lm(guess ~ condition, data = class2)

## 1 Partioning Variation into Data, Model, and Error

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-venn-diagram1.jpg" style="padding-right: 30px;" width = 70% align="left">

Last time, we used this venn diagram to show how the total variation (**SS Total**) can be split into two parts:

- **SS Error:** the leftover error from a more complex model
- **Error reduced:** the part of SS Total explained by adding an explanatory variable; this part is also called **SS Model**.

Just as we have been learning that **DATA = MODEL + ERROR**, this is also true when it comes to sum of squares: **SS Total = SS Model + SS Error**. 

In this notebook, we’ll zoom in on SS Model, how it’s calculated, and why it matters.



## 2 Reviewing the Guess-the-Answer Task

In an earlier notebook, you saw an experiment where students were each shown a card like this:

<br><div style="margin: 0 auto; font-family: Arial, sans-serif; border: 2px solid black; width: 380px; height: 220px; background-color: #F2F2F2; padding: 10px; box-sizing: border-box; font-size: 12pt; line-height: 1.4; font-weight: bold; text-align: left;">
  On the other side of this card is a math problem. When you are told to do so, turn over the card and look at it for 5 seconds. Then turn the card back over and write your estimate of the answer in the space below.<br><br><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;___________________________
</div>

Once the timer started, students flipped the card, saw one of these two math problems (ascending, shown on the left; descending, shown on the right) for five seconds, then flipped it back over and guessed the answer.

<table border="1" style="font-size: 24px; margin-left: 0; border-collapse: collapse; table-layout: fixed; width: 100%;">
  <tbody>
    <tr>
      <td style="border: 1px solid black; text-align: center; padding: 60px;">
        1 × 2 × 3 × 4 × 5 × 6 × 7 × 8
      </td>
      <td style="border: 1px solid black; text-align: center; padding: 60px;">
        8 × 7 × 6 × 5 × 4 × 3 × 2 × 1
      </td>
    </tr>
  </tbody>
</table>

### 2.1 New data from two classes: `class1` and `class2`

In this notebook, we’re going to revisit the same experiment, but this time we have data from **two different college classes** stored in two separate data frames:  
- `class1`  
- `class2`
  
There were 60 students in each class, half assigned to each condition. Each data frame has two variables: 
- `condition` : Whether students saw the multiplication question in `ascending` or `descending` order
- `guess` : The answer students guessed after viewing the card for 5 seconds

### 2.2 Write some code to see what's in the two data frames

- Do the rows and columns match what you would expect to see?
- Do both datasets have exactly 60 students? How many students within each class are assigned to each condition?

In [None]:
# code here



## 3 Modeling the Data From Two Classes


### 3.1 Look at the two jitter plots below
The left one shows data from `class1`, the right one `class2`. The plot for each class shows students' guesses by condition, with both the best-fitting condition model overlaid (solid red lines), and the empty model (blue dashed line).

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-jitter-with-two-models.png" alt="the jitter plots of class1 and class2 with the best-fitting condition model and empty model for each class overlaid; ascending group mean is about 1300 and descending group mean is about 3400; class 2's data are more variable than class 1 even though they have the same pattern -- lower estimates in the ascending versus descending group">

<div class="discussion-question">

### 3.2 Discussion Questions: 

- What do you notice is similar across the two classes?
- What do you notice is different across the two classes? 
- Do you think the parameter estimates for the condition model would be similar across the two classes, or different? Explain.

</div>


<div class="guided-notes">

### 3.3 Guided Notes: Write the R code to fit the condition model for each class. Save the models as `condition_model1` and `condition_model2`. Print out the two models and record the resulting parameter estimates and label them as b0 or b1.

</div>


In [None]:
# write R code



<div class="discussion-question">

### 3.4 Discussion Questions: 

- The parameter estimates look very similar across the two classes. What does that tell us about the model predictions, and is it what you expected from the graphs?
- If the parameter estimates are similar, does that mean the models fit the two classes equally well? Explain.
</div>


<div class="guided-notes">

### 3.5 Guided Notes: Write the code to fit the empty model to each class and save the models as `empty_model1` and `empty_model2`. Record the resulting parameter estimates.

In [None]:
# write R code



## 4 Comparing Error From the Condition Model and the Empty Model Across the Two Classes

We have seen that the model predictions are similar for the condition model across the two classes. Let's now look more closely at the error around the model predictions.

<div class="guided-notes">

### 4.1 Let's look again at the plots from above. Draw a few residuals from the empty model and the condition model for each class.

You can just pick a few points to draw; no need to do them all!
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-jitter-with-two-models.png" alt="the jitter plots of class1 and class2 with the best-fitting condition model and empty model for each class overlaid; ascending group mean is about 1300 and descending group mean is about 3400; class 2's data are more variable than class 1 even though they have the same pattern -- lower estimates in the ascending versus descending group">

<div class="discussion-question">

### 4.2 Discussion Questions: Looking at the two plots, does the condition model seem to fit one class better than the other (i.e., does it have less **error** in one class)? If so, which one? And how can you tell?
    
- Which class do you think would have a larger SSE (Sum of Squares Error)? Explain.
- Which class do you think would have a larger SST (Sum of Squares Total)? Explain.

</div>


### 4.3 Run `supernova()` on `condition_model1` and `condition_model2` to get the SSE and SST for the condition model for the two datasets

In [None]:
# code here



<div class="guided-notes">

### 4.4 Record the SSE and SST for the two class's condition models in the table. Round the numbers to the nearest million.
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 50%"><code>guess ~ condition</code></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 1</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 2</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 60px;">
        <td style="border: 1px solid black; text-align: left"><b>SSM (Sum of Squares Model)</b><br>Error Reduced by the condition model</td>
      <td style="border: 1px solid black; text-align: left; width: 50%; vertical-align: top;"><i>don't fill in this cell yet</i></td>
      <td style="border: 1px solid black; text-align: left; vertical-align: top;"><i>don't fill in this cell yet</i></td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SSE (Sum of Squares Error)</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SST (Sum of Squares Total)</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
    </tr>
  </tbody>  
</table>

<div class="discussion-question">

### 4.5 Discussion Questions: 
- Why does Class 2 have both higher SST and SSE? 
- Why is SST higher than SSE in both classes?

</div>


## 5 Error Reduced and Sum of Squares Model (SSM)

When we fit the condition model we are able to reduce the error from the empty model. We start with SST, which is the sum of squares from the empty model; then reduce the error by some amount, ending up with SSE (the sum of squares from the condition model).

<div class="guided-notes">

### 5.1 Guided Notes: For each class, subtract SSE from SST to get the *sum of squares reduced* by fitting the condition model. Again rounding to the nearest million, enter your answers in the first row of the table.
</div>

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 50%"><code>guess ~ condition</code></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 1</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 2</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 60px;">
        <td style="border: 1px solid black; text-align: left"><b>SSM (Sum of Squares Model)</b><br>Error Reduced by the condition model</td>
      <td style="border: 1px solid black; text-align: center"> </td>
      <td style="border: 1px solid black; text-align: center"></td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SSE (Sum of Squares Error)</td>
      <td style="border: 1px solid black; text-align: center">179</td>
      <td style="border: 1px solid black; text-align: center;">654</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SST (Sum of Squares Total)</td>
      <td style="border: 1px solid black; text-align: center">246</td>
      <td style="border: 1px solid black; text-align: center;">723</td>
    </tr>
  </tbody>  
</table>

<div class="discussion-question">

### 5.2 Discussion Question: Check the numbers for Sum of Squares Model in the first row of the ANOVA tables above. Does it match the number you came up with (rounded to the nearest million)? Based on that calculation, what does SS Model represent? Why might a researcher want SS Model to be relatively large?

</div>


<div style="font-size: 18px; line-height: 1.4; border: 2px solid black; padding: 10px;">

The <b>Sum of Squares Model (SSM)</b> is the amount of error reduced when we go from the empty model to a more complex model (here, the condition model).

That’s why it’s calculated as:

<div style="font-size: 20pt; text-align: center; margin: 10px 0;"> $$\text{SS Model} = \text{SS Total} - \text{SS Error}$$</div>

Or, rearranged:

<div style="font-size: 20pt; text-align: center; margin: 10px 0;"> $$\text{SS Total} = \text{SS Model} + \text{SS Error}$$</div>

And thus we come back to our core idea:
    
<div style="font-size: 20pt; text-align: center; margin: 10px 0;"> $$\text{DATA} = \text{MODEL} + \text{ERROR}$$</div>

    
</div>


<div class="discussion-question">

### 5.3 Discussion Questions: Take a look at the sums of squares in your table. The SS Model looks pretty similar across both classes. What does that tell us? Does it mean the models are equally good at reducing error?

</div>


### 5.4 SS Total = SS Model + SS Error

In general, the **SS Total** represents all the variation in our outcome variable. That's the entire shaded circle in the figure.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-purple-venn-2.png" width = 40% align="right">Let’s bring back the Venn Diagram to see how all three sums of squares fit together.

When we add an explanatory variable (like `condition`) to our model, that variable might help explain some of the variation. So now, the variation is partitioned:

- The variation that the model explained (or reduced) is called **SS Model**
- The variation that is leftover is called **SS Error**

And just as we saw that each data point can be partitioned into model and error (**DATA = MODEL + ERROR**):<br><br>

<div style="font-size: 20pt; text-align: center;">
$$\text{SS}_{\text{Total}} = \text{SS}_{\text{Model}} + \text{SS}_{\text{Error}}$$
</div>

<div class="guided-notes">

### 5.5 Guided Notes: Draw an outline around and label the regions of the Venn Diagram that represent SST, SSM, and SSE.

</div>


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-purple-venn-2.png" width=40%></img>

<div class="guided-notes">

### 5.6 Label the Venn Diagram for Class 1 with SST, SSM, and SSE. Also label these areas with the sums of squares from your table, rounded to the nearest million. 

</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 50%"><code>guess ~ condition</code></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 1</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 2</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 60px;">
        <td style="border: 1px solid black; text-align: left"><b>SSM (Sum of Squares Model)</b><br>Error Reduced by the condition model</td>
      <td style="border: 1px solid black; text-align: center">67</td>
      <td style="border: 1px solid black; text-align: center">69</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SSE (Sum of Squares Error)</td>
      <td style="border: 1px solid black; text-align: center">179</td>
      <td style="border: 1px solid black; text-align: center;">654</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SST (Sum of Squares Total)</td>
      <td style="border: 1px solid black; text-align: center">246</td>
      <td style="border: 1px solid black; text-align: center;">723</td>
    </tr>
  </tbody>  
</table>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-class1-venn.png" width = 100% align="left">

<div class="discussion-question">

### 5.7 Discussion Question: Look at the Venn diagram and ANOVA table provided for `class2`. What's different? What's similar? Why?

</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 50%"><code>guess ~ condition</code></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 1</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 2</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 60px;">
        <td style="border: 1px solid black; text-align: left"><b>SSM (Sum of Squares Model)</b><br>Error Reduced by the condition model</td>
      <td style="border: 1px solid black; text-align: center">67</td>
      <td style="border: 1px solid black; text-align: center">69</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SSE (Sum of Squares Error)</td>
      <td style="border: 1px solid black; text-align: center">179</td>
      <td style="border: 1px solid black; text-align: center;">654</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SST (Sum of Squares Total)</td>
      <td style="border: 1px solid black; text-align: center">246</td>
      <td style="border: 1px solid black; text-align: center;">723</td>
    </tr>
  </tbody>  
</table>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-both-venn.png" width = 100%>

<div class="guided-notes">

### 5.8 Label the Venn Diagram for Class 2 with SST, SSM, and SSE. Enter the sums of squares from the table above, rounded to the nearest million. 

</div>


<div class="discussion-question">

### 5.9 Discussion Question: What do these Venn diagrams show us about how to interpret SSM? Why does having a similar SSM value mean something different in Class 2 than in Class 1?

</div>


## 6 Calculating SSM Directly

SS Total and SS Error are both based on residuals, from either the empty model or the condition model. They both represent variation leftover, or still left unexplained, after fitting a model.  

SS Model represents the variation *explained* by the model. As we have seen, we can calculate SSM by subtracting *SST - SSE*. It really is the *reduction in error* from the empty model to the condition model. **But we can also calculate SSM directly by squaring and summing the difference in predictions between the two models** for each data point. 


<div class="guided-notes">

### 6.1 Draw each model's prediction. Then draw the difference between the two model predictions.

We’ve highlighted a data point from `class2` in the jitter plot below.
- Draw one dot to show the **empty model’s prediction** for this student.
- Draw another dot to show the **condition model’s prediction** for this student.
- Then draw the **difference** between them. This is the **error reduced** by using the condition model instead of the empty model.

Note that these differences are not called "residuals", because they’re not comparing data to a model prediction. They’re comparing one model’s prediction to another. 

</div>


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-chosen.png" alt="the jitter plots of class1 and class2 with the best-fitting condition model and empty model for each class overlaid; there is a data point in the descending group that is a relatively high guess chosen in each; the prediction for the empty model and condition model for that data point are depicted as dots; there is a vertical line drawn between them to show the difference" width = 60%>



<div class="guided-notes">

### 6.2 Draw the squared difference between predictions

Draw a square to represent the squared difference between the two model predictions you drew in 6.1. The sum of these squared differences across every data point is SS Model.

</div>


<div class="discussion-question">

### 6.3 Discussion Questions: SS Model is the sum of many squared differences between the empty-model and condition-model predictions. How many such squares would you need for Class 2's SS Model? Would all those squares be the same size or different?

</div>


<div class="discussion-question">

### 6.4 Discussion Question: We saw earlier that SSM is similar across the two classes. Referring to the graphs below, can you explain why this is?

</div>


<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-ss-model-square.png" alt="square that makes up SS Model">

## 7 *PRE* (Proportional Reduction in Error)

Even though SSM tells us how much total error (measured as sums of squares) is reduced by a model, it may not be the best way to assess how good the model is. The condition models for Class 1 and Class 2 explained almost the same amount of variability measured by reduction in sums of squares. But it's clear that the Class 1 model explains a larger *proportion* of total variation. 

In general, *PRE* (Proportional Reduction in Error) is a better way to assess how well a model fits the data. Rather than measure success in raw sums of squares, it measures it as the **proportion** of total variation that is explained by the model. Here are two different formulas for calculating *PRE*. Both yield the same result. The formula on the right helps keep in mind that it is the **reduction in error** from the empty to the condition model that is represented by SSM.<br>

<table style="border-collapse: collapse; width: 100%;">
  <tr>
    <th style="background:#fff !important; text-align:center; font-size:20px; padding:30px; border:1px solid #000;">
      <div style="font-size: 20pt; text-align: center;">
$$\text{PRE} = \frac{\text{SS}_{\text{Model}}}{\text{SS}_{\text{Total}}}$$
</div>
    </th>
    <th style="background:#fff !important; text-align:center; font-size:20px; padding:30px; border:1px solid #000;">
      <div style="font-size: 20pt; text-align: center;">
$$\text{PRE} = \frac{\text{SS}_{\text{Total}} - \text{SS}_{\text{Error}}}{\text{SS}_{\text{Total}}}$$
</div>
    </th>
  </tr>
</table>

<div class="guided-notes">

### 7.1 Estimate PRE for each class's condition model and write it on the Venn diagrams

- Use the table of SSM, SSE, and SST to estimate PRE for class1 and class2. Write the PRE onto the Venn diagrams below.
</div>



<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 50%"><code>guess ~ condition</code></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 1</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width: 25%">Class 2</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 60px;">
        <td style="border: 1px solid black; text-align: left"><b>SSM (Sum of Squares Model)</b><br>Error Reduced by the condition model</td>
      <td style="border: 1px solid black; text-align: center">67</td>
      <td style="border: 1px solid black; text-align: center">69</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SSE (Sum of Squares Error)</td>
      <td style="border: 1px solid black; text-align: center">179</td>
      <td style="border: 1px solid black; text-align: center;">654</td>
    </tr>
    <tr style="height: 60px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">SST (Sum of Squares Total)</td>
      <td style="border: 1px solid black; text-align: center">246</td>
      <td style="border: 1px solid black; text-align: center;">723</td>
    </tr>
  </tbody>  
</table>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/7.9-7.10-overview-both-venn.png" width = 100%>

### 7.2 *PRE* is also included in the `supernova()` tables. Run code to get those tables again. Check those PRE against the ones you calculated.

We have already saved `condition_model1` and `condition_model2` for you.

In [None]:
# write code




<div class="discussion-question">

### 7.3 Discussion Question: What do these values of PRE mean?  

Note: You can also think of a PRE as percentages (e.g., 0.27 as 27% or 0.095 as 9.5%).

</div>


<div class="discussion-question">

### 7.4 Discussion Question: The condition models for `class1` and `class2` were almost identical. Why did the same model explain so much more error in one class than in the other?
    
You may want to refer to the Venn diagram to help you explain this.
</div>
