# Degrees of Freedom, *F*, and the Dangers of Over-Fitting

## Chapter 8.4 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)
library(gridExtra)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# read in the sleep dataset
satisfaction <- c(5.4,4.9,5.1,4.6,5.3,7.2,6.4,6.3,6.2,5.2,7.2,4.7,5.2,6.1,6,7.4,6.7,3.7,2.4,5.3,8,7,3.7,3.5,7.4,6.5,3.8,5.7,4.9,4.5,5.7,6.8,5.5,7,5.3,7.1,6,5.1,6.3,7.1,7.2,4.7,6.6,3.8,5.9,5.3,6.4,4.6,4,4.5,4.7,8.6,5.3,4.2,4.1,6,6.1,7.8,4,3.4)
sleep_quality <- c(5.7,2.7,4.5,7,7.9,5.8,8.4,5.9,5.6,4.9,6.3,4.7,6.1,6.1,3.5,6.9,6.5,4.1,4.3,3.1,6,7.1,3.8,6.6,8.2,4.7,5.9,6.1,6.5,4.8,3.8,6.7,6.1,4.9,9.1,4.7,5.7,4.8,5.5,6.7,7.3,3.4,5.8,2.9,4.5,5,8.2,5.1,5.4,5,5.6,7,5.4,3.1,4.8,5.6,6.8,6.4,7,4)
sleep <- data.frame(satisfaction, sleep_quality)

## 1 Is Reducing PRE Our Only Goal?

We have made the point throughout this course that an important goal of statistical modeling is **reducing error**. But this is not our only goal: we also are trying to understand what *causes* variation in data — the Data Generating Process. In this notebook we introduce the idea that not all reductions in error (as measured, for example, by **PRE**) will help us understand the DGP. In particular, we will introduce the concept of *over-fitting*. 

## 2 Life Satisfaction and Sleep Quality

There are many studies that investigate the relationship between sleep quality and life satisfaction. (See, for example, <a href="https://www.nature.com/articles/s41598-025-16017-1?utm_source=chatgpt.com">this study</a>.) Both variables are typically measured using self-ratings, similar to the ratings of happiness we analyzed in the previous notebook.

In this notebook, we analyze the `sleep` dataset. This is a small dataset containing data from 60 college students, each of whom were asked to rate both their life satisfaction and their sleep quality on a 10-point scale. The dataset has two variables:

- `satisfaction` (1=extremely unsatisfied, 10=extremely satisfied)
- `sleep_quality` (1=extremely poor, 10=extremely good)

### 2.1 Take an initial look at the data

Write some code to look at what's in the dataset and to examine the distributions of the two variables one at a time.


In [None]:
# code here



### 2.2 Write a word equation to predict life `satisfaction` from `sleep_quality`

Why do you think people might entertain this hypothesis? 

Word equation:

Explanation:


### 2.3 Make a graph to explore the relationship between `satisfaction` and `sleep_quality`

In [None]:
# code here


<div class="discussion-question">

### 2.4 Discussion Questions: Based on the graph, do you think sleep quality explains some of the variation in life satisfaction? How can you tell?
    
</div>

## 3 Modeling the Relationship with a Three-Group Model

In the next chapter we will model this relationship with a model in which the X variable is quantitative (called a regression model). But for the purposes of this notebook, let's turn `sleep_quality` into a three-group predictor variable, which we will use to predict variation in `satisfaction`. For this we can use the `ntile()` function.

The `ntile()` function takes a quantitative variable, in this case `sleep_quality`, and turns it into a grouping variable that defines three groups of equal sizes based on `sleep_quality`. The way it works is, it takes all of the 60 values of `sleep_quality`, sorts them in order, then labels the lowest 20 as 1, the next 20 as 2, and the final 20 as 3. In the next cell, we've put the code to create the variable in the `sleep` dataset and call it `sleep3`.

We also added a `factor()` function around the `ntile()` to tell R that the 1, 2, and 3 are not quantitative measures but group labels.


<div class="guided-notes">
    
### 3.1 Circle and label (with A, B, or C) the parts of the code that we use to create `sleep3`:

- A. takes a quantitative variable and defines 3 groups of equal sizes based on it   
- B. takes the numbers 1, 2, 3 and tells R that these are group labels (not quantities 1, 2, 3)
- C. saves the result into a variable called `sleep3` in the data frame `sleep`

When we print out `sleep3`, what values will you see?

</div>

In [None]:
# run this code to create the grouping variable sleep3
sleep$sleep3 <- factor(ntile(sleep$sleep_quality, 3))

# write code to print out sleep3



<div class="guided-notes">

### 3.2 Here is a scatter plot of `satisfaction` by `sleep_quality` (the quantitative variable). What will happen if we color the dots by `sleep3`?  

Circle on the graph the dots that you think will have the same color.
    
</div>

In [None]:
# modify this graph to color the dots by their sleep category
gf_point(satisfaction ~ sleep_quality, data = sleep)



### 3.3 Graph `satisfaction` as a function of `sleep3` (instead of `sleep_quality`)

In [None]:
# code here



<div class="discussion-question">

### 3.4 Discussion Question: Does this graph look the way you would expect based on the earlier graph of `satisfaction` by `sleep_quality`? What's different about this graph?

</div>

### 3.5 Create a three-group model of `satisfaction`, save it as `sleep3_model`, and print out the parameter estimates

In [None]:
# code here



<div class="guided-notes">

### 3.6 Re-write the three-group model substituting in the parameter estimates and variable names from the R output:
    
<span style="font-size: 20px">$$Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i$$</span>  
    
</div>


<div class="guided-notes">

### 3.7 Based on the parameter estimates, fill in the empty cells in the table. 
    
</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: left;" width = 12%><code>sleep3</code> Group</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 5%>$X_{1i}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 5%>$X_{2i}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 23%>Function</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;" width = 13%>Prediction</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;">Interpretation of Prediction</td>
    </tr>
  </thead>
  <tbody>
    <tr style="height: 80px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">sleep3 = 1</td>
      <td style="border: 1px solid black; text-align: left; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: left; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;">$b_0+b_1(\text{__})+b_2(\text{__})$</td>
      <td style="border: 1px solid black; text-align: center;">$$\\b_0=4.8$$</td>
      <td style="border: 1px solid black; text-align: center;"> </td>
    </tr>
    <tr style="height: 80px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">sleep3 = 2</td>
        <td style="border: 1px solid black; font-weight: bold; text-align: center"> </td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"> </td>
      <td style="border: 1px solid black; text-align: center;">$b_0+b_1(\text{__})+b_2(\text{__})$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"> </td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
    </tr>
    <tr style="height: 80px;">
      <td style="border: 1px solid black; font-weight: bold; text-align: left">sleep3 = 3</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;">$b_0+b_1(\text{__})+b_2(\text{__})$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center;"></td>
      <td style="border: 1px solid black; text-align: center;"> </td>
    </tr>
  </tbody>  
</table>

<div class="guided-notes">

### 3.8 Based on the predictions of the `sleep3_model`, draw in the model on the scatter plot provided.
    
</div>


In [None]:
# run this code
gf_point(satisfaction ~ sleep3, data = sleep)

### 3.9 Write R code to overlay the `sleep3_model` on the graph of `satisfaction` by `sleep3`
Set the color of the model equal to "red"

In [None]:
# code here



<div class="discussion-question">

### 3.10 Discussion Question: How does overlaying the model help you see the relationship between `satisfaction` and `sleep3`?
    
</div>

### 3.11 In the cell below, generate the `supernova()` table for the `sleep3_model`

In [None]:
# code here



<div class="discussion-question">

### 3.12 Discussion Question: How well does the `sleep3_model` fit the data?
    
</div>

## 4 How Does Adding Parameters to the Model Affect PRE?

The three group model has three parameter estimates ($b_0, b_1, b_2$): the mean of group 1, the adjustment to get to group 2, and the adjustment to get to group 3, respectively. 

What if we add parameters to the model? For example, a five-group model would have five parameters; a 10-group model, 10. How would adding more parameters affect the fit of the model? Let's do some experiments to find out.

### 4.1 Use the `ntile()` function to create a few more grouping variables for 5, 10, and 15 groups

First let's create the new grouping variables using the approach we used earlier to create `sleep3`. In addition to `sleep3`, make variables that divide the students by `sleep_quality` into 5, 10, and 15 groups.

In [None]:
# modify this code to create the new grouping variables
sleep$sleep3 <- factor(ntile(sleep$sleep_quality, 3))


### 4.2 Create new models of `satisfaction` that use each new grouping variable as the predictor
Modify the code in the cell below to create the new models. Note that as we increase the number of groups, we are increasing the number of model parameters by the same number.

In [None]:
# modify the code to create the new models
sleep3_model <- lm(satisfaction ~ sleep3, data = sleep)



### 4.3 Get the `supernova` tables for all 4 models (the new ones and `sleep3_model`

In [None]:
# code here



<div class="guided-notes">

### 4.4 Enter the PRE for each model into the table
    
</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">PRE</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">$\text{df}_\text{model}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">$\text{df}_\text{error}$</td>
        <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%"><i>F</i></td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep3_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
       <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep5_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep10_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep15_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
  </tbody>  
</table>

<div class="discussion-question">

### 4.5 Discussion Questions: What happens to PRE as the number of groups goes up? Why do you think this is?
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/8.4-8.5-5-v-15.png">

<div class="discussion-question">

### 4.6 This data set has 60 students in it. What would the model be like if we divided the sample into 60 groups? Would that have a high PRE? Would that be a useful model? Explain your answer.
    
</div>

<div style="font-size: 18px; line-height: 1.4; border: 2px solid black; padding: 10px;">

**Overfitting** happens when a model is too complex for the data. It captures random noise and quirks of the sample rather than the underlying relationship. Overfit models have very high PRE (even perfect fit) on the data they were built from, but they perform poorly when predicting new data.  

</div>

## 5 PRE, Degrees of Freedom, and the *F* Ratio

As we add more parameters, PRE will typically go up. But this can lead to overfitting, because PRE only tells us how much error is reduced without accounting for the *cost* of that reduction.

The *F* ratio provides a solution to this problem, giving us an indicator of the amount of error reduced by a model that adjusts for the number of parameters it takes to realize the reduction in error. 

### 5.1 Degrees of freedom is like a budget

We measure the cost of a model using **degrees of freedom (df)**, which you can think of as a budget. With a sample of 60 students, we start with 60 df. Fitting the empty model (with one parameter estimate, $b_0$) uses up 1 df, leaving 59. As we add parameters, we spend more of this budget, leaving fewer degrees of freedom to ensure reliable estimates of group means.

<div class="guided-notes">
    
### 5.2 Look at the bottom row of the `supernova` table for the `sleep5_model`. Why do we have a df of 59? Write an explanation.
    
</div>

In [None]:
# run this cell to get a supernova table
supernova(sleep5_model)

<div class="guided-notes">

### 5.3 Re-inspect the four ANOVA tables for the four different models. Write the df for model (first row of the `supernova` output) and error (the second row) on your guided notes in the columns after PRE.
    
</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">PRE</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">$\text{df}_\text{model}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">$\text{df}_\text{error}$</td>
        <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%"><i>F</i></td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep3_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.21</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
       <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep5_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.27</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep10_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.31</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep15_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.42</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
  </tbody>  
</table>

In [None]:
# run this code
supernova(sleep3_model)
supernova(sleep5_model)
supernova(sleep10_model)
supernova(sleep15_model)

<div class="discussion-question">

### 5.4 Why is `df Model = 2` for the `sleep3_model`?  Why is the `df Error = 57`?

Consider whether degrees of freedom might also embody the pattern DATA = MODEL + ERROR.
</div>

<div class="guided-notes">

### 5.5 What do you think is the meaning of df Model and df Error? Explain how they each relate to the parameters estimated. 

</div>


### 5.6 The *F* ratio

The F ratio is a way of asking: does the increase in PRE justify the degrees of freedom we spent to get it? **A high F ratio suggests that the model improves PRE enough to justify its added complexity.**

It is typically written as a ratio of variances because MS already takes into account degrees of freedom ($\text{MS} = \text{SS}/\text{df}$). 

<span style="font-size: 20px">$$F = \frac{\text{MS}_\text{model}}{\text{MS}_\text{error}}$$</span>

But it’s easier to see the connection between *F* and PRE using this mathematically equivalent formula:

<span style="font-size: 20px">$$F = \frac{\text{PRE}/\text{df}_\text{model}}{(1-\text{PRE})/\text{df}_\text{error}}$$</span>

- The numerator shows how much PRE we gained per df spent on the model. 
- The denominator shows how much error remains per df left over.

A high F ratio means the model explains variation better compared to adding random parameters.

<div class="guided-notes">

### 5.7 Go back to the table above and write the *F* ratio for each model in the last column of the table in your guided notes.
    
</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">Model</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">PRE</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">$\text{df}_\text{model}$</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%">$\text{df}_\text{error}$</td>
        <td style="border: 1px solid black; font-weight: bold; text-align: center; width:25%"><i>F</i></td>
    </tr>
  </thead>
  <tbody>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep3_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.18</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">2</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">57</td>
       <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep5_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.27</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">4</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">55</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep10_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.31</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">9</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">50</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
    <tr>
        <td style="border: 1px solid black; text-align: left; vertical-align: top;"><code>sleep15_model</code></td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">0.42</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">14</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;">45</td>
      <td style="border: 1px solid black; text-align: center; vertical-align: center;"></td>
    </tr>
  </tbody>  
</table>

In [None]:
# run this code
supernova(sleep3_model)
supernova(sleep5_model)
supernova(sleep10_model)
supernova(sleep15_model)

<div class="discussion-question">

### 5.8 According to the *F* ratio, which models are most worth the degrees of freedom spent? What do you notice about the relationship between PRE, df, and *F*? 
</div>

<div class="guided-notes">

### 5.9 What are the pros and cons of using *PRE* or *F* to evaluate how well a model reduces error?

</div>


<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse; width: 100%; text-align: left;">
  <thead>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:20%"> </td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%">PRE</td>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:40%"><i>F</i> ratio</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:50%">Pros</td>
      <td style="border: 1px solid black; vertical-align: top; text-align: left; padding: 8px;">      </td>
      <td style="border: 1px solid black; vertical-align: top; text-align: left; padding: 8px;">      </td>
    </tr>
    <tr>
      <td style="border: 1px solid black; font-weight: bold; text-align: center; width:50%">Cons</td>
      <td style="border: 1px solid black; vertical-align: top; text-align: left; padding: 8px;">      </td>
      <td style="border: 1px solid black; vertical-align: top; text-align: left; padding: 8px;">      </td>
    </tr>
  </tbody>  
</table>
