## Define the functions gf_resid() and gf_squarederror() 

We are currently working on getting these added to the coursekata R package. https://github.com/coursekata/coursekata-r/pull/5

# The Empty Model With R (COMPLETE)

## Chapter 5.5-5.7 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)
remotes::install_version("coursekata", "0.19.0", upgrade="never")
# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))
# load data
lungs <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSz3JaTYY0RXoLQU-o1S45MDKudy6QfIQDtGLGoiy54JOydB7iELKxGbPqo_0uEvKUGCpZ_UQxu4PgM/pub?gid=1447217380&single=true&output=csv")
FEV <- c(3.0,3.2,3.8,4.6,3.8,2.3,4.8,4.1)
row <- c(1:8)
FEV_data <- data.frame(row, FEV)

<div class="teacher-note">
<b>Section Goals:</b> In this section students learn to use R to fit the empty model, generate model predictions, and generate the residuals (errors) from the model predictions. 

- Students will learn to use the lm() function in R to fit the empty model
- Students will learn to use the predict() function to generate predictions using the empty model for each row in a dataset, and recognize that the empty model generates the same prediction for every row.
- Students will learn to use the resid() function to generate a residual for each row in a dataset.
- Students will be able to plot the predictions of the empty model onto various graphs (histograms, scatter plots and jitter plots). They will also be able to recognize that the distance between the prediction and the data point in scatter and jitter plots represents the residuals.
- Students will understand that just as DATA = MEAN + ERROR, each individual data point can be thought of as having two components: the model prediction and the error, which, when added together, will equal the original value for that data point.
- Students will explore what it means to plot the residuals in a histogram vs. plotting the original distribution and notice that what changes is the center (now at 0) but not the shape or spread.

</ul>
A <a href="https://docs.google.com/document/d/1QUu7khjJGwGSdo42QDtnhyYFo51U1yZozG-UhE_2yhY/edit?tab=t.5y2a0ykmi2fk" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
</div>

## 1 Using R to Fit the Empty Model

In the last notebook, we introduced the idea of a statistical model as a *function we can use to make predictions*. We focused on the simplest model, which we call the empty model. The empty model does not use any explanatory variables; it simply predicts the outcome by using the mean.

We’ve already fit this model by just calculating the mean, and written it like this:

**FEV = mean + error**

Now we’re going to do the same thing using R. 

### 1.1 Let's start with the 8 students in `FEV_data` from last time. Before modeling it with the empty model, let's take a look at the data in a visualization. What does each dot represent?

In [None]:
# run this code
gf_point(FEV ~ row, data=FEV_data)

<div class="discussion-question">
    
### 1.2 Discussion (Review): What does it mean to fit a model? What does it mean for a model to be the best fit for the data?
    
</div>

<div class="teacher-note">

**Sample responses:**
    
- *To fit the model* means finding the specific version of the model based on data (e.g., calculating the mean).  
- *The best-fitting model* is the version that minimizes and balances error. That is, this version of the model produces predictions that are closer to the actual data than other versions.
    
</div>

### 1.3 About the `lm()` function in R

In this section we are going to introduce a new function we can use to fit the empty model: `lm()`. We already know how to use the `favstats()` function to find the mean; we did it before and got **3.7 liters**. So why should we learn a new function for fitting the empty model?

The reason is because `lm()` (which stands for "linear model") doesn’t just fit the empty model. It also fits more complex models we will learn about later in this course.

The general structure is:

`lm(outcome ~ explanatory, data=dataframe)`

In the case of the empty model, there is no explanatory variable. So we use the special R word `NULL` in its place (all uppercase).

<div class="guided-notes">
    
### 1.4 Write code that uses the `lm()` function to fit the empty model of `FEV` in `FEV_data`.
    
</div>

In [None]:
# sample code
lm(FEV ~ NULL, data = FEV_data)

<div class="teacher-note">

**Teacher Note:** Point out to the students that the 3.7 they get as the coefficient labeled "Intercept" is, in fact, the mean of `FEV`. You can run `favstats()` in the cell above to corroborate that fact. 
</div>

In [None]:
# run this code
gf_point(FEV ~ row, data=FEV_data)

<div class="guided-notes">
    
### 1.5 Draw in your approximation of the best-fitting empty model onto the graph.
    
</div>

<div class="discussion-question">
    <h3>1.6 Discussion Question: The output of the <code>lm()</code> function is called a parameter estimate. (Later, for more complex models, there will be more than one parameter estimate.) Why do you think the parameter estimate for the empty model (3.7) is labeled "Intercept" in the output?</h3>
</div>

<div class="teacher-note">

<b>Sample Response:</b> 
- The term "intercept" (if you remember from algebra) is the value of y when x = 0. 

<b>Teacher Note:</b>
- Refer back to the graph where students drew or saw the model line. Help them see that the intercept (3.7) is is the value of y when x = 0. Because there is only one model prediction in the empty model, it is represented in the scatter plot as a horizontal line at the mean of y.
- Students often confuse the vertical border of a graph with the y-axis. This is a good moment to reteach that the **y-axis** is defined as the vertical line where **x = 0**. In the graph below, we’ve highlighted this by plotting a red dot at the point where the empty model (the mean line) crosses the actual y-axis. 

- Students may ask why we call the output of `lm()` a "parameter estimate". For now, it is just the mean, what R gets when it fits the empty model. Later, as we fit more complex models and distinguish between sample and population, we will give a fuller explanation of the term.
</div>


In [None]:
# to clarify "y-axis" and "y-intercept"

gf_point(FEV ~ row, data=FEV_data) %>%
  gf_hline(yintercept = 3.7) %>%
  gf_point(3.7 ~ 0, color = "red") 

<div class="guided-notes">

### 1.7 The model we have specified is **FEV = mean + error**. How would you write this model after you have fit it using `lm()`?
</div>

## 2 Visualizing the Empty Model Using R

Now that we’ve fit the empty model using `lm()`, let’s visualize it on the graph using R. To do this, we’ll save the model and then use a new function called `gf_model()` to overlay the model’s predictions on our scatterplot.

<div class="guided-notes">

### 2.1 Write the code to fit the empty model of `FEV` and save it as an R object called `empty_model`.
    
Try running the code as a class. Why doesn't it show any output?
</div>

In [None]:
# code here
empty_model <- lm(FEV ~ NULL, data = FEV_data)

# if you want to see some output, the name of the model 
# will print out the parameter estimates
empty_model

### 2.2 About `gf_model()` function 

The `gf_model()` function adds the model’s predictions to a plot. For the empty model, this will show a horizontal line at the mean. We’ll also use this function with more complex models later in the course.

<div class="guided-notes">

### 2.3 Modify the code to overlay the empty model on the scatter plot of FEV by row. 

Does the line appear where you expected it to based on the model's intercept? Does this match the version you drew by hand?
</div>

In [None]:
# modify
gf_point(FEV ~ row, data = FEV_data) %>%
  gf_model(empty_model)

## 3 Calculating and Visualizing Predictions in R

<div class="discussion-question">
    <h3>3.1 Discussion Question: What does it mean to make a prediction with a model?</h3>
</div>

<div class="teacher-note">

<b>Teacher Note:</b> This is a good opportunity to help students connect prediction to everyday reasoning. Emphasize that using a model to make a prediction means generating a **specific numerical guess** for the outcome variable based on the fitted model.

For the empty model, we’re using no additional information, so the best guess we can make is the **mean**. That means we’d predict the same value (e.g., 3.7) for every person in the dataset: row 3, row 5, row 8, etc.
</div>

### 3.2 What do you think the `predict()` function might do? Make a guess, then run the code below to find out. 

Once we have saved the `empty_model`, we can use it to make predictions. But what exactly will R return when we do that?

We have 8 people in our dataset. What will the code below give us? Take a guess, then run the code.

In [None]:
# run this
predict(empty_model)

<div class="guided-notes">

### 3.3 In your own words, explain what `predict(empty_model)` does.

</div>

<div class="guided-notes">

### 3.4 How would you write the R code to create a data frame like this?

<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse;">
  <thead>
    <tr>
      <th style="border: 1px solid black;">row</th>
      <th style="border: 1px solid black;">FEV</th>
      <th style="border: 1px solid black;">prediction</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black;">1</td>
      <td style="border: 1px solid black;">3.0</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">2</td>
      <td style="border: 1px solid black;">3.2</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">3</td>
      <td style="border: 1px solid black;">3.8</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">4</td>
      <td style="border: 1px solid black;">4.6</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">5</td>
      <td style="border: 1px solid black;">3.8</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">6</td>
      <td style="border: 1px solid black;">2.3</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">7</td>
      <td style="border: 1px solid black;">4.8</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">8</td>
      <td style="border: 1px solid black;">4.1</td>
      <td style="border: 1px solid black;">3.7</td>
    </tr>
  </tbody>
</table>

</div>

In [None]:
# write code to test here

# sample response
FEV_data <- FEV_data %>%
  mutate(prediction = predict(empty_model))

# alternative method
#FEV_data$prediction <- predict(empty_model)

<div class="guided-notes">

### 3.5 Try writing code to overlay the model’s predictions on the scatter plot.

Hint: Try using shape = 10 to make the predictions stand out.
</div>

In [None]:
# modify this
gf_point(FEV ~ row, data = FEV_data) 

# sample answer
gf_point(FEV ~ row, data = FEV_data) %>%
  gf_point(prediction ~ row, shape = 10)

<div class="discussion-question">
    
<h3>3.6 Discussion Question: Each prediction goes with a data point in FEV_data. In the scatter plot, how can you tell which prediction belongs to which data point?</h3>

</div>

## 4 Calculating and Visualizing Residuals in R

<div class="guided-notes">

### 4.1 What do you think the `resid()` function will do? Write what you think this line of code will do: `FEV_data$residual <- resid(empty_model)`
</div>

In [None]:
# run this to test your idea
FEV_data$residual <- resid(empty_model)
FEV_data

<div class="guided-notes">

### 4.2 Draw the residuals on the scatter plot from 3.5.

Which residuals are positive? Which are negative?

</div>

### 4.3 We can use `gf_resid()` to draw in the residuals from a model.
Run the code below. 

In [None]:
gf_point(FEV ~ row, data = FEV_data) %>%
  gf_model(empty_model) %>%
  gf_point(prediction ~ row, shape = 10) %>%
  gf_resid(empty_model)

<div class="guided-notes">

### 4.4 What function do you use to draw residuals in R?
</div>

<div class="guided-notes">

### 4.5 Write some code to check that the sum of the residuals is 0.
</div>

In [None]:
# code here
sum(resid(empty_model))

<div class="teacher-note">
    
**Teacher Note:** This is a good chance to review scientific notation. Because the decimal point gets moved 16 places to the left, this sum is essentially 0. The fact that it is not can be explained by rounding errors that occur when a computer does math.
</div>

<div class="discussion-question">

<h3>4.6 Discussion Question: What do you think a histogram of residuals will look like? How will it be similar to or different from the histogram of FEV?</h3>

</div>

### 4.7 Run the code below to see the two histograms

In [None]:
# run this code
gf_histogram(~FEV, data = FEV_data)
gf_histogram(~residual, data = FEV_data, fill = "purple")

<div class="teacher-note">
    
**Teacher Note:** Notice that what changes when plotting the residuals (compared with the outcome) is the center of the distribution, which is now 0. This makes sense because the sum of the residuals is 0. The shape and spread of the distribution, on the other hand, do not change.
</div>

<div class="guided-notes">

### 4.8 Compare the two distributions. How are they similar? How are they different?
    
<table border="1" style="font-size: 18px; margin-left: 0; border-collapse: collapse;">
  <thead>
    <tr>
      <th style="border: 1px solid black;">  </th>
      <th style="border: 1px solid black;">distribution of FEV</th>
      <th style="border: 1px solid black;">distribution of residuals</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid black;">shape</td>
      <td style="border: 1px solid black;"> </td>
      <td style="border: 1px solid black;">  </td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">center</td>
      <td style="border: 1px solid black;"> </td>
      <td style="border: 1px solid black;"> </td>
    </tr>
    <tr>
      <td style="border: 1px solid black;">spread</td>
      <td style="border: 1px solid black;"> </td>
      <td style="border: 1px solid black;">  </td>
    </tr>
  </tbody>
</table>
    
</div>


<div class="teacher-note">

<b>Teacher Note:</b> This is a good place to reinforce the idea that the residuals from the empty model always sum to zero. Visually, this is reflected in the histogram being centered around 0. It’s a direct result of how the mean balances the data—just as we introduced in the earlier notebook.

</div>

## 5 Practice What You've Learned: The `lungs` Data Frame

The `lungs` data frame includes spirometry data on a sample of 80 adults ages 20-50 years old. Data were originally collected from over 6,500 individuals between 2007 and 2012 as part of the National Health and Nutrition Examination Survey (NHANES). The complete dataset, which includes many more variables and many more people, is available at https://data.mendeley.com/datasets/dwjykg3xww/1.

### 5.1 Run the code below to load and view the larger data frame

In [None]:
# run this code
set.seed(124)
Lungs <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSz3JaTYY0RXoLQU-o1S45MDKudy6QfIQDtGLGoiy54JOydB7iELKxGbPqo_0uEvKUGCpZ_UQxu4PgM/pub?gid=1447217380&single=true&output=csv") %>%
  sample(80)

# show Lungs
head(Lungs)

### 5.2 Visualize the distribution of FEV in this larger dataset


In [None]:
# code here
gf_histogram(~FEV, data=lungs)

### 5.3 What do you notice? How does it look different from the small eight-student dataset you have been working with?

Response:

### 5.4 Fit and save the empty model of FEV from the Lungs data

- Save it as `EmptyModel_Lungs` (to make clear it is based on this larger Lungs dataset, not the small one from earlier in this notebook)
- Print out the parameter estimate

In [None]:
# code here
EmptyModelLungs <- lm(FEV ~ NULL, data = Lungs)
EmptyModelLungs

### 5.5 Why does the parameter estimate for the empty model of `FEV` differ between the `Lungs` dataset and the `FEV_data` dataset? Which should you use going forward? Why?

Response: 

### 5.6 Add new columns to the `Lungs` data frame with predictions and residuals from the empty model of `FEV`. Name one variable `Prediction` and the other `Residual`. Check to make sure they appear in `Lungs`.


In [None]:
# code here
Lungs$Prediction <- predict(EmptyModelLungs)
Lungs$Residual <- resid(EmptyModelLungs)
head(Lungs)

### 5.7 See if the sum of the residuals from the empty model still is 0 in this much larger dataset

In [None]:
# code here
sum(Lungs$Residual)

### 5.8 Explore the relationship between `Height` and `FEV`. 
Does `height` explain some of the variation in `FEV`? Explore this word equation, **FEV = HEIGHT + other stuff** with a visualization.

In [None]:
# code here
gf_point(FEV ~ Height, data = Lungs)

### 5.9 Here is a scatter plot depicting the predictions of the empty model as purple dots. How would you draw the empty model of `FEV` on this graph? How about the residuals? (Feel free to make them purple as well.)

In [None]:
# code here
gf_point(FEV ~ Height, data = Lungs) %>%
  gf_point(Prediction ~ Height, color = "purple")

# sample response
gf_point(FEV ~ Height, data = Lungs) %>%
  gf_point(Prediction ~ Height, color = "purple") %>%
  gf_model(EmptyModelLungs, color = "purple") %>%
  gf_resid(EmptyModelLungs, color = "purple")

### 5.10 What do you think would happen to the residuals if we put a slope on this line?

Would the residuals be smaller or larger? More balanced or less balanced? Explain your reasoning.

Response: