# Standard Deviation and Z-Scores (COMPLETE)

## Chapter 6.4-6.6 Overview Notebook

In [None]:
# run this to set up the notebook
library(coursekata)
library(gridExtra)

# set styles
css <- suppressWarnings(readLines("https://raw.githubusercontent.com/jimstigler/jupyter/master/ck_jupyter_styles_v2.css"))
IRdisplay::display_html(sprintf('<style>%s</style>', paste(css, collapse = "\n")))

# add functions directly from github
source("https://raw.githubusercontent.com/coursekata/beta-functions/refs/heads/main/gf_resid.R")
source("https://raw.githubusercontent.com/coursekata/beta-functions/refs/heads/main/gf_square_resid.R")
source("https://raw.githubusercontent.com/coursekata/beta-functions/refs/heads/main/gf_sd_ruler.R")

# code used to generate feedback_data and no_feedback_data
set.seed(141)
feedback_data <- data.frame(
  student_id = c(1:100),
  task = rep("feedback", 100),
  median_time = round(rnorm(100, 13, 3),1)
)

set.seed(154)
no_feedback_data <- data.frame(
  student_id = c(1:100),
  task = rep("no_feedback", 100),
  median_time = round(rnorm(100, 13, 6),1)
)


<div class="teacher-note">
    
<b>Section Goals:</b> In this section students are introduced to the concept of standard deviation and to the z-score, which uses standard deviation as a unit for measuring how far a score is from the mean of its distribution. Students will learn:

- that standard deviation is the square root of variance, and understand that it represents the spread of a distribution in more interpretable units than variance.
- how to calculate a z-score using R, and understand that z-score represents how far, in standard deviations, a score is from the mean of its distribution. 
- that both residuals and z scores express, in different units, how far a score is above or below the mean of its distribution. Z-scores are sometimes referred to as “standardized scores.”
- that z-scores are a way to compare scores across different measures in different units based on their position relative to the mean, and will learn about when z-scores are useful and how they are used.
        
A <a href="https://docs.google.com/document/d/1oeJYSOL1geQGdul-BhMr0kAy0pfVKZ91FSXsrRv2TCI/edit?tab=t.5y2a0ykmi2fk" target="_blank">printable student guided-notes worksheet</a> is available to go with this Jupyter notebook, as well as a student version of this notebook.
    
</div>

## 1 Estimating Elapsed Time: More Data

We’re going to compare two versions of the stopwatch task. In both, a participant presses Start, then Stop when they think 13 seconds have elapsed. Each participant gets five trials; we record their guess.

- **feedback:** after each trial, the participant sees the actual elapsed time.
- **no_feedback:** the participant does not see the actual time until after all trials are done.

Use the stopwatches below to try each version. As you try it, think about the data generating process: How might feedback change people's behavior from trial to trial?

### 1.1 Try the `feedback` and `no_feedback` versions of the stopwatch

In [None]:
IRdisplay::display_html('
  <iframe src="https://uclatall.github.io/stopwatch/comparison-table.html" 
          width="100%" height="300" style="border: none;" 
          sandbox="allow-scripts allow-same-origin"></iframe>
')

<div class="teacher-note">

**Teacher Note:**
Project the timer on a screen. Let students come up one at a time and try to use the two stopwatches to guess when exactly 13 seconds are up. Alternatively, if students are working in their own notebooks, they can all try the stopwatch game.
    
</div>

<div class="discussion-question">
    
### 1.2 Discussion: What do you think about the data generating process? How might feedback change people's behavior from trial to trial? How will that affect total error or mean square error from the empty model?
    
</div>

<div class="teacher-note">
    
**Sample Responses:**
- With feedback I’d learn and get better, so the errors should get smaller. The total error should be smaller in feedback data.
- Feedback should make me more accurate as I do more trials. So overall it will have smaller average error.
    
</div>

### 1.3 Write some code to take a look at the dataframes `feedback_data` and `no_feedback_data`

Instead of collecting data from your class, this time we are going to analyze two larger datasets collected from college students across the USA. 

- `feedback_data` 100 students who saw the actual elapsed time after each trial. Each row represents a student who did 5 trials, and their median time is what appears in the dataset.  

- `no_feedback_data` 100 students who did not see their elapsed time until after all trials were finished. Again, each row is the median of 5 trials for each student.  

Each student is labeled with a `student_id` from 1 to 100 (assigned randomly).

In [None]:
# code here

# COMPLETE
head(feedback_data)
head(no_feedback_data)

## 2 Exploring the distributions of `feedback_data` and `no_feedback_data`

Let's examine the distributions of `median_time`. 

<div class="discussion-question">
    
### 2.1 We provided a scatter plot that shows `median_time` on the y-axis (and just the random `student_id` on the x-axis) for students who got feedback.  Modify this code to make a similar scatter plot for students who did not get feedback. How do these two groups compare?

Notice: if we let R scale the y-axis automatically, it will stretch or shrink the range to fit each dataset. That makes it hard to compare the spread of times across the two conditions. To make the comparison fair, we’ve added `gf_lims(y = c(0,30))` to keep the y-axis the same (0–30 seconds).

</div>

In [None]:
# modify this
gf_point(median_time ~ student_id, data = feedback_data) %>%
  gf_lims(y = c(0,30))

In [None]:
# modify this

# COMPLETE
gf_point(median_time ~ student_id, data = no_feedback_data) %>%
  gf_lims(y = c(0,30))

<div class="guided-notes">

### 2.2 On each visualization provided, draw the best-fitting empty model for each set of data. Label the $b_0$ on each graph. Also draw in a few residuals to remind yourself what error from the model looks like.

Run the code below to see the best-fitting parameter estimates.
</div>

In [None]:
# this saves empty models of both distributions
feedback_empty <- lm(median_time ~ NULL, data = feedback_data)
no_feedback_empty <- lm(median_time ~ NULL, data = no_feedback_data)

# this prints out the empty models
feedback_empty
no_feedback_empty

<div class="discussion-question">
    
### 2.3 Discussion Questions: 
    
- For which task is the empty model a better model? How can you tell?
- Which would have a lower SST? Which would have a lower variance?
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-feedback-no-feedback.jpg" alt="the two scatter plots for feedback and no feedback data with their empty models, respectively">

<div class="teacher-note">

**Sample Responses:**
    
- The empty model looks better for the feedback data because the dots are closer to the line; the residuals are smaller. 
- So feedback should have a smaller SST. 
- And since variance is SST adjusted for sample size (the mean square), feedback should also have a smaller variance.
  
**Teacher Note:**

- Both groups have an empty model around 13, which makes sense since they were all aiming for 13 seconds. Because of this, some students may argue the models are “equally good.” Remind them that the goal isn’t just hitting the target value, but also reducing error overall. Both models are wrong (almost all the time), but one has consistently *less* error.  

    
</div>

## 3 Quantifying Error: SST, Variance, and Standard Deviation

We have a visual sense that there is more error around the empty model for `no_feedback` than for `feedback`. Let's quantify that visual sense. We'll start with Sum of Squares Total (SST) and variance, then introduce a new measure, standard deviation.


<div class="guided-notes">
    
### 3.1 Write R code to calculate SST and variance for the two empty models. Enter the missing formulas, code, and results in the first two rows of the table below.  
    
If you use the super shortcut for one of the rows, write in at least one other way to find that measure of error in the other row.
    
</div>

In [None]:
# code here

# COMPLETE
supernova(feedback_empty)
supernova(no_feedback_empty)

sum(resid(feedback_empty)^2)
sum(resid(no_feedback_empty)^2)

var(~ median_time, data = feedback_data)
var(~ median_time, data = no_feedback_data)

<table style="width: 100%; font-size: 18px; border-collapse: collapse;">
  <thead style="background-color: white;">
    <tr>
      <th style="border: 1px solid black; font-weight: bold; width: 110px; text-align: left; background-color: white;">Name</th>
      <th style="border: 1px solid black; font-weight: bold; width: 160px; text-align: left; background-color: white;">Formula</th>
      <th style="border: 1px solid black; font-weight: bold; width: 310px; text-align: left; background-color: white;">R Code</th>
      <th style="border: 1px solid black; font-weight: bold; width: 120px; text-align: left; background-color: white;">feedback</th>
      <th style="border: 1px solid black; font-weight: bold; width: 120px; text-align: left; background-color: white;">no_feedback</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: white;">
      <td style="border: 1px solid black; height: 100px; text-align: left; background-color: white;">Total Sum of Squares</td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
    </tr>
    <tr style="background-color: white;">
      <td style="border: 1px solid black; height: 100px; text-align: left; background-color: white;">Variance</td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
    </tr>
    <tr style="background-color: white;">
      <td style="border: 1px solid black; height: 100px; text-align: left; background-color: white;">Standard Deviation</td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
      <td style="border: 1px solid black; text-align: left; background-color: white;"></td>
    </tr>
  </tbody>
</table>

<div class="discussion-question">
    
### 3.2 Discussion Questions: 
- Based on your calculated SST and variance, for which dataset is the empty model a better model? Why?    
- Variance for the no feedback data is about 38.19 seconds squared. Is "seconds squared” a useful way to describe how far off the model is? What might be a better way to express how far off the guesses are from the model?  

</div>

<div class="teacher-note">

**Sample Responses:**
- The empty model looks better for the feedback group, because the SST and variance are both smaller.  
- Seconds squared is not that helpful; you can’t really be off by ‘38 seconds squared.’ It would make more sense to go back to regular seconds somehow. (This will segue to standard deviation.)
    
</div>

<div class="guided-notes">
    
### 3.3 Look back at the variance row in the table from 3.1. How would you modify that formula so the result is in seconds instead of seconds squared?  Write your new formula in the row marked Standard Deviation.
    
</div>

<div class="teacher-note">

**Sample Response:**  
- Take the square root of the variance formula.

</div>

<div class="guided-notes">
    
### 3.4 In mathematical notation, the sample standard deviation is usually called $s$. Thus the sample variance is called $s^2$. Go back to your table from 3.1 and add this notation in the Formula column for those two rows.  
    
</div>

<div class="guided-notes">

### 3.5 There is an R function called `sd()` that works just like `var()`.  Look back at the R code you wrote in the variance row of the table. How could you modify it to calculate the standard deviation instead? Write your R code in the Standard Deviation row.  
    
Try it out in the code cell below to check.

</div>

In [None]:
# code here

# COMPLETE (only for no_feedback_data)
sd(~ median_time, data = no_feedback_data)

# you can use this to make sure it is indeed the square root of variance
sqrt(var(~ median_time, data = no_feedback_data))

# note that favstats also calculates standard deviation
favstats(~ median_time, data = no_feedback_data)

<div class="guided-notes">
    
### 3.6 Variance and standard deviation are both measures of how well the empty model fits the data. They tell us about the spread of a distribution around the prediction of the empty model. On the `no_feedback_data` plot:

- Try to draw a representation of the variance, or mean square, (38.19 seconds squared) on the graph. 
- Then, try to draw a representation of one standard deviation (6.18 seconds) on the graph.  
- Hint: One of these will be a lot harder than the other.
    
</div>


<div class="discussion-question">
    
### 3.7 Discussion Question: Which was harder to draw onto the graph? Why?
    
</div>

<div class="teacher-note">

**Sample Response:**  
- Variance is harder. But once you draw standard deviation (a little residual about 6 seconds long from the empty model), then we can draw variance.

</div>

In [None]:
# run this to see a depiction of standard deviation
gf_point(median_time ~ student_id, data = no_feedback_data) %>%
  gf_labs(title = "no_feedback_data") %>%
  gf_model(no_feedback_empty) %>%
  gf_sd_ruler(color = "red", size = 2)

### 3.8 Standard deviations are plotted on the graphs below
    
Now you can clearly see that although both models typically predict a guess to be about 13 seconds: 
- the `feedback_empty` model's SD shows that the prediction is typically off by 3 seconds.
- the `no_feedback_empty` model's SD shows that the prediction is typically off by 6 seconds.

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-sd-rulers.jpg" alt="the two scatter plots for feedback and no feedback data with their empty models and standard deviation rulers, respectively">

## 4 Developing Intuitions About Standard Deviation

Standard deviation (the typical size of a residual) is a useful metric for thinking about how well a model fits the data. It is worth spending some time getting familiar with estimating it visually.

<div class="guided-notes">
    
### 4.1 Given an sd ruler, estimate the sd value.

In these four scatter plots, the SD ruler is drawn. What’s the approximate SD? Any patterns you notice so far?
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-estimate-sd-value.jpg" alt="four scatter plots in which there is no sd ruler and no sd value" width = 100%> 

<div class="guided-notes">
    
### 4.2 Given the value of SD, draw an SD ruler.

Here, the value of the SD is given. Draw in the ruler. Any patterns you notice so far?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-draw-sd.jpg" alt="four scatter plots in which the value of SD is given but there is no sd ruler" width = 100%> 

<div class="guided-notes">
    
### 4.3 Training wheels come off! Nothing given. Draw in the sd ruler and estimate the value. Explain your strategy.
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-estimate-and-draw-sd.jpg" alt="four scatter plots in which there is no sd ruler and no sd value" width = 100%> 

## 5 What statisticians commonly use SD for: To measure distance

Now that you have developed a visual sense of SD, we'll show you how we use it commonly in statistics.

Let's learn about two students. One, whom we will refer to as the **square** student (we're going to depict their data as a square), was in the `feedback` condition. 

The second student, **triangle**, was in the `no_feedback` condition. 

Both students were trying to stop the stopwatch at exactly 13 seconds; and both actually stopped the stopwatch at 18 seconds. 


<div class="discussion-question">

### 5.1 Discussion: Would you say that both students are equally skilled at estimating an elapsed time of 13 seconds? Why or why not? 
- Which one would you predict would have been more accurate if both had been placed in the `no_feedback` condition?
    
</div>

<div class="teacher-note">

**Sample Responses:**  
- They’re the same. Both stopped at 18s, both were 5s too late.
- They’re different, because the no-feedback group had a wider spread. (Some students may vaguely recall this, but often can’t yet explain why it matters. This is the nugget you’ll build on in 5.2 when we show the square and triangle students in context of their respective distributions.)  

**Teacher Note:**  
- At this point it is okay to validate both kinds of answers. We want students to realize there is more than one way to judge the data.

</div>


<div class="discussion-question">

### 5.2 Discussion: Take a look at these two students (depicted as a square and a triangle) in their respective distributions. Do they seem equally skilled? 
- How does looking at each student **in the context of their group’s spread** change your answer?
- Can the SD ruler help you explain your reasoning?

</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-square-triangle.jpg" alt="the square student is outside 1 SD of their distribution while the triangle student is within 1 SD of their distribution" width = 100%>

<div class="teacher-note">

**Sample Responses:**  
- The square looks much worse. It's past the typical range of the feedback group. 
- The triangle is only about 1 SD above the mean, so it doesn’t seem that unusual. 
- The SD ruler helps show that one student is outside what’s typical, the other is still inside.

**Teacher Note:**  
- Encourage students to use the ruler in their explanations.
- Highlight that this way of using SD as a standard unit gives us a common way to compare  across groups with different spreads.  

</div>


<div class="guided-notes">
    
### 5.3 Fill in the approximate value of the residual for the student we call Square and the one we call Triangle. (We rounded everything to the nearest whole number.)
    
</div>

<table style="width: 100%; font-size: 18px; border-collapse: collapse;">
  <thead style="background-color: white;">
    <tr>
      <th style="border: 1px solid black; font-weight: bold; width: 150px; text-align: left; background-color: white;">Task</th>
      <th style="border: 1px solid black; font-weight: bold; width: 100px; text-align: left; background-color: white;">Student</th>
      <th style="border: 1px solid black; font-weight: bold; width: 100px; text-align: right; background-color: white;">median_time</th>
      <th style="border: 1px solid black; font-weight: bold; width: 100px; text-align: right; background-color: white;">Residual</th>
      <th style="border: 1px solid black; font-weight: bold; width: 100px; text-align: right; background-color: white;">Z-Score</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: white;">
      <td style="border: 1px solid black; height: 80px; text-align: left; background-color: white;">feedback<br>
(mean = 13, sd = 3)
</td>
      <td style="border: 1px solid black; text-align: left; background-color: white;">Square</td>
      <td style="border: 1px solid black; background-color: white;">18</td>
      <td style="border: 1px solid black; background-color: white;"></td>
      <td style="border: 1px solid black; background-color: white;"></td>
    </tr>
    <tr style="background-color: white;">
      <td style="border: 1px solid black; height: 80px; text-align: left; background-color: white;">no_feedback<br>
(mean = 13, sd = 6)
</td>
      <td style="border: 1px solid black; text-align: left; background-color: white;">Triangle</td>
      <td style="border: 1px solid black; background-color: white;">18</td>
      <td style="border: 1px solid black; background-color: white;"></td>
      <td style="border: 1px solid black; background-color: white;"></td>
    </tr>
  </tbody>
</table>

### 5.4 Z-Scores  

Looking at the `median_time` or even the residuals for Square and Triangle, they appear the same.  

**Z-score** gives us a way to describe how far a student is from the mean in the context of the spread of their group.  It is *standardized*; that is, it's expressed in *standard deviation units*, not in raw seconds.  

We calculate it by taking the residual and dividing by the standard deviation of the group: 

<div style="font-size: 20pt; text-align: center;">
$$z_i = \frac{Y_i - \bar{Y}}{s}$$
</div>

<div class="guided-notes">
    
### 5.5 Calculate the z-score for the student we call Square and the one we call Triangle. Enter these z-scores into the table (in 5.3) on your guided notes.
    
</div>

In [None]:
# code here (optional)

# z-score for Square
(18 - 13) / 3

# z-score for Triangle
(18 - 13) / 6

<div class="discussion-question">
    
### 5.6 Discussion Questions: 
    
- Why is the z-score for Square greater than the z-score for Triangle?
- How do the z-scores relate to the idea that one of these students is outside the typical range and one is within the typical range?
    
</div>

<img src="https://coursekata-course-assets.s3.us-west-1.amazonaws.com/UCLATALL/czi-stats-course/6.4-6.6-overview-square-triangle.jpg" alt="the square student is outside 1 SD of their distribution while the triangle student is within 1 SD of their distribution" width = 100%>

<div class="teacher-note">

**Sample Responses:**  
- Square’s z-score is larger because the feedback group had a smaller SD, so being 5 seconds off was more extreme there. In SD units, Square is farther from the mean.  
- Triangle’s group had more spread, so being 5 seconds off isn’t as unusual. In SD units, Triangle isn't too far off from the mean.
- The z-scores (1.67 versus .83) show that Square is outside the 1 SD range, while Triangle is within it.  

**Teacher Note:**  
- Connect back to the visual from 5.2: Square was beyond the SD ruler, while Triangle was still inside it.  

</div>


## 6 Z-Scores Help You Compare Scores Even When Measured on Different Scales

Imagine a different game where students are told to **brush their teeth for exactly 2 minutes**. Each student has to stop brushing when they think 2 minutes have passed, and then we check the actual time.  

Now the scores are in **minutes** instead of seconds, and the game is entirely different. Suppose a large group of 70 students averages right around 2 minutes. One student (let’s call them Circle!) stops at 1.4 minutes.  

<div class="discussion-question">
    
### 6.1 Is the Circle student better or worse at predicting a target time than Triangle? 
  
</div>

<div class="teacher-note">

**Sample Responses:**  
- By raw numbers alone, the two games are just completely different, so it seems impossible to compare.  
- It’s not helpful to compare the elapsed values (1.8 minutes vs. 18 seconds); even if we convert 1.4 minutes into seconds (84 seconds), it doesn’t help.  
- Similarly, comparing the residuals (-0.6 minutes vs. 5 seconds) doesn’t really tell us who was more accurate.  
- Some students may begin to wonder: Exactly how good are people at this toothbrushing task? (Maybe some might say, "Okay fine, let's use z-score."

</div>

<div class="guided-notes">
    
### 6.2 Let’s try converting the toothbrush score into a z-score

To fairly compare Circle and Triangle, we need to put both scores in the context of their group’s distribution (even without seeing the full distribution).  

Here are a few facts about the tooth brush task distribution: mean = 2 minutes; SD = 1 minute 
    
Add this information in the third row of your guided notes table (from 5.3). Then calculate the z-score for Circle.
    
</div>

<div class="discussion-question">

### 6.3 Discussion: 
- What does Circle’s z-score tell us about how good they are at predicting elapsed time? 
- How does Circle compare to Triangle once we use z-scores?  
- Does it matter that Circle's z-score is negative and Triangle's z-score is positive? 

</div>

<div class="teacher-note">

**Sample Responses:**  
- Circle’s z-score (–0.6) shows they were a little below average, but not far outside the typical range.  
- Compared to Triangle (0.8), Circle is about the same distance from the mean, just on the opposite side.  
- The negative sign tells us the direction of their error (below vs. above the mean). The size of the z-score tells us how unusual the score is. If we care about accuracy of elapsed time, we might just care about the magnitude of the z-score.

</div>
