In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
def standard_units(x):
    """Converts an array x to standard units"""
    return (x - np.mean(x)) / np.std(x)

def linear_fit(t, x, y):
    x_su = standard_units(t.column(x))
    y_su = standard_units(t.column(y))
    r = np.mean(x_su * y_su)
    slope = r * np.std(t.column(y)) / np.std(t.column(x))
    intercept = np.mean(t.column(y)) - slope * np.mean(t.column(x))
    return slope * t.column(x) + intercept

# Module 6.2 Part 2: Residuals

Residuals describe how far off estimates are from the observed values. In this notebook, you'll learn how to calculate residuals and use them to assess regression lines.

4 videos make up this notebook, for a total run time of 21:00.

1. [Residuals and Regression Diagnostics](#section1) *2 videos, total runtime 11:06*
2. [Properties of Residuals](#section2) *2 videos, total runtime 9:54*
3. [Check for Understanding](#section3)

Textbook readings:
- [Chapter 15.5: Visual Diagnostics](https://www.inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html)
- [Chapter 15.6: Numerical Diagnostics](https://www.inferentialthinking.com/chapters/15/6/Numerical_Diagnostics.html)

<a id='section1'></a>
## 1. Residuals and Regression Diagnostics



In the following videos, you'll learn how to use residuals to visualize prediction error, and check model validity.

### Residuals

In [None]:
YouTubeVideo('JlYcyQaxltc')

### Regression Diagnostics

In [None]:
YouTubeVideo("CiSYeEO-CBs")

In the cell below load `sat`, a table that contains the Participation Rate and average combined score for the SAT in each state. 

In [None]:
sat = Table.read_table('https://www.inferentialthinking.com/data/sat2014.csv').select(0, 1, 5)
sat.show(5)

In the cell below, create a scatterplot with `Participation Rate` on the x-axis and `Combined` on the y-axis.

In [None]:
...

<details>
    <summary>Solution</summary>
    
    sat.scatter("Participation Rate", "Combined")
</details>
<br>

Add the column `Predicted Combined` to `sat` that contains the predicted value of `Combined` for each value of `Participation Rate`

In [None]:
sat_with_predictions = ...
sat_with_predictions

<details>
    <summary>Solution</summary>
    
    sat_with_predictions = sat.with_column("Predicted Combined", linear_fit(sat, "Participation Rate", "Combined"))
</details>
<br>

Create a scatterplot to compare the `Combined` scores and `Predicted Combined` Scores for each value of `Participation Rate`

In [None]:
...

<details>
    <summary>Solution</summary>
    
    sat_with_predictions.drop(0).scatter(0)
</details>
<br>

Add a column `residuals`to the table `sat` to find the difference between the observed combined scors and predicted combined scores. Generate a scatter plot of the residuals

In [None]:
sat_with_residuals = ...
...

<details>
    <summary>Solution</summary>
    
    sat_with_residuals = sat_with_predictions.with_column("Residuals", sat_with_predictions.column("Combined") - sat_with_predictions.column("Predicted Combined"))
    sat_with_residuals.select("Residuals", "Participation Rate").scatter("Participation Rate", "Residuals")
</details>
<br>

What trends (if any) do you see in the residual plot? Does the regression line sufficently describe the relationship between Participation Rate and Combined Score?

<details>
    <summary>Solution</summary>
For participation rates between 20-60 and 90-100, there are only negative residuals. This indicates that the relationship between Participation Rate and Combined Score may be non-linear.
<br>

<a id='section2'></a>

## 2. Properties of Residuals

In the next video, you'll learn about the properties of residual distributions and how to calculate their standard deviation.

### Properties of Residuals

In [None]:
YouTubeVideo('X7DLRL7JzMM')

### Standard Deviation of Residuals

In [None]:
YouTubeVideo("vYQ2EWvySV0")

In the cell below, calculate the standard deviation of the residuals from predicting `Combined` scores from `Participation Rate` from the `sat` table.

In [None]:
x_su = ...
y_su = ...
r = ...
residual_std = ...
residual_std

<details>
    <summary>Solution</summary>

    x_su = standard_units(sat.column("Participation Rate"))
    y_su = standard_units(sat.column("Combined"))
    r  = np.mean(x_su * y_su)
    residual_std = (1 - r**2) ** 0.5 * np.std(sat.column("Combined"))
    residual_std
<br>

<a id='section3'></a>
## 3. Check for Understanding

**A. When the correlation coefficient is 1, what is the standard deviation of the residuals?**

<details>
    <summary>Solution</summary>
    The standard deviation of the residuals would be 0. When we have a perfect linear relationship, the predicted and observed values all fall on the same line. All the residuals are equal to 0, so there's no variation in their values. Thus, the standard deviation of the residuals is 0.
</details>
<br>

**B. Consider the residual plot below. What does this say about the accuracy of the regression estimates for the data it describes?**
<img src="residual_question.png" width=300 height=300 />

<details>
<summary>Solution</summary>
The variability in the size of the errors is greater for low values of acceleration than for high values. Uneven variation is often more easily noticed in a residual plot than in the original scatter plot.
<br><br>
Since the residual plot shows uneven variation about the horizontal line at 0, the regression estimates are not equally accurate across the range of the predictor variable.
</details>
<br>