<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 35: Residuals

Associated Textbook Sections: [15.5 - 15.6](https://inferentialthinking.com/chapters/15/5/Visual_Diagnostics.html)

## Outline

* [Residuals](#Residuals)
* [Regression Diagnostics](#Regression-Diagnostics)
* [A Measure of Clustering](#A-Measure-of-Clustering)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Functions defined in previous lectures

def standard_units(arr):
    """ Converts an array to standard units """
    return (arr - np.average(arr))/np.std(arr)

def correlation(t, x, y):
    """ Computes correlation: t is a table, and x and y are column names """
    x_standard = standard_units(t.column(x))
    y_standard = standard_units(t.column(y))
    return np.average(x_standard * y_standard)

def slope(t, x, y):
    """ Computes the slope of the regression line, like correlation above """
    r = correlation(t, x, y)
    y_sd = np.std(t.column(y))
    x_sd = np.std(t.column(x))
    return r * y_sd / x_sd

def intercept(t, x, y):
    """ Computes the intercept of the regression line, like slope above """
    x_mean = np.mean(t.column(x))
    y_mean = np.mean(t.column(y))
    return y_mean - slope(t, x, y)*x_mean

def fitted_values(t, x, y):
    """Return an array of the regression estimates (predictions) at all the x values"""
    a = slope(t, x, y)
    b = intercept(t, x, y)
    return a*t.column(x) + b

---

## Residuals

### Residuals

* Error in regression estimate
* One residual corresponding to each point (x, y)
* residual 
    * = observed y - regression estimate of y
    *  = observed y - height of regression line at x
    *  = vertical distance between the point and the best line


### Demo: Residuals

Calculate and visualize the residuals associated with linear regression estimates for `Median Income` values based on `College%` in the `district_demographics2016.csv` data. 

In [None]:
demographics = Table.read_table('data/district_demographics2016.csv')
demographics

In [None]:
predict_income = demographics.select('College%', 'Median Income')
predict_income = predict_income.with_columns('Fitted',
    fitted_values(demographics, 'College%', 'Median Income'))
predict_income.scatter('College%')

In [None]:
demographics = demographics.drop(
    'State', 'District')
demographics.show(5)

In [None]:
def residuals(t, x, y):
    predictions = fitted_values(t, x, y)
    return t.column(y) - predictions

In [None]:
demographics = demographics.with_columns(
    'Fitted Value', fitted_values(demographics, 'College%', 'Median Income'),
    'Residual', residuals(demographics, 'College%', 'Median Income')
)
demographics

In [None]:
demographics.scatter('College%')

In [None]:
def plot_residuals(t, x, y):
    tbl = t.with_columns(
        'Fitted', fitted_values(t, x, y),
        'Residual', residuals(t, x, y)
    )
    tbl.select(x, y, 'Fitted').scatter(0)
    tbl.scatter(x, 'Residual')

In [None]:
plot_residuals(demographics, 'College%', 'Median Income')

Additionally, visualize the residuals associated with the `galton.csv` data set when predicting `Child` values from `Midparent` values using linear regression.

In [None]:
family_heights = Table.read_table('data/family_heights.csv')
parents = (family_heights.column('father') + family_heights.column('mother'))/2
heights = Table().with_columns(
    'Parent Average', parents,
    'Child', family_heights.column('childHeight')
    )
plot_residuals(heights, 'Parent Average', 'Child')

---

## Regression Diagnostics

### Example: Dugongs

<img src="img/lec32_dugong_OSU.jpeg" width=50%>

Image Source: [OSU Geospatial Ecology of Marine Megafauna Laboratory](https://blogs.oregonstate.edu/gemmlab/2021/09/27/let-me-introduce-you-to-dugongs/)

### Demo: Dugongs

Visualize the relationship between a dugong's length and age based on the `dugong.csv` dataset. Although the data is not linear, calculate the correlation coefficient.

In [None]:
dugong = Table.read_table('data/dugong.csv')
dugong.show(5)

In [None]:
dugong.scatter('Length', 'Age')

In [None]:
correlation(dugong, 'Length', 'Age')

Visualize the residuals associated with the linear regression prediction for a dugong's age based on it's height.

In [None]:
plot_residuals(dugong, 'Length', 'Age')

### Demo: US Women

In [None]:
us_women = Table.read_table('data/us_women.csv')
us_women.show(5)

In [None]:
us_women.scatter('height')

In [None]:
correlation(us_women, 'height', 'ave weight')

In [None]:
plot_residuals(us_women, 'height', 'ave weight')

### Residual Plot

A scatter diagram of residuals
* Should look like an unassociated blob for linear relations
* But will show patterns for non-linear relations
* Used to check whether linear regression is appropriate
* Look for curves, trends, changes in spread, outliers, or any other patterns


### Properties of Residuals

Residuals from a linear regression always have
* Zero mean (so rmse = SD of residuals)
* Zero correlation with $x$
* Zero correlation with the fitted values

#### Demo: Properties of Residuals

In [None]:
round(np.average(residuals(dugong, 'Length', 'Age')), 6)

In [None]:
round(np.average(residuals(heights, 'Parent Average', 'Child')), 6)

In [None]:
round(np.average(residuals(demographics, 'College%', 'Median Income')), 6)

In [None]:
heights = heights.with_columns(
    'Residual', residuals(heights, 'Parent Average', 'Child'),
    'Fitted Value', fitted_values(heights, 'Parent Average', 'Child')
)

In [None]:
round(correlation(heights, 'Parent Average', 'Residual'), 6)

In [None]:
round(correlation(heights, 'Fitted Value', 'Residual'), 6)

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>