<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 29: Correlation

Associated Textbook Sections: [15.0, 15.1](https://inferentialthinking.com/chapters/15/1/Correlation.html)

## Outline

* [Regression Roadmap](#Regression-Roadmap)
* [Prediction](#Prediction)
* [Association](#Association)
* [Correlation Coefficient](#Correlation-Coefficient)
* [Care in Interpretation](#Care-in-Interpretation)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

def r_scatter(r):
    plots.figure(figsize=(5,5))
    "Generate a scatter plot with a correlation approximately r"
    x = np.random.normal(0, 1, 1000)
    z = np.random.normal(0, 1, 1000)
    y = r*x + (np.sqrt(1-r**2))*z
    plots.scatter(x, y, color='darkblue', s=20)
    plots.xlim(-4, 4)
    plots.ylim(-4, 4)

---

## Regression Roadmap

* Association and correlation
* Prediction, scatterplots and lines
* Least squares: finding the "best" line for a dataset
* Residuals: analyzing mistakes and errors
* Regression inference: understanding uncertainty


---

## Prediction

### Guessing the Future

* Based on incomplete information
* One way of making predictions: 
    * To predict an outcome for an individual, 
    * find others who are like that individual
    * and whose outcomes you know. 
    * Use those outcomes as the basis of your prediction.


### Demo: Prediction

Load the `galton.csv` data, and visualization the `MidParent` and `Child` height relationship.

In [None]:
galton = Table.read_table('data/galton.csv')

In [None]:
heights = Table().with_columns(
    'MidParent', galton.column('midparentHeight'),
    'Child', galton.column('childHeight')
    )

In [None]:
heights

In [None]:
heights.scatter('MidParent')

Use the `predict_child` function below to predict a child's hight based on a midparent height.

In [None]:
def predict_child(h):
    """Return a prediction of the height of a child 
    whose parents have a midparent height of h.
    
    The prediction is the average height of the children 
    whose midparent height is in the range h plus or minus 0.5 inches.
    """
    
    close_points = heights.where('MidParent', are.between(h-0.5, h + 0.5))
    return close_points.column('Child').mean()   

In [None]:
heights_with_predictions = heights.with_column(
    'Prediction', heights.apply(predict_child, 'MidParent')
    )

In [None]:
heights_with_predictions.scatter('MidParent')

---

## Association

### Two Numerical Variables

* Trend
    * Positive association
    * Negative association
* Pattern
    * Any discernible "shape" in the scatter
    * Linear vs. Non-linear
* Visualize, then quantify


### Demo: Association

Load the `hybrid.csv` data on several older hybrid cars.

In [None]:
hybrid = Table.read_table('data/hybrid.csv')
hybrid

In [None]:
hybrid.sort('msrp', descending=True)

Visualize the relationship between several of the numerical variables.

In [None]:
hybrid.scatter('mpg', 'msrp')

In [None]:
hybrid.scatter('acceleration', 'msrp')

In [None]:
suv = hybrid.where('class', 'SUV')
suv.num_rows

In [None]:
suv.scatter('acceleration', 'msrp')

In [None]:
suv.scatter('mpg', 'msrp')

Use the `standard_units` function bellow to visualize the relationship between `mpg` and `msrp` in standard units.

In [None]:
def standard_units(x):
    "Convert any array of numbers to standard units."
    return (x - np.average(x)) / np.std(x)

In [None]:
Table().with_columns(
    'mpg (standard units)',  standard_units(suv.column('mpg')), 
    'msrp (standard units)', standard_units(suv.column('msrp'))
).scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);

Use the `standard_units` function bellow to visualize the relationship between `acceleration` and `msrp` in standard units.

In [None]:
Table().with_columns(
    'acceleration (standard units)', standard_units(suv.column('acceleration')), 
    'msrp (standard units)',         standard_units(suv.column('msrp'))
).scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);

---

## Correlation Coefficient

### The Correlation Coefficient

* Measures linear association
* Based on standard units
* $-1 \leq r \leq 1$
    * $r =  1$: scatter is perfect straight line sloping up
    * $r = -1$: scatter is perfect straight line sloping down
    * $r = 0$: No linear association; *uncorrelated*


### Definition of $r$

* Correlation Coefficient ($r$) = 
    * average of
    * product of
    * $x$ in standard units
    * and
    * $y$ in standard units
* Measures how clustered the scatter is around a straight line

### Demo: Correlation

Demonstrate various scatter plots based on an inputted correlation coefficient using the `r_scatter` function.

In [None]:
r_scatter(-1)

### Calculating $r$

Explore the concept of calculating the $r$ value. 

In [None]:
x = np.arange(1, 7, 1)
y = make_array(2, 3, 1, 5, 2, 7)
t = Table().with_columns(
        'x', x,
        'y', y
    )
t

In [None]:
t.scatter('x', 'y', s=30, color='red')

In [None]:
t = t.with_columns(
        'x (standard units)', standard_units(x),
        'y (standard units)', standard_units(y)
    )
t

In [None]:
t = t.with_columns('product of standard units', t.column(2) * t.column(3))
t

Notice that $r$ is the average of the products of the standard units

In [None]:
r = np.average(t.column(2) * t.column(3))
r

Define `correlation` as a function for a given table and x, y column labels.

In [None]:
def correlation(t, x, y):
    """t is a table; x and y are column labels"""
    x_in_standard_units = ...
    y_in_standard_units = ...
    return ...

In [None]:
correlation(t, 'x', 'y')

In [None]:
correlation(suv, 'mpg', 'msrp')

In [None]:
correlation(suv, 'acceleration', 'msrp')

---

## Care in Interpretation

### Watch Out For ...

* False conclusions of causation
* Nonlinearity
* Outliers
* Ecological Correlations

### Demo: Switching Axes

Notice that `correlation(t, 'x', 'y') == correlation(t, 'y', 'x')`. This can lead to a false conclusion of causation.

In [None]:
correlation(t, 'x', 'y')

In [None]:
t.scatter('x', 'y', s=30, color='red')

In [None]:
t.scatter('y', 'x', s=30, color='red')

In [None]:
correlation(t, 'y', 'x')

### Demo: Nonlinearity

Explore the correlation calculation for symmetrical non-linear data.

In [None]:
new_x = np.arange(-4, 4.1, 0.5)
nonlinear = Table().with_columns(
        'x', new_x,
        'y', new_x**2
    )
nonlinear.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(nonlinear, 'x', 'y')

### Demo: Outliers

Notice how 1 outlier can strengthen or weaken an $r$ value depending on its value.

In [None]:
line = Table().with_columns(
        'x', make_array(1, 2, 3, 4),
        'y', make_array(1, 2, 3, 4)
    )
line.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(line, 'x', 'y')

In [None]:
outlier = Table().with_columns(
        'x', make_array(1, 2, 3, 4, 5),
        'y', make_array(1, 2, 3, 4, 0)
    )
outlier.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(outlier, 'x', 'y')

### Demo: Ecological Correlations

Explore an example of SAT scores (`sat2014.csv`) in connection to ecological correlations.

In [None]:
sat2014 = Table.read_table('./data/sat2014.csv').sort('State')
sat2014

In [None]:
sat2014.scatter('Critical Reading', 'Math')

In [None]:
correlation(sat2014, 'Critical Reading', 'Math')

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>