<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 30: Correlation

Associated Textbook Sections: [15.0, 15.1](https://inferentialthinking.com/chapters/15/1/Correlation.html)

---

## Outline

* [Prediction](#Prediction)
* [Association](#Association)
* [Correlation Coefficient](#Correlation-Coefficient)
* [Care in Interpretation](#Care-in-Interpretation)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

def r_scatter(r):
    plt.figure(figsize=(5,5))
    "Generate a scatter plot with a correlation approximately r"
    x = np.random.normal(0, 1, 1000)
    z = np.random.normal(0, 1, 1000)
    y = r*x + (np.sqrt(1-r**2))*z
    plt.scatter(x, y, color='darkblue', s=20)
    plt.xlim(-4, 4)
    plt.ylim(-4, 4)

---

## Prediction

---

### Guessing the Future

* Based on incomplete information
* One way of making predictions: 
    * To predict an outcome for an individual, 
    * find others who are like that individual
    * and whose outcomes you know. 
    * Use those outcomes as the basis of your prediction.


---

### Demo: Prediction

* Load the `galton.csv` data, and visualize the `midparentHeight` and `childHeight` height relationship.
* Use the `predict_child` function below to predict a child's height based on a midparent height, and visualize the relationship of these predictions with the `'midparentHeight'` and `'childHeight'` values.

In [None]:
heights = (Table.read_table('galton.csv')
                .select('midparentHeight', 'childHeight'))
heights

In [None]:
...

In [None]:
def predict_child(h):
    """Return a prediction of the height of a child 
    whose parents have a midparent height of h.
    
    The prediction is the average height of the children 
    whose midparent height is in the range h plus or minus 0.5 inches.
    """
    
    close_points = heights.where('midparentHeight', are.between(h-0.5, h + 0.5))
    return close_points.column('childHeight').mean()   

In [None]:
predicted_child_heights = ...
heights_with_predictions = heights.with_column(
    'predicted_ChildHeight', predicted_child_heights)

In [None]:
heights_with_predictions.scatter('midparentHeight')

---

## Association

---

### Two Numerical Variables

* Trend
    * Positive association
    * Negative association
* Pattern
    * Any discernible "shape" in the scatter
    * Linear vs. Non-linear
* **Visualize, then quantify!**


---

### Demo: Association

* Load the `hybrid.csv` data on several older hybrid cars.
* Visualize the relationship between several of the numerical variables.
* Use the `standard_units` function bellow to visualize the relationship between `mpg` and `msrp` in standard units.
* Use the `standard_units` function bellow to visualize the relationship between `acceleration` and `msrp` in standard units.

In [None]:
hybrid = Table.read_table('hybrid.csv')
hybrid

In [None]:
hybrid.sort('msrp', descending=True)

In [None]:
hybrid.scatter('mpg', 'msrp')

In [None]:
hybrid.scatter('acceleration', 'msrp')

In [None]:
suv = hybrid.where('class', 'SUV')
print(f'There are {suv.num_rows} SUVs in this dataset.')

In [None]:
suv.scatter('acceleration', 'msrp')

In [None]:
suv.scatter('mpg', 'msrp')

In [None]:
def standard_units(x):
    "Convert any array of numbers to standard units."
    ...

In [None]:
suv_standardized = suv
for variable in ['msrp', 'acceleration', 'mpg']:
    standardized_variable = standard_units(suv.column(variable))
    suv_standardized = suv_standardized.with_column(variable, standardized_variable)
suv_standardized

In [None]:
suv.scatter('mpg', 'msrp')
plt.title('Original Units')
suv_standardized.scatter('mpg', 'msrp')
plt.title('Standardized Units')
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.show()

In [None]:
suv.scatter('acceleration', 'msrp')
plt.title('Original Units')
suv_standardized.scatter('acceleration', 'msrp')
plt.title('Standardized Units')
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.show()

---

## Correlation Coefficient

---

### The Correlation Coefficient

* Measures linear association
* Based on standard units
* $-1 \leq r \leq 1$
    * $r =  1$: scatter is perfect straight line sloping up
    * $r = -1$: scatter is perfect straight line sloping down
    * $r = 0$: No linear association; *uncorrelated*


---

### Definition of $r$

* Correlation Coefficient ($r$) = 
    * average of
    * product of
    * $x$ in standard units
    * and
    * $y$ in standard units
* Measures how clustered the scatter is around a straight line

---

### Demo: Correlation

Demonstrate various scatter plots based on an inputted correlation coefficient using the `r_scatter` function.

In [None]:
...

---

### Demo: Calculating $r$

* Explore the concept of calculating the $r$ value. 
* Notice that $r$ is the average of the products of the standard units
* Define `correlation` as a function for a given table and x, y column labels.

In [None]:
x = np.arange(1, 7, 1)
y = make_array(2, 3, 1, 5, 2, 7)
t = Table().with_columns(
        'x', x,
        'y', y)
t

In [None]:
t.scatter('x', 'y', s=30, color='red')

In [None]:
t = t.with_columns(
        'x (standard units)', ...,
        'y (standard units)', ...)
t

In [None]:
t = t.with_columns('product of standard units', 
                   ...)
t

In [None]:
r = ...
r

In [None]:
def correlation(t, x, y):
    '''t is a table; x and y are column labels'''
    x_in_standard_units = ...
    y_in_standard_units = ...
    return ...

In [None]:
correlation(t, 'x', 'y')

In [None]:
correlation(suv, 'mpg', 'msrp')

In [None]:
correlation(suv, 'acceleration', 'msrp')

---

## Care in Interpretation

---

### Watch Out For ...

* False conclusions of causation
* Nonlinearity
* Outliers
* Ecological Correlations

---

### Demo: Switching Axes

Notice that `correlation(t, 'x', 'y') == correlation(t, 'y', 'x')`. This can lead to a false conclusion of causation.

In [None]:
t.scatter('x', 'y', s=30, color='red')
plt.title('y vs x')
plt.show()

In [None]:
correlation(t, 'x', 'y')

In [None]:
t.scatter('y', 'x', s=30, color='red')
plt.title('x vs y')
plt.show()

In [None]:
correlation(t, 'y', 'x')

---

### Demo: Nonlinearity

Explore the correlation calculation for symmetrical non-linear data.

In [None]:
new_x = np.arange(-4, 4.1, 0.5)
nonlinear = Table().with_columns(
        'x', new_x,
        'y', new_x**2)
nonlinear.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(nonlinear, 'x', 'y')

---

### Demo: Outliers

Notice how 1 outlier can strengthen or weaken an $r$ value depending on its value.

In [None]:
line = Table().with_columns(
        'x', make_array(1, 2, 3, 4),
        'y', make_array(1, 2, 3, 4))
line.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(line, 'x', 'y')

In [None]:
outlier = Table().with_columns(
        'x', make_array(1, 2, 3, 4, 5),
        'y', make_array(1, 2, 3, 4, 0)
    )
outlier.scatter('x', 'y', s=30, color='r')

In [None]:
correlation(outlier, 'x', 'y')

---

### Demo: Ecological Correlations

Explore an example of SAT scores (`sat2014.csv`) in connection to ecological correlations.

In [None]:
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014

In [None]:
sat2014.scatter('Critical Reading', 'Math')

In [None]:
correlation(sat2014, 'Critical Reading', 'Math')

* The correlation between aggregated data (e.g., after grouping) may be much higher than the correlation between the underlying variables.
* The correlation between these scores at the state level does not translate to the same correlation between the variables at the individual level.
* You may see a linear pattern in your data, but you need to consider the factors contributing to the formation of that line.

In [None]:
import plotly.express as px

px.scatter(sat2014.to_df(), x = 'Critical Reading', y = 'Math', 
           hover_name = 'State', size = 'Participation Rate')

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>