In [None]:
# Imports
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Demonstration code
def r_scatter(r):
    "Generate a scatter plot with a correlation approximately r"
    x = np.random.normal(0, 1, 1000)
    z = np.random.normal(0, 1, 1000)
    y = r * x + (np.sqrt(1 - r ** 2)) * z
    plt.scatter(x, y)
    plt.xlim(-4, 4)
    plt.ylim(-4, 4)
    
def show_scatter_grid():
    plt.subplots(1, 4, figsize=(16, 4))
    for i, r in enumerate([-1, -2/3, -1/3, 0]):
        plt.subplot(1, 4, i+1)
        r_scatter(r)
        plt.title(f'r = {np.round(r, 2)}')
    plt.show()
    plt.subplots(1, 4, figsize=(16, 4))
    for i, r in enumerate([1, 2/3, 1/3]):
        plt.subplot(1, 4, i+1)
        r_scatter(r)
        plt.title(f'r = {np.round(r, 2)}')
    plt.subplot(1, 4, 4)
    plt.axis('off')
    plt.show()

# Lecture 24 – Prediction and Correlation

## DSC 10, Winter 2022

### Announcements

- Homework 7 is due **tomorrow at 11:59pm**.
- The Final Project is due on **Wednesday 3/9 at 11:59pm** ‼️
    - Start if you haven't already!
- Lab 8 is due on **Thursday 3/10 at 11:59pm**.
- The **Final Exam** is on **Saturday 3/12 from 3-6PM**.
    - Same format as the Midterm Exam (remote via Gradescope; mix of MC, short-answer, and code; open internet but no collaboration).
    - You will be receiving an email from Suraj this weekend if you asked to take an alternate.
    - Refer to the past exams on the [Resources](https://dsc10.com/resources) tab of the course website.
    - More logistics to come.
- **Important:** lecture today will be Zoom-only, but lectures next week will be back to in-person + Zoom.

### Agenda

- High-level overview of statistical inference.
- Prediction.
- Correlation.
- Regression.

## Overview of statistical inference

> I have collected some data. What can I learn about how my data was generated?

At a high level, the second half of this class has been about understanding where data comes from.

### Populations and samples

Sometimes, our observed data is in the form of a sample, and we want to use our sample to infer something about the population from which it was drawn. Some example questions:
- What is the value of this population parameter, e.g. the population mean?
    - **Strategy:** Create a confidence interval for the population parameter, using either the bootstrap or CLT (if it applies).
- Is the value of this population parameter equal to $x$?
    - **Strategy:** For a p% significance level, compute a (100-p)% confidence interval. Reject the null (that the parameter is equal to $x$) if $x$ is not in the interval.

### Models

Other times, we want to test the validity of a **model**, which is a set of assumptions about how data were generated.
Some example questions:

- Is the data in this sample consistent with what was expected?
    - **Strategy:** Perform a hypothesis test, with a mean, proportion, or absolute difference as a test statistic.
- Was this sample drawn from this specific categorical distribution?
    - **Strategy:** Perform a hypothesis test, with the total variation distance as a test statistic.
- Are these two samples from the same population?
    - **Strategy:** Perform a permutation test.

### Now what?

- So far, we've given you an introduction to statistical inference.
- However, we have not yet spoken much about **prediction** – given a sample, what can I predict about data not in that sample?
    - Example from earlier in the quarter: Galton's method for predicting the heights of children given their parents' heights.
- Starting today, we'll focus on **linear regression**, a prediction technique that tries to find the best "linear relationship" between two numeric variables.
    - Along the way, we'll address another idea – **correlation**.
    - You will see linear regression several more times throughout your time at UCSD – it is one of the most important tools to have in your data science toolkit.

## Prediction

### Prediction

- Suppose we have a dataset with at least two variables, e.g. education level and income.
- We're interested in **predicting** the future – predicting one variable based on another:
    - Given my education level, what is my income?
    - Given my height, how tall will my kid be as an adult?
    - Given my income, how much does my car cost?
- To do this, we need to first observe a pattern between the two variables.

### Association

- An **association** is any link or relationship between two variables in a scatter plot.
- Associations can be linear or non-linear.
- If two variables have a positive association, then as one variable increases, the other tends to increase.
- If two variables have a negative association, then as one variable increases, the other tends to decrease.
- As we saw earlier in the quarter, association $\neq$ causation!

### Example: hybrid cars

In [None]:
hybrid = bpd.read_csv('data/hybrid.csv')
hybrid

### `'acceleration'` and `'msrp'`
- Is there an association?
- What kind of association?

In [None]:
hybrid.plot(kind='scatter', x='acceleration', y='msrp', figsize=(10, 5));

### `'mpg'` and `'msrp'`

- Is there an association?
- What kind of association?

In [None]:
hybrid.plot(kind='scatter', x='mpg', y='msrp', figsize=(10, 5));

**Observations:**
- There is an association – cars with better fuel economy tended to be cheaper.
    - Why is that? 🤔
- The association looks more curved than linear. 
    - It may roughly follow $y \approx \frac{1}{x}$.
   

### Understanding units
- A linear change in units doesn't change the shape of the plot, it only changes the scale of the plot.
    - Linear change means adding or subtracting a constant, and multiplying or dividing by a constant.
- In other words, instead of plotting price in _dollars_ and fuel economy in _MPG_, we can plot price in _Euros_ and fuel economy in _kilometers per liter_ and the plot would look the same, just with different axes:

In [None]:
hybrid.assign(
        km_per_liter=hybrid.get('mpg') * 0.425144,
        eur=hybrid.get('msrp') * 0.84 
).plot(kind='scatter', x='km_per_liter', y='eur', figsize=(10, 5));

### Converting columns to standard units
- Recall: to convert $x$ to standard units, we compute

$$z(x) = \frac{x - \text{mean of all $x$s}}{\text{SD of all $x$s}}$$

- Converting columns to standard units makes different scatter plots comparable, by making the $x$ and $y$ axes "similarly scaled."
    - Both axes measure the number of standard deviations a data point is above or below its mean.
- Converting columns to standard units doesn't change shape of the scatter plot (the conversion is linear).

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    any_numbers = np.array(any_numbers)
    return (any_numbers - any_numbers.mean()) / np.std(any_numbers)

In [None]:
def standardize(df):
    """Return a DataFrame in which all columns of df are converted to standard units."""
    df_su = bpd.DataFrame()
    for column in df.columns:
        df_su = df_su.assign(**{column + ' (su)': standard_units(df.get(column))})
    return df_su

### Standard units for hybrid cars
For a given pair of variables:
- Which cars are average from both perspectives?
- Which cars are both well above/below average?

In [None]:
hybrid_su = standardize(hybrid.get(['msrp', 'acceleration', 'mpg'])).assign(vehicle=hybrid.get('vehicle'))
hybrid_su

### `'acceleration'` and `'msrp'`

In [None]:
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='msrp (su)', figsize=(10, 5));

Which cars have `'acceleration'`s and `'msrp'`s that are more than 2 SDs above average?

In [None]:
hybrid_su[(hybrid_su.get('acceleration (su)') > 2) &
          (hybrid_su.get('msrp (su)') > 2)]

### `'mpg'` and `'msrp'`

In [None]:
hybrid_su.plot(kind='scatter', x='mpg (su)', y='msrp (su)', figsize=(10, 5));

Which cars have close to average `'mpg'`s and close to average `'msrp'`s?

In [None]:
hybrid_su[(hybrid_su.get('mpg (su)') <= 0.3) &
          (hybrid_su.get('mpg (su)') >= -0.3) &
          (hybrid_su.get('msrp (su)') <= 0.3) &
          (hybrid_su.get('msrp (su)') >= -0.3)]

### Observation on associations in standard units
- If two variables are positively associated,
    - their high, positive values in standard units are typically seen together, and
    - their low, negative values are seen together as well.
- If two variables are negatively associated,
    - high, positive values of one are typically coupled with low, negative values of the other.
- If two variables aren't associated, there should be no such pattern.

In [None]:
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='msrp (su)', figsize=(10, 5))
plt.axvline(0, color='black');
plt.axhline(0, color='black');

In [None]:
hybrid_su.plot(kind='scatter', x='mpg (su)', y='msrp (su)', figsize=(10, 5))
plt.axvline(0, color='black');
plt.axhline(0, color='black');

## Correlation

### Definition: correlation coefficient

**Definition**: The correlation coefficient $r$ of two variables $x$ and $y$ is the 
- **average** value of the 
- **product** of $x$ and $y$
- when both are measured in **standard units**.

If `x` and `y` are two Series or arrays, 
```py
r = (x_su * y_su).mean()
```
where `x_su` and `y_su` are `x` and `y` converted to standard units.

Let's calculate $r$ for `'acceleration'` and `'msrp'`.

In [None]:
hybrid_su

In [None]:
r_acc_price = (hybrid_su.get('acceleration (su)') * hybrid_su.get('msrp (su)')).mean()
r_acc_price

In [None]:
hybrid_su.plot(kind='scatter', x='acceleration (su)', y='msrp (su)', figsize=(10, 5))
plt.axvline(0, color='black');
plt.axhline(0, color='black');

### The correlation coefficient: $r$

- $r$ measures how clustered points are around a straight line – **it measures linear association**.
    - If two variables are correlated, it means they are linearly associated.
- $r$ is always between $-1$ and $1$.
    - If $r = 1$, the scatter plot is a line of slope 1.
    - If $r = -1$, the scatter plot is a line of slope -1.
    - If $r = 0$, there is no linear association (_uncorrelated_).
- $r$ is computed based on standard units.
    - The correlation between price in _dollars_ and fuel economy in _MPG_ is the same as the correlation between price in _Euros_ and fuel economy in _kilometers per liter_.

Let's now calculate $r$ for `'mpg'` and `'msrp'`.

In [None]:
hybrid_su

In [None]:
r_mpg_price = (hybrid_su.get('mpg (su)') * hybrid_su.get('msrp (su)')).mean()
r_mpg_price

In [None]:
hybrid_su.plot(kind='scatter', x='mpg (su)', y='msrp (su)', figsize=(10, 5));
plt.axvline(0, color='black');
plt.axhline(0, color='black');

### Scatter plots with different correlation coefficients

In [None]:
show_scatter_grid()

### Discussion Question

In [None]:
x2 = bpd.DataFrame().assign(
    x=np.arange(-6, 6.1, 0.5), 
    y=np.arange(-6, 6.1, 0.5) ** 2
)
x2.plot(kind='scatter', x='x', y='y', figsize=(10, 5));

Does the above scatter plot show:

- A. Association and correlation?
- B. Association but not correlation?
- C. Correlation but not association?
- D. Neither association nor correlation?

### To answer, go to [menti.com](https://menti.com) and enter the code 6382 7990.

### Answer

In [None]:
products = standard_units(x2.get('x')) * standard_units(x2.get('y'))
products

In [None]:
np.mean(products)

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(products, bins=np.arange(-3.5, 3.6), ec='w');

## Regression

### Goal: predict a child's height from the height of their parents

- Earlier in the quarter, we looked at Galton's method of predicting heights.
    - A child's "midparent" height is a weighted average of the height of their parents.
- Observation: Children of shorter parents tend to be shorter!

In [None]:
galton = bpd.read_csv('data/galton.csv')
galton.plot(kind='scatter', x='midparentHeight', y='childHeight', figsize=(10, 5));

### Goal: predict a child's height from the height of their parents
Galton's method involved predicting a child's height by
- looking at all midparents within $\pm 0.5$ inches of the child's midparent height and
- averaging the heights of all children of those midparents.

In [None]:
def predict_child(parent_height):
    """Return a prediction of the height of a child 
    whose parents have a midparent height of parent_height.
    """
    close_points = galton[(galton.get('midparentHeight') <= parent_height + 0.5) & 
                          (galton.get('midparentHeight') >= parent_height - 0.5)]
    return close_points.get('childHeight').mean()

with_predictions = galton.assign(
    Prediction=galton.get('midparentHeight').apply(predict_child)
)
with_predictions

In [None]:
ax = with_predictions.plot(kind='scatter', x='midparentHeight', y='childHeight')
with_predictions.plot(kind='scatter', x='midparentHeight', y='Prediction', ax=ax, color='C2', label="graph of averages", figsize=(10, 5));
plt.legend();

- This is a **graph of averages**.
- We grouped each $x$ value with nearby $x$ values, and averaged the corresponding $y$ values for each group.
- Each gold point corresponds to the predicted $y$ value for each group.
- Notice: the graph of averages looks like a straight line! **Let's try and find that line.**

### Correlation

Let's calculate the correlation between `'midparentHeight'` and `'childHeight'`.

In [None]:
heights_su = standardize(galton.get(['midparentHeight', 'childHeight']))
heights_su

In [None]:
r_mid_child = (heights_su.get('midparentHeight (su)') * heights_su.get('childHeight (su)')).mean()
r_mid_child

### The regression line

Suppose **$x$ and $y$ are in standard units**, and $r$ is the correlation coefficient between $x$ and $y$. Then, the regression line is defined as follows:

<center><img src='data/regression-line.png' width=400></center>

- The regression line is the line through $(0,0)$ with slope $r$.
- If $x$ and $y$ are linearly associated, then the graph of averages will be very similar to the regression line.
- If the regression line is given by $f(x) = mx + b$, then the prediction for $x$ is given by $f(x)$.

In [None]:
heights_su.plot(kind='scatter', x='midparentHeight (su)', y='childHeight (su)', figsize=(10, 5))
plt.plot(np.arange(-3, 3, 0.1), np.arange(-3, 3, 0.1) * r_mid_child, color='purple', label='regression line');
plt.legend();

### Making predictions in standard units

- If $r = 0.32$, and the given $x$ is 2 in standard units, then:
    - The prediction for $y$ is 0.64 standard units.
    - The regression line predicts that parents whose midparent height is 2 SDs above average have children with whose heights are 0.64 SDs above average.

- **Note:** We predict that a child will be somewhat closer to average than their parents.
    - This is a consequence of the slope ($r$, in this case) having magnitude less than 1.
    - This effect is called **regression to the mean**.

### Making predictions in original units

Of course, we'd like to be able to predict a child's height in original units, not just in standard units. Here's how we'll approach this problem:
1. Convert `'midparentHeight'` to standard units.
2. Use the correlation coefficient to predict `'childHeight'` in standard units.
3. Scale the  predicted `'childHeight'` from standard units back to inches.

In [None]:
parent_mean = galton.get('midparentHeight').mean()
parent_sd = np.std(galton.get('midparentHeight'))
child_mean = galton.get('childHeight').mean()
child_sd = np.std(galton.get('childHeight'))

In [None]:
def predict_with_r(parent):
    """Return a prediction of the height of a child 
    whose parents have a midparent height of parent, 
    using linear regression.
    """
    parent_su = (parent - parent_mean) / parent_sd
    child_su = r_mid_child * parent_su
    return child_su * child_sd + child_mean

In [None]:
predict_with_r(56)

In [None]:
preds = with_predictions.assign(
    Prediction_r=galton.get('midparentHeight').apply(predict_with_r)
)
ax = preds.plot(kind='scatter', x='midparentHeight', y='childHeight', figsize=(10, 5))
preds.plot(kind='scatter', x='midparentHeight', y='Prediction', ax=ax, color='C2', label='graph of averages')
preds.plot(kind='line', x='midparentHeight', y='Prediction_r', ax=ax, color='purple', label='regression line');
plt.legend();

As you can see, the graph of averages and the regression line are pretty similar!

### Discussion Question

A course has a midterm (mean 80, standard deviation 15) and a really hard final (mean 50, standard deviation 12).

If the scatter diagram comparing midterm & final scores for students looks linearly associated with correlation 0.75, then what is the predicted final exam score for a student who received a 90 on the midterm?

- A. 54
- B. 56
- C. 58
- D. 60
- E. 62

### To answer, go to [menti.com](https://menti.com) and enter the code 6382 7990.

## Summary

### Summary

- The correlation coefficient, $r$, measures the linear association between two variables $x$ and $y$.
    - It ranges between -1 and 1.
- The regression line is the straight line passing through $(0, 0)$ with slope $r$. We can use it to make predictions for a $y$ value (e.g. child's height) given an $x$ value (e.g. midparent's height).
- **Next time:** more on regression.