In [None]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Animations
import ipywidgets as widgets
from IPython.display import display, HTML

# Lecture 17 – Standardization and the Normal Distribution

## DSC 10, Fall 2023

### Announcements

- Lab 4 is due **tomorrow at 11:59PM**.
    - It's fine if `grader.check_all()` fails for Questions 1.5 and 1.6; see this [Ed post](https://edstem.org/us/courses/48101/discussion/3791071).
- Homework 4 is due **Saturday at 11:59PM**.
- Please fill out the [**Mid-Quarter Survey**](https://docs.google.com/forms/d/e/1FAIpQLSenMue3wGwX7OVIE0RMJ4OFzMtg0YG3T2PqXikcB7594ij5kg/viewform), where you can provide course feedback anonymously. **If at least 80% of the class fills it out by Saturday at 11:59PM, everyone will earn 2 additional points on the Midterm Exam!**
- Suraj's 1PM lecture is meeting on Zoom today; [here's the link](https://ucsd.zoom.us/my/rampure).
- Friday is a holiday, so there is no lecture and no office hours.

### Agenda

- Recap: Standard deviation and Chebyshev's inequality.
- Standardization.
- The normal distribution.

## Recap: Standard deviation and Chebyshev's inequality

### Variance and standard deviation

$$\begin{align*}\text{variance} &= \text{average squared deviation from the mean}\\
&= \frac{(\text{value}_1 - \text{mean})^2 + ... + (\text{value}_n - \text{mean})^2}{n}\\
\text{standard deviation} &= \sqrt{\text{variance}}
\end{align*}$$

where $n$ is the number of observations.

### What can we do with the standard deviation?

It turns out, in **any** numerical distribution, the bulk of the data are in the range “mean ± a few SDs”.

Let's make this more precise.

### Chebyshev’s inequality

**Fact**: In **any** numerical distribution, the proportion of values in the range “mean ± $z$ SDs” is at least 

$$1 - \frac{1}{z^2}
$$

|Range|Proportion|
|---|---|
|mean ± 2 SDs|	at least $1 - \frac{1}{4}$   (75%)|
|mean ± 3 SDs|	at least $1 - \frac{1}{9}$   (88.88..%)|
|mean ± 4 SDs|	at least $1 - \frac{1}{16}$ (93.75%)|
|mean ± 5 SDs|	at least $1 - \frac{1}{25}$  (96%)|


### Flight delays, revisited ✈️

In [None]:
delays = bpd.read_csv('data/united_summer2015.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');

In [None]:
delay_mean = delays.get('Delay').mean()
delay_mean

In [None]:
delay_std = np.std(delays.get('Delay')) # There is no .std() method in babypandas!
delay_std

### Mean and standard deviation

Chebyshev's inequality tells us that

- **At least** 75% of delays are in the following interval:

In [None]:
delay_mean - 2 * delay_std, delay_mean + 2 * delay_std

- **At least** 88.88% of delays are in the following interval:

In [None]:
delay_mean - 3 * delay_std, delay_mean + 3 * delay_std

Let's visualize these intervals!

In [None]:
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, alpha=0.65, ec='w', figsize=(10, 5), title='Flight Delays')
plt.axvline(delay_mean - 2 * delay_std, color='maroon', label='± 2 SD')
plt.axvline(delay_mean + 2 * delay_std, color='maroon')

plt.axvline(delay_mean + 3 * delay_std, color='blue',  label='± 3 SD')
plt.axvline(delay_mean - 3 * delay_std, color='blue')

plt.axvline(delay_mean, color='green', label='Mean')
plt.scatter([delay_mean], [-0.0017], color='green', marker='^', s=250)
plt.ylim(-0.0038, 0.06)
plt.legend();

### Chebyshev's inequality provides _lower_ bounds!

Remember, Chebyshev's inequality states that **at least** $1 - \frac{1}{z^2}$ of values are within $z$ SDs from the mean, for any numerical distribution.

For instance, it tells us that **at least** 75% of delays are in the following interval:

In [None]:
delay_mean - 2 * delay_std, delay_mean + 2 * delay_std

However, in this case, a much larger fraction of delays are in that interval.

In [None]:
within_2_sds = delays[(delays.get('Delay') >= delay_mean - 2 * delay_std) & 
                      (delays.get('Delay') <= delay_mean + 2 * delay_std)]

within_2_sds.shape[0] / delays.shape[0]

If we know more about the shape of the distribution, we can provide better guarantees for the proportion of values within $z$ SDs of the mean.

### Activity

For a particular set of data points, Chebyshev's inequality states that at least $\frac{8}{9}$ of the data points are between $-20$ and $40$. What is the standard deviation of the data?


<details><summary>✅ Click here to see the answer <b>after</b> you've tried it yourself.</summary>

- Chebyshev's inequality states that at least $1 - \frac{1}{z^2}$ of values are within $z$ standard deviations of the mean.
- When $z = 3$, $1 - \frac{1}{z^2} = \frac{8}{9}$.
- So, $-20$ is $3$ standard deviations below the mean, and $40$ is $3$ standard deviations above the mean.
- $10$ is in the middle of $-20$ and $40$, so the mean is $10$.
- $3$ standard deviations are between $10$ and $40$, so $1$ standard deviation is $\frac{30}{3} = 10$.
</details>

## Standardization

### Heights and weights 📏

We'll work with a data set containing the heights and weights of 5000 adult males.

In [None]:
height_and_weight = bpd.read_csv('data/height_and_weight.csv')
height_and_weight

### Distributions of height and weight

Let's look at the distributions of both numerical variables.

In [None]:
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=30, alpha=0.8, figsize=(10, 5));

In [None]:
height_and_weight.plot(kind='hist', y='Weight', density=True, ec='w', bins=30, alpha=0.8, color='C1', figsize=(10, 5));

In [None]:
height_and_weight.plot(kind='hist', density=True, ec='w', bins=60, alpha=0.8, figsize=(10, 5));

**Observation**: The two distributions look like shifted and stretched versions of the same basic shape, called a bell curve 🔔.

### Standard units

Suppose $x$ is a numerical variable, and $x_i$ is one value of that variable. Then, $$x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$$

represents $x_i$ in **standard units** – the number of standard deviations $x_i$ is above the mean.

**Example**: Suppose someone weighs 225 pounds. What is their weight in standard units?

In [None]:
weights = height_and_weight.get('Weight')
(225 - weights.mean()) / np.std(weights)

- Interpretation: 225 is 1.92 standard deviations above the mean weight.
- 225 becomes 1.92 in **standard units**.

### Standardization

The process of converting all values of a variable (i.e. a column) to standard units is known as standardization, and the resulting values are considered to be **standardized**.

In [None]:
def standard_units(col):
    return (col - col.mean()) / np.std(col)

In [None]:
standardized_height = standard_units(height_and_weight.get('Height'))
standardized_height

In [None]:
standardized_weight = standard_units(height_and_weight.get('Weight'))
standardized_weight

### The effect of standardization

Standardized variables have:
- A mean of 0.
- An SD of 1.

We often standardize variables to bring them to the same scale.

In [None]:
# e-15 means 10^(-15), which is a very small number, effectively zero.
standardized_height.describe()

In [None]:
standardized_weight.describe()

Let's look at how the process of standardization works visually.

In [None]:
HTML('data/height_anim.html')

In [None]:
HTML('data/weight_anim.html')

### Standardized histograms

Now that we've standardized the distributions of height and weight, let's see how they look on the same set of axes.

In [None]:
standardized_height_and_weight = bpd.DataFrame().assign(
    Height=standardized_height,
    Weight=standardized_weight
)

In [None]:
standardized_height_and_weight.plot(kind='hist', density=True, ec='w',bins=30, alpha=0.8, figsize=(10, 5));

These both look pretty similar!

## The standard normal distribution

### The standard normal distribution

- The distributions we've seen look essentially the same once standardized.
- This distribution is called the **standard normal distribution**. The shape is called the **standard normal curve**.

$$
\phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^2}
$$


- You don't need to know the formula – just the shape!
    - We'll just use the formula today to make plots.

### The standard normal curve

In [None]:
def normal_curve(z):
    return 1 / np.sqrt(2 * np.pi) * np.exp((-z**2)/2)

In [None]:
x = np.linspace(-4, 4, 1000)
y = normal_curve(x)

plt.figure(figsize=(10, 5))
plt.plot(x, y, color='black');
plt.xlabel('$z$');
plt.title(r'$\phi(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}z^2}$');

### Heights/weights are roughly normal

If a distribution follows this shape, we say it is roughly normal.

In [None]:
standardized_height_and_weight.plot(kind='hist', density=True, ec='w', bins=120, alpha=0.8, figsize=(10, 5));
plt.plot(x, y, color='black', linestyle='--', label='Normal', linewidth=5)
plt.legend(loc='upper right');

### The standard normal distribution

- Think of the normal distribution as a "continuous histogram".

- Its mean and median are both 0 – it is symmetric.

- It has inflection points at $\pm 1$.
    - More on this later.

- Like a histogram:
    - The **area** between $a$ and $b$ is the **proportion** of values between $a$ and $b$.
    - The total area underneath the normal curve is is 1.

In [None]:
def normal_area(a, b, bars=False):
    x = np.linspace(-4, 4, 1000)
    y = normal_curve(x)
    ix = (x >= a) & (x <= b)
    plt.figure(figsize=(10, 5))
    plt.plot(x, y, color='black')
    plt.fill_between(x[ix], y[ix], color='gold')
    if bars:
        plt.axvline(a, color='red')
        plt.axvline(b, color='red')
    plt.title(f'Area between {np.round(a, 2)} and {np.round(b, 2)}')
    plt.show()

In [None]:
a = widgets.FloatSlider(value=0, min=-4,max=3,step=0.25, description='a')
b = widgets.FloatSlider(value=1, min=-4,max=4,step=0.25, description='b')
bars = widgets.Checkbox(value=False, description='bars')
ui = widgets.HBox([a, b, bars])
out = widgets.interactive_output(normal_area, {'a': a, 'b': b, 'bars': bars})
display(ui, out)

### Cumulative density functions

- The _cumulative density function_ (CDF) of a distribution is a function that takes in a value $z$ and returns the proportion of values in the distribution that are less than or equal to $z$, i.e. **the area under the curve to the left of $z$**.

In [None]:
# cdf(0) should give us the gold area below.
normal_area(-np.inf, 0)

- To find areas under curves, we typically use integration (calculus). However, the standard normal curve has no closed-form integral.

- Often, people refer to [tables](https://www.math.arizona.edu/~jwatkins/normal-table.pdf) that contain approximations of the CDF of the standard normal distribution.

- We'll use an approximation built into the `scipy` module in Python. The function `scipy.stats.norm.cdf(z)` computes the **area under the standard normal curve to the left of `z`**.

### Areas under the standard normal curve

What does `scipy.stats.norm.cdf(0)` evaluate to? Why?

In [None]:
normal_area(-np.inf, 0)

In [None]:
from scipy import stats
stats.norm.cdf(0)

### Areas under the standard normal curve

Suppose we want to find the area to the **right** of 2 under the standard normal curve.

In [None]:
normal_area(2, np.inf)

The following expression gives us the area to the **left** of 2.

In [None]:
stats.norm.cdf(2)

In [None]:
normal_area(-np.inf, 2)

However, since the total area under the standard normal curve is 1:

$$\text{area right of $2$} = 1 - (\text{area left of $2$})$$

In [None]:
1 - stats.norm.cdf(2)

### Areas under the standard normal curve

How might we use `stats.norm.cdf` to compute the area between -1 and 0?

In [None]:
normal_area(-1, 0)

Strategy:

$$\text{area from $-1$ to $0$} = (\text{area left of $0$}) - (\text{area left of $-1$})$$

In [None]:
stats.norm.cdf(0) - stats.norm.cdf(-1)

### General strategy for finding area

The area under a standard normal curve in the interval $[a, b]$ is 

```py
stats.norm.cdf(b) - stats.norm.cdf(a)
```

What can we do with this? We're about to see!

### Using the normal distribution

Let's return to our data set of heights and weights.

In [None]:
height_and_weight

As we saw before, both variables are roughly normal. What _benefit_ is there to knowing that the two distributions are roughly normal?

### Standard units and the normal distribution

- **Key idea: The $x$-axis in a plot of the <u>standard</u> normal distribution is in <u>standard</u> units.**
    - For instance, the area between -1 and 1 is the proportion of values within 1 standard deviation of the mean.

- Suppose a distribution is (roughly) normal. Then, these are two are approximately equal:
    - The proportion of values in the distribution between $a$ and $b$.
    - The area between $a_{\: \text{(su)}}$ and $b_{\: \text{(su)}}$ under the standard normal curve.
        - Recall, $x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$.

### Example: Proportion of weights between 200 and 225 pounds

Let's suppose, as is often the case, that we don't have access to the entire distribution of weights, but just the mean and SD.

In [None]:
weight_mean = weights.mean()
weight_mean

In [None]:
weight_std = np.std(weights)
weight_std

Using just this information, we can estimate the proportion of weights between 200 and 225 pounds:

1. Convert 200 to standard units.
2. Convert 225 to standard units.
3. Use `stats.norm.cdf` to find the area between (1) and (2).

In [None]:
left = (200 - weight_mean) / weight_std
left

In [None]:
right = (225 - weight_mean) / weight_std
right

In [None]:
normal_area(left, right)

In [None]:
approximation = stats.norm.cdf(right) - stats.norm.cdf(left)
approximation

### Checking the approximation

Since we have access to the entire set of weights, we can compute the true proportion of weights between 200 and 225 pounds.

In [None]:
# True proportion of values between 200 and 225.
height_and_weight[
    (height_and_weight.get('Weight') >= 200) &
    (height_and_weight.get('Weight') <= 225)
].shape[0] / height_and_weight.shape[0]

In [None]:
# Approximation using the standard normal curve.
approximation

Pretty good for an approximation! 🤩

### Warning: Standardization doesn't make a distribution normal!

Consider the distribution of delays from earlier in the lecture.

In [None]:
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5), title='Flight Delays')
plt.xlabel('Delay (minutes)');

The distribution above does not look normal. It won't look normal even if we standardize it. By standardizing a distribution, all we do is move it horizontally and stretch it vertically – the shape itself doesn't change.

In [None]:
HTML('data/delay_anim.html')

### Chebyshev's inequality and the normal distribution

- Recall that Chebyshev's inequality states that the proportion of values within $z$ SDs of the mean is **at least** $1-\frac{1}{z^2}$.
    - This works for **any** distribution, and is a lower bound.

- If we know that the distribution is normal, we can be even more specific!

| Range | All Distributions (via Chebyshev's inequality) | Normal Distribution|
|---|---|---|
| mean $\pm \ 1$ SD | $\geq 0\%$ |$\approx 68\%$ |
| mean $\pm \ 2$ SDs | $\geq 75\%$ | $\approx 95\%$ |
| mean $\pm \ 3$ SDs | $\geq 88.8\%$ | $\approx 99.73\%$ |

### 68% of values are within 1 SD of the mean

Remember, the values on the $x$-axis for the standard normal curve are in standard units. So, the proportion of values within 1 SD of the mean is the area under the standard normal curve between -1 and 1.

In [None]:
normal_area(-1, 1, bars=True)

In [None]:
stats.norm.cdf(1) - stats.norm.cdf(-1)

This means that if a variable follows a normal distribution, approximately 68% of values will be within 1 SD of the mean.

### 95% of values are within 2 SDs of the mean

In [None]:
normal_area(-2, 2, bars=True)

In [None]:
stats.norm.cdf(2) - stats.norm.cdf(-2)

- If a variable follows a normal distribution, approximately 95% of values will be within 2 SDs of the mean.
- Consequently, 5% of values will be outside this range.
- Since the normal curve is symmetric, 
    - 2.5% of values will be more than 2 SDs above the mean, and
    - 2.5% of values will be more than 2 SDs below the mean.

### Recap: Proportion of values within $z$ SDs of the mean

| Range | All Distributions (via Chebyshev's inequality) | Normal Distribution|
|---|---|---|
| mean $\pm \ 1$ SD | $\geq 0\%$ |$\approx 68\%$ |
| mean $\pm \ 2$ SDs | $\geq 75\%$ | $\approx 95\%$ |
| mean $\pm \ 3$ SDs | $\geq 88.8\%$ | $\approx 99.73\%$ |

The percentages you see for normal distributions above are approximate, but are not lower bounds.

**Important**: They apply to all normal distributions, standardized or not. This is because all normal distributions are just stretched and shifted versions of the standard normal distribution.

### Inflection points

- We mentioned that the standard normal curve has inflection points at $z = \pm 1$.
    - An inflection point is where a curve goes from "opening down" 🙁 to "opening up" 🙂.

In [None]:
normal_area(-1, 1)

- We know that the $x$-axis of the standard normal curve represents standard units, so the inflection points are at 1 standard deviation above and below the mean.

- This means that if a distribution is roughly normal, we can determine its standard deviation by finding the distance between each inflection point and the mean.

### Example: Inflection points

Remember: The distribution of heights is roughly normal, but it is _not_ a _standard_ normal distribution. 

In [None]:
height_and_weight.plot(kind='hist', y='Height', density=True, ec='w', bins=40, alpha=0.8, figsize=(10, 5));
plt.xticks(np.arange(60, 78, 2));

- The center appears to be around 69.
- The inflection points appear to be around 66 and 72.
- So, the standard deviation is roughly 72 - 69 = 3.

In [None]:
np.std(height_and_weight.get('Height'))

### Activity: SAT scores

SAT scores range from 0 to 1600. The distribution of SAT scores has a mean of 950 and a standard deviation of 300. Your friend tells you that their SAT score, in standard units, is 2.5. What do you conclude?

## Summary, next time

### Summary: Spread and Chebyshev's inequality

- Variance and standard deviation (SD) quantify how spread out data points are.
    - Standard deviation is the square root of variance.
    - Roughly speaking, the standard deviation describes how far values in a dataset typically are from the mean.
- Chebyshev's inequality states that, in any numerical distribution, the proportion of values within $z$ SDs of the mean is at least $1 - \frac{1}{z^2}$.
    - The true proportion of values within $z$ SDs of the mean may be larger than $1 - \frac{1}{z^2}$, depending on the distribution, but it cannot be smaller.

### Summary: Standard units and the normal distribution

- To convert a value $x_i$ from a column $x$ to standard units, use $x_{i \: \text{(su)}} = \frac{x_i - \text{mean of $x$}}{\text{SD of $x$}}$.
    - A value in standard units measures the number of SDs the value is above the mean.
- The normal distribution is bell-shaped, and arises often in nature.
- The $x$-axis of the **standard** normal distribution is in **standard** units.
- If we know a distribution is roughly normal, and we know its mean and SD, then we can use the standard normal distribution's curve to approximate the proportion of values within a given range without needing access to all of the data.
    - If a variable is roughly normally distributed, then approximately 68% of its values are within 1 SD of the mean, and approximately 95% of its values are within 2 SDs of the mean.

### Next time

- The Central Limit Theorem.
- Confidence intervals, revisited.