In [None]:
# Imports
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Animations
from IPython.display import display, IFrame

def show_confidence_interval_slides():
    src="https://docs.google.com/presentation/d/e/2PACX-1vTaPZsueXI6fey_5cj2Y1TevkR1joBvpwaWVsZNvgBlnJSrw1EiBLHJywkFH_QNLU5Tdr6JZgDrhFxG/embed?start=false&loop=false&delayms=3000"
    width = 960
    height = 989
    display(IFrame(src, width, height))

# Lecture 20 – Confidence Intervals, Center and Spread

## DSC 10, Winter 2022

### Announcements

- Homework 6 is due on **Saturday 2/26 at 11:59pm**.
- Lab 6 is due on **Tuesday 3/1 at 11:59pm**.
- The Final Project is released, and is due on **Wednesday 3/9 at 11:59pm** ‼️
    - Start early and come to office hours.
    - You can work with a partner – use [this sheet](https://docs.google.com/spreadsheets/d/1m5eDcFdYTQq5bu9VRYINZBFgckCyJEOXZFZGZ9bQqKY/edit#gid=00) to find one, and follow the [pair programming guidelines](https://dsc10.com/pair-programming) if you do work with one.
- Consider taking Suraj's [History of Data Science](https://dsc-courses.github.io/dsc90-2022-wi/) seminar next quarter!

### Agenda

- Confidence intervals for hypothesis testing.
- Mean and median.
- Standard deviation.

## Confidence intervals and hypothesis testing

In [None]:
show_confidence_interval_slides()

### Using a confidence interval for hypothesis testing

It turns out that we can use bootstrapped confidence intervals for hypothesis testing!

- **Null hypothesis**: The population parameter is equal to some value, $x$.
- **Alternative hypothesis**: The population parameter is not equal to $x$.
- Cutoff for p-value: p%.
- Strategy:
    - Construct a (100-p)% confidence interval for the population parameter.
    - If $x$ is not in the interval, reject the null hypothesis.
    - If $x$ is in the interval, our results are consistent with the null (fail to reject it).

### Example: Fire-Rescue Department 🚒

In [None]:
population = bpd.read_csv('data/2020_salaries.csv')
fire_rescue_population = population[population.get('DepartmentOrSubdivision') == 'Fire-Rescue']
fire_rescue_population

### Just a sample
- Suppose we only have access to a sample of 300 Fire-Rescue Department workers.
- We want to understand the median salary of *all* Fire-Rescue Department workers.
- **Null hypothesis**: The median salary of Fire-Rescue Department workers is the same as the median of all San Diego city employee salaries, \$71,354.50.
- **Alternative hypothesis**: The median salary of Fire-Rescue Department workers is not the same as the median of all San Diego city employee salaries.
- We'll use a significance level (i.e. p-value cutoff) of 0.01, to be *really* sure.

In [None]:
# The median salary of all City of SD employees, in all departments
population_median = population.get('TotalWages').median()
population_median

In [None]:
fire_rescue_sample = fire_rescue_population.sample(300, replace=False)
fire_rescue_sample

### Testing the hypotheses
- Since we're using a significance level of 0.01 = 1%, we need a 99% confidence interval for the median salary of Fire-Rescue Department workers.
- To construct a confidence interval, we'll bootstrap to compute many sample medians, and we'll find the middle 99% of the distribution of bootstrapped medians using `np.percentile`.

In [None]:
n_resamples = 500
fire_rescue_medians = np.array([])
for i in range(n_resamples):
    # Resample from fire_rescue_sample
    resample = fire_rescue_sample.sample(300, replace=True)
    
    # Compute the median
    median = resample.get('TotalWages').median()
    
    # Add it to our array of bootstrapped medians
    fire_rescue_medians = np.append(fire_rescue_medians, median)

In [None]:
fire_rescue_medians

### Finding the interval

In [None]:
fire_left = np.percentile(fire_rescue_medians, 0.5)
fire_left

In [None]:
fire_right = np.percentile(fire_rescue_medians, 99.5)
fire_right

In [None]:
# Resulting interval
[fire_left, fire_right]

Is \$71,354.50 in this interval? ❌

### Conclusion of the hypothesis test
- Since our 99% confidence interval did not contain the median of all San Diego city employee salaries, \$71,354.50, we reject the null.
- Fire-Rescue Department workers do not have the same median salary as other city employees.
- We can't say why.

In [None]:
bpd.DataFrame().assign(FireRescueBootstrapMedians=fire_rescue_medians).plot(kind='hist', density=True, bins=np.arange(75000, 125000, 1000), ec='w', figsize=(10, 5))
plt.plot([fire_left, fire_right], [0, 0], color='lime', linewidth=12, label='99% confidence interval');
plt.legend();

In [None]:
# Actual population median of Fire-Rescue Department salaries
fire_rescue_population.get('TotalWages').median()

## Mean and median

<center><h3>What is the "center" of a distribution?</h3></center>

### The mean (i.e. average)

The mean of the data

$$2, 3, 3, 9$$

is

$$\rm{mean} = \frac{2 + 3 + 3 + 9}{4} = 4.25$$

### The mean (i.e. average)

The mean of a set of data points can be thought of as a "one-number summary". It... 

- doesn't have to be equal to one of the data points.
- doesn't have to be an integer, even if all of the data points are integers.
- will be somewhere between the min and max, but not necessarily halfway in between.
- has the same units as the data.

### The median

- Like the mean, the median is a "one-number summary" of a set of data points.
- The median is a number in the "middle" of the data.
- To calculate it, **sort the data points and pick the number in the middle**.
    - If there are two middle numbers, we usually pick the number halfway between (i.e. the mean of the middle two).

- **Example:** $\text{Median}(1, 4, 7, 12, 32) = 7$

- **Example:** $\text{Median}(1, 4, 7, 12) = 5.5$

### Example

Create a data set that has this histogram. (You can do it with a short list of whole numbers.)

<br>
<center><img src='data/df_A.png' width=450></center>
<br>

What are its mean and median?

In [None]:
numbers = ...
bpd.DataFrame().assign(data=numbers).plot(kind='hist', bins=np.arange(0.5, 12.5), density=True, ec='w');

In [None]:
np.mean(numbers)

In [None]:
np.median(numbers)

### Discussion Question

<br>
<center>
    <img src='data/df_A.png' width=300>
    <img src='data/df_B.png' width=300>
</center>
<br>

Are the means of these two distributions the same or different? What about the medians?

- A. Both are the same
- B. Means are different, medians are same
- C. Means are same, medians are different
- D. Both are different

### To answer, go to [menti.com](https://menti.com) and enter the code 6968 5594.

### Answer

In [None]:
df_A = bpd.DataFrame().assign(data=...)
df_B = bpd.DataFrame().assign(data=...)

In [None]:
print(
    'median #1:\t%f' % df_A.get('data').median(),
    'median #2:\t%f' % df_B.get('data').median(),
    'mean #1:\t%f' % df_A.get('data').mean(),
    'mean #2:\t%f' % df_B.get('data').mean(),
    sep='\n'
)

### Example: flight delays

In [None]:
delays = bpd.read_csv('data/delays.csv')
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', figsize=(10, 5))
plt.title('Flight Delays')
plt.xlabel('Delay (minutes)');

**Question:** Which is larger – the mean or the median?

In [None]:
delays.get('Delay').mean()

In [None]:
delays.get('Delay').median()

In [None]:
delays.plot(kind='hist', y='Delay', bins=np.arange(-20.5, 210, 5), density=True, ec='w', alpha=0.65, figsize=(10, 5))
plt.plot([delays.get('Delay').mean(), delays.get('Delay').mean()], [0, 1], color='green', label='Mean')
plt.scatter([delays.get('Delay').mean()], [-0.0017], color='green', marker='^', s=250)
plt.plot([delays.get('Delay').median(), delays.get('Delay').median()], [0, 1], color='purple', label='Median')
plt.title('Flight Delays')
plt.xlabel('Delay (minutes)')
plt.ylim(-0.005, 0.065)
plt.legend();

### Comparing the mean and median
- **Mean:** Balance point of the histogram.
    - Numerically: the sum of the differences between all data points and the mean is 0.
    - Physically: Think of a see-saw.
- **Median:** Half-way point of the data.
    - Half of the area of a histogram is to the left of the median, and half is to the right.
- If the distribution is symmetric about a value, then that value is both the mean and the median.
- If the distribution is **skewed**, then the mean is pulled away from the median in the direction of the tail.

- **Key property:** The median is more **robust** (less **sensitive**) to **outliers**.

### Example: net worths

Suppose we have the net worth of all UCSD students.

In [None]:
worths = np.random.lognormal(5.5, 1.25, 20_000)
plt.hist(worths, bins=np.arange(0, 2_000, 100), density=True, ec='w');

In [None]:
np.mean(worths)

In [None]:
np.median(worths)

Suppose that LeBron James enrolls at UCSD as a Data Science major. Per [Forbes](https://bleacherreport.com/articles/10014995-forbes-lebron-james-net-worth-close-to-1b-to-break-nba-earnings-record-this-year), his net worth is close to one billion dollars.

In [None]:
new_worths = np.append(worths, 10 ** 9)

In [None]:
np.mean(new_worths)

In [None]:
np.median(new_worths)

The mean net worth is now over 50 thousand dollars, while the median is unchanged!

## Standard deviation

### Question: How "wide" is a distribution? 

- **One idea:** “biggest value - smallest value” (known as the **range**).
    - Issue: this doesn’t tell us much about the shape of the distribution.

- **Another idea:** "standard deviation".
    - The mean is at the center. 
    - The standard deviation quantifies how far the data points typically are from the center.

### Deviations from the mean

In [None]:
data = np.array([2, 3, 3, 9])

In [None]:
np.mean(data)

In [None]:
deviations = data - np.mean(data)
deviations

### What is the average deviation?

In [None]:
deviations

In [None]:
np.mean(deviations)

- As was mentioned before, this is true of any dataset – the average deviation from the mean is 0.
- This implies that the average deviation itself is not useful in measuring the spread of data.

### Average **squared** deviation

In [None]:
# Square all the deviations
deviations ** 2

In [None]:
variance = np.mean(deviations ** 2)
variance

This quantity, the average squared deviation from the mean, is called the **variance**.

### Standard deviation

- Our data usually has units, e.g. dollars.
- The variance is in "squared" units, e.g. $\text{dollars}^2$.
- To account for this, we can take the square root of the variance, and the result is called the **standard deviation**.

In [None]:
# Standard Deviation (SD) is the square root of the variance
sd = variance ** 0.5
sd

### Standard deviation

- The standard deviation (SD) measures roughly how far the data are from their average.
    - Larger SD = more spread out.
- Standard deviation has the same units as the original data.
- `numpy` has a function, `np.std`, that calculates the standard deviation for us.

In [None]:
np.std(data)

### Variance and standard deviation

To summarize:

$$\text{variance} = \text{average squared deviation from the mean} = \frac{(\text{value}_1 - \text{average})^2 + (\text{value}_2 - \text{average})^2 + ... + (\text{value}_n - \text{average})^2}{n}$$

$$\text{standard deviation} = \sqrt{\text{variance}}$$

where $n$ is the number of observations.

### What can we do with the standard deviation?

It turns out, no matter what the shape of the distribution is, the bulk of the data are in the range “average ± a few SDs”.

### Chebyshev’s inequality

**Fact:** No matter what the shape of the distribution is, the proportion of values in the range “average ± z SDs” is at least 

$$1 - \frac{1}{z^2}$$

### Chebyshev's inequality

|Range|Proportion|
|---|---|
|average ± 2 SDs|	at least $1 - \frac{1}{4}$   (75%)|
|average ± 3 SDs|	at least $1 - \frac{1}{9}$   (88.888…%)|
|average ± 4 SDs|	at least $1 - \frac{1}{16}$ (93.75%)|
|average ± 5 SDs|	at least $1 - \frac{1}{25}$  (96%)|

No matter what the distribution is! 🤯

### Example: NBA data

In [None]:
nba = bpd.read_csv('data/nba.csv')
nba.plot(kind='hist', y='Height', density=True, bins=np.arange(69.5, 90, 1), ec='w', figsize=(10, 5), alpha=0.65);

In [None]:
mean = nba.get('Height').mean()
sd = np.std(nba.get('Height'))
print('mean:\t\t\t%f' % mean, 
      '\nstandard deviation:\t%f' % sd)

### Mean and standard deviation

For this data, mean is about 79 and the SD is about 3.5. By Chebyshev's inequality...

- **At least** 75% of the data is in $$[\text{Mean} - 2 \cdot \text{SD}, \text{Mean} + 2 \cdot \text{SD}] = [79 - 2\cdot 3.5,\, 79 + 2\cdot 3.5] = [72, 86]$$
- **At least** 88% of the data is in $$[\text{Mean} - 3 \cdot \text{SD}, \text{Mean} + 3 \cdot \text{SD}] = [79 - 3\cdot 3.5,\, 79 +3\cdot 3.5] = [68.5, 89.5]$$

In [None]:
nba.plot(kind='hist', y='Height', density=True, bins=np.arange(69.5, 90, 1), ec='w', figsize=(10, 5), alpha=0.25);
plt.plot([mean-2*sd, mean-2*sd], [0, 0.13], color='maroon', label="± 2SD")
plt.plot([mean+2*sd, mean+2*sd], [0, 0.13], color='maroon')
plt.plot([mean-3*sd, mean-3*sd], [0, 0.13], color='blue', label="± 3SD")
plt.plot([mean+3*sd, mean+3*sd], [0, 0.13], color='blue')
plt.plot([mean, mean], [0, 1], color='green', label='Mean')
plt.scatter([mean], [-0.003], color='green', marker='^', s=250)
plt.ylim(-0.005, 0.13)
plt.legend();

### Chebyshev's inequality

- Chebyshev's inequality works for **any** distribution.
- But if we know the type of the distribution (e.g., normal), we can say more!
    - More on this next time.

### Discussion Question

For a particular set of data points, Chebyshev's inequality states that at least $\frac{3}{4}$ of the data points are between 30 and 90. What is the standard deviation of the data?

- A. 10
- B. 15
- C. 30
- D. 60
- E. Impossible to tell


### To answer, go to [menti.com](https://menti.com) and enter the code 6968 5594.

## Summary

### Summary: confidence intervals and hypothesis testing

- **Null hypothesis**: The population parameter is equal to some value, $x$.
- **Alternative hypothesis**: The population parameter is not equal to $x$.
- Cutoff for p-value: p%.
- Strategy:
    - Construct a (100-p)% confidence interval for the population parameter.
    - If $x$ is not in the interval, reject the null hypothesis.
    - If $x$ is in the interval, our results are consistent with the null (can't reject it).

### Summary: center and spread

- The mean and median are both ways to quantify the "center" of a set of data points.
    - The mean is less robust to outliers than the median.
- Standard deviation is a way of quantifying how spread out data points are.
    - Roughly speaking, it measures how far data points are from their mean, on average.
    - $\text{standard deviation} = \sqrt{\frac{(\text{value}_1 - \text{average})^2 + (\text{value}_2 - \text{average})^2 + ... + (\text{value}_n - \text{average})^2}{n}}$, where $n$ is the number of observations.
- Chebyshev's inequality states that regardless of the distribution, the proportion of values in the range “average ± z SDs” is at least 

$$1 - \frac{1}{z^2}$$

- **Next time:** the "normal" distribution, and how it relates to some of the distributions we've already seen.