In [None]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
%reload_ext pandas_tutor
%set_pandas_tutor_options {'projectorMode': True}
set_matplotlib_formats("svg")
plt.style.use('fivethirtyeight')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import display, IFrame, HTML, YouTubeVideo
def show_permutation_testing_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vSovXDonR6EmjrT45h4pY1mwmcKFMWVSdgpbKHC5HNTm9sbG7dojvvCDEQCjuk2dk1oA4gmwMogr8ZL/embed?start=false&loop=false&delayms=3000"
    width = 960
    height = 569
    display(IFrame(src, width, height))
    
def show_bootstrapping_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vS_iYHJYXSVMMZ-YQVFwMEFR6EFN3FDSAvaMyUm-YJfLQgRMTHm3vI-wWJJ5999eFJq70nWp2hyItZg/embed?start=false&loop=false&delayms=3000"
    width = 960
    height = 509
    display(IFrame(src, width, height))

# Lecture 18 – More Permutation Testing, Causality, and Bootstrapping

## DSC 10, Summer 2022

### Announcements

- Homework 5 is due **Sat at 11:59pm**.
- Lab 6 is due **Tue at 11:59pm**.

### Agenda

- Permutation test examples:
    - Baby weights
    - Deflategate
- Using permutation tests to show causality.
- Bootstrapping.

### A note about pacing

- These slides have extra content to be flexible
- Understanding material > going fast
- Permutation testing is important, so we'll go through several more examples today
- You'll have to write code like the You Try activities many times in your assignments.

## Permutation Testing: Baby Weights



### Permutation testing (a.k.a. A/B testing)
- Given two samples, are they drawn from the same population?
- We can use permutation tests to answer questions like:
    - "Do smoking moms and nonsmoking moms have babies that weigh the same?"
    - "Were COVID-19 rates the same in Republican states and Democratic states?"
    - **More generally:** are *these things* like *those things*?

In [None]:
baby = bpd.read_csv('data/baby.csv')
smoking_and_birthweight = baby.get(['Maternal Smoker', 'Birth Weight'])
smoking_and_birthweight

### Shuffling the birth weights

In [None]:
original_and_shuffled = smoking_and_birthweight.assign(
    shuffled=np.random.permutation(smoking_and_birthweight.get('Birth Weight'))
)
original_and_shuffled

In [None]:
# Our test statistic
def difference_in_mean_weights(weights_df):
    group_means = weights_df.groupby('Maternal Smoker').mean().get('shuffled')
    return group_means.loc[False] - group_means.loc[True]

difference_in_mean_weights(original_and_shuffled)

### You Try: Is there a difference between smoking and non-smoking birth weights?

Let's finish up our activity from the end of last time's lecture:

1. Simulate drawing a sample by shuffling the birth weights. Compute mean weight for smokers - mean weight for non-smoker.
1. Repeat Step 1 500 times. Store the results in an array called `differences`.
1. Plot the simulated differences in a histogram. Plot the observed test statistic as a vertical line.
1. Compute the p-value, and make a decision.

In [None]:
# Copy your code from lec17 here:


### Caution! ⚠️

- We **cannot** conclude that smoking *causes* lower birth weight!
- This was an observational study; there may be confounding factors.
    - Maybe smokers are more likely to drink caffeine, and caffeine causes lower birth weight.
- But it suggests that it may be causal.

In [None]:
show_permutation_testing_slides()

## Another Permutation Testing Example: Deflategate 🏈

### Did the New England Patriots cheat?

<center><img width="40%" src="./data/deflate.jpg"></center>

- On January 18, 2015, the New England Patriots played the Indianapolis Colts for a spot in the Super Bowl.
- The Patriots won, 45-7. They went on to win the Super Bowl.
- After the game, it was alleged that the Patriots intentionally deflated footballs, making them easier to catch.

### Background

- Each team brings 12 footballs to the game. Teams use their own footballs while on offense.
- NFL rules stipulate that **each ball must be inflated to between 12.5 and 13.5 pounds per square inch (psi)**.
- Before the game, officials found that all of the Patriots' footballs were at about 12.5 psi, and that all of the Colts' footballs were at about 13.0 psi.
    - This pre-game data was not written down.
- In the second quarter, the Colts intercepted a Patriots ball and notified officials that it felt under-inflated.
- At halftime, two officials (Clete Blakeman and Dyrol Prioleau) independently measured the pressures of as many of the 24 footballs as they could.
    - They ran out of time before they could finish.

### The measurements

In [None]:
footballs = bpd.read_csv('data/deflategate.csv')
footballs

There are only 15 rows (11 for Patriots footballs, 4 for Colts footballs) since the officials weren't able to record the pressures of every ball.

### Combining the measurements

- Both officials measured each ball.
- Their measurements are slightly different, so we'll average them to get a combined pressure for each ball.

In [None]:
footballs = footballs.assign(
    psi=(footballs.get('Blakeman') + footballs.get('Prioleau')) / 2
).drop(columns=['Blakeman', 'Prioleau'])
footballs

### Differences in average pressure

- At first glance, it looks as though the Patriots' footballs are at a lower pressure.
- We could do a permutation test for the difference in mean pressure, but that wouldn't point towards cheating.
    - The Patriot's balls *started* at a lower psi (which is not an issue on its own).
- The allegations were that the Patriots **deflated** their balls, during the game.
    - We want to check to see if the Patriots' footballs lost more pressure than the Colts' footballs from the start of the game to halftime, when these measurements were taken.

In [None]:
# Mean pressure for each team's footballs
footballs.groupby('Team').mean()

### Calculating the pressure drop

- Let's calculate the drop in pressure for each ball in `footballs`.
- The Patriots' footballs started at around 12.5 psi, while the Colts' footballs started at around 13 psi.
- **Strategy**: we'll make an array with starting pressure for each ball, and from that subtract the halftime pressure of each ball.
    - Note that the first 11 rows correspond to Patriots balls and the last 4 rows correspond to Colts balls.
    - Thus, we need an array with 11 `12.5`s followed by 4 `13`s.
    - We can use `np.ones` to help us.

In [None]:
footballs

### Calculating the pressure drop

In [None]:
footballs = footballs.assign(
    psi_drop=...
)
footballs

### The question

- Did the Patriots' footballs drop in pressure more than the Colts'?
    - We want to test whether two samples came from the same distribution – this calls for a permutation test.

### You Try: Permutation Test

1. State null and alternative hypotheses.
1. Pick an appropriate test statistic.
1. Simulate drawing a sample by shuffling the pressure drops (`psi_drop`). Then, calculate your test statistic on the sample.
1. Repeat step 3 5,000 times. Store your results in an array called `differences`.
1. Plot `differences` on a histogram, then plot the observed statistic as a vertical line.
1. Calculate the p-value, and draw a conclusion.

### Caution! ⚠️

- We conclude that it is unlikely that the difference in mean pressure drop is due to chance alone.
- But this doesn't establish *causation*.
- That is, we can't conclude that the Patriots **intentionally** deflated their footballs.
- This was an *observational* study; to establish causation, we'd need an RCT (Randomized Controlled Trial).

### Aftermath

- Quote from an investigative report commissioned by the NFL:

> “[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi, depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls and 13.0 for the Colts balls.”

- Many different methods were used to determine whether the drop in pressures were due to chance, including physics. 
    - We computed an observed difference of 0.7335, which is in line with the findings of the report. 
- In the end, Tom Brady (quarterback for the Patriots at the time) was suspended 4 games and the team was fined $1 million dollars.
- The [Deflategate Wikipedia article](https://en.wikipedia.org/wiki/Deflategate) is extremely thorough, give it a read if you're curious!

## Causality example: chronic back pain

### Causality and permutation tests

- Permutation tests can be used to establish **causality** in a randomized control trial!
- If the only difference between two groups is that one was given the treatment, and there is a statistically significant difference between the two groups, then we can conclude the treatment had some effect.

### Using Botulinum toxin A (Botox) to treat lower back pain


> [Botulinum neurotoxins (BoNTs) are the most potent toxins known.](https://febs.onlinelibrary.wiley.com/doi/10.1002/1873-3468.13446)

- Botox is commonly used for treating muscle disorders, migraines, and for cosmetic purposes.
- A randomized controlled trial examined the use of Botox in the treatment of lower back pain.
    - 31 patients with pain were randomly assigned to control and treatment groups.
    - The control group received a placebo (saline injection).
        - Placebos are used when we don't want individuals to know which group they are in.
    - The treatment group received Botox.
    - After eight weeks, the number of people who experienced relief in both groups was counted.

### The data

- 1 means "experienced relief".
- 0 means "no relief".

In [None]:
back = bpd.read_csv('data/bta.csv')
back

In [None]:
back.groupby('Group').count()

### The results

In [None]:
# This evaluates to the proportion experiencing relief in each group
back.groupby('Group').mean()

- 60% of the treatment group experienced relief, compared to 12.5% of the control group.
- But what if the people in the treatment group would have gotten better without the treatment, by chance?
    - If this were the case, then the treatment would look like it had an impact even if it didn't.
    - To account for this possibility, we should conduct a hypothesis test.

### A permutation test

- Here, we have two numerical samples – the results for the control group, and the results for the treatment group.
- **Null hypothesis**: Results for both groups come from the same distribution. 
    - In other words, Botox does not do anything different than saline, and the results we saw are due to chance. 
- **Alternative hypothesis**: More people in the treatment group experience relief.
    - In other words, Botox helped with relief more than saline.
- **Test statistic**: difference in proportion experiencing relief.

### Conclusion

- We reject the null hypothesis with a high degree of confidence.
- This is evidence that the treatment **caused** improvement.
    - **Only because** this was a **randomized controlled trial**.
    - In earlier examples (e.g. birth weights of babies from smoking moms and nonsmoking moms), we could not establish causality because there could have been other differences between the two groups.
- Read more about this example in [CIT 12.2](https://inferentialthinking.com/chapters/12/2/Causality.html?highlight=randomized%20control#potential-outcomes).

## Bootstrapping 🥾

### City of San Diego employee salary data

All City of San Diego employee salary data [is public](https://publicpay.ca.gov/Reports/Cities/City.aspx?entityid=405&year=2020&rpt=1). We are using the latest available data.

In [None]:
population = bpd.read_csv('data/2020_salaries.csv')
population

When you load in a dataset that has so many columns that you can't see them all, it's a good idea to look at the column names.

In [None]:
population.columns

### We only need the total wages...

In [None]:
population = population.get(['TotalWages'])
population

In [None]:
population.plot(kind='hist', bins=np.arange(0, 325000, 10000), density=True, ec='w', figsize=(10, 5));

### The median salary

- We can use `.median()` to find the median salary of all city employees.
- This is **not** a random quantity.

In [None]:
population_median = population.get('TotalWages').median()
population_median

### Let's be realistic...

- In practice, it is costly and time-consuming to survey **all** 12,000+ employees.
    - More generally, we can't expect to survey all members of the population we care about.
- Instead, we gather salaries for a random sample of, say, 500 people.
- Hopefully, the median of the sample is close to the median of the population.

### In the language of statistics

- The full DataFrame of salaries is the **population**.
- We observe a **sample** of 500 salaries from the population.
- We want to determine the **population median (a parameter)**, but we don't have the whole population, so instead we use the **sample median (a statistic) as an estimate**.
- Hopefully the sample median is close to the population median.

### The sample median

- Let's survey 500 employees at random.
- We can use `.sample()`:

In [None]:
np.random.seed(23) # Magic to ensure that we get the same results every time this code is run

# Take a sample of size 500
my_sample = population.sample(500)
my_sample

We won't reassign `my_sample` at any point in this notebook, so it will always refer to this particular sample.

In [None]:
# Compute the sample median
sample_median = my_sample.get('TotalWages').median()
sample_median

### How confident are we that this is a good estimate?

- Our estimate depended on a random sample.
- If our sample was different, our estimate may have been different, too.
- **How different could our estimate have been?**
- Our confidence in the estimate depends on the answer to this question.

### The sample median is random

- The sample median is a random number.
- It comes from some distribution, which we don't know.
- How different could our estimate have been, if we drew a different sample?
    - "Narrow" distribution $\Rightarrow$ not too different.
    - "Wide" distribution $\Rightarrow$ quite different.
- **What is the distribution of the sample median?**

### An impractical approach

- One idea: repeatedly collect random samples of 500 **from the population** and compute its median.
    - This is what we did in Lecture 14 to compute an empirical distribution of the sample mean of flight delays.
- We can plot the empirical distribution of the sample median with a histogram.
- This is an approximation of the true distribution of the sample median, using 1000 samples.

In [None]:
sample_medians = np.array([])
for i in np.arange(1_000):
    median = population.sample(500).get('TotalWages').median()
    sample_medians = np.append(sample_medians, median)
sample_medians

In [None]:
(bpd.DataFrame()
 .assign(SampleMedians=sample_medians)
 .plot(kind='hist', density=True,
       bins=30, ec='w', figsize=(8, 5))
);

### The problem

- Drawing new samples like this is impractical.
    - If we were able to do this, why not just collect more data in the first place?
- Often, we can't ask for new samples from the population.
- **Key insight:** our original sample, `my_sample`, looks a lot like the population.
    - Their distributions are similar.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
bins=np.arange(10_000, 300_000, 10_000)
population.plot(kind='hist', y='TotalWages', ax=ax, density=True, alpha=.75, bins=bins, ec='w')
my_sample.plot(kind='hist', y='TotalWages', ax=ax, density=True, alpha=.75, bins=bins, ec='w')
plt.legend(['Population', 'My Sample']);

Note that unlike the previous histogram we saw, this is depicting the distribution of the population and of one particular sample (`my_sample`), **not** the distribution of sample medians for 246 samples.

### The bootstrap

- **Big idea:** Use the sample to simulate more samples.
    - The sample itself looks like the population.
    - So, resampling from the sample is like sampling from the population.
    - The act of resampling from a sample is called **bootstrapping** or "**the bootstrap**" method.

- In our case specifically:
    - We have a sample of 500 salaries.
    - We want another sample of 500 salaries, but we can't draw from the population.
    - However, the original sample looks like the population.
    - So, let's just **resample from the sample!**

In [None]:
show_bootstrapping_slides()

### Resampling with replacement

When bootstrapping, we resample **with** replacement. Why? 🤔

### Resampling with replacement

- Our goal when bootstrapping is to create a sample of the same size as our original sample.
- If we were to resample without replacement $n$ times from an original sample of size $n$, our resample would look exactly the same as the original sample.
    - For instance, if we sample 5 elements without replacement from `['A', 'B', 'C', 'D', 'E']`, our sample will contain the same 5 characters, just in a different order.
- So, we need to sample **with replacement** to ensure that our resamples can be different from the original sample.
- Why does this work? If we assume population is large, sampling with replacement is approx the same as sampling without replacement.

### Running the bootstrap

- We can simulate the act of collecting new samples by **sampling with replacement from our original sample, `my_sample`**.

In [None]:
# Note that the population DataFrame doesn't appear anywhere here!

In [None]:
boot_medians

### Bootstrap distribution of the sample median

In [None]:
bpd.DataFrame().assign(BootstrapMedians=boot_medians).plot(kind='hist', density=True, bins=np.arange(60000, 85000, 1000), ec='w', figsize=(10, 5))
plt.scatter(population_median, 0.000004, color='orange', s=100, label='population median').set_zorder(2)
plt.legend();

- The population median (orange dot) is near the middle.
    - **In reality, we'd never get to see this!**

## What's the bootstrap useful for?

- We have a sample median wage:

In [None]:
my_sample.get('TotalWages').median()

- And now we can say: the population median wage is approx \\$69616.
    - But how approximate?

In [None]:
(bpd.DataFrame()
 .assign(BootstrapMedians=boot_medians)
 .plot(kind='hist', density=True, bins=np.arange(60000, 85000, 1000), ec='w', figsize=(10, 5))
)
# plt.scatter(population_median, 0.000004, color='orange', s=100, label='population median').set_zorder(2)
plt.legend();

- So now we can say: my guess for the population median wage is that it's between \\$65,000 and \\$75,000.
- Next time, we'll talk about how to set this range precisely.

## Why does it matter?

- Now, we're learning estimation techniques that are more applicable to real life.
- Real life: no population, only a sample!
- Using the bootstrap lets us **quantify uncertainty**.
    - With one sample, I might think population median wage is between \\$65,000 and \\$75,000.
    - With another, I might think it's between \\$68,000 and \\$71,000.
    - In the second case, I'm more certain about my estimate.
- Next time: we'll make this rigorous and say exactly what uncertainty means.