In [None]:
# Imports
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

# Lecture 17 – Permutation Testing

## DSC 10, Winter 2022

### Announcements

- Lab 5 is due **Thursday 2/17 at 11:59pm**.
- Homework 5 is due **Saturday 2/19 at 11:59pm**.
- Take a look at the "Grade Report" on Gradescope.

### Agenda

- Recap: Total Variation Distance.
    - A test statistic that allows us to compare how "different" two categorical distributions are.
- A new type of hypothesis test.
    - So far, we've been assessing models given a single random sample.
        - We flip a coin 400 times. Are the flips consistent with the coin being fair?
        - Did the jury panel in the Swain case look like a random sample from the eligible population?
        - Are the test scores for the TA's section a random sample from the class's scores?
    - But we often have **two** random samples we wish to compare.
        - Example: birth weights of babies born to smoking mothers vs. non-smoking mothers 👶.
        - Example: drops in pressure for footballs from two different teams (Deflategate) 🏈.
    - **Permutation testing** (i.e. **A/B testing**) will help us decide whether two random samples come from the same distribution.

## Comparing distributions

### ACLU study
- The ACLU (American Civil Liberties Union) of Northern California studied the ethnic composition of jury panels in 11 felony trials in Alameda County between 2009 and 2010.
    - [Here's a link](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) to the study.
- 1453 people reported for jury duty in total (we will call them "panelists").
- The following DataFrame shows the distribution of ethnicities for both the eligible population and for the panelists who were studied.

In [None]:
jury = bpd.DataFrame().assign(
    Ethnicity=['Asian', 'Black', 'Latino', 'White', 'Other'],
    Eligible=[0.15, 0.18, 0.12, 0.54, 0.01],
    Panels=[0.26, 0.08, 0.08, 0.54, 0.04]
)
jury

### Are the differences in representation meaningful?
- Model: Panelists were selected at random from the eligible population.
    - Alternative viewpoint: no, they weren't.
- Observation: 1453 panelists and the distribution of their ethnicities.
- Test statistic: ???
    - How do we deal with multiple categories?

In [None]:
jury.plot(kind='barh', x='Ethnicity');

### The distance between two distributions
- Panelists are categorized into one of 5 ethnicities.
- The distribution of ethnicities is **categorical**.
- To see whether the the distribution of ethnicities for the panelists is similar to that of the eligible population, we have to measure the distance between two categorical distributions.

### Statistic: Total Variation Distance
- The **Total Variation Distance (TVD)** of two categorical distributions is **the sum of the absolute differences of their proportions, all divided by 2**.
    - We divide by 2 so that, for example, the distribution [0.51, 0.49] is 0.01 off from [0.50, 0.50].
    - This way, TVD quantifies the **total overrepresentation** across all categories.
    - It would also be valid not to divide by 2. We just wouldn't call that statistic TVD anymore.

In [None]:
def total_variation_distance(dist1, dist2):
    '''Computes the TVD between two categorical distributions, 
       assuming the categories appear in the same order.'''
    return np.abs((dist1 - dist2)).sum() / 2

In [None]:
total_variation_distance(jury.get('Eligible'), jury.get('Panels'))

### Discussion Question

What is the TVD between the distributions of class standing in DSC 10 and DSC 40A?

| **Class Standing** | **DSC 10** | **DSC 40A** |
| --- | --- | --- |
| Freshman | 0.45 | 0.15 |
| Sophomore | 0.35 | 0.35 |
| Junior | 0.15 | 0.35 |
| Senior+ | 0.05 | 0.15 |

- A. 0.2
- B. 0.3
- C. 0.5
- D. 0.6
- E. None of the above

### To answer, go to [menti.com](https://menti.com) and enter the code 3290 1539.

### TVD as a test statistic

In [None]:
# Calculate the TVD between the distribution of ethnicities in the eligible population
# and the distribution of ethnicities in the observed panelists

total_variation_distance(jury.get('Eligible'), jury.get('Panels'))

- The closer the TVD is to 0, the closer the two distributions are to one another.
- But is 0.14 a very small value? A typical value? A very large value?

### Simulate drawing jury panels
- Model: Panels are drawn at random from the eligible population.
- Statistic: TVD between the random panel's ethnicity distribution and the eligible population's ethnicity distribution.
- Repeat many times to generate many TVDs, and see where the TVD of the observed panelists lies.

_Note: `np.random.multinomial` creates samples drawn with replacement, even though real jury panels would be drawn without replacement. However, when the sample size is small relative to the population, the resulting distributions will be roughly the same whether we sample with or without replacement._

### The simulation

Let's start by simulating one jury panel and computing the TVD between the simulated panel's ethnic distribution and Alameda County's ethnic distribution.

In [None]:
eligible = jury.get('Eligible')
simulated_distribution = np.random.multinomial(1453, eligible) / 1453 
simulated_distribution

In [None]:
panels_and_sample = jury.assign(RandomSample=simulated_distribution) 
panels_and_sample

In [None]:
panels_and_sample.plot(kind='barh', x='Ethnicity');

In [None]:
total_variation_distance(panels_and_sample.get('RandomSample'), eligible)

### Repeating the experiment

In [None]:
tvds = np.array([])
repetitions = 10000
for i in np.arange(repetitions):
    simulated_distribution = np.random.multinomial(1453, eligible) / 1453
    new_tvd = total_variation_distance(simulated_distribution, eligible)
    tvds = np.append(tvds, new_tvd)
    
tvds

In [None]:
observed_tvd = total_variation_distance(jury.get('Panels'), eligible)

bpd.DataFrame().assign(tvds=tvds).plot(kind='hist', density=True, bins=20, ec='w', figsize=(10, 5))
plt.axvline(observed_tvd, color='red', label='observed TVD')
plt.legend();

### Calculating the p-value

Recall:
- The p-value is **the probability, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative**.
- Its formal name is the _observed significance level_.
- In the previous visualization, it is the area to the right of the red line (i.e. the area in the right tail, starting at the observed statistic).

In [None]:
np.count_nonzero(tvds >= observed_tvd) / repetitions

### Are the jury panels representative?
- Likely not! The distributions of ethnicities in our random samples are not like the distribution of ethnicities in our observed panelists.
- This doesn't say *why* the distributions are different!
    - Juries are drawn from voter registration lists and DMV records. Certain populations are less likely to be registered to vote or have a driver's license due to historical biases.
    - The county rarely enacts penalties for those who don't appear for jury duty; certain populations are less likely to be able to miss work to appear for jury duty.
    - [See the report](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) for more reasons.

### Summary of the method

To assess whether a sample was drawn randomly from a known categorical distribution:
- Use TVD as the test statistic because it measures the distance between categorical distributions.
- Sample at random from the population and compute the TVD between the random sample and the population; repeat numerous times.
- Compare:
    - The empirical distribution of simulated TVDs, and
    - The actual TVD from the sample in the study.
- See Question 6 on Homework 5 for an example.

## Motivating A/B testing

### 2008 Obama Campaign

<center><img src='data/obama.png' width=600>

- In 2008, the Obama campaign tested several different versions of a pop-up box on their website. Each visitor to the site would view a random version of the site, with different buttons and images.
- For each version, they recorded the number of people who signed up to be a donor.
- Read more at [this blog post](https://blog.optimizely.com/2010/11/29/how-obama-raised-60-million-by-running-a-simple-experiment/).

### Button choices

- Here are the four different buttons they used.
- Intuitively, which one do you think would lead to the most signups?

<center><img src='data/buttons.png' width=400>

### The winner

<center><img src='data/winner.png' width=600>

It is estimated that this combination of image and button brought in an **additional 60 million dollars** in donations versus the original version of the site.

## Example 1: Smoking and birth weight 👶

### Smoking and birth weight

- **Question:** Is there a significant difference in the weight of babies born to mothers who smoked vs. babies born to mothers who didn't smoke?
- We'll load in data from an **observational study**.
    - Each row corresponds to a baby.
    - There are two groups of babies: those whose mothers smoked, and those whose mothers didn't smoke.

In [None]:
baby = bpd.read_csv('data/baby.csv')
baby

In [None]:
# Selecting only the columns that are relevant
smoking_and_birthweight = baby.get(['Maternal Smoker', 'Birth Weight'])
smoking_and_birthweight

### Visualizing the distribution of each group

In [None]:
smokers = smoking_and_birthweight[smoking_and_birthweight.get('Maternal Smoker')]
non_smokers = smoking_and_birthweight[-smoking_and_birthweight.get('Maternal Smoker')]

In [None]:
fig, ax = plt.subplots()
baby_bins = np.arange(50, 200, 5)
non_smokers.plot(kind='hist', density=True, ax=ax, alpha=0.75, bins=baby_bins, ec='w', figsize=(10, 5))
smokers.plot(kind='hist', density=True, ax=ax, alpha=0.75, bins=baby_bins, ec='w')
plt.legend(['Maternal Smoker = False', 'Maternal Smoker = True'])
plt.xlabel('Birth Weight');

### The question

- It appears that babies born to smokers typically have lower birth weights than babies born to non-smokers.
- Does the difference we see reflect a real difference in the population?
- Or is it just due to random chance?

### Testing hypotheses

- **Null hypothesis**: In the population, birth weights of babies born to smokers and non-smokers have the same distribution.
    - In other words, the difference we saw was to random chance.
- **Alternative hypothesis**: In the population, babies born to smokers have lower birth weights than babies born to non-smokers, on average.
    - In other words, what we saw cannot be explained by random chance alone, and there is instead a meaningful difference in these distributions.

### Discussion Question

Recall, last time we introduced the Total Variation Distance (TVD) as a test statistic. Why **can't** we use the TVD as our test statistic here? Any ideas for a test statistic that we could use?

Discuss with your peers.

### Test statistic: the difference between means

The test statistic we'll use is

$$\text{mean birth weight of babies born to non-smokers} - \text{mean birth weight of babies born to smokers}$$

Note that **large values of this test statistic favor the alternative hypothesis**.

In [None]:
means_table = smoking_and_birthweight.groupby('Maternal Smoker').mean()
means_table

In [None]:
# The difference between the mean birth weight for non-smokers and smokers
means = means_table.get('Birth Weight')
observed_difference = means.loc[False] - means.loc[True]
observed_difference

### Hypothesis testing through simulation

- **Null hypothesis**: The two groups are sampled from the same distribution.
- **Test statistic**: The difference between the mean non-smoker weight and the mean smoker weight.
- Note that the null hypothesis doesn't say *what* the distribution is.
    - This is different from earlier examples (jury panels, fair coins, etc.) where we specified exactly what the distribution under the null is.
    - We can't draw directly from the distribution!
- We have to do something a bit more clever.

### Implications of the null hypothesis

- Under the null hypothesis, both groups are sampled from the same population distribution.
- If that's true, then whether `'Maternal Smoker'` is `True` or `False` should have no impact on the `'Birth Weight'` variable.
- **Idea:** the distribution of birth weights for smokers and non-smokers should remain the same if we shuffle one of the two columns in our DataFrame.

In [None]:
# What if we shuffle one of these columns?
smoking_and_birthweight

### Permutation tests

- Perhaps the difference in means we saw is due to random chance in assignment.
- **The key idea behind permutation tests**: Shuffle the group labels (i.e. `True`s and `False`s) many, many times, and compute the difference in the group means each time. 
    - **How often do we see a difference in means this extreme?**
    - If we **rarely** see a difference in means as extreme as the one in our observed samples, then the null hypothesis doesn't look likely.
- Randomly permuting labels is equivalent to randomly assigning birth weights to groups, without changing group sizes.


### Permutation tests with DataFrames

- We want to randomly shuffle one of the columns in our DataFrame.
- `df.sample` returns a random sample of the rows in a DataFrame, but we want to shuffle one column independently.
- **Solution:** Use `np.random.permutation`, which takes in an array and returns a shuffled version of it.

In [None]:
data = bpd.DataFrame().assign(x=['a', 'b', 'c', 'd'], y=[1, 2, 3, 4])
data

In [None]:
# Random!
np.random.permutation(data.get('y'))

In [None]:
data.assign(shuffled_y=np.random.permutation(data.get('y')))

### Shuffling one column

- Note that it doesn't matter which of the two columns we shuffle; the end result will be a random pairing of labels (`True` and `False`) and weights.
- We've chosen to shuffle `'Birth Weight'`, but we could be shuffling `'Maternal Smoker'`.

In [None]:
original_and_shuffled = smoking_and_birthweight.assign(
    ShuffledBirthWeight=np.random.permutation(smoking_and_birthweight.get('Birth Weight'))
)
original_and_shuffled

In [None]:
fig, ax = plt.subplots()
smokers = original_and_shuffled[original_and_shuffled.get('Maternal Smoker')]
non_smokers = original_and_shuffled[original_and_shuffled.get('Maternal Smoker') == False]
non_smokers.plot(kind='hist', y='ShuffledBirthWeight', density=True, ax=ax, alpha=0.75, bins=baby_bins, ec='w', figsize=(10, 5))
smokers.plot(kind='hist',y='ShuffledBirthWeight', density=True, ax=ax, alpha=0.75, bins=baby_bins)
plt.legend(['Maternal Smoker = False', 'Maternal Smoker = True'])
plt.xlabel('Birth Weight');

### How close are the means of the shuffled groups?

In [None]:
original_and_shuffled.groupby('Maternal Smoker').mean()

In [None]:
group_means = original_and_shuffled.groupby('Maternal Smoker').mean().get('ShuffledBirthWeight')

group_means.loc[False] - group_means.loc[True]

- This is the test statistic for one experiment (one "shuffle").
- Let's write a function that can compute this test statistic for any shuffle.

In [None]:
def difference_in_mean_weights(weights_df):
    group_means = weights_df.groupby('Maternal Smoker').mean().get('ShuffledBirthWeight')
    return group_means.loc[False] - group_means.loc[True]

difference_in_mean_weights(original_and_shuffled)

### Simulation

- This was just one random shuffle.
- How likely is it that a random shuffle results in a 9+ ounce difference in means?
- We have to repeat the shuffling a bunch of times. On each iteration:
    1. Shuffle the weights.
    2. Put them in a DataFrame.
    3. Compute the difference in group means.

In [None]:
n_repetitions = 500 # The dataset is large, so it takes too long to run if we use 5000 or 10000

differences = np.array([])
for i in np.arange(n_repetitions):
    # Step 1: Shuffle the weights
    shuffled_weights = np.random.permutation(
        smoking_and_birthweight.get('Birth Weight')
    )
    
    # Step 2: Put them in a DataFrame
    shuffled = smoking_and_birthweight.assign(
        ShuffledBirthWeight=shuffled_weights
    )
    
    # Step 3: Compute the difference in group means and add the result to the differences array
    difference = difference_in_mean_weights(shuffled)
    
    differences = np.append(differences, difference)

In [None]:
bpd.DataFrame().assign(DifferenceInMeans=differences).plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5));

- Note that the empirical distribution of the test statistic (difference in means) is centered around 0.
- This matches our intuition – if the null hypothesis is true, there should be no difference in the group means on average.

### Conclusion of the test

Where does our observed statistic lie?

In [None]:
bpd.DataFrame().assign(SimulatedDifferenceInMeans=differences).plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
plt.axvline(observed_difference, color='red', label='observed difference in means')
plt.legend();

In [None]:
smoker_p_value = np.count_nonzero(differences >= observed_difference) / 500
smoker_p_value

### Conclusion

- Under the null hypothesis, we rarely see differences as large as this.
- Therefore, we reject the null hypothesis: the two groups do not come from the same distribution.

### Caution! ⚠️

- We **cannot** conclude that smoking *causes* lower birth weight!
- This was an observational study; there may be confounding factors.
    - Maybe smokers are more likely to drink caffeine, and caffeine causes lower birth weight.
- But it suggests that it may be causal.

## Example 2: Deflategate 🏈

### Did the New England Patriots cheat?

<center><img width="40%" src="./data/deflate.jpg"></center>

- On January 18, 2015, the New England Patriots played the Indianapolis Colts for a spot in the Super Bowl.
- The Patriots won, 45-7. They went on to win the Super Bowl.
- After the game, it was alleged that the Patriots intentionally deflated footballs, making them easier to catch.

### Background

- Each team brings 12 footballs to the game. Teams use their own footballs while on offense.
- NFL rules stipulate that **each ball must be inflated to between 12.5 and 13.5 pounds per square inch (psi)**.
- Before the game, officials found that all of the Patriots' footballs were at about 12.5 psi, and that all of the Colts' footballs were at about 13.0 psi.
    - This pre-game data was not written down.
- In the second quarter, the Colts intercepted a Patriots ball and notified officials that it felt under-inflated.
- At halftime, two officials (Clete Blakeman and Dyrol Prioleau) independently measured the pressures of as many of the 24 footballs as they could.
    - They ran out of time before they could finish.

### The measurements

In [None]:
footballs = bpd.read_csv('data/deflategate.csv')
footballs

There are only 15 rows (11 for Patriots footballs, 4 for Colts footballs) since the officials weren't able to record the pressures of every ball.

### Combining the measurements

- Both officials measured each ball.
- Their measurements are slightly different, so we'll average them to get a combined pressure for each ball.

In [None]:
footballs = footballs.assign(
    Pressure=(footballs.get('Blakeman') + footballs.get('Prioleau')) / 2
).drop(columns=['Blakeman', 'Prioleau'])
footballs

### Differences in average pressure

- At first glance, it looks as though the Patriots' footballs are at a lower pressure.
- We could do a permutation test for the difference in mean pressure, but that wouldn't point towards cheating.
    - The Patriot's balls *started* at a lower psi (which is not an issue on its own).
- The allegations were that the Patriots **deflated** their balls, during the game.
    - We want to check to see if the Patriots' footballs lost more pressure than the Colts' footballs from the start of the game to halftime, when these measurements were taken.

In [None]:
# Mean pressure for each team's footballs
footballs.groupby('Team').mean()

### Calculating the pressure drop

- Let's calculate the drop in pressure for each ball in `footballs`.
- The Patriots' footballs started at around 12.5 psi, while the Colts' footballs started at around 13 psi.
- **Strategy**: we'll make an array with starting pressure for each ball, and from that subtract the halftime pressure of each ball.
    - Note that the first 11 rows correspond to Patriots balls and the last 4 rows correspond to Colts balls.
    - Thus, we need an array with 11 `12.5`s followed by 4 `13`s.
    - We can use `np.ones` to help us.

In [None]:
footballs

In [None]:
# np.ones(n) returns an array of n 1s
np.ones(11)

In [None]:
starting_pressure = np.ones(11) * 12.5
starting_pressure = np.append(starting_pressure, np.ones(4) * 13)
starting_pressure

### Calculating the pressure drop

In [None]:
footballs = footballs.assign(
    PressureDrop=(starting_pressure - footballs.get('Pressure'))
)
footballs

### The question

- Did the Patriots' footballs drop in pressure more than the Colts'?
    - We want to test whether two samples came from the same distribution – this calls for a permutation test.
- **Null hypothesis**: The drop in pressures for both teams came from the same distribution.
    - By chance, the Patriots' footballs deflated more.
- **Alternative hypothesis**: No, the Patriots' footballs deflated more than one would expect due to random chance alone.

### The test statistic

Similar to in the baby weights example, our test statistic will be the difference between the teams' average pressure drops.

In [None]:
means = footballs.groupby('Team').mean().get('PressureDrop')
means

In [None]:
# The observed statistic
observed_difference = means.loc['Patriots'] - means.loc['Colts']
observed_difference

The average pressure drop for the Patriots was 0.73 psi more than the Colts.

### Permutation test

- We run a permutation test to see if this is a significant difference.
- Permute the drop in pressure many, many times, and compute the difference in the mean pressure drops for the two teams.
    - Instead, we could permute the team names.

In [None]:
def difference_in_mean_pressure_drops(pressures_df):
    team_means = pressures_df.groupby('Team').mean().get('ShuffledPressureDrop')
    return team_means.loc['Patriots'] - team_means.loc['Colts']

In [None]:
n_repetitions = 5000 # The dataset is much smaller than in the baby weights example, so a larger number of repetitions will still run quickly

differences = np.array([])
for i in np.arange(n_repetitions):
    # Step 1: Shuffle the pressure drops
    shuffled_drops = np.random.permutation(footballs.get('PressureDrop'))
    
    # Step 2: Put them in a DataFrame
    shuffled = footballs.assign(
        ShuffledPressureDrop=shuffled_drops
    )
    
    # Step 3: Compute the difference in group means and add the result to the differences array
    difference = difference_in_mean_pressure_drops(shuffled)

    differences = np.append(differences, difference)

### Conclusion

In [None]:
bpd.DataFrame().assign(SimulatedDifferenceInMeans=differences).plot(kind='hist', bins=20, density=True, ec='w', figsize=(10, 5))
plt.axvline(observed_difference, color='red', label='observed difference in means')
plt.legend();

- It doesn't look good for the Patriots. What is the p-value?
    - Recall, the p-value is the **probability of seeing a result as or more extreme than the observation under the null hypothesis**.
    - In this case, that's the probability of the difference in mean pressure drops being greater than or equal to 0.7335.

In [None]:
observed_difference

In [None]:
# Calculating the p-value
np.count_nonzero(differences >= observed_difference) / n_repetitions

This p-value is low enough to consider this result to be highly statistically significant ($p<0.01$).

### Caution! ⚠️

- We conclude that it is unlikely that the difference in mean pressure drop is due to chance alone.
- But this doesn't establish *causation*.
- That is, we can't conclude that the Patriots **intentionally** deflated their footballs.
- This was an *observational* study; to establish causation, we'd need an RCT (Randomized Controlled Trial).

### Aftermath

- Quote from an investigative report commissioned by the NFL:

> “[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi, depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls and 13.0 for the Colts balls.”

- Many different methods were used to determine whether the drop in pressures were due to chance, including physics. 
    - We computed an observed difference of 0.7335, which is in line with the findings of the report. 
- In the end, Tom Brady (quarterback for the Patriots at the time) was suspended 4 games and the team was fined $1 million dollars.
- The [Deflategate Wikipedia article](https://en.wikipedia.org/wiki/Deflategate) is extremely thorough, give it a read if you're curious!

## Summary

### Summary

- The total variation distance is a test statistic that measures the difference between two categorical distributions.
    - Note: the TVD is **not** used for permutation tests!
- Permutation tests help us determine if two numerical samples came from the sample population.
- We can answers questions like:
    - "Do smoking moms and nonsmoking moms have babies that weigh the same?"
    - "Did the Colts' footballs and Patriots' footballs have the same amount of pressure drop?"
    - More generally: are *these things* like *those things*?

### A/B testing

- Permutation tests are one way to perform **A/B tests**.
    - These are both also hypothesis tests.
- An A/B test aims to determine if two samples are from the same population (the name comes from giving names to the samples – sample A and sample B).
- We implemented A/B tests by using permutations. Butside of this class, permutation tests can be used for other purposes, and A/B tests can be done without permutations. 
- **For us, they mean the same thing, so if you see A/B test anywhere in the class, that is referring to a permutation test.**

### Next time

- On Wednesday, we'll see how we can use permutation tests to try and establish causality.
- We'll also introduce bootstrapping, a procedure that will allow us to quantify the variation in parameter estimates by using just a single sample.