In [None]:
# Imports
import numpy as np
import babypandas as bpd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Lecture 16 – Hypothesis Testing, Continued

## DSC 10, Winter 2022

### Announcements

- The Midterm Project is due **tomorrow at 11:59pm**.
    - Check the [Calendar](https://dsc10.com/calendar) for the most up-to-date office hours schedule (we've added a few more over the weekend).
- Midterm Exam grades are available on Gradescope.
    - **If you didn't do as well as you'd hope, remember that the Midterm Exam is only worth 10% of your overall grade!**
- Lab 5 is due on **Thursday 2/17 at 11:59pm**.
- Homework 5 will be released on Sunday and will be due on **Saturday 2/19 at 11:59pm**.

### Agenda

- Example: Is our coin fair?
- Decisions and uncertainty.
- Example: Midterm scores.
    - p-values.
- Example: Jury Panels.
    - Total variation distance.

## Example: Is our coin fair?

Recall, last class we looked at two pairs of viewpoints, regarding the flips of a coin.

### (1) “This coin is fair.” OR “No, it’s not.”

* Large or small values of the number of heads suggest that the coin is "not fair".
    - **Test statistic: $| \text{number of heads} - 200 |$.**
    - Large values of the statistic suggest that the coin is "not fair".
    - If we used the number of heads, then both large values of the statistic **and** small values of the statistic suggest that the coin is "not fair", which makes the calculation more complicated.

### (2) “This coin is fair.” OR “No, it’s biased towards heads.”

* Only large values of the number of heads suggest that the coin is "biased toward heads".
    - **Test statistic: number of heads.**
    - Statistic (1) wouldn't work because some large values of statistic (1) lean towards "biased towards heads" (when there are many heads), while some don't (when there are many more tails than 200, statistic (1) is also large).

### Pick a (possibly) biased coin
- We'll randomly select the chance that the coin flips heads from the list `[0.4, 0.5, 0.6]`, but we will **not** look at this chance directly.
- We'll then flip this coin 400 times and look at the resulting number of heads and tails. 
    - This is our "observation", i.e. it's the equivalent of seeing a jury panel with 8 Black men on it, or seeing 705 plants with purple flowers out of 929.
    - This is **not** our simulation!

In [None]:
# By setting a seed, we ensure that we get the same results every time this cell is run.
np.random.seed(42)

# Pick a possibly biased coin
prob = np.random.choice([0.4, 0.5, 0.6])

# Flip this coin 400 times; flips = [Num Heads, Num Tails]
flips = np.random.multinomial(400, [prob, 1 - prob])
flips

### Compute the statistics
- Is the coin biased?
- Is the coin biased toward heads?
- Let's write functions that compute the two relevant statistics (the number of heads, and the distance between the number of heads and 200) given an array of flips.
    - Why 200? That's the number of heads we'd expect if we flip a fair coin ($0.5 \cdot 400 = 200$).
    - Instead of the number of heads and 200, we could look at the proportion of heads and 0.5.

In [None]:
def num_heads(arr):
    return arr[0]

def dist_from_200(arr):
    return np.abs(num_heads(arr) - 200)

In [None]:
num_heads(flips)

In [None]:
dist_from_200(flips)

### What do these statistics look like for a fair coin?

For each pair of viewpoints:
- Define the model for a fair coin (done).
- Define the test statistic (done).
- Run the simulation: flip the coin 400 times, calculate the test statistic, and add it to a `results` array. Repeat this process many, many times.
- Plot a histogram of the `results`.

### (1) “This coin is fair.” OR “No, it’s not.”

For this pair of viewpoints, a good test statistic is

$$ | \text{number of heads} - 200 |$$

This is calculated by our function `dist_from_200`.

In [None]:
model = [0.5, 0.5]

repetitions = 10000
results = np.array([])
for i in np.arange(repetitions):
    coins = np.random.multinomial(400, model)
    result = dist_from_200(coins)
    results = np.append(results, result)

results

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', density=True, bins=np.arange(0, 40, 2), ec='w',
                                             figsize=(10, 5));
plt.axvline(dist_from_200(flips), color='red');

- The distance between the number of heads in our observed sample (236) and 200 is 36, which is much, much larger than a typical distance **under the assumption that the coin is fair**.
- As such, the coin is probably not fair, and we side with the viewpoint "No, it's not fair."

### (2) “This coin is fair.” OR “No, it’s biased towards heads.”

A good test statistic here is the number of heads, which is calculated by our function `num_heads`.

In [None]:
# Note that we can re-use most of our code from above!
# The only part that will change is how we calculate our statistic.

model = [0.5, 0.5]

repetitions = 10000
results = np.array([])
for i in np.arange(repetitions):
    coins = np.random.multinomial(400, model)
    result = num_heads(coins)
    results = np.append(results, result)

results

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', density=True, bins=np.arange(160, 240, 5), ec='w',
                                             figsize=(10, 5));
plt.axvline(num_heads(flips), color='red');

- The number of heads in our observed sample is 236.
- Under the assumption that the coin is fair, we essentially never saw 236 or more heads in 400 flips.
- As such, the coin is probably not fair, and we side with the viewpoint "No, it's biased towards heads."
    - Note that if the vertical red line was somewhere near 160, we'd side with the viewpoint "This coin is fair." since a small number of heads is not evidence in favor of the viewpoint "No, it's biased towards heads."

### Was the coin biased?

In [None]:
prob 

### What if the coin is only *slightly* biased?

Let's suppose our coin flips heads with probability 0.51 and tails with probability 0.49.

In [None]:
np.random.seed(23)
prob = 0.51
flips = np.random.multinomial(400, [prob, 1-prob])
flips

Let's again simulate the number of heads in 400 flips of a fair coin, 10000 times.

In [None]:
model = [0.5, 0.5]
repetitions = 10000
results = np.array([])
for i in np.arange(repetitions):
    coins = np.random.multinomial(400, model)
    result = num_heads(coins)
    results = np.append(results, result)
    
results

In [None]:
bpd.DataFrame().assign(num_of_heads=results).plot(kind='hist', density=True, ec='w', bins=20, figsize=(10, 5));
plt.axvline(num_heads(flips), color='red', label='observed statistic')
plt.legend();

- Even though this new coin is still biased, the resulting number of heads in our observed sample is not all that atypical according to the model that the coin is fair.
    - In other words: the observed number of heads **looks like** it could have come from a fair coin.
- As such, given the data we have, we'd still side with the viewpoint that the coin is fair, even though it truly isn't.

### Discussion Question
If the coin were biased towards heads with probability 0.51, how can we change the experiment to detect the bias?

|Option|Answer|
|---|---|
|A.|Increase the number of experiments|
|B.|Increase the number of coin flips per experiment|
|C.|Find a totally different statistic|
|D.|There's no way to detect a bias this small|

### To answer, go to [menti.com](https://menti.com) and enter the code 4235 1662.

### Answer

In [None]:
# Design the experiment
def run_experiments(number_of_flips, number_of_repetitions):
    prob = 0.51
    flips = np.random.multinomial(number_of_flips, [prob, 1-prob])
    model = [0.5, 0.5]

    results = np.array([])
    for i in np.arange(number_of_repetitions):
        coins = np.random.multinomial(number_of_flips, model)
        result = num_heads(coins)
        results = np.append(results, result)
    return results, flips

In [None]:
results, flips = run_experiments(number_of_flips=40000, number_of_repetitions=10000)

bpd.DataFrame().assign(results=results).plot(kind='hist', density=True, ec='w', bins=20, figsize=(10, 5))
plt.axvline(num_heads(flips), color='red');

## Decisions and uncertainty

### Incomplete information

- We try to choose between two views of the world, based on data in a sample.
- It's not always clear whether the data are consistent with one view or the other.
- Random samples can turn out quite extreme. It is unlikely, but possible.

### Testing hypotheses
- A test chooses between two views of how data were generated.
- The views are called **hypotheses**.
- The test picks the hypothesis that is better supported by the observed data.
    - We will formalize this notion now.

### Null and alternative hypotheses
- This method only works if we can simulate data under one of the hypotheses.
- **Null hypothesis**: A well-defined probability model about how the data were generated.
    - We can simulate data under the assumptions of this model – “under the null hypothesis”.
- **Alternative hypothesis**: A different view about the origin of the data.

### Test statistics, revisited
- Recall, we compute the test statistic on each of our samples in our simulation.
- Its goal is to give us information that will help us in determining which hypothesis to side with.
- The test statistic evaluated on our observed data is called the **observed statistic**.

Questions before choosing the statistic:
- What values of the statistic will make us lean towards the null hypothesis?
- What values will make us lean towards the alternative?
    - The answer should be just “high” or just "low". 
    - Try to avoid “both high and low”: this is why, for example, we used the statistic $| \text{number of heads} - 200|$ when our alternative hypothesis was "the coin is not fair."

### Empirical distribution of the test statistic
- When performing a hypothesis test, we **simulate** the test statistic **under the null hypothesis** and draw a histogram of the simulated values.
- This shows us the **empirical distribution of the test statistic under the null hypothesis**.
    - It shows all of the likely values of the test statistic.
    - It also shows how likely they are, under the assumption that the null hypothesis is true.
    - The probabilities are approximate, because we can’t generate all possible random samples.
- We side with the null only if the observed statistic is **consistent** with the empirical distribution of the test statistic.
- **Question:** is there a formal definition for what we mean by "consistent"?

## Example: Midterm scores

### The problem

- Consider a large Biology course divided into 12 discussion sections. 
- Each student is in exactly one discussion section.
- TAs lead the sections.
- After the midterm exam, students in Section 3 notice that the average score in their section is lower than in others.

### The TA's defense

- **The TA's position (Null Hypothesis)**: It's chance. If students were divided into sections randomly, we'd probably see at least one section with a score this low.
- **Alternative Hypothesis**: No, the average score is too low. Randomness is not the only reason for the low scores.
- Let's perform a hypothesis test!

In [None]:
scores = bpd.read_csv('data/scores_by_section.csv')
scores

In [None]:
scores.groupby('Section').count()

In [None]:
# Calculate the average midterm score per section
scores.groupby('Section').mean()

### What are the observed characteristics of section 3?

In [None]:
section_size = scores.groupby('Section').count().get('Midterm').loc[3]
observed_avg = scores.groupby('Section').mean().get('Midterm').loc[3]
print(f'Section 3 had {section_size} students and an average midterm score of {observed_avg}.')

### Simulating under the null hypothesis
- Model: There is no significant difference between the exam scores in different sections, and observed differences are purely due to chance.
    - To simulate: sample 27 students uniformly at random without replacement from the class.
- Test statistic: The average midterm score of a section.
    - The observed statistic is the average midterm score of Section 3 (13.6666).

In [None]:
scores.sample(int(section_size), replace=False).get('Midterm').mean()

In [None]:
averages = np.array([])
repetitions = 10000
for i in np.arange(repetitions):
    random_sample = scores.sample(int(section_size), replace=False)
    new_average = random_sample.get('Midterm').mean()
    averages = np.append(averages, new_average)
    
averages

In [None]:
bpd.DataFrame().assign(RandomSampleAverage=averages).plot(kind='hist', bins=25, ec='w', figsize=(10, 5))
plt.axvline(observed_avg, color='red', label='observed statistic')
plt.legend();

### What's the verdict? 🤔
- This is not as obvious as in previous examples, where it was clear whether the observed statistic was consistent with the empirical distribution of the test statistic.
- We need a precise way of capturing the uncertainty of the conclusion.

## Statistical significance

**Question:** What is the probability that under the null hypothesis, a result *at least* as extreme as our observation occurs?
- In this example, what is the probability that under the null hypothesis, a section of 27 students has an average exam score of 13.66 or lower?
- This quantity is called a **p-value**.

In [None]:
observed_avg

In [None]:
np.count_nonzero(averages <= observed_avg) / repetitions

In [None]:
bpd.DataFrame().assign(RandomSampleAverage=averages).plot(kind='hist', bins=25, ec='w', figsize=(10, 5))
plt.axvline(observed_avg, color='red', label='observed statistic')
plt.legend();

### Definition of the p-value

- The p-value is **the probability, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative**.
- Its formal name is the _observed significance level_.
- In the previous visualization, it is the area to the left of the red line (i.e. the area in the left tail, starting at the observed statistic).

### Conventions about inconsistency

- If the p-value is sufficiently large, we say the data is **consistent** with the null hypothesis and so we "**fail to reject the null hypothesis**".
    - We never say that we "accept" the null hypothesis!
- If the p-value is below some cutoff, we say it is **inconsistent** with the null hypothesis, and we **"reject the null hypothesis"**.
    - p-values correspond to the "tail areas" of a histogram, starting at the observed statistic.
    - If a p-value is less than 0.05, the result is said to be "statistically significant".
    - If a p-value is less than 0.01, the result is said to be "highly statistically significant".
    - These conventions are historical!

### Error probabilities

The cutoff for the p-value is an error probability. If:

- your cutoff is 0.05, and
- the null hypothesis happens to be true

then there is about a 0.05 chance that your test will (incorrectly) reject the null hypothesis.

- In other words, if the same TA teaches 20 discussion sections for the same course, they should expect to see students with a "statistically significantly low" average in one of those sections.

## Comparing distributions

### Jury Selection in Alameda County

<br>

<center><img src='data/aclu.png' width=500></center>

### Jury panels

$$\text{eligible population} \rightarrow \text{list of eligible jurors} \rightarrow \text{jury panel} \rightarrow \text{actual jury}$$

Section 197 of California's Code of Civil Procedure says, 
> "All persons selected for jury service shall be selected at random, from a source or sources inclusive of a representative cross section of the population of the area served by the court."

### ACLU study:
- The ACLU (American Civil Liberties Union) of Northern California studied the ethnic composition of jury panels in 11 felony trials in Alameda County between 2009 and 2010.
    - [Here's a link](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) to the study.
- 1453 people reported for jury duty in total (we will call them "panelists").
- The following DataFrame shows the distribution in ethnicities for both the eligible population and for the panelists who were studied.

In [None]:
jury = bpd.DataFrame().assign(
    Ethnicity=['Asian', 'Black', 'Latino', 'White', 'Other'],
    Eligible=[0.15, 0.18, 0.12, 0.54, 0.01],
    Panels=[0.26, 0.08, 0.08, 0.54, 0.04]
)
jury

What do you notice? 🤔

### Are the differences in representation meaningful?
- Model: Panelists were selected at random from the eligible population.
    - Alternative viewpoint: no, they weren't.
- Observation: 1453 panelists and the distribution of their ethnicities.
- Test statistic: ???
    - How do we deal with multiple categories?

In [None]:
jury.plot(kind='barh', x='Ethnicity');

### The distance between two distributions
- Panelists are categorized into one of 5 ethnicities.
- The distribution of ethnicities is **categorical**.
- To see whether the the distribution of ethnicities for the panelists is similar to that of the eligible population, we have to measure the distance between two categorical distributions.

### The distance between two distributions
- Let's start by considering the difference in proportions for each category.

In [None]:
with_diffs = jury.assign(Difference=(jury.get('Panels') - jury.get('Eligible')))
with_diffs

- Note that if we sum these differences, the result is 0.
- So that the positive and negative differences don't "cancel", we can take the absolute value of these differences.

In [None]:
with_abs_diffs = with_diffs.assign(AbsoluteDifference=np.abs(with_diffs.get('Difference')))
with_abs_diffs

### Statistic: Total Variation Distance
- The **Total Variation Distance (TVD)** of two categorical distributions is **the sum of the absolute differences of their proportions, all divided by 2**.
    - We divide by 2 so that, for example, the distribution [0.51, 0.49] is 0.01 off from [0.50, 0.50].
    - This way, TVD quantifies the **total overrepresentation** across all categories.
    - It would also be valid not to divide by 2. We just wouldn't call that statistic TVD anymore.

In [None]:
with_abs_diffs

In [None]:
with_abs_diffs.get('AbsoluteDifference').sum() / 2

### Discussion Question

What is the TVD between the distributions of class standing in DSC 10 and DSC 40A?

| **Class Standing** | **DSC 10** | **DSC 40A** |
| --- | --- | --- |
| Freshman | 0.45 | 0.15 |
| Sophomore | 0.35 | 0.35 |
| Junior | 0.15 | 0.35 |
| Senior+ | 0.05 | 0.15 |

- A. 0.2
- B. 0.3
- C. 0.5
- D. 0.6
- E. None of the above

### To answer, go to [menti.com](https://menti.com) and enter the code 4235 1662.

### Statistic: Total Variation Distance

In [None]:
def total_variation_distance(dist1, dist2):
    '''Computes the TVD between two categorical distributions, 
       assuming the categories appear in the same order.'''
    return np.abs((dist1 - dist2)).sum() / 2

In [None]:
# Calculate the TVD between the distribution of ethnicities in the eligible population
# and the distribution of ethnicities in the observed panelists

total_variation_distance(jury.get('Eligible'), jury.get('Panels'))

- The closer the TVD is to 0, the closer the two distributions are to one another.
- But is 0.14 a very small value? A typical value? A very large value?

### Simulate drawing jury panels
- Model: Panels are drawn at from the eligible population.
- Statistic: TVD between the random panel's ethnicity distribution and the eligible population's ethnicity distribution.
- Repeat many times to generate many TVDs, and see where the TVD of the observed panelists lies.

_Note: `np.random.multinomial` creates samples drawn with replacement, even though real jury panels would be drawn without replacement. However, when the sample size is small relative to the population, the resulting distributions will be roughly the same whether we sample with or without replacement._

### The simulation

In [None]:
eligible = jury.get('Eligible')
sample_distribution = np.random.multinomial(1453, eligible) / 1453 
sample_distribution

In [None]:
panels_and_sample = jury.assign(RandomSample=sample_distribution) 
panels_and_sample

In [None]:
panels_and_sample.plot(kind='barh', x='Ethnicity');

In [None]:
total_variation_distance(panels_and_sample.get('RandomSample'), eligible)

### Repeating the experiment

In [None]:
tvds = np.array([])
repetitions = 10000
for i in np.arange(repetitions):
    sample_distribution = np.random.multinomial(1453, eligible) / 1453
    new_tvd = total_variation_distance(sample_distribution, eligible)
    tvds = np.append(tvds, new_tvd)

In [None]:
observed_tvd = total_variation_distance(jury.get('Panels'), eligible)

bpd.DataFrame().assign(tvds=tvds).plot(kind='hist', density=True, bins=20, ec='w', figsize=(10, 5))
plt.axvline(observed_tvd, color='red', label='observed statistic')
plt.legend();

### Calculating the p-value

In [None]:
np.count_nonzero(tvds >= observed_tvd) / repetitions

### Are the jury panels representative?
- Likely not! The distributions of ethnicities in our random samples are not like the distribution of ethnicities in our observed panelists.
- This doesn't say *why* the distributions are different!
    - Juries are drawn from voter registration lists and DMV records. Certain populations are less likely to be registered to vote or have a driver's license due to historical biases.
    - The county rarely enacts penalties for those who don't appear for jury duty; certain populations are less likely to be able to miss work to appear for jury duty.
    - [See the report](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) for more reasons.

### Summary of the method

To assess whether a sample was drawn randomly from a known categorical distribution:
- Use TVD as the test statistic because it measures the distance between categorical distributions.
- Sample at random from the population and compute the TVD between the random sample and the population; repeat numerous times.
- Compare:
    - The empirical distribution of simulated TVDs, and
    - The actual TVD from the sample in the study.
- See Question 6 on Homework 5 for an example.

## Summary

### Summary

- A hypothesis test helps us decide between two hypotheses – a "null" hypothesis and an "alternative" hypothesis. Framework:
    - Collect some real world data. (e.g. 1453 panelists, students in a real Biology course).
    - Specify a null and alternative hypothesis.
    - Specify a test statistic.
    - Simulate data under the assumption the null hypothesis is true and compute the test statistic on each one. This creates an empirical distribution of the test statistic.
    - Check if the observed statistic is consistent with the empirical distribution of the test statistic.
- To conclude whether an observed statistic is consistent with an empirical distribution of that test statistic, we compute a p-value, which is the probability, under the null hypothesis, that the test statistic is equal to the observed statistic or is even further in the direction of the alternate hypothesis.
    - There are conventional cutoffs for significance. 0.05 is the most common.
- The total variation distance is a test statistic that measures the difference between two categorical distributions.
- **Next time**: How do we test whether two samples are from the same numerical distribution?