In [None]:
# Run this cell to set up packages for lecture.
from lec14_imports import *

# Lecture 14, Part 1: Hypothesis Testing and Total Variation Distance

## DSC 10, Summer 2025

### Agenda

- Test statistics, one-tailed tests, and two-tailed tests (Again)
- Example: Jury selection in Alameda County.
    - Total variation distance.

## Example: My winrate in Dominion

**One tailed tests**: There is one way to accept the alternative hypothesis

**Two tailed tests**: There is more than one way to accept the alternative hypothesis. 

### Stating the hypotheses

- The **hypotheses** describe two views of how our data was generated.
    - Remember, the null hypothesis needs to be a well-defined probability model about how the data was generated, so that we can use it for simulation. 

- One pair of hypotheses is:
    - **Null Hypothesis:** The coin is fair.
    - **Alternative Hypothesis:** The coin is not fair.

- A different pair is:
    - **Null Hypothesis:** The coin is fair.
    - **Alternative Hypothesis:** The coin is biased towards tails.

### Scenario

I've been playing the card game Dominion with two of my friends. If we have the same skill level, then I should be able to win ~33.33% of the time. 

**Null Hypothesis:** ????

**Alternative Hypothesis:** ????

**Null Hypothesis:** ????

**Alternative Hypothesis:** ????

In [1]:
## Generate how many wins I would get under my model if I play 100 games of Dominion. 


In [2]:
## Simulate under the null hypothesis

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical distribution of my winrate');
plt.legend();

## Comparing distributions

### Jury selection in Alameda County

<br>

<center><img src='images/aclu.png' width=500></center>

### Jury panels

Recall the jury panel selection process:

<center>$\substack{\text{eligible} \\ \text{population}}
\xrightarrow{\substack{\text{representative} \\ \text{sample}}} 
\substack{\text{jury} \\ \text{panel}}
\xrightarrow{\substack{\text{selection by} \\ \text{judge/attorneys}}} 
\substack{\text{actual} \\ \text{jury}}$</center>

Section 197 of California's Code of Civil Procedure says, 
> "All persons selected for jury service shall be selected at random, from a source or sources inclusive of a representative cross section of the population of the area served by the court."

### Racial and Ethnic Disparities in Alameda County Jury Pools

- The American Civil Liberties Union (ACLU) of Northern California [studied](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) the ethnic composition of jury panels in 11 felony trials in Alameda County between 2009 and 2010.

- 1453 people reported for jury duty in total (we will call them "panelists").

- The following DataFrame shows the distribution in ethnicities for both the eligible population and for the panelists who were studied.

In [None]:
jury = bpd.DataFrame().assign(
    Ethnicity=['Asian', 'Black', 'Latino', 'White', 'Other'],
    Eligible=[0.15, 0.18, 0.12, 0.54, 0.01],
    Panels=[0.26, 0.08, 0.08, 0.54, 0.04]
)
jury

What do you notice? 👀

### Are the differences in representation meaningful?

- **Null Hypothesis:** Panelists were selected at random from the eligible population.

- **Alternative Hypothesis:** Panelists were _not_ selected at random from the eligible population.

- Observation: 1453 panelists and the distribution of their ethnicities.

- Test statistic: ???
    - How do we deal with multiple categories?

In [None]:
jury.plot(kind='barh', x='Ethnicity', figsize=(10, 5));

### The distance between two distributions

- Panelists are categorized into one of 5 ethnicities. In other words, ethnicity is a **categorical** variable.

- To see whether the the distribution of ethnicities for the panelists is similar to that of the eligible population, we have to measure the distance between two categorical distributions.
    - We've done this for distributions with just two categories – heads and tails, for instance – but not when there are more than two categories.

### The distance between two distributions

- Let's start by considering the difference in proportions for each category.

In [None]:
with_diffs = jury.assign(Difference=(jury.get('Panels') - jury.get('Eligible')))
with_diffs

- Note that if we sum these differences, the result is 0 (you'll see the proof in DSC 40A).
- To avoid cancellation of positive and negative differences, we can take the absolute value of these differences.

In [None]:
with_abs_diffs = with_diffs.assign(AbsoluteDifference=np.abs(with_diffs.get('Difference')))
with_abs_diffs

### Statistic: Total Variation Distance

The **Total Variation Distance (TVD)** of two categorical distributions is **the sum of the absolute differences of their proportions, all divided by 2**.

- We divide by 2 so that, for example, the distribution [0.51, 0.49] is 0.01 away from [0.50, 0.50].

- It would also be valid not to divide by 2. We just wouldn't call that statistic TVD anymore.

In [None]:
with_abs_diffs

In [None]:
with_abs_diffs.get('AbsoluteDifference').sum() / 2

In [None]:
def total_variation_distance(dist1, dist2):
    '''Computes the TVD between two categorical distributions, 
       assuming the categories appear in the same order.'''
    return np.abs((dist1 - dist2)).sum() / 2

jury.plot(kind='barh', x='Ethnicity', figsize=(10, 5))
plt.annotate('If you add up the total amount by which the blue bars\n are longer than the red bars, you get 0.14.', (0.08, 3.9), bbox=dict(boxstyle="larrow,pad=0.3", fc="#e5e5e5", ec="black", lw=2));
plt.annotate('If you add up the total amount by which the red bars\n are longer than the blue bars, you also get 0.14!', (0.23, 0.9), bbox=dict(boxstyle="larrow,pad=0.3", fc="#e5e5e5", ec="black", lw=2));

- TVD quantifies the **total overrepresentation** across all categories. 
    - Equivalently, it also quantifies total underrepresentation across all categories.

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

What is the TVD between the distributions of class standing in DSC 10 and DSC 40A?

| **Class Standing** | **DSC 10** | **DSC 40A** |
| --- | --- | --- |
| Freshman | 0.45 | 0.15 |
| Sophomore | 0.35 | 0.35 |
| Junior | 0.15 | 0.35 |
| Senior+ | 0.05 | 0.15 |

- A. 0.2
- B. 0.3
- C. 0.5
- D. 0.6
- E. None of the above

### Simulate drawing jury panels

- Model: Panels are drawn at from the eligible population.

- Statistic: TVD between the random panel's ethnicity distribution and the eligible population's ethnicity distribution.

- Repeat many times to generate many TVDs, and see where the TVD of the observed panelists lies.

_Note_: `np.random.multinomial` creates samples drawn with replacement, even though real jury panels would be drawn without replacement. However, when the sample size (1453) is small relative to the population (number of people in Alameda County), the resulting distributions will be roughly the same whether we sample with or without replacement.

### The simulation

In [None]:
eligible = jury.get('Eligible')
sample_distribution = np.random.multinomial(1453, eligible) / 1453 
sample_distribution

In [None]:
total_variation_distance(sample_distribution, eligible)

### Repeating the experiment

In [None]:
tvds = np.array([])
repetitions = 10000
for i in np.arange(repetitions):
    sample_distribution = np.random.multinomial(1453, eligible) / 1453
    new_tvd = total_variation_distance(sample_distribution, eligible)
    tvds = np.append(tvds, new_tvd)

In [None]:
observed_tvd = total_variation_distance(jury.get('Panels'), eligible)

bpd.DataFrame().assign(tvds=tvds).plot(kind='hist', density=True, bins=20, ec='w', figsize=(10, 5),
                                      title='Empirical Distribution of TVD Between Eligible Population and Random Sample')
plt.axvline(observed_tvd, color='black', linewidth=4, label='Observed Statistic')
plt.legend();

### Calculating the p-value

In [None]:
np.count_nonzero(tvds >= observed_tvd) / repetitions

- Random samples from the eligible population are typically **much more similar** to the eligible population than our observed data. 
- We see this in the empirical distribution, which consists of **small TVDs** (much smaller than our observed TVD).

In [None]:
jury.assign(RandomSample=sample_distribution).plot(kind='barh', x='Ethnicity', figsize=(10, 5),
                                                   title = "A Random Sample is Usually Similar to the Eligible Population");

### Are the jury panels representative?

- Likely not! The distributions of ethnicities in our random samples are not like the distribution of ethnicities in our observed panelists.

- This doesn't say *why* the distributions are different!
    - Juries are drawn from voter registration lists and DMV records. Certain populations are less likely to be registered to vote or have a driver's license due to historical biases.
    - The county rarely enacts penalties for those who don't appear for jury duty; certain populations are less likely to be able to miss work to appear for jury duty.
    - [See the report](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) for more reasons.

## Summary, next time

### The hypothesis testing "recipe"

1. **State hypotheses**: State the null and alternative hypotheses. We must be able to simulate data under the null hypothesis.
1. **Choose test statistic**: Choose something that allows you to distinguish between the two hypotheses based on whether its value is high or low.
1. **Simulate**: Draw samples under the null hypothesis, and calculate the test statistic on each one.
1. **Visualize**: Plot the simulated values of the test statistic in a histogram, and compare this to the observed statistic (black line).
1. **Calculate p-value**: Find the proportion of simulations for which the test statistic was at least as extreme as the one observed.

### Why does it matter?

- Hypothesis testing is used all the time to make decisions in science and business.
- Hypothesis testing quantifies how "weird" a result is.
    - Instead of saying, "I think that's unusual," people say, "This has a p-value of 0.001."
- Through simulation, we have the tools to estimate how _likely_ something is, without needing to know a ton about probability theory.