```





```

# Hypothesis Testing

Now armed with a solid understanding of populations and samples, we can start to question about beliefs *about* populations and samples.

Here's a basic motivating example: You flip a coin 10 times and only observe 3 Heads -- do you still believe the coin is fair? What if you flip it 100 times and observe 30 Heads? Or 1000 flips and 300 Heads? At what point is our observation *so weird* that we start to question our assumptions about the population these coin flips are coming from?

In this chapter we'll introduce a framework to help us quantify how 'weird' an observation is, and help us decide if we should stick with or abandon our current set of beliefs about a population.

However, as useful as coins are for teaching examples, we're not in the business of checking the fairness of coins. Data and statistics can be used to check assumptions in other, extraordinatily impactful fields, too...

## Swain v. Alabama

In the local court of Talladega County, 1964, a black man by the name of Robert Swain was convicted by an all-white jury and sentenced to death. Swain appealed his sentence, arguing that the jury was not representative of his community and revealed a biased selection processess. In doing so, the case was soon elevated to the United States Supreme Court, arguing over *Swain v. Alabama*.

By the U.S. Constitution, all people accused of a crime have the right to a fair jury, with a well-defined *fair* selection process. First, a pool of potential jurors are to be drawn representatively from the local population of eligible jurors -- as we know by now, "representative" implies *random*. After the initial, fair pool is drawn, the final panel of jurors is chosen, non-randomly, with direct intervention from the involved parties.

At the time, the population of eligible jurors in Talladega County was about $26\%$ black, yet in the initial pool of $100$ potential jurors, only $8$ selected were black -- whereas we would have expected that roughly $26$ members of the pool would be black. Such a discrepancy between the *observed* and the *expected* number of black candidates naturally gives rise to the question: *isn't that strange?*

### A need to quantify unlikely events

Unfortunately, we haven't yet learned a way to definitely state how 'strange' this observation was. We might intuitively know that it's unlikely -- but *how unlikely*? In the court, that judgement was left to each unique person's personal beliefs about how random chance works. In the end, the Supreme Court decided that the disparity wasn't too unlikely, and didn't support the claim that there was a biased jury selection process. The claim was denied, and Swain's appeal along with it.

Sure, we wouldn't expect to randomly select exactly twenty-six black members for the pool every single time, but *eight*? We may start to hypothesize that the sample was not in fact drawn from the entire population -- perhaps due to poor, or unjust sampling practices, the pool was selected from a subset of the population.

## Competing hypotheses about the population

On one hand, we have an underlying assumption that the sample was drawn from the total population of Talladega County, having a population proportion of eligible black jurors equal to $26\%$. Now we have a suspicion that the sampling might be biased, and in fact the jury pool was drawn from a population with an artificially lower population proportion of black members $< 26\%$.

These beliefs can be thought of as two competing *hypotheses* about the population which our data was generated from.

### The null hypothesis
In the language of statistics, the first hypothesis is considered the {dterm}`null hypothesis` ($H_0$). The 'null' is a hypothesis about some population parameter which our initial assumptions and models are built on. The null is assumed to be true at the outset of our experiment, but it is our job to be critical of the null hypothesis, and ask whether or not we should stick with our assumptions, or find new ones.

When we were presented with the problem, we operated under the *model* our sample was generated randomly from a population with the proportion of black members equal to $26\%$. These assumptions lead us to define our null hypothesis about the sampled population as:

$$
H_0: P(\text{juror is black}) = 0.26
$$

Under this hypothesis, any difference between our observed sample statistic and the assumed population parameter are due to no other reason than random chance in our sampling.

This is the official proposal put forth by the Supreme Court -- that we collected a sample of $100$ individuals from a population that was $26\%$ black, and managed to only observe $8$ black members *purely by random chance*.

### The alternate hypothesis
Since we're critical of the null hypothesis, we propose an {dterm}`alternate hypothesis` ($H_1$) which contradicts the null. The alternate hypothesis is essentially the set of beliefs we'll choose to adopt if it appears that our null hypothesis is too unlikely to be true. 

The alternate hypothesis proposes a model where differences between the observed sample statistic and the assumed population parameter are due to some reason *other than randomness*.

In examining Swain v. Alabama, we contradict the null proposing that the jury pool was sampled from a population with artificially less than $26\%$ black members:

$$
H_1: P(\text{juror is black}) < 0.26
$$

If it appears that random chance alone couldn't have produced the observed number of black jury pool members that we saw in the case, then we'll move to this belief -- that we managed to observe only $8$ black members because the population we sampled from was less than fair.

## Testing our assumptions

As data scientists, we don't just tell the ~myths~ hypotheses, we put them to the test!

Given an observed sample statistic -- like 8 black members out of 100 -- and a potential population the sample came frome -- like a population that is 26% black -- we can test whether or not the sample likely came from that population. Here's the general framework:

1. By conducting a {dterm}`hypothesis test`, we can decide whether or not to abandon the beliefs of our null hypothesis. Simply, this is done by calculating how unlikely our *observed sample statistic* is under the assumptions of the null.

Note the significance of the null hypothesis in this context -- it provides a model *under which we can generate data*.

2. Under the null hypothesis, we can use our simulation know-how to create new random samples with the same conditions that the original jury pool supposedly followed.

3. Using our probability know-how, we estimate the probability that this null model could produce a sample statistic as 'strange' as the one we originally observed.

4. If it's too unlikely, we reject the null hypothesis and move our beliefs over to the alternate hypothesis. Otherwise, if it's decently likely then we have no reason to abandon the assumptions of the null.

### Generating data under the null

- it *could have happened*, but it's *so unlikely* to happen that we choose to believe there's something besides random chance involved
- that *so unlikely* threshold is the level of significance

- we don't have access to the population
- need a way to see which hypothesis we should embrace
- not necessarily which one is true, just which we'll choose to belive as the result of our experiment

- [HT procedures]
    - one-sided? two-sided?
    - introduction of a 'test statistic'
    - basically, comparing the sampling distribution of the test statistic from the hypothesized (null) population to the observed statistic from our sample, which *may perhaps* suggest that our population is potentially-different (alternate) from what we thought
    - generalize the procedures at the end

We define a sample statistic as 'the number of Heads received in 100 coin flips' and simulate the sampling distribution under the assumption that our coin is fair. 

```



















```

Among their official statements, which you can find [here](https://tile.loc.gov/storage-services/service/ll/usrep/usrep380/usrep380202/usrep380202.pdf) if you are so inclined, the following are notable:
> "The overall percentage disparity has been small, and reflects no studied attempt to include or exclude a specified number of Negroes." (p. 209)

> "Even if a State's systematic striking of Negroes in selecting trial juries raises a prima facie case of discrimination under the Fourteenth Amendment the record here is insufficient to establish such systematic striking in the county." (p. 

In [None]:
import babypandas as bpd
import numpy as np

In [None]:
# Define a function for a single trial
def run_sample_statistic():
    
    # Sample (with replacement) from the probability distribution of a fair coin
    flips = np.random.choice(['Heads', 'Tails'], replace=True, size=100)
    
    # And compute the statistic: number of Heads
    statistic = (flips == 'Heads').sum()
    
    return statistic

In [None]:
# Conduct an experiment, keeping track of the sampling distribution
statistics = []

for i in range(10_000):
    statistics.append(run_sample_statistic())
    
num_heads = np.array(statistics)
    
# And plot the sampling distribution
bpd.Series(data=num_heads).plot(
    kind='hist', bins=np.arange(25, 76), density=True,
    title='Sampling Distribution of # Heads in 100 Fair Coin Flips'
)

In [None]:
(num_heads == 50).mean()

## The birthday problem revisited

we're now ready to return to assumption -- *are all birthdays in the U.S. actually equally likely?* Again, ignoring leap years for simplicity.

- to put this in terms of populations, we're hypothesizing that the true population distribution of all birthdays in the United States is *uniform*.
- ![hypothesized population distribution]
- but, we don't have data from the entire U.S. populace. Time to look at our sample
    - maybe some issues with the sample -- only recentish years, isn't random like what we're used to...
- the sample looks decidedly not uniform
    - point out some of the interesting bits


- two hypotheses:
    - 0: Population is uniform
    - 1: Population is not uniform
- need a way to capture the difference of a distribution

### A new test statistic: *total variation distance*

- [rewrite hypotheses with tvd]
- [conduct HT]

## Uncovering racial bias with the total variation distance

- the hypothesis testing framework works with *any test statistic*, and the total variation distance is a great test statistic for categorical distributions 
- [jury selection]

- remember: poor sampling techniques essentially alter the population which the data represents -- so we're curious if the 'population' of selected jurors actually matches the population of the U.S., or if their selection process artificially excluded certain members from being part of the population

---
# Staging area