In [None]:
import pandas as pd
import numpy as np
import os

import seaborn as sns
import plotly.express as px
pd.options.plotting.backend = 'plotly'

from ipywidgets import interact

# Lecture 9 – Hypothesis Testing

## DSC 80, Spring 2023

### Agenda

We'll look at many examples, and cover the necessary theory along the way.

- Coin flipping
- Total variation distance.
- Penguin bill lengths 🐧.

### "Standard" hypothesis testing

"Standard" hypothesis testing helps us answer questions of the form:

> I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

- Sample: 59 heads and 41 tails. Population: A fair coin.

- Sample: Ethnic distribution of UCSD. Population: Ethnic distribution of California. (Comparing categorical distributions with the TVD.)

- Sample: Sample of Torgersen Island penguins. Population: All 333 penguins.

## Example: Coin flipping

### Recap: Coin flipping

Let's recap the example we saw last time.

- **Observation**: We flipped a coin 100 times, and saw 59 heads and 41 tails.

- **Null Hypothesis**: The coin is fair.

- **Alternative Hypothesis**: The coin is biased in favor of heads.

- **Test Statistic**: Number of heads, $N_H$.

### Generating the null distribution

- Now that we've chosen a test statistic, we need to generate the distribution of the test statistic under the assumption the null hypothesis is true, i.e. the **null distribution**.

- This distribution will give us, for instance:
    - The probability of seeing 4 heads in 100 flips of a fair coin.
    - The probability of seeing at most 46 heads in 100 flips of a fair coin.
    - **The probability of seeing at least 59 heads in 100 flips of a fair coin.**

### Generating the null distribution, using math

The number of heads in 100 flips of a fair coin follows the $\text{Binomial(100, 0.5)}$ distribution, in which

$$P(\text{# heads} = k) = {100 \choose k} (0.5)^k{(1-0.5)^{100-k}} = {100 \choose k} 0.5^{100}$$

In [None]:
from scipy.special import comb

def p_k_heads(k):
    return comb(100, k) * (0.5) ** 100

The probability that we see at least 59 heads is then:

In [None]:
sum([p_k_heads(k) for k in range(59, 101)])

Let's look at this distribution visually.

In [None]:
plot_df = pd.DataFrame().assign(k = range(101))
plot_df['p_k'] = p_k_heads(plot_df['k'])
plot_df['color'] = plot_df['k'].apply(lambda k: 'orange' if k >= 59 else 'blue')

fig = plot_df.plot(kind='bar', x='k', y='p_k', color='color', width=1000)
fig.add_annotation(text='This red area is called the p-value!', x=77, y=0.008, showarrow=False)

### Making a decision

We saw that, in 100 flips of a fair coin, $P(\text{# heads} \geq 59)$  is only ~4.4%.

- This is quite low – it suggests that our observed result is quite unlikely **under** the null.

- As such, we will **reject** the null hypothesis – our observation is **not consistent** with the hypothesis that the coin is fair.

- The null still may be true – it's possible that the coin we flipped was fair, and we just happened to see a rare result. For the same reason, we also **cannot "accept"** the alternative.

- This probability – **the probability of seeing a result at least as extreme as the observed, under the null hypothesis** – is called the p-value.
    - If the p-value is below a pre-defined cutoff (often 5%), we reject the null.
    - Otherwise, we fail to reject the null.

### ⚠️ We can't "accept" the null!

- Note that we are very careful in saying that we either **reject the null** or **fail to reject the null**.

- Just because we fail to reject the null, it doesn't mean the null is true – we cannot "accept" it.

- Example:
    - Suppose there is a coin that is truly biased towards heads, with probability 0.55.
    - We flip it 10 times and see 5 heads and 5 tails.
    - If we conduct a hypothesis test where the null is that the coin is fair, we will fail to reject the null.
    - But the null isn't true.

### Generating the null distribution, using simulation

- In the most recent example, we computed the **true probability distribution** of the test statistic under the null hypothesis.

- We could only do this because we know that the number of heads in $N$ flips of a fair coin follows the $\text{Binomial}(N, 0.5)$ distribution.

- Often, we'll pick test statistics for which we don't know the true probability distribution. In such cases, we'll have to **simulate, as we did in DSC 10**.

- Simulations provide us with **empirical distributions of test statistics**; if we simulate with a large (>= 10,000) number of repetitions, the empirical distribution of the test statistic should look similar to the true probability distribution of the test statistic.

### Generating the null distribution, using simulation

First, let's figure out how to perform one instance of the experiment – that is, how to flip 100 coins once. Recall, to sample from a categorical distribution, we use `np.random.multinomial`.

In [None]:
# Flipping a fair coin 100 times.
# Interpret the result as [Heads, Tails].
np.random.multinomial(100, [0.5, 0.5])

Then, we can repeat it a large number of times.

In [None]:
# 100,000 times, we want to flip a coin 100 times.
results = []

for _ in range(100_000):
    num_heads = np.random.multinomial(100, [0.5, 0.5])[0]
    results.append(num_heads)

Each entry in `results` is the number of heads in 100 simulated coin flips.

In [None]:
results[:10]

### Visualizing the empirical distribution of the test statistic

In [None]:
fig = px.histogram(pd.DataFrame(results), x=0, nbins=50, histnorm='probability', 
                   title='Empirical Distribution of the Number of Heads in 100 Flips of a Fair Coin')
fig.add_vline(x=59, line_color='red')
fig.update_layout(xaxis_range=[0, 100])

Again, we can compute the p-value, which is the **probability of seeing a result as or more extreme than the observed, under the null**.

In [None]:
(np.array(results) >= 59).mean()

Note that this number is close, but not identical, to the true p-value we found before. That's because we computed this p-value using a simulation, and hence an approximation.

## Reflection

### Can we make things faster? 🏃

A mantra so far in this course has been **avoid `for`-loops whenever possible**. That applies here, too.

`np.random.multinomial` (and `np.random.choice`) accepts a `size` argument. By providing `size=100_000`, we can tell `numpy` to draw 100 elements from a uniform distribution, `100_000` times, **without needing a `for`-loop!**

In [None]:
# An array with 100000 rows and 2 columns.
np.random.multinomial(100, [0.5, 0.5], size=100_000)

In [None]:
# Just the first column of the above array. Note the iloc-like syntax.
np.random.multinomial(100, [0.5, 0.5], size=100_000)[:, 0]

In [None]:
%%time

faster_results = np.random.multinomial(100, [0.5, 0.5], size=100_000)[:, 0]

The above approach is orders of magnitude faster than the `for`-loop approach! With that said, you are still _allowed_ to use `for`-loops for hypothesis (and permutation) tests on assignments.

In [None]:
%%time

# 100,000 times, we want to flip a coin 100 times.
results = []

for _ in range(100_000):
    num_heads = np.random.multinomial(100, [0.5, 0.5])[0]
    results.append(num_heads)

### Choosing alternative hypotheses and test statistics

- The alternative hypothesis we chose was **the coin is biased in favor of heads**, and the test statistic we chose was the number of heads, $N_H$.

- We could've also chosen one the following options; each of them has the quality that **large values point to one hypothesis, and small values point to the other**:
    - $\frac{N_H}{100}$, the proportion of heads.
    - $N_H - 50$, the difference from the expected number of heads.

- What if our alternative hypothesis was **the coin is biased**?

### Absolute test statistics

For the alternative hypothesis "the coin is biased", one test statistic we could use is $|N_H - \frac{N}{2}|$, the absolute difference from the expected number of heads.

- **If this test statistic is large, it means that there were many more heads than expected, or many fewer heads than expected. If this test statistic is small, it means that the number of heads was close to expected.**

- For instance, suppose we flip 100 coins, and I tell you the absolute difference from the expected number of heads is 20.

- Then, either we flipped 70 heads or 30 heads. 

- If our alternative hypothesis is that the coin was biased, then it doesn't matter in which direction it was biased, and this test statistic works.


- But if our alternative hypothesis is that the coin was biased towards heads, then this is not helpful, because we don't know whether or not there were 70 heads (evidence for the alternative) or 30 heads (not evidence for the alternative).  

### Important

We'd like to choose a test statistic such that large values of the test statistic correspond to one hypothesis, and small values correspond to the other. 

**In other words, we'll try to avoid "two-tailed tests".** Rough rule of thumb:

- If the alternative hypothesis is "A > B", then the test statistic should measure differences and **should not** contain an absolute value.

- If the alternative hypothesis is "A and B are different", then the test statistic should measure distances and **should** contain an absolute value.

### Fun fact

- One researcher found that coin flips aren't 50/50, but rather are closer to 51/49, biased towards whichever side started facing up.
- [Read this](https://www.smithsonianmag.com/science-nature/gamblers-take-note-the-odds-in-a-coin-flip-arent-quite-5050-145465423) for more details.

## Example: Total variation distance

### Ethnic distribution of California vs. UCSD

The DataFrame below contains the ethnic breakdown of the state as a whole ([source](https://www.ppic.org/publication/californias-population/)) and UCSD as of 2016 ([source](https://ir.ucsd.edu/_files/stats-data/enrollment/ugethnic.pdf)).

In [None]:
eth = pd.DataFrame([['Asian', 0.15, 0.51],
                    ['Black', 0.05, 0.02],
                    ['Latino', 0.39, 0.16],
                    ['White', 0.35, 0.2],
                    ['Other', 0.06, 0.11]],
                   columns=['Ethnicity', 'California', 'UCSD']).set_index('Ethnicity')

eth

- We want to decide whether UCSD students were drawn at random from the state of California.
- The two **categorical distributions** above are clearly different. But how different are they?

### Is the difference between the two distributions significant?

Let's establish our hypotheses.
- **Null Hypothesis**: UCSD students **were** selected at random from the population of California residents.
- **Alternative Hypothesis**: UCSD students **were not** selected at random from the population of California residents.
- **Observation**: Ethnic distribution of UCSD students.
- **Test Statistic**: We need a way of quantifying **how different** two categorical distributions are.

In [None]:
eth.plot(kind='barh', title='Ethnic Distribution of California and UCSD', barmode='group')

### Total variation distance

The total variation distance (TVD) is a test statistic that describes the **distance between two categorical distributions**.

If $A = [a_1, a_2, ..., a_k]$ and $B = [b_1, b_2, ..., b_k]$ are both categorical distributions, then the TVD between $A$ and $B$ is

$$\text{TVD}(A, B) = \frac{1}{2} \sum_{i = 1}^k |a_i - b_i|$$

In [None]:
def total_variation_distance(dist1, dist2):
    '''Given two categorical distributions, 
    both sorted with same categories, calculates the TVD'''
    return np.sum(np.abs(dist1 - dist2)) / 2

Let's compute the TVD between UCSD's ethnic distribution and California's ethnic distribution.

In [None]:
observed_tvd = total_variation_distance(eth['UCSD'], eth['California'])
observed_tvd

The issue is we don't know whether this is a large value or a small value – we don't know where it lies in the **distribution of TVDs under the null**.

### The plan

To conduct our hypothesis test, we will:

- Repeatedly generate samples of size 30,000 (number of UCSD students) from the ethnic distribution of all of California.

- Each time, compute the TVD between the simulated distribution and California's distribution.

- **This will generate an empirical distribution of TVDs, under the null.**

- Finally, determine whether the observed TVD is consistent with the empirical distribution of TVDs.

### Generating one random sample

Again, to sample from a categorical distribution, we use `np.random.multinomial`.

**Important**: We must sample from the "population" distribution here, which is the ethnic distribution of everyone in California.

In [None]:
# Number of students at UCSD in this example.
N_STUDENTS = 30_000

In [None]:
eth['California']

In [None]:
np.random.multinomial(N_STUDENTS, eth['California'])

In [None]:
np.random.multinomial(N_STUDENTS, eth['California']) / N_STUDENTS

### Generating many random samples

We _could_ write a `for`-loop to repeat the process on the previous slide repeatedly (and you _can_ in labs and projects). However, we now know about the `size` argument in `np.random.multinomial`, so let's use that here.

In [None]:
num_reps = 100_000
eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=num_reps) / N_STUDENTS
eth_draws

In [None]:
eth_draws.shape

Notice that each row of `eth_draws` sums to 1, because each row is a simulated categorical distribution.

### Computing many TVDs, without a `for`-loop

One issue is that the `total_variation_distance` function we've defined won't work with `eth_draws` (unless we use a `for`-loop), so we'll have to compute the TVD again.

In [None]:
tvds = np.sum(np.abs(eth_draws - eth['California'].to_numpy()), axis=1) / 2
tvds

Just to make sure we did things correctly, we can compute the TVD between the first row of `eth_draws` and `eth['California']` using our previous function.

In [None]:
# Note that this is the same as the first element in tvds.
total_variation_distance(eth_draws[0], eth['California'])

### Visualizing the empirical distribution of the test statistic

In [None]:
fig = px.histogram(pd.DataFrame(tvds), x=0, nbins=20, histnorm='probability', 
                   title='Empirical Distribution of the TVD')
fig.add_vline(x=observed_tvd, line_color='red')
fig

In [None]:
(np.array(tvds) >= observed_tvd).mean()

No, there's not a mistake in our code!

### Conclusion

- The chance that the observed TVD came from the distribution of TVDs under the null is essentially 0.
- This matches our intuition from the start – the two distributions looked very different to begin with. But now we're quite sure the difference can't be explained solely due to chance.

### Summary of the method

To assess whether an "observed sample" was drawn randomly from a known categorical distribution:
* Use the TVD as the test statistic because it measures the distance between two categorical distributions.
* Sample at random from the population. Compute the TVD between each random sample and the known distribution to get an idea for what reasonable deviations from the eligible pool look like. Repeat this process many, many times.
* Compare:
    - the empirical distribution of TVDs, with
    - the observed TVD from the sample.

### Aside

- It was probably obvious that the difference is significant even before running a hypothesis test.

- Why? There are 30,000 students. Such a difference in proportion is unlikely to be due to random chance (something more systematic at play).

- But what if `N_STUDENTS = 300`, `N_STUDENTS = 30`, or `N_STUDENTS=3`?

### Discussion Question

At what value of `N_STUDENTS` would we fail to reject the null (at a 0.05 p-value cutoff)?

In [None]:
def p_value_given_n_students(N_STUDENTS):
    eth_draws = np.random.multinomial(N_STUDENTS, eth['California'], size=num_reps) / N_STUDENTS
    tvds = np.sum(np.abs(eth_draws - eth['California'].to_numpy()), axis=1) / 2
    p_value = (tvds >= observed_tvd).mean()
    return p_value

In [None]:
interact(p_value_given_n_students, N_STUDENTS=(1, 300));

To fail to reject the null, our sample size (that is, the number of students at UCSD) would have to be in the single digits.

## Example: Penguins (again!)

<center><img src='imgs/lter_penguins.png' width=60%></center>

([source](https://allisonhorst.github.io/palmerpenguins/articles/intro.html))

Consider the `penguins` dataset from a few lectures ago.

In [None]:
penguins = sns.load_dataset('penguins').dropna()
penguins.head()

### Average bill length by island

In [None]:
penguins.groupby('island')['bill_length_mm'].agg(['mean', 'count'])

It appears that penguins on Torgersen Island have shorter bills on average than penguins on other islands.

- This could've happened due to random chance. If island and bill length were unrelated, and we randomly assigned all 333 penguins an island, **one of the three islands would have to have the lowest average bill length** (unless multiple islands are tied for the lowest).

- But, is the average bill length of penguins on Torgersen Island lower than we'd expect due to chance? Let's perform a hypothesis test!

### Setup

- **Null Hypothesis**: Island and bill length **are not** related – the low average bill length of Torgersen Island penguins is due to chance alone.
    - In other words, if we picked 47 penguins randomly from the population of 333 penguins, it is reasonable to see an average this low.

- **Alternative Hypothesis**: Island and bill length **are** related – the low average bill length of Torgersen Island penguins is not due to chance alone.

### The plan

- The null hypothesis states that the 47 bill lengths of Torgersen Island penguins were drawn uniformly at random from the 333 bill lengths in the population.

- That is, if we repeatedly sampled groups of 47 penguins from the population and computed their mean bill length, it would not be uncommon to see an average bill length this low.

- **Plan**: Repeatedly sample (without replacement) 47 penguins from the population and **compute their average bill length**, and see where the Torgersen Island average bill length lies in this distribution.
    - Average bill length is our **test statistic**.
    - This is not a test statistic we've used in this lecture yet (and this is what separates this example from previous examples).

### Simulation

Again, while you could do this with a `for`-loop (and you _can_ use a `for`-loop for hypothesis tests in labs and projects), we'll use the faster `size` approach here.

Instead of using `np.random.multinomial`, which samples from a categorical distribution, we'll use `np.random.choice`, which samples from a known sequence of values.

In [None]:
# Draws two samples of size 47 from penguins['bill_length_mm'].
# Question: Why must we sample with replacement here (or, more specifically, in the next cell)?
np.random.choice(penguins['bill_length_mm'], size=(2, 47))

In [None]:
# Draws 100000 samples of size 47 from penguins['bill_length_mm'].
num_reps = 100_000
averages = np.random.choice(penguins['bill_length_mm'], size=(num_reps, 47)).mean(axis=1)
averages

### Visualizing the empirical distribution of the test statistic

In [None]:
fig = px.histogram(pd.DataFrame(averages), x=0, nbins=50, histnorm='probability', 
                   title='Empirical Distribution of the Average Bill Length in Samples of Size 47')
fig.add_vline(x=penguins.loc[penguins['island'] == 'Torgersen', 'bill_length_mm'].mean(), line_color='red')

It doesn't look like the average bill length of penguins on Torgersen Island came from the distribution of bill lengths of all penguins in our dataset.

### Discussion Question

There **is** a statistical tool you've learned about that would allow us to find the **true probability distribution** of the test statistic in this case. What is it?

<br>

<details>
    <summary>➡️ Click here to see the answer <b>after</b> you've thought about it.</summary>
    <b>The Central Limit Theorem (CLT).</b>
    
Recall, the CLT tells us that for any population distribution, the distribution of the sample mean is roughly normal, with the same mean as the population mean. Furthermore, it tells that the standard deviation of the distribution of the sample mean is $\frac{\text{Population SD}}{\sqrt{\text{sample size}}}$.
    
So, the distribution of sample means of samples of size 47 drawn from <code>penguins['bill_length_mm']</code> is roughly normal with mean <code>penguins['bill_length_mm']</code> and standard deviation <code>penguins['bill_length_mm'].std(ddof=0) / np.sqrt(47)</code>.
    
</details>

## Summary

### The hypothesis testing "recipe"

Faced with a question about the data raised by an observation...
1. Carefully pose the question as a testable "yes or no" hypothesis.
2. Decide on a **test statistic** that helps differentiate between instances that would affirm or reject the hypothesis.
3. Create a probability model for the data generating process that reflects the "known behavior" of the process.
4. Simulate the data generating process using this probability model (the "**null hypothesis**").
5. Assess if the observation is consistent with the simulations by computing a **p-value**.

### Hypothesis testing vs. permutation testing

"Standard" hypothesis testing helps us answer questions of the form:

> I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

- Sample: 59 heads and 41 tails. Population: A fair coin.

- Sample: Ethnic distribution of UCSD. Population: Ethnic distribution of California. (Comparing categorical distributions with the TVD.)

- Sample: Sample of Torgersen Island penguins. Population: All 333 penguins. (Comparing a subgroup statistic to a population parameter.) 

It **does not** help us answer questions of the form:

> I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

That's where permutation testing comes in.

## Additional reading

Here are a few more slides with examples that we won't cover in lecture.

### Null hypothesis

- Recall, a **null hypothesis** is an initial or default belief as to how data were generated.
    - The null hypothesis must be a **probability model**, i.e. something that we can simulate under.

- Often, but not always, the null hypothesis states there is no association or difference between variables or subpopulations, and that any observed differences were due to random chance. 
- Examples:
    * The coin was fair.
    * The music preferences of Americans and Canadians are the same.
    * The median number of Instagram followers of DSC majors is equal to the median number of Instagram followers of all students at UCSD.

### Alternative hypothesis

- An **alternative hypothesis** is a different viewpoint as to how data were generated.
- The alternative hypothesis typically states that the difference between variables or subpopulations exists and is not due to random chance.
- Examples:
    - The coin is biased towards heads.
    - The coin is biased.
    - The music preferences of Americans and Canadians are different.
    - The median number of Instagram followers of DSC majors is greater than the median number of Instagram followers of all students at UCSD.

### P-values and cutoffs

- The **p-value**, or **observed significance level**, is the probability, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative.
- If the p-value is below a pre-determined **cutoff**, or **significance level**, we say that our observation is inconsistent with the null hypothesis and we **reject the null**.
    - 0.05 (significant) and 0.01 (highly significant) are common cutoffs.
    - If the p-value is above the cutoff, we **fail to reject the null**.
- Note that the cutoff is an **error probability**.
    - If your cutoff is 0.05, then 5% of the time, you will incorrectly reject the null hypothesis.