# Discussion 7 - Hypothesis Testing, Total Variation Distance, and Permutation Testing

## DSC 10, Summer 2024

### Agenda
- Review of concepts:
    - Hypothesis testing.
    - Total variation distance.
- Work in groups of 2-4 on practice problems covering these topics.
    - Available at [practice.dsc10.com](https://practice.dsc10.com/).
- All together, go over the ones people had the most trouble with at the end.

## Hypothesis Testing

### 1) Defining hypothesis testing

A hypothesis test chooses between two views (**hypotheses**) of how data were generated, based on data in a sample.
- Suppose you flip a coin 400 times and see 230 heads. You want to test between two viewpoints:
    - "The coin is fair" or "The coin is not fair"

The test picks the hypothesis that is better supported by the observed data.

### 2) Null and alternative hypotheses
- **Null hypothesis**: a well-defined probability model about how the data was generated.
    - "The coin is fair" - 0.5 probability for both heads and tails.
- **Alternative hypothesis**: a different view about how the data was generated.
    - "The coin is not fair" - not 0.5 probability for both heads and tails.

### 3) Test statistics

**Test statistic**: a single number that we will use to test which viewpoint the data better supports.
- Large values should side with one hypothesis.
- Small values should side with the other hypothesis.

Suppose we simulate flipping a coin 400 times. The $|\text{number of heads} - 200|$ is a valid test statistic:

- Smaller values (# heads is close to 200) support the hypothesis that the coin is fair.
- Larger values (# heads is far from 200) support the hypothesis that the coin is not fair. 

**Observed statistic**: the test statistic evaluated on our observed data.

In [None]:
observed_stat = np.abs(230 - 200)
observed_stat

Our hypothesis test boils down to checking whether the observed statistic is a "typical value" to encounter if the null is true.

### 4) Simulation

Once we have established our null and alternative hypotheses and test statistic, we have the following structure for running a simulation for a hypothesis test:
1. Repeatedly generate samples under the assumption that the **null is true**.
1. For each sample, calculate a **test statistic**.

In [None]:
import numpy as np
np.random.multinomial(400, [0.5, 0.5])

In [None]:
results = np.array([])
for i in np.arange(10000):
    flips = np.random.multinomial(400, [0.5, 0.5])
    num_heads = flips[0]
    statistic = np.abs(num_heads - 200)
    results = np.append(results, statistic)

### 5) p-values

**p-value**: the probability under the null hypothesis that the test statistic is equal to the observed statistic or is even further in the direction of the alternative.

- After simulating many sample test statistics, we want to check whether the **observed statistic is a typical value** in the distribution of our test statistic.
- If the null is true, what is the chance of seeing an outcome at least as extreme as the observed statistic?

In [None]:
# rare to flip a fair coin 400 times and see 230 heads or more, 170 heads or less
np.count_nonzero(results >= observed_stat) / len(results)

In [None]:
# common to flip a fair coin 400 times and see 205 heads or more, 195 heads or less
np.count_nonzero(results >= 5) / len(results)

Conventionally, we use a significance level of 0.05, meaning:
- if the p-value is **above** the 0.05 cutoff, we **fail to reject the null hypothesis**.
- if the p-value is **below** the 0.05 cutoff, we **reject the null hypothesis**.

## Total Variation Distance

**Total Variation Distance (TVD)**: a test statistic that calculates the distance between two categorical distributions using the following formula:
- the sum of the absolute differences of their proportions, all divided by 2

In [None]:
fair_die = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
observed = np.array([10/60, 12/60, 8/60, 11/60, 9/60, 10/60])

abs_diff = np.abs(fair_die - observed)
tvd = np.sum(abs_diff) / 2
tvd

- TVD will be a number between 0 and 1. 
- The further the TVD is from 0, the more the observed distribution deviates from the expected, indicating the die may be biased.