# Hypothesis Testing

## Motivating example

- something we know intuitively -- how do we back it up with math

## An assumption, a sample, and an important question

We have a set of assumptions about the population which specify the shape of its distribution, and therefore also assumptions about various population parameters.

In contrast to the intangible assumed population and parameter, we have *actual data* collected from an observed sample and corresponding sample statistic. Remember that samples will generally resemble the population they were drawn from. But, once we actually visualize our sample it might so happen that our observed sample statistic differs substantially from the assumed parameter...

![](../images/assumed-vs-observed.png)

Our question is simple: *could this observed sample statistic have really come from the assumed population?*

If not, there could be some serious repercussions -- chances are our previous set of assumptions about the population this sample came from are wrong! We would need to move to a new set of beliefs about the population.

Otherwise, if it looks like the observed statistic could have feasibly come from the assumed population, then we don't have enough evidence to abandon the assumptions. We can continue operating under those beliefs (for now at least).

```{note}

In statistics, these two competing answers to our question give way to the formal definitions of the *null hypothesis* and *alternate hypothesis*:

- The {dterm}`null hypothesis`, denoted $H_0$, is the hypothesis that our assumptions about the population are valid. The null states that the population parameter is truly equal to the assumed value.
    $$H_0: \text{population parameter} = value$$
    Under the belief of this hypothesis, the only reason we managed to observe a sample whose statistic wasn't equal to this parameter value is *wholly do to random chance*.
    
- The {dterm}`alternate hypothesis`, denoted $H_1$, contradicts the null by stating that the true population parameter is *not* equal to the assumed value.
    $$H_1: \text{population parameter} \neq value$$
    or
    $$H_1: \text{population parameter} < value$$
    or
    $$H_1: \text{population parameter} > value$$
    This hypothesis stipulates that there was some reason *other than randomness* that we observed a sample whose statistic wasn't equal to the parameter value -- namely, the population the sample was drawn from doesn't actually have that parameter value!
    
```

But how do we decide whether or not to abandon the null while relying on quantitative methods, rather than loose judgements? By harnessing the power of probabilities, of course.

## Simulating what our sample could have looked like

As you know by now, the way we data scientists answer uncertain questions is by answering with a {dterm}`probability`. To answer this question, we want to find a way to say "there's a $p\%$ chance that such an extreme sample statistic could have been generated by sampling from this population." We're defining 'extreme' as being far away from the assumed parameter value in the direction of the alternative hypothesis (more on this later).

How do we find $p$? We already know how to do this! We can use our simulation skills to literally just simulate a ton of random samples from the assumed population, then calculate the probability that these samples exhibit statistics like the one we observed.

Since we're generating these new sample statistics for the purpose of using them in a hypothesis test, they're called {dterm}`test statistics`. And, since we're operating under the population assumptions in order to create these samples, this entire process is called **generating test statistics *under the null***.

![We draw a bunch of new samples from the assumed population. Each sample in turn generates a sample statistic.](../images/sample-from-population.png)

If our assumed population is the true population that our sample came from, then we should expect that some of these randomly simulated samples will exhibit statistics that look like the one we originally observed.

Let's take a look at the different values of test statistics that resulted from our simulated samples -- this {dterm}`sampling distribution` is called the distribution of test statistics under the null.

![](../images/simulated-statistics.png)

Do the simulated test statistics ever look as extreme the one we originally observed? In the illustration above, yes -- but with what probability? Using the frequentist definition of probabilities, we define our question about $p$ from above as the *proportion of test statistics that are equal to the observed value or even more extreme (in the direction of the alternative hypothesis)*.

![](../images/p-value.png)

```{note}

Once again, we give way to the formal language of statistics. This proportion $p$ has a name: the {dterm}`p-value`. It can be mathematically defined as the probability that a statistic generated under the population assumptions of the null hypothesis will be at least as extreme as the observed statistic.

The extremety of a test statistic is based on our alternate hypothesis, if $H_1: \text{param} < value$ then lower statistics are more extreme, if $H_1$ proposes $>$ then higher statistics are more extreme, and if $\neq$ is stipulated then a statistic far away from the assumed parameter in either direction is considered extreme.

In the illustration above, we're clearly interested in the alternate hypothesis suggesting the true parameter is $>$ than the assumed value, so the mathematical notation for our p-value is
$$\text{p-value} = P(\text{test statistic generated under the null} \geq \text{observed statistic})$$

```

## Making judgements with the p-value

Once we've found the p-value -- the chance our null population could have produced a sample statistics as extreme as the one we originally observed -- it's time to get judgemental.

Are those null assumptions still holding up? Depends on how big our p-value is, and how much uncertainty we're willing to accept in our lives.

- **If the p-value is really small** it suggests that it's really unlikely for the assumed population to produce such extreme sample statistics -- not impossible, just so rare that we think there are flaws with our assumptions.

    When what we observed would be super unlikely if the null were true, we make the judgement to **"reject the null"** in favor of the beliefs set forth by the alternative hypothesis.

- **If the p-value isn't so small** then it's still conceivable that random chance could have been the sole reason for observing a sample statistic different from the population parameter -- after all, we did observe it $p\%$ of the time!

    Because our test doesn't seem to suggest anything wrong with our observed statistic meshing with the beliefs set forth by the null, we don't have justification to change those beliefs, and we **"fail to reject** the null".

```{admonition} A look forward
Even if we reject the null in favor of the alternative, we don't yet have a solid guess for what population we think our original sample truly came from! All that the alternative says is that the underlying population parameter is less-than/greater-than/not-equal-to a value, but shouldn't we take a next step and estimate what value it actually is?

We'll learn to do exactly this in the chapter on Estimation.
```

So, we abandon the null hypothesis if the p-value is really small, and don't if the p-value isn't really small.

But how small *is* 'really small'?

## The threshold for $p$

The cutoff for when a p-value is considered small enough to reject the null should be set ahead of time -- before the test is conducted. This threshold is generally referred to as the {dterm}`level of significance`, denoted $\alpha$, and its value should be chosen depending on conventions, tolerance of risk (what if your test is wrong!?), and even your personal belief about how random the universe is.

When you set the level of significance, you're essentially saying:  
"*If my observed statistic had less than $\alpha\%$ chance of coming from the null-assumed population, then I'll reject the null.*"

In the history of hypothesis testing setting this value has been somewhat of a touchy-subject, and has sparked great controversy! If you've ever seen a hypothesis test before this point, you've likely seen the phrase "$p < \alpha = 0.05$" as the justification that the results of some paper are significant since they were able to quash the prior beliefs of the null. Why is $0.05$ used so commonly, though? Simply because it is popular.

Remember, if we set $\alpha=0.05$ then we'll choose abandon an entire set of beliefs just because we observed something that had a less-than $5\%$ chance of occurring. What if the null is actually true, though? Then $5\%$ of the time that we run our test we might randomly make an observation worth of incorrectly rejecting the null -- that's could be a one-in-twenty event!

The level of significance is directly linked with the chance that we incorrectly reject the null hypothesis when in fact it is true. A "false positive" such as this could lead to disasterous effects depending on your field, which is why many fields like medecine choose much more stringent levels of significance, like $\alpha=0.001$.

## Applying the framework to diverse scenarios

The hypothesis testing framework work we've discovered really works great on lots of different things you throw at it.

To summarize this framework, we:
1. Observe a sample and statistic
2. Formulate a null hypothesis about the underlying population and parameter, and set a level of significance
3. Formulate an alternative hypothesis
4. Generate new test statistics under the assumptions of the null
5. Calculate the probability of a test statistic being as extreme as the observed statistic
6. Make a judgement about the null hypothesis using the p-value and level of significance

Notice that this approach towards hypothesis testing handles many different things we throw at it. We can run tests on continuous data or discrete data and it doesn't matter what the true population looks like just as long as we're able to generate a statistic. We can test using practically *any* statistic we choose, such as a mean, max, variance, or even count.

This framework is so incredibly flexible because it follows a simple, intuitive principle: *if we have a set of assumptions, then we better expect data generated from those assumptions to look like what we observed in real life.* We now have a solid way to quantify the degree to which an observation challenges a set of assumptions.

## Follow-up questions

### Why we don't calculate probability of equalling the observed value?

- When there are lots of possible outcomes, the probability of any *one specific outcome* is assuredly low!
- When working with continuous data, the probability of observing or generating any single number is (mathematically) zero
- Besides, if you were to see a result *more* extreme than the observed statistic, surely it should also count as evidence against the null, right?

### What if I reject the null but it's actually true?

- This does happen! It's called a "false positive" or "Type I Error" (type one)
- We know the probably of incorrectly rejecting the null -- it's $\alpha$, our level of significance!