# Hypothesis Testing

## An assumption, a sample, and an important question

We have a set of assumptions about the population. With these assumptions, we suppose the population has a certain shape. The assumed population must then also have a certain parameter.

In contrast to the intangible assumed population and parameter, we have *actual data* from an observed sample and corresponding sample statistic. Remember that samples will generally resemble the population they were drawn from. But, once we actually visualize our sample it might so happen that our observed sample statistic differs substantially from the assumed parameter...

![](../images/assumed-vs-observed.png)

Our question is simple: *could this observed sample statistic have really come from the assumed population?*

If not, there could be some serious repercussions -- chances are our previous set of assumptions about the population this sample came from are wrong! We would need to move to a new set of beliefs about the population.

Otherwise, if it looks like the observed statistic could have feasibly come from the assumed population, then we don't have enough evidence to abandon the assumptions. We can continue operating under those beliefs (for now at least).

\`\`\`{note}

In statistics, these two competing answers to our question give way to the formal definitions of the *null hypothesis* and *alternate hypothesis*:

- The {dterm}`null hypothesis`, denoted $H_0$, is the hypothesis that our assumptions about the population are valid. The null states that the population parameter is truly equal to the assumed value.
    $$H_0: \text{population parameter} = value$$
    Under the belief of this hypothesis, the only reason we managed to observe a sample whose statistic wasn't equal to this parameter value is *wholly do to random chance*.
    
- The {dterm}`alternate hypothesis`, denoted $H_1$, contradicts the null by stating that the true population parameter is *not* equal to the assumed value.
    $$H_1: \text{population parameter} \neq value$$
    or
    $$H_1: \text{population parameter} < value$$
    or
    $$H_1: \text{population parameter} > value$$
    This hypothesis stipulates that there was some reason *other than randomness* that we observed a sample whose statistic wasn't equal to the parameter value -- namely, the population the sample was drawn from doesn't actually have that parameter value!
\`\`\`

But how do we decide whether or not to abandon the null while relying on quantitative methods, rather than loose judgements? By harnessing the power of probabilities, of course.

## Simulating what our sample could have looked like

As you know by now, the way we data scientists answer uncertain questions is by answering with a {dterm}`probability`. To answer this question, we want to find a way to say "there's a $p\%$ chance that such an extreme sample statistic could have been generated by sampling from this population." We're defining 'extreme' as being far away from the assumed parameter value in the direction of the alternative hypothesis (more on this later).

How do we find $p$? We already know how to do this! We can use our simulation skills to literally just simulate a ton of random samples from the assumed population, then calculate the probability that these samples exhibit statistics like the one we observed.

Since we're generating these new sample statistics for the purpose of using them in a hypothesis test, they're called {dterm}`test statistics`. And, since we're operating under the population assumptions in order to create these samples, this entire process is called **generating test statistics *under the null***.

![We draw a bunch of new samples from the assumed population. Each sample in turn generates a sample statistic.](../images/sample-from-population.png)

If our assumed population is the true population that our sample came from, then we should expect that some of these randomly simulated samples will exhibit statistics that look like the one we originally observed.

Let's take a look at the different values of test statistics that resulted from our simulated samples -- this {dterm}`sampling distribution` is called the distribution of test statistics under the null.

![](../images/simulated-statistics.png)

Do the simulated test statistics ever look as extreme the one we originally observed? In the illustration above, yes -- but with what probability? Using the frequentist definition of probabilities, we define our question about $p$ from above as the *proportion of test statistics that are equal to the observed value or even more extreme (in the direction of the alternative hypothesis)*.

![](../images/p-value.png)

\`\`\`{note}

Once again, we give way to the formal language of statistics. This proportion $p$ has a name: the {dterm}`p-value`. It can be mathematically defined as the probability that a statistic generated under the population assumptions of the null hypothesis will be at least as extreme as the observed statistic.

The extremety of a test statistic is based on our alternate hypothesis, if $H_1: \text{param} < value$ then lower statistics are more extreme, if $H_1$ proposes $>$ then higher statistics are more extreme, and if $\neq$ is stipulated then a statistic far away from the assumed parameter in either direction is considered extreme.

In the illustration above, we're clearly interested in the alternate hypothesis suggesting the true parameter is $>$ than the assumed value, so the mathematical notation for our p-value is
$$\text{p-value} = P(\text{test statistic generated under the null} \geq \text{observed statistic})$$

\`\`\`

## Making judgements with the p-value

Once we've found the p-value -- the probability that a statistic as extreme as the one we originally observed could have been generated by a population fitting our null assumptions -- it's time to be judgemental.

Are those null assumptions still holding up? Depends on how big our p-value is, and how much uncertainty we're willing to accept in our lives.

- **If the p-value is really small** it suggests that it's really unlikely for the assumed population to produce such extreme sample statistics -- not impossible, just so rare that we think there are flaws with our assumptions.

- **If the p-value isn't so small** then it's still conceivable that random chance could have been the sole reason for observing a sample statistic different from the population parameter -- after all, we did observe it $p\%$ of the time!

- reject the null, fail to reject the null

But how 'small' is 'really small'?
- set ahead of time, conventions, risk, and personal belief about randomness

## Using different test statistics

- the hypothesis test framework works in general on really anything you throw at it
- different types of data (continuous, discrete), different statistics -- we could check assumptions about the *max* of a population, the *mean*, the *number of sampled individuals that equal a certain value* -- really any statistic (think: number) that can be distilled from a sample
- all follows the same principle: if we generate data using those assumptions, we better expect the test data to look like what we actually observed in real life -- otherwise those assumptions should be called into question

## Follow-up questions

### Why we don't calculate probability of equalling the observed value?

### What if I reject the null but it's actually true?