<img src="./img/HWNI_logo.svg"/>

# Tutorial - Hypothesis Testing

This tutorial covers the basis of hypothesis testing, including null hypotheses, error types, error rates, and $p$-values. These will require us to develop some fundamental ideas from probability, including joint, marginal, and conditional probabilities.

### Hypothesis Testing is a Kind of Decision-Making Under Uncertainty

In the lab on inferential statistics, we discussed one class of inference problems: trying to infer the value of a parameter, like the average value or spread of some random variable in a population.

We stated that the purpose of such inferences was to guide decision making. For example, we might measure the average response of a sick population to a candidate drug and then use that value to determine whether to prescribe the drug or not. We left unsaid, however, just exactly how statistical inferences are to be used.

In this tutorial, we'll work through how to use statistical inferences to guide the simplest kinds of decisions: yes-or-no decisions, also known as *binary* decisions, since there are two choices. We'll focus on the yes-|or-no decision of most interest to scientists: is my hypothesis true or not?

### Binary Hypothesis Testing

In an experimental science context, a binary hypothesis is an answer to a question that usually looks something like: "does this intervention have an effect?" Examples include:

- Do neurons subjected to trauma express different levels of protein XYZ?
- Does adding distractors increase reaction time in healthy human subjects?
- Does optogenetically stimulating neural circuit A change the activity of neural circuit B?

Each of these questions has a yes or no answer: "yes, the intervention has an effect" or "no, the intervention has no effect". We call these answers *hypotheses*. The hypothesis that the intervention has no effect is called the *null hypothesis*, while the hypothesis that the intervention has an effect is called the *alternative hypothesis*. These are frequently written as $H_0$ and $H_A$, with $0$ and $A$ standing for for "null" and "alternative".

When we answer a binary question, there are two possible answers: yes and no, which we call the "positive" and "negative" answer. Furthermore, either the alternative or the null hypothesis could be true. Therefore, there are four possibilities, which appear in the table below:

|                    |  $H_A$ is True | $H_0$ is True  |
|:------------------:|:--------------:|----------------|
| **We claim $H_A$** |  True Positive | False Positive |
| **We claim $H_0$** | False Negative | True Negative  |

The nomenclature for each of these four events is intuitive: the first word is "true" or "false" depending on whether out answer was correct or incorrect (not, e.g., whether the alternative hypothesis is true or false) and the second word is "positive" or "negative" depending on what we claimed.

In statistical hypothesis testing, we take some data and use that to determine whether we claim $H_A$ or $H_0$. The key insight that lets us think rigorously about statistical hypothesis testing is to treat both the correct answer (columns in the above table) and the answer we give (rows in the above table) as uncertain quantities, as *random variables*.

It may seem strange to think of our claims about the world as being random, since random is used colloquially to mean "arbitrary" or "without structure or meaning". But recall that anything we calculate from randomly-sampled data "inherits" some randomness from that data. Put another way, the result of calculating something based on data is a statistic, and statistics have sampling distributions.

The "correct answer" also isn't random in the sense that most people think of randomness. However, it's also not random in the sense described above: a statement like "this intervention has an effect" is either true or false, and it doesn't change whether it's true or false depending on data. Instead, we recognize that we aren't entirely certain whether "this intervention has an effect" is true or not, and we instead write down a number that captures the degree to which we believe that the statement is true. We call this number the probability of that the statement is true.

The view that randomness arises from sampling is a core component of the *frequentist* view of statistics. The view of that randomness arises from uncertainty is a core component of the *Bayesian* view of statistics. We'll avoid being dogmatic in this course, and switch freely between the views whenever one or the other is simpler.

### Joint Probabilities

Now that we have two different random variables, the outcome of our testing procedure and the ground truth, we can think of the probability that any pair of events occurs, where the first element of a pair comes from the first random variable and the second element comes from the second random variable.

If we shorten the events to $+$ and $-$ for the outcome and $T$ and $F$ for the correct answer (where $T$ means the alternative hypothesis is true), we write a table just like the one above to store the probabilities of pairs of events:

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td>&nbsp; $p(+,T)$ &nbsp;</td>
      <td>&nbsp; $p(+,F)$ &nbsp;</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td>&nbsp; $p(-,T)$ &nbsp;</td>
      <td>&nbsp; $p(-,F)$ &nbsp;</td>
    </tr>
  </tbody>
</table>


where $p(+,T)$ should be read as "the probability the test is positive and the alternative hypothesis is true". Since these probabilities tell us the chance that two events *both occur*, we call them *joint probabilities*. The table above is called a *joint probability table*. The information it stores is called a *joint probability distribution*. In this case, the distribution is a mass function.

A joint probability distribution is a powerful thing -- if we had access to all of the numbers in the joint probability tables for our experiments, it'd make statistical test design much easier! Unfortunately, constructing these tables can be very difficult and involves a degree of subjectivity. 

For now, let's assume a God's eye view, where we know all of these numbers, and see what we can figure out. We'll use a running example with the joint probability table below. Notice that these numbers add up to 1 -- that's what makes the values a valid probability distribution.

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td>$0.5$</td>
      <td>$0.15$</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td>$0.05$</td>
      <td>$0.3$</td>
    </tr>
  </tbody>
</table>


### Marginal Probabilities

First, we can use the joint probability table to figure out the probabilities of the individual random variables that make up the table. For example, we can figure out the probability that the alternative hypothesis is true.

We do this by simply adding up the probabilities of all events in which the alternative hypothesis is true. In this case, there are two such events: the alternative hypothesis is true and we claim it is true $(+,T)$ and the alternative hypothesis is true and we claim it is false $(-,T)$. These events correspond to the cells in the first column. Similarly, we can calculate the probability that the alternative hypothesis is false by adding up the values in the second column or calculate the probability of each outcome of our test by adding up the appropriate row.

Below, you'll find these values are worked for our example table. There, as traditionally, the probability of an event is written at the end of the column or row corresponding to that event. These areas are called the *margins* and so these probabilities are called *marginal probabilities*. Notice that if we add up the marginal probabilities along a row or column (the numbers with a particular background color), the result is 1. That means these are probability distributions -- the marginal probability distributions of the test outcome and the alternative hypothesis.

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
      <th style="background-color: rgb(255,204,204);"> &nbsp; $p(\text{test})$ &nbsp; </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td>$0.5$</td>
      <td>$0.15$</td>
      <td style="background-color: rgb(255,204,204);">$0.65$</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td>$0.05$</td>
      <td>$0.3$</td>
      <td style="background-color: rgb(255,204,204);">$0.35$</td>
    </tr>
     <tr>
      <td style="background-color: rgb(204,204,255);"> &nbsp; $p(H_A)$ &nbsp; </td>
      <td style="background-color: rgb(204,204,255);">$0.55$</td>
      <td style="background-color: rgb(204,204,255);">$0.45$</td>
    </tr>
  </tbody>
</table>


There's nothing magical happening here. We could've made the rows correspond to a different random variable, like whether it's raining in Kansas or whether Mercury is in retrograde. After all, the total chance that the alternative hypothesis is true is equal to the chance that it is equal and Mercury is in retrograde plus the chance that it is equal and Mercury is not in retrograde (it might be helpful to think in terms of frequencies here). Calculating marginal probabilities from a joint probability table is just an accounting trick to make more obvious the information that's already in the table.



The marginal probability distribution of the alternative hypothesis is of particular importance. It tells us what we believe about the world *before* we take into account the result of our statistical test. For this reason, it's called the *prior* probability of the hypothesis. There is some degree of subjectivity in setting prior probabilities, since they arise from complex, fuzzy factors related to our fundamental beliefs about the world (are our educated guesses usually right or usually wrong? is nature usually simple or usually messy?) and the combined results in the literature.

### Conditional Probabilities

The joint probability table told us the chance that any particular pair of events occurs, while the marginal probability distributions told us the chance that any individual event occurs, irrespective of the other variable in the pair.

Often, however, we know the value of one of the variables. For example, once we've run the statistical test, we know what the outcome turned out to be. Other times, we would like to assume that one of the variables takes on a particular value (e.g. that the null hypothesis is true) and determine how the probability of each outcome for the other variable has changed. Do we need to throw out our old, possibly hard-won joint probability table and start over?

Luckily, the answer is no. Using the joint probability table, we can construct two new tables, which tell us the probability distribution of one of the random variables for fixed values of the other. Because the probabilities in these tables only pertain when a certain *condition* is satisfied (these probabilities are *conditional* on the other random variable having a certain value), they are called *conditional probability tables*.

How do we determine the values in these tables? Consider the right column of the joint probability table above, corresponding to all cases where the alternative hypothesis is false. This column *almost* tells us the conditional probabilities. For example, we can readily see that when the alternative hypothesis is false, the probability that the test comes up negative is twice the probability that it comes up positive ($0.3\ =\ 2\cdot0.15$).

However, $0.3 + 0.15$ doesn't equal $1$ -- it's equal to $0.45$, so we can't just directly use those numbers for the conditional probabilities. However, if we divide them by their sum, they'll add up to $1$:

$$
\frac{0.3}{0.3+0.15} + \frac{0.15}{0.3+0.15} = \frac{0.3+0.15}{0.3+0.15} = 1
$$

Put another way, the rows and columns of the joint probability table are like *un-normalized* conditional distributions -- distributions that don't add up to one. To make them into proper distributions, we need to normalize them by dividing by their sums, which happen to correspond to the marginal probabilities.

The two conditional probability tables for our running example appear below. One corresponds to viewing the rows of the joint table as un-normalized distributions (and so conditioning on test outcome, the random variable in the rows of the table) while the other corresponds to viewing the columns as un-normalized distributions (and so conditioning on the truth value of the alternative hypothesis, the random variable in the columns of the table).

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td style="background-color: rgb(255,204,204);">0.77</td>
      <td style="background-color: rgb(255,204,204);">0.23</td>
      <td style="background-color: rgb(255,204,204);">$p(H_A\lvert +)$</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td style="background-color: rgb(204,204,255);">0.14</td>
      <td style="background-color: rgb(204,204,255);">0.86</td>
      <td style="background-color: rgb(204,204,255);">$p(H_A\lvert -)$</td>
    </tr>
  </tbody>
</table>

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td style="background-color: rgb(204,204,255);">0.91</td>
      <td style="background-color: rgb(255,204,204);">0.33</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td style="background-color: rgb(204,204,255);">0.09</td>
      <td style="background-color: rgb(255,204,204);">0.67</td>
    </tr>
    <tr>
      <td ></td>
      <td style="background-color: rgb(204,204,255);">&nbsp;$p(\text{test}\lvert T)$&nbsp;</td>
      <td style="background-color: rgb(255,204,204);">&nbsp;$p(\text{test}\lvert F)$&nbsp;</td>
    </tr>
  </tbody>
</table>



The vertical bar, $\vert$, is pronounced "conditioned on". One would read the expression $p(\text{test}\vert T)$ as "the conditional probability distribution of the test outcome given that the alternative hypothesis is true".

Note one important difference between a conditional probability table and a joint probability table: while the latter is a distribution, the former is NOT. For example, the entries of a conditional probability table don't add up to 1. Instead, each row or column of a conditional probability table adds up to 1. A conditional probability table is a collection of distributions, with one distribution for each value of the variable being conditioned on.

Because of this distinction, there are several entities that end up getting called "the conditional probability". For example, the first table above is "the conditional probability of the alernative hypothesis given the test outcome". The first row in that table is "the conditional probability of the alternative hypothesis given that the test is positive". The first cell in that row is "the conditional probability that the alternative hypothesis is true given that the test is positive".

In English, this distinction is clear enough, but unfortunately the usual mathematical notation for all three of the above is $p(x\vert y)$, with the meaning depending on which of $x$, $y$, or both are outcomes (e.g. "test is positive") and which are random variables (e.g. "the outcome of the test").

### Conditional Probabilities and Hypothesis Testing

We introduced joint probabilities in order to understand hypothesis testing. Now that we are armed with the two conditional probability tables associated with the joint probability table, we can start to dive deeper. Let's start with the row-wise conditional probability distributions.

#### The "Test-Interpretation" Table

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td style="background-color: rgb(255,204,204);">&nbsp;$p(T\lvert +)$&nbsp;</td>
      <td style="background-color: rgb(255,204,204);">&nbsp;$p(F\lvert +)$&nbsp;</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td style="background-color: rgb(204,204,255);">&nbsp;$p(T\lvert -)$&nbsp;</td>
      <td style="background-color: rgb(204,204,255);">&nbsp;$p(F\lvert -)$&nbsp;</td>
    </tr>
  </tbody>
</table>

Because these probabilities are conditioned on the outcome of a test, they tell us how to interpret the results of a test that we have performed.

Consider the top row of this table. This row tells us the conditional probability of the alternative hypothesis when we've gotten a positive test result. Notice that $p(T\vert t)$ isn't $1$ -- a positive test result doesn't mean that we are now 100% certain that that the alternative hypothesis is true. In the case of our running example, it's $0.77$.

Recall that we already had a number that reflected our belief that the alternative hypothesis is true: the prior probability of the alternative hypothesis. The conditional probability table above tells us how we should update that belief when we see the result of the test. Since these probabilities come after we collect data and perform a test, they are called *posterior* probabilities.

Posterior and conditional probabilities are very general concepts. Because of the importance of binary hypothesis testing, the conditional probabilities in the table above have special names that capture their role in interpreting tests. Those names are:

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td style="background-color: rgb(255,204,204);">Positive Predictive Value</td>
      <td style="background-color: rgb(255,204,204);">False Discovery Rate</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td style="background-color: rgb(204,204,255);">False Omission Rate</td>
      <td style="background-color: rgb(204,204,255);">Negative Predictive Value</td>
    </tr>
  </tbody>
</table>

where coloring, as above, indicates conditioning on the same value. Because of this, two named quantities with the same background color must add up to 1, and so knowing one tells you the other. Depending on the context of the problem, one will be easier or harder to think about.

The *positive predictive value* was described above. The term *false discovery rate* arises from considering what would happen if we ran a particular test many times. The false discovery rate tells us the fraction of our positives that would be false positives (or false *discoveries*, when the alternative hypothesis being true means the discovery of a new phenomenon or drug).

The values in the second row mirror the values in the first row. The *negative predictive value* tells us the posterior probability that a negative test result reflects the truth -- higher values mean that negative results on the test are more meaningful. The *false omission rate* is akin to the false discovery rate, but it tells us the fraction of our negatives that are false negatives, or incorrect omissions of certain phenomena or candidate drugs from our list of real or effective ones.

Next, let's consider the column-wise conditional probability distributions.

#### The "Test-Design" Table

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td style="background-color: rgb(204,204,255);">&nbsp;$p(+\lvert T)$&nbsp;</td>
      <td style="background-color: rgb(255,204,204);">&nbsp;$p(+\lvert F)$&nbsp;</td>
    </tr>
     <tr>
      <td >$-$</td>
      <td style="background-color: rgb(204,204,255);">&nbsp;$p(-\lvert T)$&nbsp;</td>
      <td style="background-color: rgb(255,204,204);">&nbsp;$p(-\lvert F)$&nbsp;</td>
    </tr>
  </tbody>
</table>

Because these probabilities are conditioned on whether the alternative hypothesis is true or false, they don't tell us how to interpret the results of a test. Instead, they tell us how the test will perform in situations where the hypothesis is true and where it is false. These probabilities are useful for folks who design statistical tests: without having to worry about whether the alternative hypothesis is likely to be true or false, they can confirm that their test is useful by ensuring that the diagonal elements of the table are large.

As above, these quantities have special names to distinguish them from run-of-the-mill conditional probabilities. Because they are more commonly used and used by a wide array of disciplines, they have multiple names, the most common of which appear below.

<table>
  <tbody>
    <tr>
      <th> </th>
      <th > $T$ </th>
      <th > $F$ </th>
    </tr>
    <tr>
      <td >$+$</td>
      <td style="background-color: rgb(204,204,255);"> True Positive Rate, Power, Sensitivity </td>
      <td style="background-color: rgb(255,204,204);"> False Positive Rate, $\alpha$ </td>
    </tr>
     <tr>
      <td >$-$</td>
      <td style="background-color: rgb(204,204,255);"> False Negative Rate, $\beta$</td>
      <td style="background-color: rgb(255,204,204);"> True Negative Rate, Specificity</td>
    </tr>
  </tbody>
</table>

Take care when interpreting the terms that end in rate, like "true positive rate" and "false negative rate". The temptation is to interpret them as referring to the fraction of your tests that are true positives/false negatives. This is incorrect. Instead, these rates tell you the fractions of such tests *in situations where the alternative hypothesis is true*. That is, if we know the alternative hypothesis is true, then we can use the true positive rate to tell us how many of our tests should be true positives.

At first, these numbers seem to be of limited use for scientists, since we certainly wouldn't be doing experiments if we knew whether the hypothesis was true or false!

The utility of this table is that it doesn't require us to specify $p(H_A)$, so we can avoid all of the difficulties described above for figuring out prior probabilities. The rightmost column of this table is particularly easy to calculate, so it has long dominated the design of hypothesis tests. To make things a bit more concrete, we now turn to how we calculate the values in this column and use them to design a test.

### Designing a Hypothesis Test: The Easy Way