<img src="../../shared/img/slides_banner.svg" width=2560></img>

# 05a - Null Hypothesis Significance Testing

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import random

import daft
from IPython.display import Image, YouTubeVideo
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

This week, we put on our frequentist hats:
we are going to study the most common method for frequentist inference,
_null hypothesis significance testing_.

Next week, we'll take a Bayesian approach to similar problems.

Warning: as indicated below, these ideas are notoriously unintuitive and easy to confuse yourself with.

So don't be discouraged if you find this material confusing.
There is a lot of other tutorial material online
which might be helpful.

## Null Hypothesis Significance Testing

0. Collect data from an experiment and compute a statistic on that data.

1. Come up with a model of "nothing interesting is happening in my data".
    - This model can be resampling-based, mathematical, or in `pyMC`
    - It is called the _null model_. The hypothesis that it is true is the _null hypothesis_.

Mathematical models are the classical approach.
Bootstrapping is a type of resampling, but it won't be the type we use here.
In general, a `pyMC` model can be replaced with any _generative_ model:
any model form which you can generate samples.

2. Obtain the sampling distribution of the statistic from the model
    - by literally sampling from it,
        - either resampling or with `pm.sample`
    - or by clever mathematical manipulation

3. Compare the value of the statistic observed on the data to the sampling distribution.
   - If the observed value is "too extreme", the test is positive and the result is "statistically significant".

## The Intellectual Heritage of Null Hypothesis Significance Testing

The formal apparatus of null hypothesis significance testing dates to the early-to-mid 20th century.

This was an important moment in the history of science:
for the preceding century, major discoveries in the laws of nature were made regularly,
and they were enabling rapid technological progress.

And furthermore, disciplines like physics had gone through several iterations of discovering "laws of nature"
and then realizing they needed to be amended:
what was once absolute truth was replaced with a new absolute truth, which was itself replaced.

This inspired philosophers, and philosophically-minded scientists,
to try to formalize and explain the process by which science produced knowledge,
since the kind of knowledge it produced was clearly not absolute.

> Since we can never know anything for sure, it is simply not worth searching for certainty; but it is well worth searching for truth; and we do this chiefly by searching for mistakes, so that we have to correct them

Karl Popper, _In Search of a Better World_, 1984

A hypothesis is _falsifiable_ if it can be proven false.

A hypothesis is _verifiable_ if it can be proven true.

In this presentation of the scientific method,
hypotheses must be _falsifiable_ but need not be _verifiable_,
and the goal is to _falsify_ rather than _verify_.

### Falsifiability

> In so far as a scientific statement speaks about reality, it must be falsifiable: and in so far as it is not falsifiable, it does not speak about reality.

Karl Popper, _The Logic of Scientific Discovery_, 1934

#### Unfalsifiable Example 1: Russell's Teapot

Bertrand Russell proposed, as an example of an unfalsifiable statement, the following:

> If I were to suggest that between the Earth and Mars there is a china teapot revolving about the sun in an elliptical orbit, nobody would be able to disprove my assertion provided I were careful to add that the teapot is too small to be revealed even by our most powerful telescopes.

Bertrand Russell, unpublished article for _Illustrated_ magazine, 1952.

In [None]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/7/70/Teapot_in_space.jpg", width=400)

#### Unfalsifiable Example 2: The Ptolemaic Model

For 1500 years, the cosmological model that was
most widely believed by astronomers and astrologes was the
[Ptolemaic model](https://en.wikipedia.org/wiki/Geocentric_model#Ptolemaic_model).

In [None]:
YouTubeVideo("EpSy0Lkm3zM", width=600, height=450)

In this model, planets move in "double orbits": 
a "larger orbit" around (a point close to) the Earth
and a "smaller orbit" around the path of that orbit.
See the animation above.
The latter is called an _epicycle_:
a cycle on top of a cycle.

Imagine that each planet is being rotated by a gear
which is itself on a rotating gear.
This is how [some mechanical planetaria operate](https://en.wikipedia.org/wiki/Planetarium#Traditional_electromechanical/optical_projectors).

This model does quite well explaining the orbits of the planets
well enough to match data based on observation by eye.

This was the only data available until the revolution in optical technology
around the turn of the 17th century.
Finding a simple model to fit to this data
led to the theories of [Kepler](https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion)
and Newton.

However, it can be shown that,
if one merely adds enough more epicycles
(epiepicycles, epiepiepicycles, and so on), 
the shape of _any orbit_ can be explained,
and so this version of the Ptolemaic model cannot be falisified.

In [None]:
YouTubeVideo("QVuU2YCwHjw", start=20, rel=0, autoplay=1, width=600, height=450)

NB: this isn't exactly how the Ptolemaic model worked
(there was only one epicycle),
but it would be a way to "extend" the Ptolemaic model to fit any data.

For more, check out [jezzamon.com](http://www.jezzamon.com/fourier/), which has a nice intuitive and visual introduction to the Fourier transform, or this [Mathologer video](https://www.youtube.com/watch?v=qS4H6PEcCCA) explaining the Homer Simpson example.

#### Falisifiable Example 1: Swans

As an example of a falsifiable claim, consider the claim that "All Swans are White".

Across roughly the same time period as the Ptolemaic model held sway, some 1500 years,
this claim was widely believed in Europe, even a proverbial truth, like $2+2=4$.

In [None]:
Image("https://upload.wikimedia.org/wikipedia/commons/c/ce/Black_Swan_at_Martin_Mere.JPG", width=250)

The [arrival of Dutch explorer Willem de Vlamingh in Australia in 1697](https://en.wikipedia.org/wiki/Black_swan_emblems_and_popular_culture#European_myth_and_metaphor),
where he observed the [black swan](https://en.wikipedia.org/wiki/Black_swan),
falsified that belief.

### Statistical Claims are not Logically Falsifiable

The claims we make with statistics are less like "All Swans are White"
and more like "It is Unlikely we would Observe a Black Swan".

The observation of a single black swan obviously does not invalidate this claim the level of logic.

Importantly, the observation of hundreds of black swans and no white swans also does not _logically_ invalidate this claim.

Even if the claim is true, it could be the case that, by some cosmic coincidence, we observed only the uncommon black swans.

In Tom Stoppard's play/film [_Rosencrantz & Guildenstern Are Dead_](https://en.wikipedia.org/wiki/Rosencrantz_and_Guildenstern_Are_Dead),
the titular characters (from _Hamlet_)
flip a coin 92 times, getting a heads every single time [(video link)](https://www.youtube.com/watch?v=gOwLEVQGbrM).

While this event is not _impossible_, it is fabulously _unlikely_:
it has a chance of $1$ in $2^{92}$.

In [None]:
1 / 2 ** 92

## Binary Hypothesis Testing Brings Falsifiability to Statistics

We've already done some binary hypothesis testing in previous labs,
and versions of it appear in `data8`.

We're going through it in detail here to review and in order to frame it in terms of falsification,
which will be useful for understanding null hypothesis significance testing.

Simple version:

1. Write down a hypothesis.
2. Determine the chance your results would occur under that hypothesis.
3. If the chance that your results would occur is sufficiently low, the _hypothesis is rejected_ as falsified.

#### Caution: how do you pick _sufficiently low_?

> It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result.

Fisher, _The Principles of Experimentation_, 1935

But most people use $0.05$.

#### Caution: How do you define _the chance your results would occur_?

Consider the case of Rosencrantz tossing the coin: the chance of the sequence of 92 heads is $1$ in $2^{92}$, but so is the chance of _any_ given particular sequence.

And so you would either always reject the hypothesis that the coin was fair or always accept it,
depending on where you set your threshold.

#### 1. Define a statistic of the observed data and calculate the chance of observing that statistic.

Usually, we pick a statistic such that many datasets get mapped to the same value:
summarizing with a single number usually does the trick.

But we also want datasets that are _meaningfully_ different to get mapped to different values.
Determining what differences are meaningful is the art of model selection.

For this example, consider the _count_ of the number of heads,
rather than the specific sequence.
Many specific sequences have the same number of heads in them,
and so their probabilities are added together.

As discussed previously, the distribution of the count is a `Binomial`.
For Rosencrantz's experiment, the parameters are $n=92$ and $p=0.5$,
since it is hypothesized that the coin is fair.

Now, if we observe exactly 46 heads and 46 tails, and use 0.05 as our cutoff,
we would not reject the hypothesis that the coin is fair.

In [None]:
scipy.stats.binom.pmf(46, n=92, p=0.5)

But if we observed 92 heads, we would:

In [None]:
scipy.stats.binom.pmf(92, n=92, p=0.5)

But that's the chance that _the specific outcome we observed would occur_.

We want to control instead the chance that _we reject the hypothesis_ when it is true,
and this procedure won't do that.

In [None]:
scipy.stats.binom.pmf(46, n=92, p=0.5)

Adjust the number above, starting from 46 and working first upwards and then downwards.
At first, the numbers will be above `0.05`, and so we won't reject the hypothesis that the coin is far.
Stop when the number goes below `0.05`.

The probability drops below `0.05` at `51` heads or at `41` heads.
But the chance of observing _either_ `51` _or_ `41` heads is above `0.05`:

In [None]:
scipy.stats.binom.pmf(41, n=92, p=0.5) + scipy.stats.binom.pmf(51, n=92, p=0.5)

#### 2. Consider the value of your statistic on all possible datasets.

That is, compute, approximate, or estimate the sampling distribution of the statistic.

The chance you reject the hypothesis when it is true is the sum of the chances of observing each of the datasets that would cause you to reject the hypothesis, given that the hypothesis is true.

In [None]:
observations_that_would_cause_rejection = []

for k in range(0, 93):
    if scipy.stats.binom.pmf(k, n=92, p=0.5) < 0.05:
        observations_that_would_cause_rejection.append(k)

In [None]:
sum(scipy.stats.binom.pmf(k, n=92, p=0.5)
    for k in observations_that_would_cause_rejection)

The above cell demonstrates that if we were to reject the hypothesis that the coin was fair
whenever the chance of observing the statistic was less than 0.05,
the chance we'd reject the hypothesis when working with a fair coin would be about 35%!

#### 3. Set a threshold for the statistic such that the chance you incorrectly reject is _sufficiently low_.

In [None]:
rejection_threshold = 0.01
observations_that_would_cause_rejection = []

for k in range(0, 93):
    if scipy.stats.binom.pmf(k, n=92, p=0.5) < rejection_threshold:
        observations_that_would_cause_rejection.append(k)

sum(scipy.stats.binom.pmf(k, n=92, p=0.5)
    for k in observations_that_would_cause_rejection)

Adjust the value of `rejection_threshold` until the value produced by the cell,
which represents the chance of rejecting the null,
drops below `0.05`.

I find that `0.01` is sufficient, but `0.015` is not.

Once this has been performed for a given statistic and set of parameters,
we can give a decision rule that's in terms not of the probability,
but in terms of the original values.

For this distribution, and many others, the `observations_that_would_cause_rejection`
are in the _tails_ of the distribution: the values at least some distance from the center.

In this case, that distance is 10:

In [None]:
num_reject = len(observations_that_would_cause_rejection)
observations_that_would_cause_rejection[num_reject // 2 - 3 : num_reject // 2 + 3]

So if our "count" statistic is above 55 (is 56 or higher) or below 37 (is 36 or lower),
we reject the hypothesis that the coin is fair.

Put another way, if the _distance from the value predicted by the hypothesis is 10 or more_,
we reject the hypothesis.

In [None]:
(observations_that_would_cause_rejection[num_reject // 2 - 1],
observations_that_would_cause_rejection[num_reject // 2])

These are called _critical values_ of the statistic.

Generic hypothesis testing, final version:

#### 1. Define a statistic of the observed data and calculate the chance of observing that statistic.
#### 2. Determine the chance your results would occur _under that hypothesis_.
#### 3. If the chance that your results _or others like them_ would occur is sufficiently low, the _hypothesis is rejected_ as falsified.

### We Can Approximate the Distribution of our Statistic with pyMC

In [None]:
with pm.Model() as coins_model:
    coins = pm.Bernoulli("coins", p=0.5, shape=(92))

In [None]:
coins_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    coins_model, draws=2500, chains=4, progressbar=True))

In [None]:
k_pymc = coins_samples["coins"].apply(sum)

Two notes on modeling choices:

1. **Why don't we calculate the `sum` inside of the model, with `pm.math.sum`?** It's typically a bad idea to include too many `Deterministic` components to a model. It slows down the sampling process and can make it unstable.
2. **Why don't we just use a `Binomial` for the count directly?** This depends on how detailed you want your model to be. whenever possible, prefer to design models that generate raw data, or data in as raw a state as possible, and then apply my analyses to the samples _as though they were real data_. This emphasizes the role of the model as a _simulation of the real data generating process_.

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
sns.distplot(k_pymc, bins=range(0, 93), kde=False, norm_hist=True,
             hist_kws={"align": "left"}, label="sampled with pyMC");
plt.plot(range(0, 93), scipy.stats.binom.pmf(range(0, 93), n=92, p=0.5),
         lw=2, label="mathematically-derived", marker=".", markersize=10);
plt.ylim(1.7 * np.array(plt.ylim())); plt.xlabel("k"); plt.legend(); plt.tight_layout();

But remember that our final decision rule was in terms of the deviation from the value given by the hypothesis.
We can compute that statistic on our samples by using the following function:

In [None]:
def deviation_statistic(observed_values, expected_value):
    return np.abs(expected_value - observed_values)

In [None]:
deviation_samp_dist_samples = deviation_statistic(k_pymc, 46)

With sufficient cleverness and by applying the rules of probability,
we can obtain a sampling distribution for this statistic from the sampling distribution for the old one.

In [None]:
def deviation_statistic_pmf(k, n=92, p=0.5):
    binom_pmf = scipy.stats.binom(n=n, p=0.5).pmf
    if k > 0:
        return 2 * binom_pmf(n * p + k)
    else:
        return binom_pmf(n * p + k)

This is included only for comparion purposes;
the goal of this class is to learn tools, like pyMC,
that avoid using mathematical manipulations like these.

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
sns.distplot(deviation_samp_dist_samples, bins=np.arange(-0.5, 47.5), kde=False, norm_hist=True,
             hist_kws={"align": "mid"}, label="Estimated with pyMC");
plt.ylim(1.7 * np.array(plt.ylim())); ax.set_xlabel("Deviation from Expected");
ax.plot([deviation_statistic_pmf(val) for val in np.arange(46)], color="C1",
        lw=2, label="Derived Mathematically", marker=".", markersize=10);
ax.set_title("Sampling Distribution"); ax.legend(); plt.tight_layout();

Notice the advantage in simplicity for the pyMC method:
I didn't need to know any rules of probability to convert the sampling distribution
of the first statistic into the sampling distribution of the second,
I only needed to know how to compute the statistic on a sample.

This is a generic benefit of the sampling approach,
to balance the drawback of getting imprecise answers.

## Null Hypotheses Put the Focus on Falsification

One of the motivations for the falsification model of science is the recognition that our track record for hypotheses
is extremely poor.

From the Ptolematic model and the humor theory of medicine to Newton's laws and the wave theory of light,
even useful and widely-believed hypotheses have turned out to be false.

So instead of claiming to "prove" any hypothesis was true,
early statisticians took the line that we should only _disprove_ theories:
that every experiment was to be a miniature version of Galileo dethroning Ptolemy,
or Einstein disproving Newton,
but respecting the fact that the theories of Galileo and Einsteins were or will be themselves falsified.

To that end, we propose **null hypotheses**:
hypotheses we wish to disprove or falsify with an experiment.

> Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.

R.A. Fisher, _The Design of Experiments_, 1925.

### A Null Hypothesis is the Hypothesis of a Skeptic

Here are a few generic null hypotheses, rendered in plain English:

- There is nothing interesting happening in this data
- There is no effect of Foo on Bar
- The size of the effect of Foo on Bar is 0
- Foo and Bar are unrelated

### Common Null Hypotheses

Here are some null hypotheses render slightly more technically:

- The true value of this parameter is 0
- The true difference in the value of this parameter between two groups is 0
- The true value of this statistic is 0
- The true value of this statistic is 1

### Proposing Null Hypotheses

Give a null hypothesis for each of the claims below:

- Individuals whose attention is not divided will perform worse at a task
- Reaction time decreases when the dose of caffeine is increased
- Smoking increases the risk of lung cancer

Note: there is not one single answer for any of these.
Proposing a null model often means making the same kinds of assumptions that are needed when making a real model.
This makes it somewhat subjective, or at the very least driven by the specific problem.

### Null Hypothesis Testing Cannot be Done with Basic Bootstrapping

Bootstrapping generates an estimate of the _true_ sampling distribution of a statistic.

This is not the same as the sampling distribution under the null hypothesis unless the null hypothesis is true.
But that's exactly what we're trying to test!

Bootstrapping involves resampling from our data.

In order to perform an equivalent of bootstrapping for null hypothesis testing,
we'll need to come up with a way to resample from our data _as though the null hypothesis were true_.

More on that in the next lecture.

### We Capture the Degree of Evidence Against the Null with the $p$-value

To compute the $p$ value,
we look at the sampling distribution of the statistic under the null
and compare our observed value to it.

We then add up the probability of every value of the statistic that is
as or more extreme.

For a statistic that gets larger as the data gets less likely under
the null hypothesis,
this can be achieved with a `>=`:

In [None]:
def one_tailed_p_value_from_samples(observed_value, null_sampling_dist_samples):
    return (null_sampling_dist_samples >= observed_value).mean()

Let's presume we observed `57` heads in our experiment
and take as our null hypothesis that the coin is fair,
allowing us to use the samples from before as our null samples.

In [None]:
observed_count = 57
observed_statistic = deviation_statistic(observed_count, 46)
null_sampling_dist_samples = deviation_statistic(k_pymc, 46)

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
sns.distplot(null_sampling_dist_samples, bins=np.arange(-0.5, 47.5), kde=False, norm_hist=True, ax=ax,
             label="Sampling Distribution\nUnder Null Hypothesis");
ax.vlines(observed_statistic, 0, 0.01, lw=6, label="observed");
ax.set_xlabel("Deviation from Expected"); plt.ylim(1.5 * np.array(plt.ylim())); ax.legend();

In [None]:
bar_heights, _ = np.histogram(null_sampling_dist_samples, bins=range(48), density=True)
extreme_values = range(observed_statistic, 47)  # all values at least

In [None]:
f, ax = plt.subplots(figsize=(14, 8))
sns.distplot(null_sampling_dist_samples, bins=np.arange(-0.5, 47.5), kde=False, norm_hist=True, ax=ax,
             label="Sampling Distribution\nUnder Null Hypothesis");
ax.vlines(observed_statistic, 0, 0.01, lw=6, label="Observed");
ax.bar(extreme_values, bar_heights[extreme_values], width=1, color="C1", label="Contributes to $p$")
ax.set_xlabel("Deviation from Expected"); plt.ylim(1.5 * np.array(plt.ylim())); ax.legend();

The area of the region in gold is the _p_ value for our observation of 57 heads,
a deviation of 11 from the 46 predicted under the null.

It gives us the probability that we'd observe a deviation at least that large,
just by chance, under the null. The value is calculated two different ways in the cell below.

In [None]:
observed_p = sum(bar_heights[extreme_values])
assert np.isclose(observed_p, one_tailed_p_value_from_samples(observed_statistic, null_sampling_dist_samples))

observed_p

This value will be close to the one obtained by using the values from the mathematically-derived pmf

In [None]:
null_sampling_dist_pmf_vals = pd.Series(deviation_statistic_pmf(val) for val in np.arange(46))
f, ax = plt.subplots(figsize=(16, 6))
ax.plot(null_sampling_dist_pmf_vals, color="C1",
        lw=2, label="Sampling Distribution, pmf", marker=".", markersize=10);
ax.plot(null_sampling_dist_pmf_vals[null_sampling_dist_pmf_vals.index >= observed_statistic],
        linestyle="none", markeredgecolor="C3", markerfacecolor="none", marker=".", markersize=10,
        label="Contribues to $p$"); ax.set_xlabel("Deviation from Expected"); ax.legend();

In [None]:
def tail_p_value_from_pmf(observed_value, null_sampling_dist_pmf, vals):
    vals = [val for val in vals if val >= observed_value]
    return sum(null_sampling_dist_pmf(val) for val in vals)

In [None]:
tail_p_value_from_pmf(observed_statistic, deviation_statistic_pmf, range(93))

### When the Value of $p$ is Below a Threshold, we Reject the Null Hypothesis

This is a special case of the generic hypothesis testing described above.

By definition, this will happen sometimes when the null hypothesis is true.
It will happen at a rate given by the threshold we apply to $p$,
known as the _false positive rate_ or $\alpha$.

We try to design hypothesis tests where it's _more likely_ that
$p$ is small when the null hypothesis is false.
The chance that $p$ is small when the null is false is known as the
_true positive rate_ or _power_.

### If we Reject the Null Hypothesis, We Say our Observed Statistic was _Statistically Significant_

Do not read any more into that statement than its definition:
it literally only means that the null hypothesis was rejected.

"Statistically significant" does not mean that any particular non-null hypothesis is true, or even likely.

"Statistically significant" does not guarantee that the null hypothesis is false, it only suggests it.

"Statistically significant" does not mean that the effect is large, or _significant_ in any practical sense.

## $p$-Values are Notoriously Misleading

The following are all **incorrect** interpretations of the $p$-value:

### "$p < 0.05$, therefore ..."

"...the chance that the null hypothesis is true is less than 1 in 20."

"...the chance my original hypothesis is wrong is less than 1 in 20."

"...the chance a repeat of this experiment would get a negative result is less than 1 in 20."

"...a repeat of this experiment would also find $p < 0.05$."

### $p$ _IS NOT_ the Posterior Probability of the Null Hypothesis

The posterior represents our beliefs after observing data,
e.g. the value of $p$.

In order to determine our beliefs after observing the data,
we need to specify our beliefs before observing the data,
our prior on the null hypothesis,
and we need to determine the likelihood of our data.

Estimating this posterior with `pyMC` is a focus of this week's lab.

The $p$ value gives the strength of evidence against the null hypothesis,
but even very strong evidence is insufficient to disprove a very likely null hypothesis.
If I'm claiming to overturn the Standard Model of physics,
I'll require much stronger evidence than if I'm studying a new phenomenon in psychology.

### $1 - p$ _IS NOT_ the Probability Another Run of the Same  Experiment would Report the Same Finding

This corresponds to the third and fourth statements above.
Such an experiment is called a _replication experiment_, and is the focus of

This quantity requires us to also specify, or assume, or estimate, the prior probability of the null hypothesis
and examine what happens when the null hypothesis is wrong.

### $p$ _IS_ the Conditional Probability of Such an Extreme Statistic, Given that the Null Hypothesis is True

And remember we introduced the null hypothesis with the hope of falsifying it!

### $p$ _IS_ itself a Statistic

Like any statistic, it has a sampling distribution,
and so is different from experiment to experiment,
just like the mean or the standard deviation.

Just as we think of a "sample mean" and a "sample standard deviation",
which vary from dataset to dataset,
and which we distinguish from the "true mean" and "true standard deviation",
we should recognize the $p$ value as a sample-dependent quantity.

## There is Nothing Intellectually Necessary About NHST

The choice to binarize things into "falsified" and "not falsified"
is unncessary:
we don't ever need to consider any hypothesis falsified,
just less and less likely.

Falsification is just one way of formalizing the scientific method.

Like the processes that science studies, science itself defies simplistic formal models:
just as all hypotheses are only ever partially true,
the falsification model of science is itself incomplete at best.

The famous dictum that "all models are wrong, but some are useful" applies to falsification as a model of science.

Besides being a convenient apparatus for implementing falsification,
null hypotheses are chosen for simplicity's sake.
There's nothing stopping you from testing,
and choosing to reject,
a non-null hypothesis.

Methods like confidence intervals, bootstrapping, and sampling from posteriors all allow you to assign probabilities to null _and_ non-null hypotheses.

`data8` avoids NHST because it's so unintuitive and, frankly, unnecessary.

## But NHST is Extremely Common

For historical, technological, and psychological reasons,
NHST based on $p$-values derived from classic statistical tests
is an extremely common form of inference in research science,
common enough that there is no choice but to include it in this course.

By psychological reasons, I mean that
it allows researchers to "sweep uncertainty under the rug"
and pretend that their results are set in stone,
so long as they have passed a null hypothesis significance test.

As might be obvious, this is an opinion,
rather than a mathematical fact,
but the distaste for NHST with $p$ values is
[widespread among the current generation of statisticians](https://amstat.tandfonline.com/toc/utas20/73/sup1).

## Null Hypothesis Significance Testing

0. Collect data from an experiment and compute a statistic on that data.

1. Come up with a model of "nothing interesting is happening in my data".
    - This model can be resampling-based, mathematical, or in `pyMC`
    - It is called the _null model_. The hypothesis that it is true is the _null hypothesis_.

Mathematical models are the classical approach.
Bootstrapping is a type of resampling, but it won't be the type we use here.
In general, a `pyMC` model can be replaced with any _generative_ model:
any model form which you can generate samples.

2. Obtain the sampling distribution of the statistic from the model
    - by literally sampling from it,
        - either resampling or with `pm.sample`
    - or by clever mathematical manipulation

3. Compare the value of the statistic observed on the data to the sampling distribution.
   - If the observed value is "too extreme", the test is positive and the result is "statistically significant".

## Modeling NHST

### To Model NHST, we need to specify a prior and a likelihood

```python
with pm.Model() as nhst_model:
    # prior
    null_true = pm.?("null_true", ?)
    # likelihood (will be a function of null_true)
    positive_result = pm.?("positive_result", ?)
```

#### Prior: What do we believe about the null?

As with any `pyMC` model, we must specify a prior.
In this case, the prior determines our faith in the null hypothesis.

There is no generic objective way to do this.
The necessary subjectivity of this step makes it
one of the most controversial aspects of the generative/Bayesian modeling approach,
the one we focus on in this course.

Because the null is either true or false,
we can model it as a discrete variable with two states,
a `Bernoulli` or a `Categorical`.

#### Likelihood: What happens when the null is true and when it is false?

The outcome of the test is also binary.

We either reject the null, and say that the test was positive ($+$),
or we fail to reject the null, and say that the the test was negative ($-$).

Therefore our likelihood is going to be `Bernoulli` as well.

The parameter `p` of that `Bernoulli` will take on one of two different values,
depending on whether the null is true or false.

If you were writing a pure Python model of null hypothesis significance testing,
you'd use an `if`/`else` construct to handle this:
`if null_true`, use one value, `else` use a different one.

In pyMC, the role of `if` is played by `pm.math.switch`.
The lab contains a description of how it is used.

We can visualize the likelihood for this model directlyas a table,
since it is discrete and the number of possible values is small: $2 * 2 = 4$.

<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
      <th > $$F$$ </th>
      <th > $$T$$ </th>
    </tr>
    <tr>
      <td >$$+$$</td>
      <td style="background-color: rgb(0,50,98); color: white"> True Positive Rate, Power, Sensitivity </td>
      <td style="background-color: rgb(253,181,21);"> False Positive Rate, &#945; </td>
    </tr>
     <tr>
      <td >$$-$$</td>
      <td style="background-color: rgb(0,50,98); color: white"> False Negative Rate, &#946;</td>
      <td style="background-color: rgb(253,181,21);"> True Negative Rate, Specificity</td>
    </tr>
  </tbody>
</table>


A given cell in this table represents a particular combination of the state of the null hypothesis,
$F$alse or $T$rue, and the outcome of the test, either positive ($+$) or negative ($-$).
Inside the cell are the names that these particular conditional probabilities go by in different disciplines.

The columns represent the conditional distributions of the test outcomes given the state of the null hypothesis.

For more on where this table comes from, see
[this blog post](https://charlesfrye.github.io/stats/2018/06/09/hypothesis-testing.html).

Computing the power is more difficult than computing the false positive rate, $\alpha$,
and typically requires the prior be specified in more detail than we have done here.

For the lab this week, we will assume that the power is known.
We will see power computations in future lectures, labs, and homeworks.