<img src="../../shared/img/banner.svg" width=2560></img>

# Null Models 02 - Null Models for Means: _t_-Tests and Randomization

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import random

import daft
from IPython.display import Image, YouTubeVideo
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
def t_plot(df, kde=False, print_t=True, print_p=False, style="kde"):
    if print_t:
        print(f"t = {compute_t_df(df)}")
    if print_p:
        print(f"p = {compute_p_df(df)}")
        
    group0_selector = df["variety"] == 0
    group1_selector = df["variety"] == 1
    group0_values = df["yield"][group0_selector]
    group1_values = df["yield"][group1_selector]
    
    if style == "dist":
        f, ax = plt.subplots(figsize=(10, 10))
        sns.distplot(group0_values, label="Variety 0", kde=kde, norm_hist=True)
        sns.distplot(group1_values, label="Variety 1", kde=kde, norm_hist=True)
        
    elif style == "kde":
        f, ax = plt.subplots(figsize=(8, 8))
        sns.kdeplot(group0_values, shade=True, lw=4, label="Variety 0");
        sns.kdeplot(group1_values, shade=True, lw=4, label="Variety 1");
        ax.set_ylim(1.2 * np.array(ax.get_ylim()));
    ax.legend();

    
def display_t_test_model(
    mean_observed=False, sd_observed=False, value_observed=True, title=""):

    aspect = 1.5
    scale = 2

    offset = 0
    delta = 2
    y = 1.5

    group_node = daft.Node("group", "group",
                           offset + delta, y, scale=scale, aspect=aspect, observed=True)
    mean_node = daft.Node("$\mu$", "$\mu$",
                          offset + 2 * delta, y, scale=scale, aspect=1, observed=mean_observed)
    value_node = daft.Node("value", "value",
                           offset + 3 * delta, y, scale=scale, aspect=aspect, observed=value_observed)
    sd_node = daft.Node("$\sigma$", "$\sigma$",
                        offset + 2 * delta, y + delta, scale=scale, aspect=1, observed=sd_observed)

    nodes = [sd_node, mean_node, value_node, group_node]
    edges = [("$\sigma$", "value"), ("group", "$\mu$"), ("$\mu$", "value")]

    gap = 1
    left_edge = offset + delta - gap
    width = (len(nodes) - 2) * delta + 2 * gap

    plate = daft.Plate((left_edge, 0, width, y + 1), shift=0.25,
                       label="$2 \cdot N_g$", position="bottom right")

    plates = [plate]

    model = daft.PGM(shape=(8, 4 + 0.5), line_width=2)

    [model.add_node(node) for node in nodes]
    [model.add_plate(plate) for plate in plates]
    [model.add_edge(*edge, head_width=0.25) for edge in edges]

    # Avoid fill with blue in newer versions of matplotlib
    for plate in plates:
        plate.bbox["fc"] = "white"

    model.render();
    
    plt.title(title);

In [None]:
import shared.src.utils.util as shared_util

This week, we put on our frequentist hats:
we are going to study the most common method for frequentist inference,
_null hypothesis significance testing_.

Today, we will work through one of the most common tests:
testing for a difference in means.
We'll do this parameterically, using
the $t$-test, specifically the two-sided unpaired Student's $t$-test,
and non-parametrically, using a resampling scheme called the _permutation test_.

Next week, we'll take a Bayesian approach to similar problems.

## Null Hypothesis Significance Testing Proceeds by Applying Thresholds to $p$-Values

A $p$-value for a test is obtained by comparing the value of a statistic,
called the _test statistic_,
to the sampling distribution of that same statistic under the null hypothesis.

In [None]:
Image(url=
      "https://4.bp.blogspot.com/-XvlZhew0MqY/TeL3zIYTvcI/AAAAAAAAANk/i4vN6nVPMnE/s640/Hypothesistestingfigure.png",
     width=1024)

## In this lecture, we work through a specific example: $t$

It was one of the very first test statistics ever developed,
in the 1910s by
[William Gosset](https://en.wikipedia.org/wiki/William_Sealy_Gosset),
working pseudonymously under the name "Student"
for Guinness & Co. Brewing.

Gossett was interested in determining which varieties of barley provided the highest yields,
but he was plagued by the fact that he was only able to observe a small number of plantings.

If a barley variety A produced on average 3 bushels, averaged over 5 plantings,
and another B produced 5, averaged over the same number,
was it reasonable to conclude that variety B has a higher average yield?

In [None]:
barley_A_yield = pd.Series([3, 5, 1, 2, 4])
barley_B_yield = pd.Series([7, 5, 3, 4, 6])

yields = pd.concat([barley_A_yield, barley_B_yield])

In [None]:
barley_df = pd.DataFrame({"yield": yields, "variety": [0] * 5 + [1] * 5})

print(barley_df)

Note: these numbers are entirely fictional, though the example is not.

In [None]:
t_plot(barley_df, print_t=False, style="kde")

If this is our estimate of our uncertainty in the yields
of the two varieties, is it reasonable to conclude that the means are different,
or might this have occurred due to chance?

## Defining $t$

Gossett pioneered the approach of defining a test statistic and determining the distribution of that statistic under the null.

He chose the following statistic:

$$
t = \frac{\mu_A - \mu_B}{\sigma\sqrt{\frac{2}{N_g}}}
$$

where $\mu_A$ for a pandas series `A` is `A.mean()`,
and $N_g$ is `len(A)`, which is presumed equal to `len(B)`.
The other value the denominator, $\sigma$, is
the estimate of the group standard deviation
and is given by:

$$
\sigma^2 = \frac{\sigma^2_A + \sigma^2_B}{2}
$$

where $\sigma_A^2$ is `A.var()`.
That is, we average the estimated variances of each group.

It is called $t$ because it was the $t$est statistic
for one of the first modern statistical tests.
We will work through the rationale for this statistic below,
but for now,
take it as a given.
Notice that its magnitude will be large
if the observed means of the groups are very different,
relative to their spread.

In [None]:
def compute_t(a, b):
    sp = compute_sigma_pooled(a, b)
    t = (a.mean() - b.mean()) / (sp * np.sqrt(2 / len(a)))
    return t


def compute_sigma_pooled(a, b):
    sigma_pooled = np.sqrt((a.var() + b.var()) / 2)
    return sigma_pooled

And now, a convenient function that lets us apply `compute_t` directly to our dataframe:

In [None]:
def compute_t_df(df):
    group0_selector = df["variety"] == 0
    group1_selector = df["variety"] == 1
    
    return compute_t(df["yield"][group0_selector], df["yield"][group1_selector])

In [None]:
compute_t_df(barley_df)

But is that large? Is that small?

Just as we have seen in previous instances of inferential thinking,
we need to compare a value to a distribution in order to get a sense of scale.
Previously, the question was how uncertain we were in a given value,
and we obtained an answer by _bootstrapping_.
Now, the question is how likely a value like the above might occur by chance,
and we will obtain the answer by _comparison to the null_.

## Defining the $t$ Test

The usual null hypothesis associated with this statistic is
that

$$
\mu_A = \mu_B
$$

and that the sample means have a `Normal` distribution.

This will be exactly true if the individual observations are also `Normal`.
It will be approximately true, even if that is not the case,
in many cases, so long as the number of samples is sufficiently high.

In the example of Gossett,
this is the statement that the two varieties of barley have the same average yield.

Note that we've also assumed that the true standard devation of the groups is the same
in writing down the definition of $t$.

Graphically, that looks like:

In [None]:
display_t_test_model(
    mean_observed=True, sd_observed=True, value_observed=False,
    title="Graphical Representation of $t$-Test Null")

where _group_ is a binary variable that determines which group an observation is in,
$\mu$ is a list of means for each group,
_value_ is the value that is being observed,
$N_g$ is the number of observations in each group,
and $\sigma$ is the standard deviation, which is presumed the same in each group,
and

$$
\text{value} \sim \text{Normal}(\mu\left[\text{group}\right], \sigma)
$$

Remember that the rectangle, called a "plate",
means that we observe the contents multiple times,
with the number of observations given in the small box in the bottom right.
For the barley example, $N_g = 5$.

In [None]:
def make_t_model(true_sigma, group_size, group0mean=0, delta_mean=0):
    
    group_labels = [0] * group_size + [1] * group_size
    
    with pm.Model() as t_model:
        group = pm.Categorical("varieties", p=[1/2, 1/2],
                               observed=group_labels)
        group_means = shared_util.to_pymc(
            [group0mean, group0mean + delta_mean])
        
        value = pm.Normal("yields",
                          mu=shared_util.to_pymc(group_means)[group],
                          sd=true_sigma, shape=len(group_labels))
        
    return t_model, group_labels

In the barley example, the "group" variable is _variety_
and the observed "value" is _yield_.

The values of $\mu$ are determined by our null hypothesis:
they are equal, and equal to the mean we observed.

In [None]:
mean_pooled = barley_df["yield"].mean()
mean_pooled

In [None]:
sigma_pooled = compute_sigma_pooled(barley_A_yield, barley_B_yield)
sigma_pooled

In [None]:
group_size = 5

Having fully specified our null model,
we can now draw samples from it.

In [None]:
null_t, group_labels = make_t_model(sigma_pooled, group_size, group0mean=mean_pooled, delta_mean=0)
samples = shared_util.samples_to_dataframe(shared_util.sample_from(null_t, draws=2500, chains=4))

Each resulting sample is like the data from a fictitious experiment:
a window into what we might observe in an alternate world where the null hypothesis is true.

But each row doesn't yet _look_ like the data from our barley experiment:

In [None]:
samples.iloc[0].head()

In [None]:
print(barley_df.head())

and so we cannot use the analysis tools we applied to our real data,
like `compute_t_df` or `t_plot`.

To allow their use,
we need to transform our samples into something that looks like our data:

In [None]:
def sample_to_experiment_df(sample, group_labels):
    return pd.DataFrame({"yield": sample, "variety": group_labels})

This isn't just a programming convenience.
This is synecdochal of the entire approach:
the samples from the null are _plausible alternative datasets_,
given that the null is true,
and so our analysis should respect and reflect that fact.

In [None]:
sample_to_experiment_df(samples["yields"].iloc[0], group_labels).head()

Now, we can get a sense for the sampling distribution of the data under the null hypothesis.

In [None]:
null_t_dfs = [sample_to_experiment_df(sample["yields"], group_labels) for _, sample in samples.iterrows()]

In [None]:
target_df = random.choice(null_t_dfs)
t_plot(target_df, print_t=True, style="kde")

Notice that on many of these samples, especially for a small group size,
there appears to be an obvious pattern:
the variance of one is much larger,
the mean of one variety is much smaller.
But because we are sampling from the null,
we know that those patterns are illusory:
the groups have the same mean and the same variance.

It is the job of statistics to inform us when the pattern we are claim is liable to be illusory,
because it occurs sufficiently often under the null.

The pattern we are claiming is meaningful in our data is that the $t$ value is large in magnitude.

In order to test that pattern,
we must compute `t` on all of our samples from the null,
and then compare the magnitudes we sampled to the magnitude we observed.

In [None]:
null_distribution_t = pd.Series(
    [compute_t_df(null_t_df) for null_t_df in null_t_dfs], name="$t$")

In [None]:
null_f, null_ax = plt.subplots(figsize=(10, 6))
sns.distplot(null_distribution_t, bins=100, ax=null_ax, label="MC Estimate");
plt.xlim(-10, 10); plt.ylim(1.5 * np.array(plt.ylim())); plt.legend();

Now, to perform our $t$ test,
we need to compare the value we obtained to the values under the null hypothesis.

In [None]:
barley_t = compute_t_df(barley_df)
barley_t

In [None]:
null_ax.vlines(barley_t, 0, null_ax.get_ylim()[1] * 0.25, lw=6, label="Observed")
null_ax.legend(); null_f

Form the histogram,
we add up the areas of all of the bars
corresponding to values larger in magnitude than the one we actually observed.

In [None]:
(np.abs(null_distribution_t) >
     np.abs(compute_t(barley_A_yield, barley_B_yield))).mean()

The traditional way to compute $p$ is that we compare the value of $t$ optained to a table of values:

In [None]:
Image("https://www.ruf.rice.edu/~bioslabs/tools/stats/ttable.gif")

"Degrees of freedom" is the _only_ parameter for the null distribution of $t$,
and it is determined by the group sizes you choose.
It is also unaffected by the group means or by the pooled standard deviation,
as you can confirm for yourself by changing
those numbers in the definition of the pyMC model `null_t`.

Degrees of freedom is, in general,
equal to the number of data points from the dataset I would
need to tell you,
along with the value of the statistic,
before you could guess the remaining values.

In our case, it's $2\cdot N_g - 2$:
we calculate $t$ from the pooled standard deviations of the two groups,
and each has $N_g - 1$ degrees of freedom.

For the barley example,
this value is 8,
so we can say that our $p$ value is $>0.05$.

We don't get a continuous value for $p$ out of this method.
For this reason, it was customary up until very recently to only report
whether the $p$ value was below a certain threshold.

So why was this ever done?

Because before there were ubiquitous computers,
it was infeasible to calculate the sampling distribution of $t$ yourself,
either by estimation or by means of a formula.

In order to get those values,
we need to know the a mathematical or analytical form for the shape of the
distribution.

In [None]:
null_f

The shape looks very much like a `Normal`,
but it is in fact ever so slightly different.

In [None]:
norm_estimate_params = null_distribution_t.mean(), null_distribution_t.std()
norm_estimate_pdf = scipy.stats.norm(*norm_estimate_params).pdf
ts = np.linspace(min(null_distribution_t), max(null_distribution_t), num=1000)
null_ax.plot(ts, norm_estimate_pdf(ts), lw=4, label="Normal Estimate")
null_ax.legend(); null_f

As William Gosset showed,
this distribution is actually given by
a curve that now bears his pseudonym:
[_Student's t distribution_](https://en.wikipedia.org/wiki/Student%27s_t-distribution).

In [None]:
student_true_pdf = scipy.stats.t(df=2 * group_size - 2).pdf

student_true_pdf = scipy.stats.t(df=2 * group_size - 2).pdf
null_ax.plot(ts, student_true_pdf(ts), lw=4, label="Analytical Student")
null_ax.legend(); null_f

It is like a `Normal`, but with more values close to $0$ and more values very far from $0$.

pyMC allows you to sample directly from the $t$ distribution with `pm.StudentT`.

Nowadays, for tests that used to have tables,
we can use computers to calculate the values of $p$ instead:

In [None]:
scipy.stats.ttest_ind(barley_A_yield, barley_B_yield)

In [None]:
def compute_p_df(df):
    group0_selector = df["variety"] == 0
    group1_selector = df["variety"] == 1
    t, p = scipy.stats.ttest_ind(df["yield"][group0_selector], df["yield"][group1_selector])
    return p

The possession of a formula allows us to demonstrate something interesting:
the sampling distribution of the $p$ statistic under the null hypothesis.

We simply calculate $p$ for each of our samples:

In [None]:
null_distribution_p = pd.Series(
    [compute_p_df(null_t_df) for null_t_df in null_t_dfs], name="$p$")

In [None]:
p_null_f, ax = plt.subplots(figsize=(6, 12))
sns.distplot(null_distribution_p, color="C0",
             bins=20, kde=False, norm_hist=True, label="MC Estimate", ax=ax);
ax.hlines(1, 0, 1, lw=4, label="Uniform($0$, $1$)")
ax.set_ylim(0, 1.7);
ax.set_title("Null Distribution of $p$"); ax.legend();

Notice that it is approximately uniform.
Apply bootstrap sampling to `null_distribution_p` if you're skeptical.

From this, we can also see
why the false positive rate is directly the value of the threshold we apply to $p$.

In [None]:
p_null_f, ax = plt.subplots(figsize=(6, 12))
sns.distplot(null_distribution_p, bins=20, kde=False, norm_hist=True, label="True Negatives");
# note: this is only true negatives because the positives are hidden behind the bar plotted below
ax.hlines(1, 0, 1, lw=4, label="Uniform($0$, $1$)")

bar_heights, _ = np.histogram(null_distribution_p, bins=20, density=True)
ax.bar(0.025, bar_heights[0], width=0.05, color="C1", label="False Positives");
ax.set_ylim(1.7 * np.array(ax.get_ylim()));
ax.set_title("Null Distribution of $p$"); ax.legend();

ax.legend();

In [None]:
print(bar_heights[0] * 0.05)

The fraction of observations, under the null hypothesis,
whose $p$ values will be below a value $\alpha$,
is exactly equal to $\alpha$,
because $p$ is uniformly distributed when the null is true.

The false positive rate is the chance that we incorrectly reject the null hypothesis
given that it is true:
the chance that the value of $p$ is below our threshold of $\alpha$ if our data
is generated according to the null hypothesis.
But as can be seen above, this is just equal to $\alpha$.

## But Why $t$?

Let's review the definition of $t$:

$$
t = \frac{\mu_A - \mu_B}{\sigma\sqrt{\frac{2}{N_g}}}
$$

The quantity we were actually interested in was just the difference in means.

The confusing part, in the denominator,
was something of a distraction.

Let's instead calculate the difference in means.

In [None]:
def compute_delta_mu_from_df(df):
    group0_selector = df["variety"] == 0
    group1_selector = df["variety"] == 1
    
    delta_mu = (df["yield"][group0_selector].mean() - df["yield"][group1_selector].mean())
    return delta_mu

In [None]:
null_distribution_delta_mus = [compute_delta_mu_from_df(df) for df in null_t_dfs]
# non_null_distribution_delta_mus = [compute_delta_mu_from_df(df) for df in power_t_dfs]

In [None]:
barley_delta_mu = compute_delta_mu_from_df(barley_df)
barley_delta_mu

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(null_distribution_delta_mus, label="Null, MC Est.", ax=ax)
ax.vlines(barley_delta_mu, 0, null_ax.get_ylim()[1] * 0.25, lw=6, label="Observed")
ax.set_xlabel("$\mu_A - \mu_B$"); ax.legend();

And the result is approximately the same $p$-value:

In [None]:
(np.abs(null_distribution_delta_mus) > np.abs(barley_delta_mu)).mean()

Side Note: sometimes, this value is below the mystical threshold of `0.05`,
even though the value above is not.
This is an indication of the inadequacy of the NHST method:
it requires a hard threshold,
but whether your data passes that threshold can be up to fairly arbitrary decisions,
like exactly which statistic to use,
and can be buffeted about by random forces.
Even if one answer is correct and the other is incorrect,
the choice to apply a hard threshold magnifies small errors:
the null hypothesis is rejected in one case and not in the other,
and one of those must be incorrect.
If we retain a continuous value,
the outcome of this experiment is the same for both statistics:
there is moderately strong evidence against the null hypothesis
that the two varieties have the same average yield.

Now, notice this distribution is `Normal`, and so
it seems like a promising target for a nice statistical test based on the `Normal` distribution.

The trouble is,
the width of that distribution is unknown ahead of time.
It is different _depending on the value of $\sigma$_, which is unknown.

This was bad for the days when statistical testing was done
by reading tables out of books:
instead of a single table,
with a printed out set of $t$ values for each choice of group size,
you'd need a table for every possible value of $\sigma$,
and there are infinitely many values!

If you try to estimate the width from the data
and then divide it out,
so that the distribution of `delta_mu` is always
the same width,
then you end up with the $t$ distribution!

Your attempt to estimate the width adds additional variability,
which results in a distribution with bigger tails.
As the size of the groups increases,
your error is smaller and so the distribution becomes closer to a `Normal`
with mean 0 and standard deviation 1.

And so the definition of $t$ was chosen for convenience and to handle very small samples:
it means someone just had to calculate one big table, once,
and then anyone could get a $p$ value by consulting that table,
e.g. this one:

In [None]:
Image("https://www.ruf.rice.edu/~bioslabs/tools/stats/ttable.gif")

But nowadays, we can estimate the null distribution of $t$,
or _almost any other statistic_
in seconds, by sampling with MC methods.

This is known as a _parametric method_ of sampling,
since our models have parameters.

In contrast, bootstrapping is _non-parametric_:
there are no fixed parameters and so there exists no parametric model of the dataset.

We can also sample from the null non-parametrically, by means of _randomization_.

## Randomization Tests

Under the null hypothesis,
there is no difference between the groups
in terms of the statistic,
and so from the perpsective of the test,
the labels are irrelevant.

Put another way,
they _might as well be random_.

For example,
if William Gosset had accidentally swapped all of the labels on all of his
barley seeds,
and some of Variety A were accidentally labeled "Variety B" and vice versa,
it would change the exact value of $t$ that he calculated,
but we wouldn't expect it to change $t$ in any specific way:
increasing or decreasing it, for example.

Or, in another example,
consider two salt shakers.
Any grain of salt is exchangeable with any other grain of salt,
at least in my view.
And so, if someone were to mix the contents of the two shakers together
and then pour the resulting mixture back out,
I would be happy using either shaker,
and the experience would be the same as if they'd never been mixed at all.

The situation is very different for a salt and a pepper shaker.
If the contents of the two are mixed,
the result of applying the mixture to your food is quite different
from the result of applying either of the original shakers.

The idea of a randomization test is to apply such a procedure to your data
and see whether your data behaves more like two salt shakers
or more like a salt and a pepper shaker.

That is, loosely:
we strip the labels off of the data,
we put all of it into a bag,
and then shuffle it around.
Then, we pull it back out,
sticking the labels back on randomly as we go.

Finally, we treat the resulting data just like the data we observed,
and compute our statistic of choice on it.

To perform such a randomization, also known as a _permutation_,
in pandas, we simply sample the Series, with replacement,
and put it back into the DataFrame.
The resulting DataFrame then goes into our statistical computation pipeline,
just as though it were the original data.

We repeat this process over and over again,
obtaining lots of different values for the statistic.

Importantly, these values _will be distributed exactly as though the labels were meaningless_.
That is, they represent the distribution of the statistic under the hypothesis
that the groups are identical in every way, except for which label was chosen to put on them.

Thus the null of a randomization test is much broader
than the null of a similar parametric test:
we no longer assume that the data was generated according to any particular model,
as we did in the traditional and the pyMC approaches.

The fact that it involves returning to your data and sampling from it
makes it kin to bootstrapping.
Both methods are known as _resampling_ methods,
and they are furthermore _nonparametric_
because they do not require the specification of a model with parameters.

Check out
[this slick visualization](https://www.jwilber.me/permutationtest/)
for a thorough introduction to this idea,
in the context of alpaca shampooing.

In [None]:
def randomize_group(df):
    rs_df = pd.DataFrame()
    
    rs_df["yield"] = df["yield"]
    rs_df["variety"] = df["variety"].sample(frac=1).values
    
    return rs_df

In [None]:
resample_dfs = [randomize_group(barley_df) for _ in range(10000)]

In [None]:
resample_ts = [compute_t_df(df) for df in resample_dfs]

In [None]:
sns.distplot(
    resample_ts, ax=null_ax, label="Resample Estimate", bins=25);
null_ax.legend(ncol=2); null_f

When sample sizes are small,
as in this case,
the distribution under the permutation test can look a bit wonky,
as this one does,
but it becomes smoother as the sample size gets bigger.

In [None]:
(np.abs(resample_ts) > np.abs(barley_t)).mean()

We can do the same with the `delta_mu`, or "difference in means", statistic:

In [None]:
resample_delta_mus = [compute_delta_mu_from_df(df) for df in resample_dfs]

In [None]:
(np.abs(resample_delta_mus) > np.abs(barley_delta_mu)).mean()

So if you must perform null hypothesis testing,
rather than spending your time
agonizing over the definition of the statistic and where it comes from,
the assumptions of the associated test, etc.,
spend your time coming up with a natural statistic for your data
and apply null-sampling methods to it.

If you can come up with a detailed null model,
you can write a pyMC model to draw the samples.
This same model can also be used to draw posterior samples.

If you cannot come up with such a model,
because you don't have the requisite knowledge
to apply assumptions to your data,
then you can apply randomization tests.

In neither case is it absolutely necessary
to use an existing statistic and its associated test!

This view is argued forcefully
in the blogpost
["There is Only One Test"](http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html),
by Allen Downey,
and elaborated on in a follow-up,
["There is Still Only One Test"](http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html),
from which the below graphic id derived.
That post also includes links to short video lectures,
by the author and by others.
One of them is [a tutorial](https://youtu.be/-7I7MWTX0gA)
by Jake Vanderplas, author of
[_The Python Data Science Handbook_](https://jakevdp.github.io/PythonDataScienceHandbook/)
and
[prolific blogger](http://jakevdp.github.io/)
on all things data science and Python.

In [None]:
Image(url=
      "https://lh4.googleusercontent.com/Bud31guq0w0FvylY57VMR0zHkYqxIpYAfOqgZietyvv1n2ToNEHwHKZWYix8pwct8kDKsZKiwvOWm6PIFEL3gBIQmbakQYHwVT02nn9_H8Fht_zaSBlrRNcqwZa950Vb5nt-5B84",
     width=1024)

The process of modeling and simulating data can be done
with parametric methods, e.g. in pyMC,
or with non-parametric methods, i.e. by applying an appropriate randomization procedure.

## Sampling from Non-Null Models to Calculate Power

The advantage of using a generative pyMC model
is that we can, by writing it down correctly,
also determine what happens when the null is false.

This is often very difficult in traditional approaches,
and impossible in a fully non-parametric approach,
which does not specify any alternative hypothesis.

The first question to ask about the behavior of a significance test when the null hypothesis is false is "what is the chance that I correctly reject the null?"

This chance is known as the _power_ of the test,
and it critically depends on what we assume the true value of the effect is.

That is,
just as we were able to calculate the chance of rejecting the null hypothesis when it was true
by specifying the null hypothesis,
we can calculate the chance of rejecting the null hypothesis when an alternate is true
by specifying an alternate hypothesis.

### Calculating Power _post hoc_

We wrote `make_t_model` in such a way that it can be used to model
the behavior of the $t$ statistic under non-null hypotheses:
just set `delta_mean`

And therefore we can easily calculate power according to one of the most common methods:
use the results of an experiment to make a _post hoc_ estimate of the experiment's power.

That is, we take as out alternative to the null the hypothesis that the effect we observed was the correct value,
and then we draw samples from a model where that is the case.

Graphically, this model looks like:

In [None]:
display_t_test_model(
    mean_observed=True, sd_observed=True, value_observed=False,
    title="Graphical Representation of\nPost-Hoc $t$-Test Power Calculation")

Everything is observed except the value,
just as in the null,
but now the values in $\mu$ are different:
they correspond to the values under the alternative,
rather than the null, hypothesis.

The cells below specify and then sample from this model.

In [None]:
barley_A_mean = barley_A_yield.mean()
observed_delta_mean = barley_B_yield.mean() - barley_A_mean

non_null_t, group_labels = make_t_model(sigma_pooled, group_size, barley_A_mean, delta_mean=observed_delta_mean)
samples = shared_util.samples_to_dataframe(shared_util.sample_from(non_null_t, draws=1000, chains=4))

sample_to_experiment_df(samples["yields"].iloc[0], group_labels).head()

non_null_t_dfs = [sample_to_experiment_df(sample["yields"], group_labels) for _, sample in samples.iterrows()]

Once again,
we estimate the distributions of the statistics of interest,
$t$ and $p$,
by applying our statistic-calculating functions to the samples.

In [None]:
non_null_distribution_t = pd.Series(
    [compute_t_df(non_null_t_df) for non_null_t_df in non_null_t_dfs], name="$t$")

non_null_distribution_p = pd.Series(
    [compute_p_df(non_null_t_df) for non_null_t_df in non_null_t_dfs], name="$p$")

In [None]:
f, ax = plt.subplots(figsize=(10, 10))
sns.distplot(null_distribution_t, label="Null, MC Est.");
sns.distplot(non_null_distribution_t, label="Post-Hoc Alt, MC Est.");
ax.set_title("Distributions of $t$"); ax.legend();

The fact that the distribution of $t$ is different under the null
is what makes this hypothesis test work:
when we observe large negative values of $t$,
the alternative hypothesis is a much more likely explanation than the null is.

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.distplot(null_distribution_p, kde=False, norm_hist=True, label="Null, MC Est.");
sns.distplot(non_null_distribution_p, kde=False, norm_hist=True, label="Post-Hoc Alt, MC Est.");
ax.set_title("Distributions of $p$"); ax.legend();

This difference in the distribution of $t$ is then passed forward
into a difference in the distribution of $p$.

Though the distribution of $p$ is always uniform under the null hypothesis,
the distribution of $p$ under the alternative is generally not.
In a good statistical test,
this distribution is shifted towards 0.
it would certainly be bad if it was shifted towards 1!

From this plot, we can obtain the power:
it is the fraction of $p$s in our samples that are below the threshold,
aka the area of the bar corresponding to $p$ values less than $\alpha$.

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.distplot(non_null_distribution_p, kde=False, norm_hist=True, bins=20, label="False Negatives", color="C1");
bar_heights, _ = np.histogram(non_null_distribution_p, bins=20, density=True)
ax.bar(0.025, bar_heights[0], width=0.05, color="C1", label="True Positives")

ax.set_title("Distribution of $p$ under Post-Hoc Alternate, MC Estimates"); ax.legend();

In [None]:
print(bar_heights[0] * 0.05)

The value above is (an estimate of) the power of the $t$-test for this choice of settings of all the other relevant parameters:
- the true value of the pooled standard deviation
- the true value of the difference in means
- the sample size
- the choice of $\alpha$/threshold on $p$

### Calculating Power _a priori_

The sample size and choice of $\alpha$ are known to us, and even under our control.
If we wish to calculate the power for a different setting of those parameters,
we just repeat the above analysis with those parameters changed.

In general, we don't know what the true value of the pooled standard deviation is
and we don't know what the true value of the difference in means is,
or else we wouldn't need statistics!
In running the above power calculation,
we assumed that the values we observed were correct.

Firstly, this is typically not true, nor even close to true,
unless the sample sizes are very large.
Secondly, this is of little help in choosing whether to run an experiment,
since it requires the experiment to be run first.

One way to compute the power of a test before running an experiment
is to fix these values at some reasonable level:
either _a priori_ or by taking a small amount of "pilot data".
Instead of picking a single value,
you might also pick a series of values and get out a series of possible powers:
the lowest power we might have is this, the highest is that.

But not knowing the value of the mean is just a special case of
not knowing the value of a parameter in a model.
That is, the problem of trying to determine the power
is just a very important example of an inference problem.

To compute the power of the $t$-test without assuming some fixed value,
we therefore must place a prior over the difference in means.

In [None]:
display_t_test_model(
    mean_observed=False, sd_observed=True, value_observed=False,
    title="Graphical Representation of\nPrior $t$-Test Power Calculation")

Now, the values and the means have not been observed,
and so they will be free random variables in our model:

In [None]:
def make_t_model_w_prior(estimated_sd, group_size, group1mean=0, delta_mean_sd=2):
    
    group_labels = [0] * group_size + [1] * group_size
    
    with pm.Model() as t_model:
        delta_mean = pm.Normal("delta_mean", mu=0, sd=delta_mean_sd)
        group = pm.Categorical("varieties", p=[1/2, 1/2],
                               observed=group_labels)
        value = pm.Normal("yields",
                          mu=group1mean + pm.math.switch(group, delta_mean, 0),
                          sd=estimated_sd, shape=len(group_labels))
        
    return t_model, group_labels

Notice that now we need to provide a parameter to our priod over the difference in means.
In general, setting this prior requires us to think about what kinds of effect sizes we are likely to see.

In this case, I chose a `Normal` prior centered at 0,
so the only parameter was the width of that distribution, `delta_mean_sd`.
I chose a symmetric prior because we weren't sure, _a priori_,
which variety was more likely to give a larger yield.

In [None]:
power_t, group_labels = make_t_model_w_prior(1, group_size, 0, 2)
power_t_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    power_t, draws=2500, chains=4, progressbar=True))

In [None]:
power_t_dfs = [sample_to_experiment_df(sample["yields"], group_labels)
               for _, sample in power_t_samples.iterrows()]

In [None]:
target_df = random.choice(power_t_dfs)
t_plot(target_df, print_p=True, style="kde")

Now, we can apply our functions that compute $t$ and $p$
to get our prior distributions over those statistics,
which tell us our guess, before seeing the data,
that we will be able to correctly reject the null if it is false.

In [None]:
power_t_samples = pd.Series([compute_t_df(power_t_df) for power_t_df in power_t_dfs], name="$t$")

power_p_samples = pd.Series([compute_p_df(power_t_df) for power_t_df in power_t_dfs], name="$p$")

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(power_p_samples, kde=False, norm_hist=True, bins=20, label="MC Estimate");
ax.set_title("Prior Distribution of $p$"); ax.legend();

In [None]:
power_estimate = (power_p_samples < 0.05).mean()
power_estimate

#### Suggested Exercises with this Notebook

Try increasing the `group_size` up to `10` or `30` or `100`.
What happens to the distribution of $p$ under the null?
What about under the post-hoc alternate?
What effect does this have on the power?

Try decreasing the `delta_mean` in the post-hoc alternate model (`non_null_t`).
What does this do the distributions of $p$ and $t$ under that model?
What does this do to the power?
Do the same with increasing `true_sigma` (the first argument to `make_t_model`).