<img src="../../shared/img/banner.svg" width=2560></img>

# Lab 05 - Modeling Science

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from client.api.notebook import Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

import shared.src.utils.util as shared_util
import utils

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

## Learning Objectives

1. Connect the quantities in NHST to the components of a model of NHST.
2. Recognize the tentative nature of research results from small numbers of low-powered studies.
3. Practice using pyMC to draw from priors and posteriors

JPA Ioannidis argued, in
a 2005 article provocatively titled
["Why Most Published Research Findings Are False"](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124),
that statistical factors, like low prior probabilites and low power,
and human factors, like conscious and unconscious bias,
suggest that the aura of finality attached to scientific claims
that are the result of hypothesis testing is misplaced.

In this lab, we will apply our pyMC modeling skills to the process of running a scientific experiment
and applying null hypothesis significance testing.
This will allows us to obtain posterior probabilities of the null hypothesis,
instead of just claims about results being "positive" or "negative"
and the hypothesis being "rejected" or "not rejected".
We will observe the impact of low power on interpeting results.

## Problem Setup: Replication Failure

As a budding young research psychologist,
you've just published your first paper:
_Exposure to Foo Reduces Bar_.
In it, you performed a statistical test
of the null hypothesis that Foo does not have any impact on the effect of Bar,
found that $p<0.05$, and so found that this null hypothesis could be rejected.

Only one year on from publication, a contradicting paper comes out:
_Exposure to Foo Does Reduce Bar_.
Another research group has repeated your experiment
and gotten a different result:
their $p \geq 0.05$.
An experiment that repeats another is called a _replication experiment_,
and when the results are different,
we say that the first experiment _failed to replicate_.
The authors of the paper claim that
the null hypothesis should therefore not be rejected.

You are devastated:
you've suffered a _replication failure_.
Someone has contradicted your work,
and surely that means that your hypotheses about Foo and Bar have been falsified,
and you need to burn your dissertation and start over.

But should you be?
Let's make a pyMC model of the process of running a science experiment
and obtaining a result using NHST.
Then, we can draw from the posterior of the model
to determine what your beliefs about the null hypothesis should be,
given the results you've seen.

## Writing Down our Model

In [None]:
utils.plot.make_science_model_graph();

To write down a model, we need to combine a prior and a likelihood.

The prior will be determined by what we thought about the null hypothesis before we saw the experiment.

The likelihood will be determined by the parameters of our statistical test.

### Setting the Prior

The prior for this model is straightforward:
the null hypothesis is either true or false,
and so we have a `Bernoulli` distribution.

That `Bernoulli` distribution has a parameter, `p`,
which determines the probability that it is equal to `1`.
Let's interpret this variable being equal to `1` to mean that the null hypothesis was true.
For that reason, we'll call this variable `null_true`.

Do not confuse this `p`, or any of the `p` parameters
in the pyMC model,
for the $p$ value calculated from a statistic (note the difference in typesetting).
The $p$-value is a random variable, derived from our data,
while `p` here is a fixed parameter.

To represent the state of maximum uncertainty about the null hypothesis,
we set this parameter to be equal to `0.5`:
the null hypothesis is equally likely to be true as false.
Given the track record of scientific claims,
it would perhaps be more appropriate to be even more conservative,
and set the prior probability still lower.

In [None]:
prior = 0.5

### Setting the Likelihood

Since the result of our hypothesis test is similarly binary,
either positive (we reject the null)
or negative (we fail to reject the null),
the likelihood will also be `Bernoulli`.

The table below breaks down the English-language terms for the likelihood component of our model.
The columns correspond to whether the null hypothesis is $F$alse or $T$rue,
while the rows correspond to whether the outcome of the test is positive ($+$) or negative ($-$).
Because our prior was over the truth value of the null hypothesis,
the columns (colored) are the conditional distributions.

<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
      <th > $$F$$ </th>
      <th > $$T$$ </th>
    </tr>
    <tr>
      <td >$$+$$</td>
      <td style="background-color: rgb(0,50,98); color: white"> True Positive Rate, Power, Sensitivity </td>
      <td style="background-color: rgb(253,181,21);"> False Positive Rate, &#945; </td>
    </tr>
     <tr>
      <td >$$-$$</td>
      <td style="background-color: rgb(0,50,98); color: white"> False Negative Rate, &#946;</td>
      <td style="background-color: rgb(253,181,21);"> True Negative Rate, Specificity</td>
    </tr>
  </tbody>
</table>


For our pyMC verison of this model, let's say that a value of `1`
for this variable corresponds to a positive outcome,
and give it the name `positive_result`.
Then, the two values for the  `p` parameter of the `Bernoulli` likelihood are
the "False Positive Rate", $\alpha$, for when the null is true,
and the "Power", $1 - \beta$, for when the null is false.

If we reject the null whenever $p$ is below a threshold $x$,
then our false positive rate is $x$.
This follows from the definition of $p$.

The choice of $x$ is determined by tradition: `0.05`.

In [None]:
alpha = 0.05

The `power` is much harder to measure.
Once you've completed an experiment,
you can estimate the power of a follow-up experiment of any size,
but this relies on assuming your results were accurate.
Results are only accurate for large sample sizes,
which are uncommon in many branches of science.

By combining results from many papers,
a process known as _meta-analysis_,
we can improve the accuracy of our power estimations.

By doing [a meta-analysis of meta-analyses](https://www.nature.com/articles/nrn3475),
Button et al. were able to estimate the median power level of experiments in neuroscience to be `0.3`.
Power levels are low in neuroscience because of ethical concerns
(each element in a sample often requires the death of an animal)
and the expense of experiments.

In psychology, power levels are sometimes higher,
but they are limited by ethical concerns of their own
(human experimentation is subjected, rightly, to substantial oversight)
and to the tremendous degree of uncontrolled variability in experiments with humans.

So for the power of our experiment, we will take the pessimistic value of $0.3$.

In [None]:
power = 0.3

To set the value of the `p` in the likelihood according to whether,
on a given sample, the null was true or not,
and so whether the false positive rate or the power
determines the behavior of our simulated test,
we use `pm.math.switch`.

This function is a sort of `if`/`else` for pyMC variables.
Read the section on it under **Tips** at the bottom of the lab for an explanation of how to use it,
or check out the slides for the previous week for an example of it in use.

### Including the Observations

Lastly, we need to set the `observed` value.

We have not directly observed whether the null is true or not,
so we do not place the `observed` keyword in the variable
representing the truth value of the null hypothesis.
Instead, we've observed the outcome of a test,
so we put it in the `positive_result` variable.

We will first model our beliefs after seeing the positive result
that led us to publish our paper, so `observed=1`.

### Specifying the Model in pyMC and Drawing Samples

Now, combine all of these components together in a model,
`science_model_positive_result`,
based on the template below.

Put `prior`, `power` and `alpha` in the right places.
Use `pm.math.switch` to switch between using `power` and `alpha`
depending on the value of `null_true`.
Again, check out the section under **Tips** for more on how to use `pm.math.switch`.

```python
with pm.Model() as science_model_positive_result:
    null_true = pm.Bernoulli("null_true", p=?)
    positive_result = pm.Bernoulli("positive_result",
                                     p=pm.math.switch(?, ?, ?), observed=1,
                                     dtype="int64")
```

Note: ignore the `dtype=int64`, this is to resolve a small bug in the version of pyMC used in this class.

## Checking the Model

Use `sample_prior_predictive` to draw `10000` samples,
saved to a dataframe called `prior_samples`.

In [None]:
ok.grade("q1")

**Now, produce two histograms representing the likelihood**.

Make a histogram for each of the two conditional distributions of the `positive_result` variable,
one given that the null hypothesis is true (`prior_samples["null_true"] == 1`)
and one given that it is false.

For convenience, the function `utils.plot.compare_bernoullis` is provided to help you make the histograms.
If you'd rather do it yourself,
for practice or because you find that function more confusing than helpful,
you may do so.
Check out the documentation string by running a cell containing the line `utils.plot.compare_bernoullis??`
and see the example under the **Tips** section at the bottom of the lab.

Once you have the histograms, use them to answer the question below.

#### Q Which bar's height is equal to `alpha`? Which bar's height is equal to `power`?

## Approximating Our Posterior Beliefs with pyMC

Next, let's use `pm.sample` to draw samples from the posterior,
which represents our beliefs about the null hypothesis after observing the
results of the first experiment.

Draw at least `10000` samples and put them in a dataframe called `posterior_samples`.

In [None]:
ok.grade("q2")

**Now, produce two histograms representing the prior and the posterior for the null hypothesis**.

The former will be based on the `prior_samples` and the latter on the `posterior_samples`.

Again, `utils.plot.compare_bernoullis` might be of help.

Use the histograms to answer the question below.

#### Q What, approximately, is the probability that the null hypothesis is true under the posterior? Is it larger or smaller than the probability under the prior? Intuitively, why did the probability change in this direction?

The questions below ask about what might happen if certain parameters differed. If you've got a strong grasp on how conditional probabilities work, you can answer them directly.

If you're struggling, just try simulating what happens instead! Change the values of the relevant parameters,
`alpha` and `prior`.
Except for values of `alpha` and `prior` very close to `0` or `1`, this won't interfere with the autograder.

#### Q If you wanted to increase the posterior probability of the null hypothesis being false, given a positive result, would you increase or decrease the value of `alpha`? Give an intuitive explanation for your answer.

#### Q If the `prior` probability of the null hypothesis being true were larger, would the posterior probability of the null hypothesis being false increase or decrease?

## Incorporating the Failed Replication Experiment

Now, let's see how our beliefs change once the replication failure occurs.
The impact of a failed replication is such that folks
usually interpret a replication failure to mean
that the null hypothesis is extremely likely.

To do so, let's specify a new model, `science_model_replication_failure`.
Again, we'll need observations, a likelihood, and a prior.

The observed value will be different, since we are now observing a negative result: `observed=0`.

The likelihood will be the same as in the first model,
because the experiment and statistical test being performed are exactly the same.

The prior will also have changed.
We can now _incorporate the evidence from the previous experiment into our prior_.
The new prior probability of the null hypothesis,
under this new model of the failed replication experiment,
is equal to the posterior probability of the null hypothesis,
under the old model and given the observation (in this case, of a positive result).
That is, we _update our beliefs_ in response to the evidence.

Calculate the `new_prior` probability of the null hypothesis
from the `posterior_samples` and use it,
along with the values of `alpha` and `power`,
to specify a model based on the template below.

Then, draw at least `10000` samples with `pm.sample` or `shared_util.sample_from`,
and put them in a dataframe called `replication_failure_samples`.

```python
with pm.Model() as science_model_replication_failure:
    null_true = pm.Bernoulli("null_true", p=?)
    positive_result = pm.Bernoulli("positive_result",
                                     p=pm.math.switch(?, ?, ?), observed=0,
                                     dtype="int64")
```

In [None]:
ok.grade("q3")

**Now, plot the histograms of the posterior after the first experiment
and the posterior after the failed replication experiment.**

Use them to answer the questions below.

#### Q What is the final posterior probability of the null hypothesis, given the results of the two experiments?

#### Q How different are the posterior before and after observing the result of the failed replication experiment? Does this surprise you? Why or why not?

The original scientific question was about whether the null hypothesis could be rejected: does Foo, in fact, reduce the effect of Bar? 

#### Q Would you consider this question settled? Explain how you came to this conclusion based on the final posterior probability. Note: there is an element of subjectivity in answering this question, so don't be afraid to disagree with your classmates.

In [None]:
ok.score()

## Tips

### Using `pm.math.switch`

`pm.math.switch` acts like an `if`/`else` statement, but for a pyMC model.

The arguments for `pm.math.switch` are:
0. The "trigger" variable whose value determines which argument to return. This acts like the expression after an `if`.
1. The value to return when the "trigger" is `1`.
2. The value to return when the "trigger" is `0`.

```python
p = pm.math.switch(arg0, arg1, arg2)
```

is the pyMC equivalent of:

```python
if arg0:
    p = arg1
else:
    p = arg2
```

### Using `utils.plot.compare_bernoullis`

If you'd like, you may use the `utils.plot.compare_bernoullis` function
to plot your histograms.

I encourage you to use your own plotting code, if you'd like the practice!

Read the documentation for this function by uncommenting and the running the cell directly below this one.

In [None]:
# utils.plot.compare_bernoullis??

Run the cell beneath this one to see an example.

In [None]:
example_series = [pd.Series([1, 1, 1, 1]), pd.Series([1, 1, 0, 0]), pd.Series([0, 0, 0, 0])]
utils.plot.compare_bernoullis(example_series,
                   colors=["C0", "C1", "C2"],
                   titles=["All 1s", "Half and Half", "All 0s"]);