<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Bayesian Inference 01

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

import daft
from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

In [None]:
def compare_bernoullis(bernoulli_samples, colors=None, titles=None):
    """Given a list of Series representing samples from a Bernoulli variable,
    plot histograms of each Series in the list.
    Optionally, provide a color and/or title for each histogram.
    
    Parameters
    ==========
    bernoulli_samples: list of Series or list of arrays. Each element in
                       this list is passed to sns.distplot
    colors : list of strings or None. If not None, use to color the histograms.
    titles : list of strings or None. If not None, use to title the axes.
    
    Returns
    =======
    f : matplotlib Figure containing axs with histograms plotted in
    axs : array of matplotlib Axes
    """
    n_bernoullis = len(bernoulli_samples)
    
    f, axs = plt.subplots(figsize=(6 * n_bernoullis, 6),
                          ncols=n_bernoullis,  sharex=True, sharey=True)
    if n_bernoullis == 1:
        axs = np.array([axs])
    if colors is None:
        colors = [None] * n_bernoullis
    if titles is None:
        titles = [""] * n_bernoullis
        
    assert len(colors) == n_bernoullis, f"provide the same number of colors as bernoulli_samples: {n_bernoullis}"
    assert len(titles) == n_bernoullis, f"provide the same number of titles as bernoulli_samples: {n_bernoullis}"
    
    bins = [-0.5, 0.5, 1.5]
    kwargs = {"kde": False, "bins": bins,
              "norm_hist": True, "hist_kws": {"alpha": 1, "ec": "k", "lw": 4}}
    for ax, bernoulli, color, title in zip(axs, bernoulli_samples, colors, titles):
        sns.distplot(bernoulli, **kwargs, color=color, ax=ax);
        ax.set_xlabel("")
        ax.set_title(title)
        ax.set_ylim(0, 1.1)

    [ax.set_xticks([0, 1]) for ax in axs]
    [ax.set_xticklabels(["0", "1"]) for ax in axs];
    return f, axs

In [None]:
def add_arrow_chain(prior1, ax):
    posterior1 = posterior_given_passing(prior1)
    ax.arrow(prior1, posterior1, (posterior1 - prior1) * 0.67, 0, lw=4, head_width=0.02, color="k")
    
    prior2 = posterior1
    posterior2 = posterior_given_passing(prior2)
    
    ax.vlines([prior1, prior2], [prior1, prior2], [posterior1, posterior2], lw=4, color="C3");

# Bayes' Rule

Previously, we derived Bayes' Rule,
for relating conditional probabilities to one another,
for "inverting" a conditional probability statement:

$$
p(A \vert B) \ \ \ \overleftrightarrow{\text{Bayes}} \ \ \ p(B \vert A)
$$

The important special case of this rule that was considered last week
was the relationship between a hypothesis and data that provided evidence about that hypothesis:

$$
p(\text{hypothesis}\vert \text{data}) = \frac{p(\text{data}\vert \text{hypothesis}) p(\text{hypothesis})}{p(\text{data})}
$$

The left term in the numerator is the _likelihood_:
typically, our data is fixed, and we vary the hypothesis,
obtaining the probability we would observe the data we did observe,
for each hypothesis we consider.

When we build a model,
this is the piece that relates unknown quantities,
like the true mean of the population,
to quantities we can observe, like the value on a sample.

Note how much easier it is to specify this direction of conditional probability
than the other way around.

For example: if I know that an animal is a cat, rather than a dog,
I can guess its weight.
But if I know an animal's weight, I need to think quite a bit harder
to determine whether its a dog or a cat.

The right term in the numerator is the _prior_:
the probability we assign to the hypothesis,
having not seen any data.

When we build a model,
this is the piece that captures the knowledge we bring to the problem
from our experience, from the scientific literature, or by assumption.

Recall that, for pyMC to work,
we don't need to specify the denominator:
we only need to know the probability of the hypothesis
"up to a proportionality constant".

$$
p(\text{hypothesis}\vert \text{data}) \propto p(\text{data}\vert \text{hypothesis}) p(\text{hypothesis})
$$

# Bayes' Rule and Binary Hypothesis Testing

In last week's lab, we were even more specific:
we focused in on the case where the "data" we observe is just the result of a statistical test:

$$
p(\text{hypothesis}\vert \text{test result}) = \frac{p(\text{test result}\vert \text{hypothesis}) p(\text{hypothesis})}{p(\text{test result})}
$$

## $ p(\text{hypothesis})$

This is the _prior_ component of our model.

In [None]:
prior_on_null = 0.5

In the lab, we said that we thought there was a 50% chance that the null was true.

## $ p(\text{test result}\vert \text{hypothesis})$

This is the _likelihood_ component of our model.

The state of the null hypothesis determines which column we are in in this table.
Remember that this is generally unknown, even unknowable!

The output of the statistical test determines which row.
This is the component that we actually know.


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+2"> $$ p(\text{result}\vert \text{hypothesis}=T)$$ </font></th>
        <th > <font size="+2"> $$ p(\text{result}\vert \text{hypothesis}=F)$$ </font></th>
    </tr>
    <tr>
        <td ><font size="+2"> $$+$$ </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> True Positive Rate, Power, Sensitivity </font> </td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> False Positive Rate, &#945; </font> </td>
    </tr>
     <tr>
         <td ><font size="+2"> $$-$$</font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> False Negative Rate, &#946; </font></td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> True Negative Rate, Specificity </font> </td>
    </tr>
  </tbody>
</table>
</font>


Color indicates the components that are probability distributions: they add up to 1.
The columns of this table are probability distributions, and so add up to 1.

The rows of this table for our model do not add up to 1,
and they shouldn't, in general.

Since we think of our data
(in this case, the output of the statistical test)
as being fixed and compare the choices of the value of the unknown
(in this case, the truth value of the null hypothesis),
we are usually in the case where we are moving within a row of this table,
rather than within a column.

That's why this component has the name _likelihood_,
more specificially _data likelihood_,
rather than anything involving the word "probability",
as in _prior probability_ and _posterior probability_:
probabilities add up to 1.

These are both binary variables, so we can model them as `Bernoulli`s and need to specify a value for `p` for each:

In [None]:
alpha = 0.05; power = 0.3

These values were used in the lab.

The value of `alpha`, remember, is determined by our cutoff on the $p$ statistic.

The value of `power` is trickier to determine, and ranges from `0.3` for small, noisy studies,
as are common in many branches of biology, including neuroscience,
up to `0.8` or `0.9` or higher for large, well-controlled studies,
like clinical trials.

## We Can Almost Never Write Down a Table, But We Can Almost Always Write Down a Model

In [None]:
with pm.Model() as science_model_positive_result:
    null_true = pm.Bernoulli("null_true", p=prior_on_null)
    positive_result = pm.Bernoulli("positive_result",
                                   p=pm.math.switch(null_true, alpha, power), observed=1,
                                   dtype="int64")

Depending on the context, this will be called a
- _`pyMC` model_, when we want to emphasize the concrete implementation
- _generative model_, when we want to emphasize the abstract concept

Some terms you might hear elsewhere:
- _Bayesian model_, _Bayesian network_, _graphical model_, _probabilistic program_, _directed acylic graph_ (DAG)

When we want to ask questions of the model,
we draw samples from it.

Any technique for statistical estimation based on simulating random samples is a _Monte Carlo_ technique.

For example, bootstrapping is also a Monte Carlo technique.

In [None]:
samples_from_prior = pm.sample_prior_predictive(model=science_model_positive_result)

In [None]:
samples_from_prior

In [None]:
samples_from_prior_df = shared_util.samples_to_dataframe(samples_from_prior)

`sample_prior_predictive` produces a dictionary,
as does `sample_posterior_predictive`,
while `pm.sample` produces something else called a `MultiTrace`.

For most of our analysis, we want to think of these all as _samples_,
so we convert to one datatype, a pandas `DataFrame`.

In [None]:
print(samples_from_prior_df.head())

In [None]:
null_true_selector = samples_from_prior_df["null_true"].astype(bool)
null_false_selector = -null_true_selector

Note: pyMC works exclusively with numbers.
Variable values cannot be `bool`eans or `str`ings,
only things like `int`s and `float`s.

`pandas`, on the other hand, uses `bool`eans and `str`ings quite a lot,
and so you'll need to interconvert between the two.

This will come up when using data from `pandas` as
the observed values in pyMC,
e.g. in this week's lab.

In [None]:
compare_bernoullis(
    [samples_from_prior_df["positive_result"].loc[null_true_selector],
     samples_from_prior_df["positive_result"].loc[null_false_selector]],
    colors=["C0", "C1"], titles=[
        "Null Hypothesis True:\n $P(R\\vert H_0=$True$)$",
        "Null Hypothesis False:\n $P(R\\vert H_0=$False$)$"]);


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+1"> $$F$$ </font> </th>
        <th > <font size="+1"> $$T$$ </font> </th>
    </tr>
    <tr>
        <td > <font size="+2"> $$+$$ </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> True Positive Rate, Power, Sensitivity </font> </td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> False Positive Rate, &#945; </font> </td>
    </tr>
     <tr>
         <td ><font size="+2"> $$-$$ </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> False Negative Rate, &#946; </font></td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> True Negative Rate, Specificity </font> </td>
    </tr>
  </tbody>
</table>
</font>


## Bayes and Bugs

Now, let's do another example, also with binary variables.

In [None]:
HTML(filename="data/debug_tweet.html")

When we write code, we aim for it to have no bugs.

To try and ensure this,
we write tests and check whether the code passes those tests.

If a chunk of code fails a test,
then we know there's a bug.

Presuming our tests don't have bugs! Remember [Cromwell's Rule](https://en.wikipedia.org/wiki/Cromwell%27s_rule).

But in Python, even if we've got really good tests,
there's still a chance a bug slips through.

Some other languages can make guarantees, of a sort,
that certain kinds of bugs are not present.

So the inferential question here is:
if I write code that passes all my tests,
what's the chance that it has no bugs?

### $$ p(\text{no bugs}\vert \text{pass tests}) = 🤔$$ 

That is, what should I _believe_ about the bugginess of my code,
_after_ I've passed the tests.

It's intuitive that this depends on what kinds of code I tend to write
and how good my tests are.

As we'll see,
if we just write out Bayes' Rule
and start filling it in,
those two intuitions will pop out.

$$
p(\text{no bugs}\vert \text{pass tests}) = \frac{p(\text{pass tests}\vert \text{no bugs}) p(\text{no bugs})}{p(\text{pass tests})}
$$

Afterwards, we'll compare our answer to pyMC's results.

The "direct" method we use here won't scale
to bigger, more complicated problems,
but pyMC will.

At least to a certain extent.

There are, of course, problems too big for any approach.
For Bayesian Monte Carlo methods like pyMC,
some of those problems are practical applications,
like images and video.

But for statistical inference
of the kind most often done in research psychology,
Bayesian Monte Carlo will scale well enough to do the job.

This example comes from
[Chapter 1](https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_PyMC3.ipynb)
of the GitHub textbook
[Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers),
one of the core inspirations for this class.

### Prior: $ p(\text{no bugs}) $

This is the chance that the code I have written, without testing it,
has no bugs in it.

It expresses my beliefs about my code, before I have observed the results of tests.

In real life, this prior wouldn't be so simple:
if the code were more complicated,
the prior probability would be lower,
while if it were less complicated,
or I had worked on it with a friend,
the prior probability might be higher.

You might even start to think of it as a function of other variables,
some of which you can measure and some of which you can't.

In [None]:
prior_no_bugs = 0.2

Below, this will be denoted $p_n$.

### Likelihood: $ p(\text{pass tests} \lvert \text{no bugs}) $

This component relates one variable's value to the distribution over another variable's possible values.

Most often, it relates the value of a variable we _cannot_ observe
to a distribution over the values of a variable we _can_ observe.


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+1"> No Bugs </font> </th>
      <th > <font size="+1"> Some Bugs </font> </th>
    </tr>
    <tr>
        <td > <font size="+1"> Pass </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> 1 </font> </td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> 0.5 </font> </td>
    </tr>
     <tr>
      <td ><font size="+1"> Fail </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> 0 </font></td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> 0.5 </font> </td>
    </tr>
  </tbody>
</table>
</font>


In [None]:
def tests_likelihood(observation, truth_about_code):
    if observation == "pass tests":
        if truth_about_code == "no bugs":
            return 1
        elif truth_about_code == "some bugs":
            return 0.5
        
    if observation == "fail tests":
        if truth_about_code == "no bugs":
            return 0
        elif truth_about_code == "some bugs":
            return 0.5

This code implements a "look-up table"
for the likelihood above.

You might think of all of the likelihoods in all of our models as being
just big, fast versions of a look-up table like this one:
given a value for the parameters (bugs or no bugs)
and the observations (pass or fail),
they return a probability.

We are only interested in what happens if the code passes the tests,
so we can make a simple function that only looks at a "row" of the table.

In [None]:
def likelihood_test_passed(truth_about_code):
    return tests_likelihood("pass tests", truth_about_code)

In [None]:
likelihood_test_passed("no bugs") + likelihood_test_passed("some bugs")

### Normalizing Factor: $p(\text{pass tests})$

This is also called the _marginal probability of the observations_.

$$
p(\text{pass tests}) = p(\text{pass tests}\ ,\ \text{no bugs}) + p(\text{pass tests}\ ,\ \text{some bugs})
$$

The chance that we pass the tests is

- the chance we pass the tests and there are no bugs PLUS
- the chance we pass the tests and there are some bugs

We could have instead written it as

- the chance we pass the tests and [Mercury is in retrograde](https://www.ismercuryinretrograde.com/) PLUS
- the chance we pass the tests and Mercurcy is not in retrograde

since the only thing that matters is that we break the probabiity down
in terms of two mutually exclusive events.

But using the behavior of Mercury wouldn't give us
a useful way of breaking down the number we're trying to calculate,
whereas the way we chose gives us,
by applying the rule $p(x,y) = p(x\vert y)p(y)$ twice:

$$
p(\text{pass tests}) = p(\text{pass tests}\vert \text{no bugs}) p(\text{no bugs}) + p(\text{pass tests}\vert \text{some bugs}) p(\text{some bugs})
$$

These are all numbers from our likelihood and prior!

And indeed, once you have a likelihood and a prior,
nothing in principle is stopping you from calculating
this normalization factor.

For us,
the numbers are:

$$
p(\text{pass tests}) = 1 \cdot p_n + 0.5 \cdot (1 - p_n)
$$

Which we calculate in Python as

In [None]:
prior_some_bugs = 1 - prior_no_bugs

normalizing_factor = likelihood_test_passed("no bugs") * prior_no_bugs\
    + likelihood_test_passed("some bugs") * prior_some_bugs

In [None]:
normalizing_factor

This example is a bit deceptive:
with more complicated discrete models, the number of things we need to add together grows very rapidly.
With continuous variables, the sum becomes an integral, and those are very hard in general.
With complicated continuous models, the result is a high dimensional integral,
for which we have very limited mathematical tools.

So even though it's possible to do, in theory,
it is impractical, and so pyMC is built specifically to avoid computing it.

### Posterior: $p(\text{no bugs}\vert \text{pass tests})$

This is what we were actually interested in.

### $$ p(\text{no bugs}\vert \text{pass tests}) = 🤔$$ 

Unlike the majority of cases,
in this one we can actually calculate the posterior directly by hand,
using the quantities above.

We begin by writing out Bayes' Rule:

$$\begin{align}
p(\text{no bugs}\vert \text{pass tests}) &= \frac{p(\text{pass tests}\vert \text{no bugs})p(\text{no bugs})}{ p(\text{pass tests})}
\end{align}$$

then we plug in the numbers

$$\begin{align}
p(\text{no bugs}\vert \text{pass tests}) &= \frac{1 \cdot p_n}{ 1 \cdot p_n + 0.5 \cdot (1 - p_n)}
\end{align}$$

and simplify:

$$\begin{align}
p(\text{no bugs}\vert \text{pass tests}) &= \frac{2 p_n}{2 p_n +(1 - p_n)}\\
&= \frac{2 p_n}{ p_n + 1}
\end{align}$$

And so our posterior probability is no longer a mystery.

It's just this simple function of our prior:

### $$ p(\text{no bugs}\vert \text{pass tests}) \neq 🤔$$ 

### $$ p(\text{no bugs}\vert \text{pass tests}) = \frac{2 p_n}{p_n + 1}$$ 

First, let's use this function to get our posterior,
given all of the assumptions we made:

In [None]:
# note that this assumes that our likelihood of passing with some bugs was 0.5!
# if you change the definition of likelihood above, you also need to rederive the values for this function

def posterior_given_passing(p_no_bugs):
    return 2 * p_no_bugs / (p_no_bugs + 1)

In [None]:
print(prior_no_bugs, posterior_given_passing(prior_no_bugs))

This makes sense: once you've seen the tests pass,
the chance there are no bugs increases, but it doesn't go to 1.

Clearly, the posterior depends on our prior,
which we set fairly arbitrarily.

Let's take a look at what the posterior probability looks like for 
a bunch of different values of the prior porbability.

In [None]:
f, ax = plt.subplots(figsize=(10, 10)); ps = np.linspace(0, 1, num=100)
ax.plot([0, 1],  [0, 1], color="gray", lw=4, ls="--", label="Prior");
ax.plot(ps, posterior_given_passing(ps), lw=4, label="Posterior");
ax.vlines(ps, ps, posterior_given_passing(ps), color="C3", zorder=0, label="Change in Beliefs");
ax.set_xlim(0, 1); ax.set_ylim(0, 1); ax.legend();
ax.set_xlabel("$p_n$, prior $p$ of bug-free code");
ax.set_ylabel("$p($no bugs $\\vert$ passed tests$)$\nposterior $p$ of bug-free code");

1. Observing that the tests were passed always decreases the chance that there are bugs.
2. The only time that observing that the tests were passed doesn't change the posterior from the prior
is when the prior is 0 or 1: if you're certain, you have no need of evidence.
3. The change is maximized when the prior is 0.5: evidence is most useful when you are most uncertain. 

### For Comparison, the pyMC Way

Returning to the original problem,
let's remind ourselves of how we'd use pyMC to solve it.

### $$ p(\text{no bugs}\vert \text{pass tests}) = 🤔$$ 

The pyMC approach to solving this problem is to _approximate the posterior_
by drawing samples from it.

In [None]:
with pm.Model() as bugs_model:
    bug_free = pm.Bernoulli("bug_free", p=prior_no_bugs)  # prior
    
    pass_test_no_bugs = likelihood_test_passed("no bugs")
    pass_test_some_bugs = likelihood_test_passed("some bugs")
    
    pass_tests = pm.Bernoulli(  # likelihood
        "pass_tests",
        p=pm.math.switch(
            bug_free, pass_test_no_bugs, pass_test_some_bugs),
        observed=1)

In [None]:
posterior_samples_trace = shared_util.sample_from(bugs_model)
posterior_samples_df = shared_util.samples_to_dataframe(posterior_samples_trace)

posterior_ps = posterior_samples_df["bug_free"].value_counts().sort_index() / len(posterior_samples_df)
posterior_ps

In [None]:
f, ax = plt.subplots(figsize=(6, 6))
ax.bar([0, 1], posterior_ps, color="C2");
ax.set_xticks([0, 1]); ax.set_xticklabels(["some bugs", "bug free"]);
ax.set_ylabel("Posterior probability");

## Iteratively Applying Bayes' Rule

Let's say we write another set of tests, and the code passes again.

What is our new answer?

### $$ p(\text{no bugs}\vert \text{pass both sets of tests}) = 🤔$$ 

Remember that our prior represented our beliefs before we observed the first tests had been passed.

The posterior represents our beliefs after we observed the first tests had been passed.

_After_ the first tests have been passed is also _before_ the second tests have been passed.

Therefore the posterior for the first set of tests is the prior for the second set of tests.

> Nach dem Spiel ist vor dem Spiel

Sepp Herberger, German football coach

In English: "After the game is before the game".

In [None]:
print(posterior_given_passing(prior_no_bugs), # prior for second tests
      posterior_given_passing(posterior_given_passing(prior_no_bugs)) ) # posterior for second tests

In [None]:
f, ax = plt.subplots(figsize=(10, 10)); ps = np.linspace(0, 1, num=100)
ax.plot([0, 1],  [0, 1], color="gray", lw=4, ls="--", label="Prior");
ax.plot(ps, posterior_given_passing(ps), lw=4, label="Posterior");
ax.vlines(ps, ps, posterior_given_passing(ps), color="C3", zorder=0, label="Change in Beliefs");
ax.set_xlim(0, 1); ax.set_ylim(0, 1); ax.legend();
add_arrow_chain(prior_no_bugs, ax);
ax.set_xlabel("$p_n$, prior $p$ of bug-free code"); #plt.axis("equal");
ax.set_ylabel("$p($no bugs $\\vert$ passed tests$)$\nposterior $p$ of bug-free code");

The thick red vertical lines above represent the updating of our beliefs from the prior to the posterior.

The one going from `0.2` to approximately `0.33` represents the update to our beliefs after the first set of tests.

The arrow indicates that, when we apply another test to the same unknown variable,
the posterior (position on y-axis) becomes the prior (position of x-axis).
Our final beliefs about the unknown variable,
whether the code has bugs,
are obtained by applying Bayes' Rule to get a posterior from that prior
(graphically, by following the thick red line from `0.33` to `0.5`).

A fun exercise:
rewrite `posterior_given_passing` so that it also takes in the `likelihood_test_passed`
and compute the posterior given Bayes' rule,
then edit the parameters of the likelihood and see how it changes the posterior as a function of the prior.

## Bayes' Rule Flips the Table Around

Returning to the picture in terms of tables representing conditional distributions,
we can see that Bayes' Rule has taken this table,
representing the likelihood:


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+1"> No Bugs </font> </th>
      <th > <font size="+1"> Some Bugs </font> </th>
    </tr>
    <tr>
        <td > <font size="+1"> Pass </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> 1 </font> </td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> 0.5 </font> </td>
    </tr>
     <tr>
      <td ><font size="+1"> Fail </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> 0 </font></td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> 0.5 </font> </td>
    </tr>
  </tbody>
</table>
</font>


and, by combining it with the prior, turned it into this table,
representing the posterior:


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+1"> No Bugs </font> </th>
      <th > <font size="+1"> Some Bugs </font> </th>
    </tr>
    <tr style="background-color: rgb(0,50,98); color: white">
        <td > <font size="+1"> Pass </font></td>
      <td> <font size="+2"> 1/3 </font> </td>
      <td> <font size="+2"> 2/3 </font> </td>
    </tr>
     <tr style="background-color: rgb(253,181,21);">
      <td><font size="+1"> Fail </font></td>
      <td> <font size="+2"> 0 </font></td>
      <td> <font size="+2"> 1 </font> </td>
    </tr>
  </tbody>
</table>
</font>


### For NHST, these probabilities have special names


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+1"> Null Hypothesis False </font> </th>
      <th > <font size="+1"> Null Hypothesis True </font> </th>
    </tr>
    <tr style="background-color: rgb(0,50,98); color: white">
        <td > <font size="+2"> + </font></td>
      <td> <font size="+2"> Positive Predictive Value </font> </td>
      <td> <font size="+2"> False Discovery Rate</font> </td>
    </tr>
     <tr style="background-color: rgb(253,181,21);">
      <td><font size="+2"> - </font></td>
      <td> <font size="+2"> False Omission Rate </font></td>
      <td> <font size="+2"> Negative Predictive Value </font> </td>
    </tr>
  </tbody>
</table>
</font>


Let's think about these terms for the "bugs" example,
where the null hypothesis is that you have some bugs.

- Tests have a very high negative predictive value: if you fail a test, you have a bug,
so failing a test is very informative

- Tests have a much lower positive predictive value: if you pass the tests,
you still have a good chance of having a bug.

The need for prior probabilities to "flip the table around" with Bayes' Rule
is one reason why most statistical treatments only focus on the likelihood,
the "forward model" table.

But when you're trying to interpret the outputs of a test,
this table is the one you want.

## Example: Medical Testing

Someone I know had a medical test (below, Test 2) done in an attempt to confirm a diagnosis strongly suggested by an earlier test (Test 1).

Test 2 came back negative, out of alignment with both Test 1 and with some other clinical evidence.
In light of this, the physician recommended more expensive and invasive testing to rule out alternatives.

The reported numbers for $p(+\vert \text{Disease})$ and
$p(-\vert \text{No Disease})$ looked quite good:
the former, the _sensitivity_ (or power) was around 70%
and the latter, the _specificity_, was around 90%.


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+1"> Disease </font> </th>
      <th > <font size="+1"> No Disease </font> </th>
    </tr>
    <tr>
        <td > <font size="+1"> Test 2 + </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> 0.7 </font> </td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> 0.1 </font> </td>
    </tr>
     <tr>
      <td ><font size="+1"> Test 2 - </font></td>
      <td style="background-color: rgb(0,50,98); color: white"> <font size="+2"> 0.3 </font></td>
      <td style="background-color: rgb(253,181,21);"> <font size="+2"> 0.9 </font> </td>
    </tr>
  </tbody>
</table>
</font>


When you hear someone say “90% of folks without the disease tested negative”,
the immediate gut reaction is to infer “someone with a negative result probably doesn’t have the disease”
and even "someone with a positive result probably does have the disease."

But remember that we need to consider the chance that the person given the test had the disease in the first place:
the _prior_ probability of the person having the disease.

I checked out the paper that was used to design Test 2:
they noted that over 90% of individuals who tested positive on Test 1 turned out to, in fact, have the disease.
This is the _positive predictive value_ of the test,
and it is also the posterior probability that an individual has the disease,
given that they got a positive result on that test.


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+2"> Disease </font> </th>
      <th > <font size="+2"> No Disease </font> </th>
    </tr>
    <tr style="background-color: rgb(0,50,98); color: white">
        <td > <font size="+2"> Test 1 + </font></td>
      <td> <font size="+2"> 0.9</font> </td>
      <td> <font size="+2"> 0.1 </font> </td>
    </tr>
     <tr style="background-color: rgb(253,181,21);">
      <td><font size="+2"> Test 1 - </font></td>
      <td> <font size="+2"> ? </font></td>
      <td> <font size="+2"> ? </font> </td>
    </tr>
  </tbody>
</table>
</font>


Once someone has gotten a positive result on Test 1,
that becomes our new prior probabilty.

Combining that information with the likelihood table for Test 2,
we have


<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+2"> Disease </font> </th>
      <th > <font size="+2"> No Disease </font> </th>
    </tr>
    <tr style="background-color: rgb(0,50,98); color: white">
        <td > <font size="+2"> Test 2 + </font></td>
      <td> <font size="+2"> 0.98</font> </td>
      <td> <font size="+2"> 0.02 </font> </td>
    </tr>
     <tr style="background-color: rgb(253,181,21);">
      <td><font size="+2"> Test 2 - </font></td>
      <td> <font size="+2"> 0.75 </font></td>
      <td> <font size="+2"> 0.25 </font> </td>
    </tr>
  </tbody>
</table>
</font>


This is our final posterior: what we think _after_ seeing the results of test 2 and test 1.

Seeing a negative result on the second test does not have nearly the effect that one might expect: it changes 9:1 odds into 3:1 odds. A positive result similarly takes 9:1 odds to ~50:1 odds.

Think back to the example with waking up in your room to find it dark:
the sensitivity of this test is high,
since the room is likely to be dark if the sun has gone out,
but that doesn't mean it's right to infer the sun has gone out just because your room is dark!

This example is covered in slightly more detail in
[this blog post](https://charlesfrye.github.io/stats/2018/01/09/hypothesis-test-example.html).

As a note:
the second test was based on machine learning,
while the first test was based on biology.
If there's another take away from this,
besides "Be Bayesian About Evidence",
it's: "Take Care Incorporating ML Into Your Decision-Making".

## Last Bit: The Joint Probability Table

The tables we considered were the _conditional probabilities_:

if I assume that some condition is true,
what will the other variables look like?

Instead, we can consider what the chance is of observing any pair
of outcomes, one for each variable,
and so consider the _joint probability_:

<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+2"> $$H_0 \text{ False}$$ </font></th>
        <th > <font size="+2"> $$H_0 \text{ True}$$  </font></th>
    </tr>
    <tr style="background-color: rgb(255,255,255);">
        <td ><font size="+2"> $$+$$ </font></td>
      <td> <font size="+2"> $$p(F, +)$$</font> </td>
      <td> <font size="+2"> $$p(T, +)$$</font> </td>
    </tr>
     <tr style="background-color: rgb(255,255,255);">
         <td ><font size="+2"> $$-$$</font></td>
      <td> <font size="+2"> $$p(F, -)$$ </font></td>
      <td> <font size="+2"> $$p(T, -)$$ </font> </td>
    </tr>
  </tbody>
</table>
</font>

These events have special names for the case of hypothesis testing:

<table class="center">
  <tbody>
    <tr>
      <th class="border-less"></th>
        <th > <font size="+2"> $$H_0 \text{ False}$$ </font></th>
        <th > <font size="+2"> $$H_0 \text{ True}$$  </font></th>
    </tr>
    <tr style="background-color: rgb(255,255,255);">
        <td ><font size="+2"> $$+$$ </font></td>
      <td> <font size="+2"> True Positive </font> </td>
      <td> <font size="+2"> False Positive, Type I Error </font> </td>
    </tr>
     <tr style="background-color: rgb(255,255,255);">
         <td ><font size="+2"> $$-$$</font></td>
      <td> <font size="+2"> False Negative, Type II Error </font></td>
      <td> <font size="+2"> True Negative </font> </td>
    </tr>
  </tbody>
</table>
</font>

Note: these are not the names of the _probabilities_ but the names of the _events_.
The previous tables had names for the probabilities.
The names of the probabilites, the equivalents of "power" and "positive predictive value",
would be "_Chance of_ True Positive", "_Chance of_ Type I Error".

Note: the names "Type I Error" and "Type II Error" should be
[banished from language](https://en.wikipedia.org/wiki/Damnatio_memoriae).

Their literal etymology is as follows:
while writing about errors that can occur during hypothesis testing,
in 1933 Jerzy Neyman and Egon Pearson wrote

> these errors will be of two kinds:
<br>(I) we reject $H_0$ ... when it is true,
<br>(II) we fail to reject $H_0$ when some alternative hypothesis $H_A$ or $H_1$ is true.

Sourced from [Wikipedia](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Etymology),
though the original quote is from their paper _The testing of statistical hypotheses in relation to probabilities a priori_.

This labeling stuck, in part because the field of traditional statistics is so hide-bound
and taught in a way that encourages practitioners to treat it as a series of magical incantations and fixed recipes.

These labels are literally there just in order to separate out the two kinds of errors;
there is no meaning to them.
They could have written

> these errors will be of two kinds:
<br>(😢) we reject $H_0$ ... when it is true,
<br>(😡) we fail to reject $H_0$ when some alternative hypothesis $H_A$ or $H_1$ is true.

and we'd probably be talking about Type 😡 Errors.

# Bayesian Inference

Rather than

$$
p(\text{hypothesis}\vert \text{test result}) = \frac{p(\text{test result}\vert \text{hypothesis}) p(\text{hypothesis})}{p(\text{test result})}
$$

go back to

$$
p(\text{hypothesis}\vert \text{data}) = \frac{p(\text{data}\vert \text{hypothesis}) p(\text{hypothesis})}{p(\text{data})}
$$

that is, use data more complicated than just a binary test result to determine our posterior beliefs.

Binary hypothesis testing, and its special case of null hypothesis significance testing,
are a specific form of inferential thinking.

NHST has, for the past century or so,
been the dominant method for inferential thinking in science,
for the essentially historical and technological reasons
outlined last week.

With Bayesian inference, each hypothesis will be a concrete choice for the parameters of our model,
and this will lead to a much simpler approach to understanding how to interpret our results.

## Bayesian Inference for Differences in Means: Guinness and Barley

Problem setup:
William Gosset, alias "Student",
is interested in determining which variety of barley produces a higher yield when planted.

In [None]:
barley_A_yield = pd.Series([3, 1, 4, 5, 2])
barley_B_yield = pd.Series([7, 5, 3, 4, 6])

yields = pd.concat([barley_A_yield, barley_B_yield])

In [None]:
barley_df = pd.DataFrame({"yield": yields, "variety": ["A"] * 5 + ["B"] * 5})

Since it's clear that sometimes Variety A produces more,
while sometimes Variety B produces more,
we have to frame the question in terms of some statistic or parameter.

The typical choice is the _mean_.

### First Model

For our first model, let's use some ideas from the $t$-test:

1. Both groups are normally-distributed
2. The two groups have the same standard deviation

The first statement means our likelihood will be `Normal`.

The `Normal` has two parameters, `mu` and `sd`.
The second statement means that `sd` is shared between the groups.

And so our model has three latent, or hidden, variables:
the variance parameter of the likelihood
and the two mean parameters of the likelihood,
one for each group.

### Means: `pm.Normal`

The most objective way to set this prior
would be to look at past data about barley yields,
allowing us to get a sense for what's likely
for these novel barley varieties.

But that's usually not possible:
we're often working with new data,
for which there isn't a large database.
The closest thing we have is
the data we have collected:

In [None]:
barley_A_yield.mean(), barley_B_yield.mean()

In [None]:
np.std([barley_A_yield.mean(), barley_B_yield.mean()], ddof=1)

It's somewhat cheating to use your data in setting your prior.

To remedy this somewhat,
we will just increase the standard deviation by a factor of about 2.

This reduces the impact of our prior on our posterior
by spreading out the distribution.
More widely-spread priors have less impact on posteriors,
as you saw in the lab on parameterized models.

If you're concerned about "double-dipping", you can always just increase the standard deviation further.

We'll see below that this choice of parameters has a fairly modest impact on inference.

In [None]:
with pm.Model() as barley_model:
    # priors on parameters
    means = pm.Normal("means", mu=4, sd=3, shape=2)

### Standard Deviation: `pm.Exponential`

Originally introduced back in the second lecture on random variables as "time in between events in a memoryless process".

A _memoryless process_ is one where events have no influence on each other.

Examples: raindrops, Amazon orders.

Counterexamples: [buses](http://jakevdp.github.io/blog/2018/09/13/waiting-time-paradox/), parliamentary elections, bedtimes. 

The `Exponential` is also a common choice whenever we want to express the belief

#### This variable is positive, and larger values get less likely fairly quickly

In [None]:
with barley_model:
    # priors on parameters
    pooled_sd = pm.Exponential(r"$\sigma$", lam=0.5)

It has one parameter, `lam` or $\lambda$, which is 1 / mean, or the 1 / average time between events,
aka the average rate at which events occur.

In this model, we are saying that the standard deviation is positive,
and that very large values of the standard deviation are very unlikely:
we don't expect that a variety will sometimes produce 1, other times 100, bushels
with very high probability.

Look up `pm.HalfNormal`, `pm.HalfStudentT`, and `pm.Lognormal` for two more distributions
that express a similar belief.
They are also positive-only:
the first two are "positive-only" versions of the `Normal` and the `StudentT` distribution.

The `HalfNormal` most strongly discounts large values,
while the `HalfStudentT` is somewhere in between `HalfNormal` and `Exponential`,
depending on its parameter.

The distribution `pm.Lognormal` says that we can guess the order of magnitude
of the variable, plus or minus some spread.
This can be a very weak prior if the spread is large.

### Finishing the Model with a Likelihood

In [None]:
with barley_model:
    # likelihood to relate parameters to data
    varieties = pd.Series(barley_df["variety"] == "B", dtype=int)
    yields = pm.Normal("yields", mu=means[varieties], sd=pooled_sd,
                       observed=barley_df["yield"])
    delta_means = pm.Deterministic("$\mu_1 - \mu_0$", means[1] - means[0])

### Now, let's take a look at our prior by sampling with `sample_prior_predictive`

In [None]:
barley_model_prior_samples = shared_util.samples_to_dataframe(pm.sample_prior_predictive(
    model=barley_model, samples=5000))

In [None]:
shared_scales = True
f, axs = plt.subplots(nrows=3, figsize=(12, 12), sharex=shared_scales, sharey=shared_scales)
sns.distplot(barley_model_prior_samples[r"$\sigma$"], ax=axs[0]);
sns.distplot(barley_model_prior_samples["means"].apply(lambda xs: xs[0]), ax=axs[1], axlabel=r"$\mu_0$");
sns.distplot(barley_model_prior_samples["$\mu_1 - \mu_0$"], ax=axs[2]);
plt.tight_layout();

Note that we didn't _explicitly_ specify a prior on the latter:
our prior on the value of "$\mu_1 - \mu_0$" is a consequence of our other priors.

This is something like the sampling distribution of the "difference in means"
statistic under the prior.

If we observe a value of this variable that is very unlikely under our prior distribution,
that suggests our prior might be wrong,
just as observing a value of a statistic that is very unlikely under
the sampling distribution of the null hypothesis (a low $p$ value)
suggests that the null hypothesis might be wrong.

### And then look at the posterior given the data with `sample`

In [None]:
barley_model_trace = shared_util.sample_from(barley_model)
barley_model_samples = shared_util.samples_to_dataframe(barley_model_trace)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(barley_model_samples["$\sigma$"], color="C2");

First, the posterior for the standard deviation.

It's somewhat hard to interpret without comparing to the prior.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(barley_model_prior_samples["$\sigma$"], color="C0", label="prior");
sns.distplot(barley_model_samples["$\sigma$"], color="C2", label="posterior");
plt.legend();

Our posterior is much tighter than our prior:
where before, we thought there was about a 50% chance
that the standard deviation was below 1 or above 4,
we now put a vanishly small chance on that being true.

In [None]:
(barley_model_prior_samples["$\sigma$"] < 1).mean() + (barley_model_prior_samples["$\sigma$"] > 4).mean()

In [None]:
(barley_model_samples["$\sigma$"] < 1).mean() + (barley_model_samples["$\sigma$"] > 4).mean()

Note: determining something like this while running a $t$-test would have required
the elaboration of _another_ statistical test,
likely with additional assumptions.

But we were more interested in the difference of means.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))

sns.distplot(barley_model_prior_samples["$\mu_1 - \mu_0$"], color="C0", label="prior");
sns.distplot(barley_model_samples["$\mu_1 - \mu_0$"], color="C2", label="posterior");
plt.legend();

Again,
even though we only observed a relatively small amount of data,
it's enough to massively change our prior,
since it reflected our state of very extreme ignorance.

If we'd like to infer whether the
mean of Variety B is higher,
we just need to check what the probability
of that claim is, under the posterior.

In [None]:
(barley_model_samples["$\mu_1 - \mu_0$"] > 0).mean()

Note that the resulting value is dramatically different from
what we had in the prior:
about 50-50 odds.

This experiment was very informative,
even if it wasn't definitive.

In [None]:
(barley_model_prior_samples["$\mu_1 - \mu_0$"] > 0).mean()

Notice what is being done here,
along with what was being done above:
we are checking whether the inference we wanted to draw
was true on each sample,
and then calculating the fraction of samples on which it was true.

This can be generalized to all kinds of different inferences,
without any need to do more than change what we calculate on our samples.

In [None]:
def compute_posterior_p(posterior_samples, check_inference):
    inference_true_booleans = []
    for _, sample in posterior_samples.iterrows():
        inference_true_booleans.append(check_inference(sample))
    return pd.Series(inference_true_booleans).mean()

In [None]:
# the inference we were interested in
def mu1_greater(sample):
    return sample["$\mu_1 - \mu_0$"] > 0

# what's the chance that these varieties have a low value of sigma
def sigma_under_3(sample):  
    return sample["$\sigma$"] < 3

# a wacky inference, but one we can ask
def mu0_less_than_sigma(sample):  # 
    return sample["$\sigma$"] > sample["means"][0]

print(compute_posterior_p(barley_model_samples, mu1_greater),
      compute_posterior_p(barley_model_samples, sigma_under_3),
      compute_posterior_p(barley_model_samples, mu0_less_than_sigma))

### Credible Intervals: Confidence Intervals for Bayesians

The Confidence Interval was intended to give an estimate of what values of a variable were plausible or likely.

But remember, that's not what a Confidence Interval really is:
it is merely an interval that,
on 95% of samples, covers the true value.

Credible Intervals are the Bayesian equivalent of Confidence Intervals.

A **Bayesian Credible Interval** is _any_ interval that covers 95% of the posterior density.

#### Highest Posterior Density Intervals

The _Highest Posterior Density Interval_ is the shortest credible interval.

It is computed with `pm.stats.hpd`.

In [None]:
pm.stats.hpd(barley_model_samples["$\mu_1 - \mu_0$"])

#### `plot_posterior`

Given the output of `pm.sample` or `shared_util.sample_from`
(not a `DataFrame`, aka the output of `shared_util.samples_to_dataframe`),
pyMC can make a convenient plot of the posterior and the Highest Posterior Density Interval.

In [None]:
pm.plot_posterior(barley_model_trace, varnames=["$\mu_1 - \mu_0$"], figsize=(12, 6), text_size=24,
                  color="C2");

This is a histogram, just like in `sns.distplot`.

The black bar covers, by default, the 95% HPD.
The endpoints are indicated by hovering text,
as is the mean.

#### Quantiles: Equal Tail Intervals

The _Equal Tail Interval_ is the credible interval with equal total probability
above it and below it.

In [None]:
pm.stats.quantiles(barley_model_samples["$\mu_1 - \mu_0$"])

Notice that the Equal Tail Interval covering 95% of the posterior
is not the same as the Highest Posterior Density Interval.

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
sns.boxplot(barley_model_samples["$\mu_1 - \mu_0$"], width=0.2, linewidth=4);

We display quantile information using a _box plot_,
aka a _box-and-whisker plot_
accessible with `sns.boxplot`.

The middle half of the data is indicated with the box:
its left edge is at the 25th percentile
and its right edge is at the 75th percentile.
The width of this box is called the "interquartile range".
The median is indicated with a bar through the box.

The "whiskers" extend to cover all data points up to a maximum length equal to some number
times the width of the box in the middle.
The keyword argument in seaborn is `whis` and the default value is `1.5`,
which is standard.

Any points outside of this range are plotted individually.

### Comparing the Prior and Posterior with `boxplot`

The cell below combines the samples from the posterior with the samples from the prior into a single dataframe.

In [None]:
posterior_prior_comparison_df = pd.concat([barley_model_samples, barley_model_prior_samples])

posterior_prior_comparison_df["distribution"] = \
    ["posterior"] * len(barley_model_samples) + ["prior"] * len(barley_model_prior_samples)

In [None]:
posterior_prior_comparison_df.sample(10)

Side note: you might notice a column $\sigma$`_log__`. Internally, pyMC works with logarithms for positive-only variables.

The additional `distribution` column identifies where a given sample was from the `prior` or the `posterior`.

We can use this column for `groupby` operations:

In [None]:
posterior_prior_comparison_df.groupby("distribution")["$\sigma$"].mean()

And for hooking into seaborn.

Many seaborn plotting functions, including `boxplot`,
can use columns of the dataframe to split up the data and automatically produce
the same visualization for multiple subsets of the data.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(x="$\mu_1 - \mu_0$", data=posterior_prior_comparison_df,
            y="distribution", hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.legend([], frameon=False);

For `boxplot` the `y` argument determines which variable sets the height of the boxes,
while the `hue` argument determines which variable sets their color.

If you use learn to use these features of seaborn,
you can make very rich and informative plots in just a few lines!

### A Model with Weaker Priors: `agnostic`

Sometimes, we want to bring even less prior information to bear on our modeling problem.

Our previous model very strongly discounted the possibility that the mean number of bushels
would be in the hundreds or the hundreds of thousands.

But perhaps that was too strong of an assumption?

There are several "go-to" choices of prior that are common when trying to make as few assumptions as possible.

### `pm.HalfCauchy` and `pm.Cauchy`

These two distributions have very "long tails":
the chance of producing a value very far away from their center is relatively small,
but substantially higher than for the `Exponential` or `Normal` distributions.

They are used when we want to say that even extremely large values aren't too unsurprising.

The `HalfCauchy`, like the `HalfNormal`, the `HalfStudentT`, and the `HalfFlat`,
is the positive-only version of the `Cauchy`.

These distributions are so broad that sampling from them is difficult,
so instead of showing what they look like by drawing samples,
the code below plots their distribution functions directly.

In [None]:
import theano.tensor as tt

def make_probability(distribution, **params):
    """Constructs a function that evaluates the exponential of
    distribution's logp function for a given set of parameters,
    provided as kwargs.
    
    For continuous distributions, this is a probability density.
    For discrete distributions, this is a probability mass.
    """
    logp = distribution.dist(**params).logp
    
    def probability(vals):
        return np.exp(logp(shared_util.to_pymc(vals)).eval())
    
    return probability

In [None]:
half_cauchy_probability = make_probability(pm.HalfCauchy, beta=100)

In [None]:
exponential_probability = make_probability(pm.Exponential, lam=0.01)

In [None]:
sigmas = np.logspace(-5, 5, num=1000)
half_cauchy_ps = half_cauchy_probability(sigmas)
exponential_ps = exponential_probability(sigmas)

In [None]:
plt.plot(sigmas, exponential_ps, lw=4); plt.xlim([100, 1000]);

plt.plot(sigmas, half_cauchy_ps, lw=4); plt.xlim([100, 1000]);

As you can see, the `HalfCauchy` is just so slightly above the `Exponential`.

This difference is much easier to see if we log-transform the probabilities:

In [None]:
plt.semilogy(sigmas, exponential_ps, lw=4); plt.xlim([0, 100000]);

plt.semilogy(sigmas, half_cauchy_ps, lw=4); plt.xlim([0, 100000]);

The probabilities are exponentially decreasing for the `Exponential` distribution,
as indicated by the fact that the log probabilities are decreasing in a straight line.

The probabilities are decreasing much more slowly than exponentially for the `Cauchy` distribution:
even though they are small, they are not dropping nearly as low as for the `Exponential`.

The difference is much easier to see
if we just look at a `rugplot` of the samples.

In [None]:
with pm.Model() as barley_model_agnostic:
    pooled_sd = pm.HalfCauchy(r"$\sigma$", beta=10)
    means = pm.Cauchy("means",
                      alpha=4,  # center
                      beta=1,  # spread
                      shape=2)
    
    varieties = pd.Series(barley_df["variety"] == "B", dtype=int)
    yields = pm.Normal("yields", mu=means[varieties], sd=pooled_sd,
                       observed=barley_df["yield"])
    delta_means = pm.Deterministic("$\mu_1 - \mu_0$", means[1] - means[0])

In [None]:
barley_model_agnostic_prior_samples = shared_util.samples_to_dataframe(pm.sample_prior_predictive(
    model=barley_model_agnostic, samples=5000))

In [None]:
shared_scales = True
f, axs = plt.subplots(nrows=3, ncols=2, figsize=(12, 12), sharex=shared_scales, sharey=shared_scales)

sns.distplot(barley_model_prior_samples[r"$\sigma$"],
             ax=axs[0, 0], rug=True);
sns.distplot(barley_model_prior_samples["means"].apply(lambda xs: xs[0]),
             ax=axs[1, 0], rug=True, axlabel=r"$\mu_0$");
sns.distplot(barley_model_prior_samples["$\mu_1 - \mu_0$"],
             ax=axs[2, 0], rug=True);

sns.distplot(barley_model_agnostic_prior_samples[r"$\sigma$"],
             ax=axs[0, 1], kde=False, rug=True, norm_hist=True, bins=1000, color="C1");
sns.distplot(barley_model_agnostic_prior_samples["means"].apply(lambda xs: xs[0]),
             ax=axs[1, 1], kde=False, rug=True, norm_hist=True, bins=1000, color="C1", axlabel=r"$\mu_0$");
sns.distplot(barley_model_agnostic_prior_samples["$\mu_1 - \mu_0$"],
             ax=axs[2, 1], kde=False, rug=True, norm_hist=True, bins=1000, color="C1");
axs[0, 0].set_title("Model A:\nExponential-Normal Prior")
axs[0, 1].set_title("Model B:\nHalfCauchy-Cauchy Prior")
axs[-1, -1].set_xlim([-100, 100]); plt.tight_layout();

While the draws from the original model are fairly tightly concentrated around the regions around 0,
the draws from the agnostic model, with the `HalfCauchy` and `Cauchy` prior,
are much more broadly distributed.

Intuitively, this model is much less opinionated than the other about the data.

Of course, the chance of a variety of barley having a yield
that is an order or of magnitude or more higher than all the others is quite small,
and so the `Exponential`-`Normal` model is very reasonable.

In [None]:
shared_scales = True
f, axs = plt.subplots(nrows=3, figsize=(12, 12), sharex=shared_scales, sharey=shared_scales)

sns.distplot(barley_model_agnostic_prior_samples[r"$\sigma$"],
             ax=axs[0], kde=False, norm_hist=True, bins=1000, color="C1");
sns.distplot(barley_model_agnostic_prior_samples["means"].apply(lambda xs: xs[0]),
             ax=axs[1], kde=False, norm_hist=True, bins=1000, color="C1", axlabel=r"$\mu_0$");
sns.distplot(barley_model_agnostic_prior_samples["$\mu_1 - \mu_0$"],
             ax=axs[2], kde=False, norm_hist=True, bins=1000, color="C1");
axs[-1].set_xlim([-100, 100]); plt.tight_layout();

This kernel density estimates don't quite do these distributions justice.

Once again, we draw some samples,
package them into a dataframe, and visualize the posterior
with a box-and-whisker plot.

In [None]:
barley_model_agnostic_trace = shared_util.sample_from(barley_model_agnostic)
barley_model_agnostic_samples = shared_util.samples_to_dataframe(barley_model_agnostic_trace)

In [None]:
agnostic_posterior_prior_comparison_df = pd.concat(
    [barley_model_agnostic_samples, barley_model_agnostic_prior_samples])

agnostic_posterior_prior_comparison_df["distribution"] = \
    ["posterior"] * len(barley_model_agnostic_samples) + ["prior"] * len(barley_model_agnostic_prior_samples)

In [None]:
agnostic_posterior_prior_comparison_df.sample(10)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(x="$\mu_1 - \mu_0$", y="distribution", data=agnostic_posterior_prior_comparison_df, hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.legend([], frameon=False);
ax.set_xlim([-20, 20]);

If combine the samples from both of our models into a single dataframe,
including samples from the prior and the posterior for both,
we can plot the original and updated beliefs for both models.

In [None]:
model_comparison_df = pd.concat([posterior_prior_comparison_df, agnostic_posterior_prior_comparison_df])
model_comparison_df["model"] = \
    ["original"] * len(posterior_prior_comparison_df) + ["agnostic"] * len(agnostic_posterior_prior_comparison_df)

In [None]:
model_comparison_df.sample(10)

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
sns.boxplot(x="$\mu_1 - \mu_0$", y="model", data=model_comparison_df, hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.set_xlim([-20, 20]);

This plot separates out the original and the agnostic model on the y-axis
and then uses color to indicate which distribution hte samples are drawn from.

Directly from this plot,
we can see that though the centers of the two priors are similar,
the prior of the agnostic model is much more widely distributed.

We can also see that the difference in posteriors is much smaller.
At least on the scale of the priors,
the centers are fairly close,
and the widths are about the same.

### A Model with the Weakest Priors: `improper`

What if we wanted to make no assumption about what values of the standard deviation or the mean were more or less likely?

### `pm.Flat` and `pm.HalfFlat`

pyMC provides access to two distributions that aren't really probability distributions at all.

In [None]:
flat_probability = make_probability(pm.Flat)

In [None]:
half_flat_probability = make_probability(pm.HalfFlat)

In [None]:
xs = np.arange(-10, 10)
plt.step(xs, half_flat_probability(xs), lw=4);

In [None]:
xs = np.arange(-10, 10)
plt.step(xs, flat_probability(xs), lw=4);
plt.ylim([0, 1.1]);

Because the values are the same everywhere,
except where they are 0,
in the case of `HalfFlat`,
they have no effect on the posterior
except to say that some values are impossible.

Because they aren't probability distributions,
as they don't add up to 1, they can't be sampled from
with `sample_prior_predictive`.

But they still result in a valid posterior,
so we draw samples with `pm.sample`.

In [None]:
with pm.Model() as barley_model_improper:
    pooled_sd = pm.HalfFlat(r"$\sigma$")
    means = pm.Flat("means", shape=2)
    
    varieties = pd.Series(barley_df["variety"] == "B", dtype=int)
    yields = pm.Normal("yields", mu=means[varieties], sd=pooled_sd,
                       observed=barley_df["yield"])
    delta_means = pm.Deterministic("$\mu_1 - \mu_0$", means[1] - means[0])

In [None]:
barley_model_improper_trace = shared_util.sample_from(barley_model_improper)
barley_model_improper_samples = shared_util.samples_to_dataframe(barley_model_improper_trace)

In [None]:
barley_model_improper_samples["model"] = "improper"
barley_model_improper_samples["distribution"] = "posterior"

In [None]:
model_comparison_df = model_comparison_df.append(barley_model_improper_samples)

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
sns.boxplot(x="$\mu_1 - \mu_0$", y="model", data=model_comparison_df, hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.set_xlim([-20, 20]);

## What would you do?

My opinion: the evidence is somewhat ambiguous.
If this is a low-downside decision,
plant Variety 1 and then reassess after the next harvest.

If this is a high-downside decision,
either repeat the experiment,
using some combination of these posteriors as a prior,
or maybe plant some of each!