<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Bayesian Inference 02 - Modeling Differences of Means

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

import daft
from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

In [None]:
import theano.tensor as tt

def make_probability(distribution, **params):
    """Constructs a function that evaluates the exponential of
    distribution's logp function for a given set of parameters,
    provided as kwargs.
    
    For continuous distributions, this is a probability density.
    For discrete distributions, this is a probability mass.
    """
    logp = distribution.dist(**params).logp
    
    def probability(vals):
        return np.exp(logp(shared_util.to_pymc(vals)).eval())
    
    return probability

## Bayesian Inference

Rather than

$$
p(\text{hypothesis}\vert \text{test result}) = \frac{p(\text{test result}\vert \text{hypothesis}) p(\text{hypothesis})}{p(\text{test result})}
$$

go back to

$$
p(\text{hypothesis}\vert \text{data}) = \frac{p(\text{data}\vert \text{hypothesis}) p(\text{hypothesis})}{p(\text{data})}
$$

that is, use data more complicated than just a binary test result to determine our posterior beliefs.

Binary hypothesis testing, and its special case of null hypothesis significance testing,
are a specific form of inferential thinking.

NHST has, for the past century or so,
been the dominant method for inferential thinking in science,
for the essentially historical and technological reasons
outlined last week.

With Bayesian inference, each hypothesis will be a concrete choice for the parameters of our model,
and this will lead to a much simpler approach to understanding how to interpret our results.

## Bayesian Inference for Differences in Means: Guinness and Barley

Problem setup:
William Gosset, alias "Student",
is interested in determining which variety of barley produces a higher yield when planted.

In [None]:
barley_A_yield = pd.Series([3, 1, 4, 5, 2])
barley_B_yield = pd.Series([7, 5, 3, 4, 6])

yields = pd.concat([barley_A_yield, barley_B_yield])

In [None]:
barley_df = pd.DataFrame({"yield": yields, "variety": ["A"] * 5 + ["B"] * 5})

Since it's clear that sometimes Variety A produces more,
while sometimes Variety B produces more,
we have to frame the question in terms of some statistic or parameter.

The typical choice is the _mean_.

### First model: `Normal` and `Exponential`

Setting the priors for a model is an unsolved problem.

It's something of an art to transform qualitative knowledge
into quantitative statements about probabilities.

For example, we run into numerical and mathematical issues when we specify our priors badly.

See [this GitHub wiki page](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations)
by the authors of Stan, another MCMC library, for some generic tips.
These are world experts on Bayesian inference, and note how informal and loose these recommendations are.
This is an unsolved problem, and the presence of these subjective unsolved problems in Bayesian inference
workflows is a major impediment to its adoption,
since often the goal of quantitative methods is to eliminate subjectivity.

For our first model, let's use some ideas from the $t$-test:

1. Both groups are normally-distributed
2. The two groups have the same standard deviation

The first statement means our likelihood will be `Normal`.

The `Normal` has two parameters, `mu` and `sd`.
The second statement means that `sd` is shared between the groups.

And so our model has three latent, or hidden, variables:
the variance parameter of the likelihood
and the two mean parameters of the likelihood,
one for each group.

### Means: `pm.Normal`

The most objective way to set this prior
would be to look at past data about barley yields,
allowing us to get a sense for what's likely
for these novel barley varieties.

But that's usually not possible:
we're often working with new data,
for which there isn't a large database.
The closest thing we have is
the data we have collected:

In [None]:
barley_A_yield.mean(), barley_B_yield.mean()

In [None]:
np.std([barley_A_yield.mean(), barley_B_yield.mean()], ddof=1)

It's somewhat cheating to use your data in setting your prior.

To remedy this somewhat,
we will just increase the standard deviation by a factor of about 2.

This reduces the impact of our prior on our posterior
by spreading out the distribution.
More widely-spread priors have less impact on posteriors,
as you saw in the lab on parameterized models.

If you're concerned about "double-dipping", you can always just increase the standard deviation further.

We'll see below that this choice of parameters has a fairly modest impact on inference.

In [None]:
with pm.Model() as barley_model:
    # priors on parameters
    means = pm.Normal("means", mu=4, sd=3, shape=2)

### Standard Deviation: `pm.Exponential`

Originally introduced back in the second lecture on random variables as "time in between events in a memoryless process".

A _memoryless process_ is one where events have no influence on each other.

Examples: raindrops, Amazon orders.

Counterexamples: [buses](http://jakevdp.github.io/blog/2018/09/13/waiting-time-paradox/), parliamentary elections, bedtimes. 

The `Exponential` is also a common choice whenever we want to express the belief

#### This variable is positive, and larger values get less likely fairly quickly

In [None]:
with barley_model:
    # priors on parameters
    pooled_sd = pm.Exponential(r"$\sigma$", lam=0.5)

It has one parameter, `lam` or $\lambda$, which is 1 / mean, or the 1 / average time between events,
aka the average rate at which events occur.

In this model, we are saying that the standard deviation is positive,
and that very large values of the standard deviation are very unlikely:
we don't expect that a variety will sometimes produce 1, other times 100, bushels
with very high probability.

Look up `pm.HalfNormal`, `pm.HalfStudentT`, and `pm.Lognormal` for two more distributions
that express a similar belief.
They are also positive-only:
the first two are "positive-only" versions of the `Normal` and the `StudentT` distribution.

The `HalfNormal` most strongly discounts large values,
while the `HalfStudentT` is somewhere in between `HalfNormal` and `Exponential`,
depending on its parameter.

The distribution `pm.Lognormal` says that we can guess the order of magnitude
of the variable, plus or minus some spread.
This can be a very weak prior if the spread is large.

### Finishing the model with a likelihood

In [None]:
with barley_model:
    # likelihood to relate parameters to data
    varieties = pd.Series(barley_df["variety"] == "B", dtype=int)
    yields = pm.Normal("yields", mu=means[varieties], sd=pooled_sd,
                       observed=barley_df["yield"])
    delta_means = pm.Deterministic("$\mu_1 - \mu_0$", means[1] - means[0])

### Now, let's take a look at our prior by sampling with `sample_prior_predictive`

In [None]:
barley_model_prior_samples = shared_util.samples_to_dataframe(pm.sample_prior_predictive(
    model=barley_model, samples=5000))

In [None]:
shared_scales = True
f, axs = plt.subplots(nrows=3, figsize=(12, 12), sharex=shared_scales, sharey=shared_scales)
sns.distplot(barley_model_prior_samples[r"$\sigma$"], ax=axs[0]);
sns.distplot(barley_model_prior_samples["means"].apply(lambda xs: xs[0]), ax=axs[1], axlabel=r"$\mu_0$");
sns.distplot(barley_model_prior_samples["$\mu_1 - \mu_0$"], ax=axs[2]);
plt.tight_layout();

Note that we didn't _explicitly_ specify a prior on the latter:
our prior on the value of "$\mu_1 - \mu_0$" is a consequence of our other priors.

This is something like the sampling distribution of the "difference in means"
statistic under the prior.
It's not precisely the same because this the distribution of the _true difference in means_.
To get the sampling distribution of the difference in means,
we'd need to compute the difference in means on the samples of yields. 

If we observe a value of this variable that is very unlikely under our prior distribution,
that suggests our prior might be wrong,
just as observing a value of a statistic that is very unlikely under
the sampling distribution of the null hypothesis (a low $p$ value)
suggests that the null hypothesis might be wrong.

### And then look at the posterior given the data with `sample`

In [None]:
barley_model_trace = shared_util.sample_from(barley_model, draws=2500, chains=4, progressbar=True)
barley_model_samples = shared_util.samples_to_dataframe(barley_model_trace)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(barley_model_samples["$\sigma$"], color="C2");

First, the posterior for the standard deviation.

It's somewhat hard to interpret without comparing to the prior.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(barley_model_prior_samples["$\sigma$"], color="C0", label="prior");
sns.distplot(barley_model_samples["$\sigma$"], color="C2", label="posterior");
plt.legend();

Our posterior is much tighter than our prior:
where before, we thought there was about a 50% chance
that the standard deviation was below 1 or above 4,
we now put a vanishly small chance on that being true.

The cells below compute these probabilities more exactly.
See the discussion around `compute_posterior_p`
for more on how/why this is done in this way.

In [None]:
(barley_model_prior_samples["$\sigma$"] < 1).mean() + (barley_model_prior_samples["$\sigma$"] > 4).mean()

In [None]:
(barley_model_samples["$\sigma$"] < 1).mean() + (barley_model_samples["$\sigma$"] > 4).mean()

Note: determining something like this while running a $t$-test would have required
the elaboration of _another_ statistical test,
likely with additional assumptions.

But we were more interested in the difference of means.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))

sns.distplot(barley_model_prior_samples["$\mu_1 - \mu_0$"], color="C0", label="prior");
sns.distplot(barley_model_samples["$\mu_1 - \mu_0$"], color="C2", label="posterior");
plt.legend();

Again,
even though we only observed a relatively small amount of data,
it's enough to massively change our prior,
since it reflected our state of very extreme ignorance.

Just from visually inspecting these distributions,
we can draw some inferences.
For example, the difference in means is
very unlikely to be extremely large.

If we want to be more quantitative in our inferences,
e.g. if we'd like to infer whether the
mean of Variety B is higher, in order to drive a business decision,
we just need to check what the probability
of that claim is, under the posterior.

In [None]:
(barley_model_samples["$\mu_1 - \mu_0$"] > 0).mean()

Note that the resulting value is dramatically different from
what we had in the prior:
about 50-50 odds.

This experiment was very informative,
even if it wasn't definitive.

In [None]:
(barley_model_prior_samples["$\mu_1 - \mu_0$"] > 0).mean()

Notice what is being done here,
and how it is similar to the qualitative form of inference:
we are checking whether the inference we wanted to draw
was true on each sample,
and then calculating the fraction of samples on which it was true.

This can be generalized to all kinds of different inferences,
without any need to do more than change what we calculate on our samples.

In [None]:
def compute_posterior_p(posterior_samples, check_inference):
    """Given some posterior samples and function that checks
    whether an inference is true or false, returns the probability
    under the posterior given by the samples that the inference is true.
    """
    inference_true_booleans = []
    for _, sample in posterior_samples.iterrows():
        inference_true_booleans.append(check_inference(sample))
    return pd.Series(inference_true_booleans).mean()

In [None]:
# the inference we were interested in
def mu1_greater(sample):
    return sample["$\mu_1 - \mu_0$"] > 0

# what's the chance that these varieties have a low value of sigma
def sigma_under_3(sample):  
    return sample["$\sigma$"] < 3

# a wacky inference, but that's no barrier to computing it
def mu0_less_than_sigma(sample):  # 
    return sample["$\sigma$"] > sample["means"][0]

print(compute_posterior_p(barley_model_samples, mu1_greater),
      compute_posterior_p(barley_model_samples, sigma_under_3),
      compute_posterior_p(barley_model_samples, mu0_less_than_sigma))

### Credible Intervals: Confidence Intervals for Bayesians

The Confidence Interval was intended to give an estimate of what values of a variable were plausible or likely.

But remember, that's not what a Confidence Interval really is:
it is merely an interval-valued statistic that,
on 95% of samples, covers the true value.

Credible Intervals are the Bayesian equivalent of Confidence Intervals.

A **Bayesian Credible Interval** is _any_ interval that covers some given percentage of the posterior density.

For example, a **95% Credible Interval** covers 95% of the posterior.

That is, if the 95% Credible Interval is `(-1, 1)` we believe there is 95% chance that the value lies between `-1` and `1`.

Just as there were many ways to construct Confidence Intervals,
there are many ways to construct Credible Intervals.
However, because it's easier to construct and compute a variety of Bayesian Credible Intervals,
a wider variety get used.

The two most common varieties are **Highest Posterior Density Intervals**
and **Equal Tail Intervals**.

#### Highest Posterior Density Intervals

The _Highest Posterior Density Interval_ is the shortest credible interval.

It can be computed from a `Series`, list, or array with `pm.stats.hpd`.

In [None]:
pm.stats.hpd(barley_model_samples["$\mu_1 - \mu_0$"])

#### `plot_posterior`

Given the output of `pm.sample` or `shared_util.sample_from`
(not a `DataFrame`, aka the output of `shared_util.samples_to_dataframe`),
pyMC can make a convenient plot of the posterior and the Highest Posterior Density Interval.

In [None]:
pm.plot_posterior(barley_model_trace, varnames=["$\mu_1 - \mu_0$"], figsize=(12, 6), text_size=24,
                  color="C2");

This is a histogram, just like in `sns.distplot`.

The black bar covers, by default, the 95% HPD.
The endpoints are indicated by hovering text,
as is the mean.

From this 95% Credible Interval,
we can determine that the difference in means has a 95% chance of being somewhere between roughly -0.3 and roughly 4.
The exact boundaries of the interval will depend on the samples drawn, and can be read off from the text hovering above the black bar.

#### Quantiles: Equal Tail Intervals

The _Equal Tail Interval_ is the credible interval with equal total probability
above it and below it.

It is given by computing the percentiles of the posterior samples,
e.g. with `pm.stats.quantiles`:

In [None]:
pm.stats.quantiles(barley_model_samples["$\mu_1 - \mu_0$"])

The endpoints of the 50% Equal Tail Interval are given by the values at
`"25"` and `"75"` in the dictionary returned above,
roughly `1.1` and `2.6`.

The 95% Equal Tail Interval's endpoints
are given by the values at `2.5` annd `97.5`
and should be roughly `-0.4` to `4`.

Notice that the Equal Tail Interval covering 95% of the posterior
is not the same as the Highest Posterior Density Interval  covering 95% of the posterior.

They will be different whenever the samples are asymmetric.

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
sns.boxplot(barley_model_samples["$\mu_1 - \mu_0$"], width=0.2, linewidth=4);

We display quantile information using a _box plot_,
aka a _box-and-whisker plot_,
accessible with `sns.boxplot`.

The middle half of the data is indicated with the box:
its left edge is at the 25th percentile
and its right edge is at the 75th percentile.
The width of this box is called the "interquartile range".
The median is indicated with a bar through the box.

The "whiskers" extend to cover all data points up to a maximum length equal to some number
times the width of the box in the middle.
The keyword argument in seaborn is `whis` and the default value is `1.5`,
which is standard.

Any points outside of this range are plotted individually.

### Comparing the prior and posterior with `boxplot`

The cell below combines the samples from the posterior with the samples from the prior into a single dataframe.

In [None]:
posterior_prior_comparison_df = pd.concat(
    [barley_model_samples, barley_model_prior_samples])

posterior_prior_comparison_df["distribution"] = \
    ["posterior"] * len(barley_model_samples) + ["prior"] * len(barley_model_prior_samples)

In [None]:
posterior_prior_comparison_df.sample(10)

Side note: you might notice a column $\sigma$`_log__`. Internally, pyMC works with logarithms for positive-only variables.

The additional `distribution` column identifies where a given sample was from the `prior` or the `posterior`.

We can use this column for `groupby` operations:

In [None]:
posterior_prior_comparison_df.groupby("distribution")["$\mu_1 - \mu_0$"].mean()

And for hooking into seaborn.

Many seaborn plotting functions, including `boxplot`,
can use columns of the dataframe to split up the data and automatically produce
the same visualization for multiple subsets of the data.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(x="$\mu_1 - \mu_0$", data=posterior_prior_comparison_df,
            y="distribution", hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.legend([], frameon=False);

For `boxplot` the `y` argument determines which variable sets the height of the boxes,
while the `hue` argument determines which variable sets their color.

If you use learn to use these features of seaborn,
you can make very rich and informative plots in just a few lines!

### A model with weaker priors: `agnostic`

Sometimes, we want to bring even less prior information to bear on our modeling problem.

Our previous model very strongly discounted the possibility that the mean number of bushels
would be in the hundreds or the hundreds of thousands.

But perhaps that was too strong of an assumption?

There are several "go-to" choices of prior that are common when trying to make as few assumptions as possible.

### `pm.HalfCauchy` and `pm.Cauchy`

These two distributions have very "long tails":
the chance of producing a value very far away from their center is relatively small,
but substantially higher than for the `Exponential` or `Normal` distributions.

They are used when we want to say that even extremely large values aren't too unsurprising.

The `HalfCauchy`, like the `HalfNormal`, the `HalfStudentT`, and the `HalfFlat`,
is the positive-only version of the `Cauchy`.

These distributions are so broad that sampling from them is difficult,
so instead of showing what they look like by drawing samples,
the code below plots their distribution functions directly.

In [None]:
half_cauchy_probability = make_probability(pm.HalfCauchy, beta=0.5)

exponential_probability = make_probability(pm.Exponential, lam=2)

sigmas = np.logspace(-5, 5, num=1000)
half_cauchy_ps = half_cauchy_probability(sigmas)
exponential_ps = exponential_probability(sigmas)

In [None]:
f, ax = plt.subplots(figsize=(16, 8))

ax.plot(sigmas, exponential_ps, lw=4); plt.xlim([0, 5]);
ax.plot(sigmas, half_cauchy_ps, lw=4); plt.xlim([0, 5]);

As you can see, the `HalfCauchy` is ever so slightly above the `Exponential`.

This difference is much easier to see if we log-transform the probabilities:

In [None]:
f, ax = plt.subplots(figsize=(16, 8))
ax.semilogy(sigmas, exponential_ps, lw=4); plt.xlim([0, 40]);

ax.semilogy(sigmas, half_cauchy_ps, lw=4); plt.xlim([0, 40]);
ax.set_ylim([1e1, 1e-25]);

The probabilities are exponentially decreasing for the `Exponential` distribution,
as indicated by the fact that the log probabilities are decreasing in a straight line.

The probabilities are decreasing much more slowly than exponentially for the `Cauchy` distribution:
even though they are small, they are not dropping nearly as low as for the `Exponential`.

The difference is much easier to see
if we just look at a `rugplot` of the samples.

In [None]:
with pm.Model() as barley_model_agnostic:
    pooled_sd = pm.HalfCauchy(r"$\sigma$", beta=10)
    means = pm.Cauchy("means",
                      alpha=4,  # center
                      beta=1,  # spread
                      shape=2)
    
    varieties = pd.Series(barley_df["variety"] == "B", dtype=int)
    yields = pm.Normal("yields", mu=means[varieties], sd=pooled_sd,
                       observed=barley_df["yield"])
    delta_means = pm.Deterministic("$\mu_1 - \mu_0$", means[1] - means[0])

In [None]:
barley_model_agnostic_prior_samples = shared_util.samples_to_dataframe(pm.sample_prior_predictive(
    model=barley_model_agnostic, samples=5000))

In [None]:
shared_scales = True
f, axs = plt.subplots(nrows=3, ncols=2, figsize=(12, 12), sharex=shared_scales, sharey=shared_scales)

sns.distplot(barley_model_prior_samples[r"$\sigma$"], ax=axs[0, 0], rug=True);
sns.distplot(barley_model_prior_samples["means"].apply(lambda xs: xs[0]), ax=axs[1, 0], rug=True, axlabel=r"$\mu_0$");
sns.distplot(barley_model_prior_samples["$\mu_1 - \mu_0$"], ax=axs[2, 0], rug=True);

sns.distplot(barley_model_agnostic_prior_samples[r"$\sigma$"],
             ax=axs[0, 1], kde=False, rug=True, norm_hist=True, bins=1000, color="C1");
sns.distplot(barley_model_agnostic_prior_samples["means"].apply(lambda xs: xs[0]),
             ax=axs[1, 1], kde=False, rug=True, norm_hist=True, bins=1000, color="C1", axlabel=r"$\mu_0$");
sns.distplot(barley_model_agnostic_prior_samples["$\mu_1 - \mu_0$"],
             ax=axs[2, 1], kde=False, rug=True, norm_hist=True, bins=1000, color="C1");
axs[0, 0].set_title("Model A:\nExponential-Normal Prior"); axs[0, 1].set_title("Model B:\nHalfCauchy-Cauchy Prior")
axs[-1, -1].set_xlim([-100, 100]); plt.tight_layout();

While the draws from the original model are fairly tightly concentrated around the regions around 0,
the draws from the agnostic model, with the `HalfCauchy` and `Cauchy` prior,
are much more broadly distributed.

Intuitively, this model is much less opinionated than the other about the data.

Of course, the chance of a variety of barley having a yield
that is an order or of magnitude or more higher than all the others is quite small,
and so the `Exponential`-`Normal` model is very reasonable.

In [None]:
shared_scales = True
f, axs = plt.subplots(nrows=3, figsize=(12, 12), sharex=shared_scales, sharey=shared_scales)

sns.distplot(barley_model_agnostic_prior_samples[r"$\sigma$"],
             ax=axs[0], kde=False, norm_hist=True, bins=1000, color="C1");
sns.distplot(barley_model_agnostic_prior_samples["means"].apply(lambda xs: xs[0]),
             ax=axs[1], kde=False, norm_hist=True, bins=1000, color="C1", axlabel=r"$\mu_0$");
sns.distplot(barley_model_agnostic_prior_samples["$\mu_1 - \mu_0$"],
             ax=axs[2], kde=False, norm_hist=True, bins=1000, color="C1");
axs[-1].set_xlim([-100, 100]); plt.tight_layout();

Note: the kernel density estimates don't quite do these distributions justice.

Once again, we draw some samples,
package them into a dataframe, and visualize the posterior
with a box-and-whisker plot.

In [None]:
barley_model_agnostic_trace = shared_util.sample_from(barley_model_agnostic, draws=2500, chains=4, progressbar=True)
barley_model_agnostic_samples = shared_util.samples_to_dataframe(barley_model_agnostic_trace)

In [None]:
agnostic_posterior_prior_comparison_df = pd.concat(
    [barley_model_agnostic_samples, barley_model_agnostic_prior_samples])

agnostic_posterior_prior_comparison_df["distribution"] = \
    ["posterior"] * len(barley_model_agnostic_samples) + ["prior"] * len(barley_model_agnostic_prior_samples)

In [None]:
agnostic_posterior_prior_comparison_df.sample(10)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(x="$\mu_1 - \mu_0$", y="distribution",
            data=agnostic_posterior_prior_comparison_df, hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.legend([], frameon=False);
ax.set_xlim([-20, 20]);

If combine the samples from both of our models into a single dataframe,
including samples from the prior and the posterior for both,
we can plot the original and updated beliefs for both models
together and with minimal fuss.

In [None]:
model_comparison_df = pd.concat([posterior_prior_comparison_df,  # stack the two dataframes on top of each other
                                 agnostic_posterior_prior_comparison_df])
# add a column indicating which model each sample came from
model_comparison_df["model"] = \
    ["original"] * len(posterior_prior_comparison_df) + ["agnostic"] * len(agnostic_posterior_prior_comparison_df)

In [None]:
model_comparison_df.sample(10)

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
sns.boxplot(x="$\mu_1 - \mu_0$", y="model", data=model_comparison_df, hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.set_xlim([-20, 20]);

This plot separates out the original and the agnostic model on the y-axis, with `y="model"`,
and then uses color to indicate which distribution the samples are drawn from, with `hue="distribution"`.

Directly from this plot,
we can see that though the centers of the two priors are similar,
the prior of the agnostic model is much more widely distributed.

We can also see that the difference in posteriors is much smaller
than the difference in priors.
At least on the scale of the priors,
the centers are fairly close,
and the widths are about the same.

### Model with the weakest possible priors: `improper`

What if we wanted to make no assumption about what values of the standard deviation or the mean were more or less likely?

### `pm.Flat` and `pm.HalfFlat`

pyMC provides access to two distributions that aren't really probability distributions at all.

In [None]:
flat_probability = make_probability(pm.Flat)

half_flat_probability = make_probability(pm.HalfFlat)

In [None]:
xs = np.arange(-10, 10)
plt.step(xs, half_flat_probability(xs), lw=4);

In [None]:
xs = np.arange(-10, 10)
plt.step(xs, flat_probability(xs), lw=4);
plt.ylim([0, 1.1]);

Because the values are the same everywhere,
except where they are 0,
in the case of `HalfFlat`,
they have no effect on the posterior
except to say that some values are impossible.
Whenever they show up in e.g. a Bayes' rule calculation,
they have no effect on the output.

Because they aren't probability distributions,
as they don't add up to 1, they can't be sampled from
with `sample_prior_predictive`.

But they still result in a valid posterior,
so we draw samples with `pm.sample`.

In [None]:
with pm.Model() as barley_model_improper:
    pooled_sd = pm.HalfFlat(r"$\sigma$")
    means = pm.Flat("means", shape=2)
    
    varieties = pd.Series(barley_df["variety"] == "B", dtype=int)
    yields = pm.Normal("yields", mu=means[varieties], sd=pooled_sd,
                       observed=barley_df["yield"])
    delta_means = pm.Deterministic("$\mu_1 - \mu_0$", means[1] - means[0])

In [None]:
barley_model_improper_trace = shared_util.sample_from(barley_model_improper, draws=2500, chains=4)
barley_model_improper_samples = shared_util.samples_to_dataframe(barley_model_improper_trace)

And then add them to our model comparison dataframe.

In [None]:
barley_model_improper_samples["model"] = "improper"
barley_model_improper_samples["distribution"] = "posterior"

In [None]:
model_comparison_df = model_comparison_df.append(barley_model_improper_samples)

And, thanks to the power of seaborn,
we automatically get a comparative visualization
of the three models using the same code as before.

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
sns.boxplot(x="$\mu_1 - \mu_0$", y="model", data=model_comparison_df, hue="distribution",
            palette=["C2", "C0"], linewidth=4);
ax.set_xlim([-20, 20]);

Notice that the improper model has no prior distribution shown,
since its prior cannot be sampled from.

Notice also that the width of the posterior, the uncertainty, is greater than for the other models.
This will generically be the case, as the width of the prior in the improper model is effectively infinite.

Lastly, the center for the `Cauchy`-`HalfCauchy` model, the agnostic model,
is closer to the center than either of the other models.
This is because the peak of the prior is at 0,
like the original model and unlike the improper model,
and the greater width of the prior over the variance relative to the original model
means it is updated less aggressively
in response to the data.

This makes the `Cauchy`-`HalfCauchy` model a good _conservative_ choice.

## What would you do?

In [None]:
print("Normal-Exponential:\t", compute_posterior_p(barley_model_samples, mu1_greater),
      "\nCauchy-HalfCauchy:\t", compute_posterior_p(barley_model_agnostic_samples, mu1_greater),
      "\nFlat-HalfFlat:\t\t", compute_posterior_p(barley_model_improper_samples, mu1_greater))

Each of these models produces a slightly different posterior,
and none of the posteriors results in a completely unambiguous inference about
the difference in the group means.

So what is to be done?

There is still a decent chance that Variety A
has a slightly larger yield,
even though it's unlikely to have a much larger yield,
on the order of several bushels.

If the downside to choosing the wrong variety is minimal,
especially when the difference is on the order of a single bushel per planting,
as is likely the case,
then plant Variety B and reassess after the next harvest
brings in more data.

If this is a decision with a large downside,
where even a small difference in yields
could result in a major loss,
either repeat the experiment,
using some combination of these posteriors as a prior,
or maybe plant some of each!