<img src="../../shared/img/banner.svg"></img>

# Lab 04 - Comparing Bootstrap and Model-Based Sampling

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from client.api.notebook import Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

import utils.util as util
import shared.src.utils.util as shared_util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

In [None]:
#### SOL'N

def compare_inferential_distributions(samples, labels=None, true_value=None, xlabel=None, ax=None):
    if ax is None:
        f, ax = plt.subplots(figsize=(12, 4))
        
    if labels is None:
        labels = [labels] * len(samples)
        
    [ax.hist(sample, label=label, normed=True, alpha=0.5)
     for sample, label in zip(samples, labels)]
    
    if true_value is not None:
        ax.vlines(true_value, 0, ax.get_ylim()[1] * 0.25)
    ax.set_xlabel(xlabel)
    plt.legend();

## Learning Objectives

1. Recognize the similarities between sampling from a `pyMC` model posterior and bootstrap sampling and note the differences.
1. Understand the benefits and drawbacks of incorporating strong and weak prior information into a model.
1. Learn to work with the following `pyMC` tools: the `observed` kwarg, `sample_posterior_predictive`, and `sample_prior_predictive`.

## Introduction

In this lab, we'll look a variety different models of the same data.
The data will be Gaussian, or Normal, data,
and the models will have different priors about the parameters of that data.

The first model will be a bootstrapping-based model, of the type used in data8.
The remaining models will be `pyMC` models.
They are defined below, using the function `define_normal_model`.
We will compare and contrast the bootstrapping-based model to the `pyMC` models,
in terms of their prior distributions, their posterior distributions, and the inferences drawn from them.

## Generating Data

First, we generate the data with a small utility, `util.generate_normal_data`.
The true mean and standard deviation of the data are set by the variables `true_mu` and `true_sigma`.
The resulting sample will almost surely have a different mean and standard deviation.
For convenience, the `seed` argument to this utility sets the state of Python's random number generator
and ensures that the data generated will be the same each time it is run.

In a real scenario, we wouldn't know the values of `true_mu` and `true_sigma`.
We make models to try to represent and quantify our uncertainty about what those values might be.
Working with a case where we know what the real answer is, even if it's a bit artificial, helps us gain a better understanding of our models.

In [None]:
true_mu = 2.
true_sigma = 1.

N = 10

#### SOL'N

Teacher Note: the values for the bounds on a number of the auto-graded questions below
were estimated using simple Monte Carlo methods based on the parameters above.
If these parameters are adjusted, the bounds will need to be adjusted as well.

In [None]:
df = util.generate_normal_data(true_mu, true_sigma, N, seed=seed.SHARED_SEED)

As always, before we start modeling data, even with bootstrapping,
we should calculate descriptive statistics and visualize it.

In [None]:
df[0].mean(), df[0].std()

In [None]:
sns.distplot(df[0]);

## Bootstrap Sampling

As we've done previously, we can use bootstrap sampling to estimate our uncertainty about the values of parameters.

In the cells below, run bootstrapping on the data to estimate the uncertainty in the value of the true mean.
First, produce a `Series` of bootstrap sample means, called `boot_mus`,
and then produce a histogram of the bootstrapped values with `distplot`.

Don't be afraid to reuse plotting code from previous labs/homeworks.

The first cell produces a list of length `num_boots` of pandas `Series`, each of which is a bootstrap sample.

In [None]:
num_boots = 1000

boots = [df[0].sample(frac=1, replace=True) for _ in range(num_boots)]

In [None]:
#### SOL'N

boot_mus = pd.Series([boot.mean() for boot in boots])

compare_inferential_distributions([boot_mus],
                                  ["Bootstrapping Model"],
                                  true_value=true_mu, xlabel=r"$\mu$")

In [None]:
ok.grade("q1")

## Defining Normal Models

Our models will all have a graph as below:

In [None]:
g = util.make_normal_model_graph()

That is, the random variable x, which we observe $N$ times,
is parameterized by a random variable we don't observe, $\mu$,
which is presumed to have the same, unknown value for each x.

This variable sets the "true" mean of our data.

The function `define_normal_model` creates a model of this kind,
where $\mu$ is a Normal random variable whose mean and standard deviation are set by
the first and second arguments, `mu_prior_mean` and `mu_prior_sd`.

The random variable $\mu$ represents our state of belief about what the mean of our data might be,
before we have observed any data.
On average, this variable will have the value `mu_prior_mean` and samples will be spread around that value
with standard deviation `mu_prior_sd`.

In [None]:
def define_normal_model(mu_prior_mean, mu_prior_sd, observed_data, true_sigma=true_sigma):
    """Defines a model with Normal prior over mu, parameters determined by first two arguments,
    a Normal likelihoood with mean mu and standard deviation equal to the optional arg true_sigma,
    and observed values equal to the argument observed_data
    """
    with pm.Model() as normal_model:
        mu = pm.Normal("mu", mu=mu_prior_mean, sd=mu_prior_sd)
        
        pm.Normal("xs", mu=mu, sd=true_sigma, observed=observed_data)
    return normal_model

We define three models: `agnostic_model`, `right_model`, and `wrong_model`.

Agnostic means, loosely, ["knowing nothing"](https://en.wiktionary.org/wiki/agnostic).
This model makes much weaker assumptions about what the value of $\mu$ probably is.
This is the model we have when we know little about our data, aside from the form of the model.

The `right_model`, on the other hand, makes a much stronger assumption on what the value of $\mu$ is,
and it has the right guess.
This is the kind of model we have when we are well-informed about our data.

The `wrong_model` makes just as strong an assumption as the right model does,
but its assumption is wrong. It expresses the belief that the mean of the data is within ±0.5 of 0, but it is not.
This is the kind of model we have when we are mis-informed about our data.

In [None]:
agnostic_model_prior_mu = wrong_model_prior_mu = 0
agnostic_model_prior_sd = 10
wrong_model_prior_sd = right_model_prior_sd = 0.25

agnostic_model = define_normal_model(agnostic_model_prior_mu, agnostic_model_prior_sd, df)

wrong_model = define_normal_model(wrong_model_prior_mu, wrong_model_prior_sd, df)

right_model = define_normal_model(true_mu, right_model_prior_sd, df)

## Sample from Prior

Let's look at our priors on $\mu$ for each model. Remember that the prior is what we believe about our data before looking at the particular values we observed.

Draw a number of samples equal to `num_prior_samples` of `mu` from the prior using `sample_prior_predictive` for each of the three `pyMC` models.

Name the two series of samples `agnostic_prior_mus` and `wrong_prior_mus`.
The cell below shows how to do this for the `right_model`.


In [None]:
num_prior_samples = 10000

right_prior = pm.sample_prior_predictive(samples=num_prior_samples,
                                         model=right_model)
right_prior_mus = pd.Series(right_prior["mu"])

In [None]:
#### SOL'N
agnostic_prior = pm.sample_prior_predictive(samples=num_prior_samples,
                                            model=agnostic_model)

wrong_prior = pm.sample_prior_predictive(samples=num_prior_samples,
                                         model=wrong_model)

agnostic_prior_mus = pd.Series(agnostic_prior["mu"])
wrong_prior_mus = pd.Series(wrong_prior["mu"])

In [None]:
ok.grade("q2")

Plot a histogram for the samples of `mu` from each model and answer the questions below.

In [None]:
#### SOL'N
compare_inferential_distributions([right_prior_mus, wrong_prior_mus],
                                  labels=["Right Model", "Wrong Model"],
                                  true_value=true_mu, xlabel=r"$\mu$")

In [None]:
#### SOL'N
compare_inferential_distributions([agnostic_prior_mus], labels=["Agnostic Model"],
                                  true_value=true_mu, xlabel=r"$\mu$")

#### Q Why is it fair to say that the agnostic model makes weaker assumptions about the value of $\mu$ than do the other models?  What parameter choice, in the definition of the model, is responsible for this?

#### SOL'N

The samples from the agnostic model's prior cover a wider range of values for $\mu$:
samples with values between -30 and +30 are fairly common.
This is a much broader range of values than the other models produce.

This width is determined by `mu_prior_sd`,
the standard deviation of the Normal random variable that represents our state of belief about $\mu$.

#### Q Above, it was claimed that the wrong model encodes the belief that $\mu$ is within about ±0.5 of 0. How would you determine that from the histogram? In these terms, what belief does the right model represent?

#### SOL'N

This fact can be read off by looking at what values appear in the samples from the prior:
sampled values of $\mu$ outside the range of 0 ± 0.5 are rare for the wrong model.

The right model represents the belief that the value of $\mu$ is within ±0.5 of 2.

#### Q Why don't we plot a prior for our bootstrapping-based model?

#### SOL'N

Our bootstrapping model does not have a prior! Bootstrapping can only represent the information contained in the data, not anything else.

## Sampling from the Posteriors of the Parameters

The posterior represents what we believe about our data after we've looked at the values we've observed.

First, let's look at what each of our models believes about the true mean of the variable $X$
after having observed the data.
Plot histograms of the samples of $\mu$ from each of our models
(the three `pyMC` models and the bootstrapping model)
and then answer the questions below.
Name the samples from the models `agnostic_samples` and `wrong_samples`.

Remember that, for unobserved variables in `pyMC` models, like $\mu$,
we sample from the posterior by running `pm.sample`.
Make sure to keep around the `trace`, which is returned by `pm.sample`, since we'll need that to look at our posterior predictions for future values of the data, the `posterior_predictive`.
This is done by the `get_trace_and_samples` function.

For unobserved variables in a bootstrapping model, we need to estimate their values on all of the bootstrapped samples. You've already done this above, in the computation of `boot_mus`.

In [None]:
def get_trace_and_samples(model):
    with model:
        trace = pm.sample(draws=1000, n_chains=4)
        samples = shared_util.samples_to_dataframe(trace)
    return trace, samples

In [None]:
right_trace, right_samples = get_trace_and_samples(right_model)

In [None]:
#### SOL'N

agnostic_trace, agnostic_samples = get_trace_and_samples(agnostic_model)

wrong_trace, wrong_samples = get_trace_and_samples(wrong_model)

In [None]:
ok.grade("q3")

In [None]:
#### SOL'N

compare_inferential_distributions([wrong_samples.mu, right_samples.mu, agnostic_samples.mu],
                                  ["Wrong Model", "Right Model", "Agnostic Model"],
                                  true_value=true_mu, xlabel=r"$\mu$")
plt.gca().set_xlim([0, 3.5]);

In [None]:
#### SOL'N

compare_inferential_distributions([agnostic_samples.mu, boot_mus],
                                  ["Agnostic Model", "Bootstrapping Model"],
                                  true_value=true_mu, xlabel=r"$\mu$")
plt.gca().set_xlim([0, 3.5]);

#### Q For the `pyMC` models, how have the beliefs about $\mu$ changed? That is, compare the posteriors over $\mu$ to the priors. Do they differ more for some of the models than for others? If so, which one(s)?

#### SOL'N

For the right and wrong models, the prior and posterior look about the same,
though the posterior for the wrong model might have scooted upwards a bit.
Intuitively, these models were very sure in their beliefs, and 10 data points weren't enough to change them.

#### Q Can you give an intuitive explanation, in your own words, for why each model's beliefs changed the way they did?

#### SOL'N 

The agnostic model's posterior is very different from its prior:
instead of being broadly distributed across values from -30 to +30,
it's distributed across values between about 1 and 3.
Intuitively, this model had weakly-held beliefs about the values of the parameter,
and 10 data points were sufficient to change them dramatically.

#### Q Directly compare the posterior for the agnostic model to that for the bootstrapping model. If they are different, why are they different? If they are similar, why are they similar?

#### SOL'N

Usually, the agnostic model and the bootstrapping model have fairly similar posteriors,
and they should for the dataset that was drawn.

They are similar because the bootstrapping model makes no assumptions about the likely values of $\mu$,
while the agnostic model only makes weak assumptions about the likely values of $\mu$.

However, they can be different, depending on the data sample,
which might happen if a student changes around the data sampling code.
In that case, a student might answer that they are different because
the bootstrap model has no prior, and so is like a model with a `Flat` distribution for `mu`.

#### Q What do you suspect would happen to the posterior of the wrong and agnostic models if the number of samples was 100 or 1000 instead of just 10? Explain your reasoning.

#### SOL'N

They would move towards the true value of $\mu$ and become less wide.
As the number of samples increases, the posterior for any model should converge to the true value.

# Model Comparison

## Sampling from the Posterior of the Observed Variable

In this example, we know the correct value of $\mu$, since we're simulating the entire process.
In a real experiment, we wouldn't know the true value, so we wouldn't know which model to trust.
That is, in our simulation, we know which model is the right model and which model is the wrong model,
but we wouldn't know that in most real life situations.

So how do we determine which model to believe?

The first thing to check is whether the model's posterior over the observed variable matches
the data we observed. If it doesn't, then the model doesn't consider the data we observed very likely.
And while it's possible that the data we drew was unrepresentative, the larger a sample, the less likely that is,
and so the more evidence we have that the model is incorrect.

Because it can be used to predict future values of the observed variable, based on the values we've already observed, this distribution is known as the "posterior predictive" distribution.

The quantitative version of this approach to model comparison is called _maximum likelihood modeling_,
and it's especially common in cases where an uninformative prior is chosen.

The code in the cell below will generate samples from the posterior over the observed variable x for the right model using `sample_posterior_predictive`. Note that it requires a copy of a `trace` from the same model.

The cells beneath that will compare the posterior predictive to the observed data using a histogram and rugplot.
They will run as written, producing only a plot of the posterior samples from the `right_model`.

Use `sample_posterior_predictive` to draw samples according to the posteriors of the wrong and agnostic models.
Then, add the results to the `postpreds` list, as in the example code below (order is important!),
and then run the cell containing `compare_multiple_postpreds_to_observed`
to plot the comparisons.
Then answer the questions below.

Example Code:
```python
wrong_postpred = pm.sample_posterior_predictive(?)
agnostic_postpred = pm.sample_posterior_predictive(?)
postpreds = [right_postpred["xs"], agnostic_postpred["xs"], wrong_postpred["xs"]]
```

In [None]:
num_posterior_samples = 1000

right_postpred = pm.sample_posterior_predictive(
    right_trace, samples=num_posterior_samples, model=right_model)

In [None]:
#### SOL'N

wrong_postpred = pm.sample_posterior_predictive(
    wrong_trace, samples=num_posterior_samples, model=wrong_model)

agnostic_postpred = pm.sample_posterior_predictive(
    agnostic_trace, samples=num_posterior_samples, model=agnostic_model)

In [None]:
postpreds = [right_postpred["xs"]]

In [None]:
#### SOL'N

postpreds = [right_postpred["xs"], agnostic_postpred["xs"], wrong_postpred["xs"]]

In [None]:
f, axs = util.compare_multiple_postpreds_to_observed(
    postpreds, df, titles=["Right Model", "Agnostic Model", "Wrong Model"])

#### Q How well does the posterior predictive distribution match the data for each model?

#### SOL'N

The predictions of the right and agnostic model match the data very closely, while the predictions of the wrong model are way off.

#### Q Which model is doing worst according to this criterion? How do its predictions differ from the observed data? Why might this be?

#### SOL'N

The wrong model is doing the worst. Its predictions differ from the data by underestimating the mean, since its posterior over the mean is down at 0.

### Limits to Sampling from the Posterior

This method of picking models by how well their samples match the data has its flaws.
For example, the bootstrap always come out looking better than any other model possibly can.
Let's see why this is.

The observed values for 100 samples from the posterior predictive distribution of each model will be plotted below.

Note that the "bootstrapping" is the same as "sampling from the posterior predictive" for a bootstrap-based model.
Sample values will appear as black tick marks, with the observed values underneath in blue.

In [None]:
boots_np = np.asarray([np.asarray(boot) for boot in boots])

f, axs = util.elementwise_compare_multiple_postpreds_to_observed(
    [boots_np] + postpreds, df,
    titles=["Bootstraps", "Right Model", "Agnostic Model", "Wrong Model"])

Explain, in your own words, why model comparison on the basis of posterior predictive similarity to the data would always pick the bootstrap model.

#### SOL'N

The bootstrap model will only ever produce values from the dataset, so its samples will always perfectly match the data. Other models will typically produce values not from the dataset, so their samples won't perfectly match the data.

## Model Comparison by Surprise

In order to quantitatively characterize which models are best supported by the data,
we need to consider not just how well the models predict the data,
but also how well the prior characterizes the data.

The full details of how this is done are outside of the scope of this class, since

1) comparing models via sampling is an area of active development in `pyMC` and in research

2) existing methods tend to use ideas outside of the core concepts of this class.

The curious can look up [Bayes factors](https://en.wikipedia.org/wiki/Bayes_factor).

But in short, we can perform a form of model comparison by computing a statistic called
the average [surprise](https://charlesfrye.github.io/stats/2016/03/29/info-theory-surprise-entropy.html),
or the negative log-likelihood, on samples from the prior.
This number will be high if the choice of parameter makes the data unlikely
and low if the choice of parameter makes the data likely.
A model with lower surprise, averaged over samples from the prior, is better.
This method is known as _marginal likelihood comparison_.

In [None]:
def calculate_surprise(model, prior):
    model_logp = model.observed_RVs[0].logp
    surprises = [-model_logp(mu=mu) for mu in prior["mu"]]
    return np.mean(surprises)

In [None]:
right_surprise = calculate_surprise(right_model, right_prior)
right_surprise

As done above for the `right_model`,
calculate and the surprise for the agnostic and wrong models,
name them `agnostic_surprise` and `wrong_surprise`,
and use their values to answer the questions below.

In [None]:
#### SOL'N

agnostic_surprise = calculate_surprise(agnostic_model, agnostic_prior)
agnostic_surprise

In [None]:
#### SOL'N

wrong_surprise = calculate_surprise(wrong_model, wrong_prior)
wrong_surprise

In [None]:
ok.grade("q4")

#### Q According to this "surprise" criterion, which model is best?

#### SOL'N

The right model performs best here.

#### Q How does the agnostic model perform? Why? 

#### SOL'N

The agnostic model performs poorly. Many of the samples from the prior will have means very far away from the true value (-30 to +30), so the data will look very unlikely.

_Hint for why: what do samples from the agnostic prior look like? Try plotting the distribution of_`agnostic_prior["xs"].flatten()` _compared to the observed data (with_ `sns.rugplot`_)_.

In [None]:
#### SOL'N
compare_inferential_distributions([agnostic_prior["xs"].flatten()], labels=["agnostic"], xlabel="x");
sns.rugplot(df, ax=plt.gca());

In [None]:
ok.score()