<img src="../../shared/img/banner.svg"></img>

# Lab 04 - Comparing Bootstrap and Model-Based Sampling

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

import utils.util as util
import shared.src.utils.util as shared_util

## Learning Objectives

1. Recognize the similarity between sampling from a model posterior and bootstrap sampling and note the differences.
1. Understand the benefits and drawbacks of incorporating strong and weak prior information into a model.
1. Learn to work with the following `pyMC` tools: the `observed` kwarg, `sample_posterior_predictive`, and `sample_prior_predictive`.

## Introduction

In this lab, we'll look a variety different models of the same data.
The data will be Gaussian, or Normal, data, with varying numbers of samples.

The first model will be a bootstrapping-based model, of the type used in data8.
The remaining models will be `pyMC` models.
They are defined below, using the function `define_normal_model`.
We will compare and contrast the bootstrapping-based model to the `pyMC` models,
in terms of their prior distributions, their posterior distributions, and the inferences drawn from them.

## Generating Data

First, we generate the data.
The true mean and standard deviation of the data are set by the variables `true_mu` and `true_sigma`.
The resulting sample will almost surely have a different mean and standard deviation.

In a real scenario, we wouldn't know the values of `true_mu` and `true_sigma`.
We make models to try to represent and quantify our uncertainty about what those values might be.
Working with a case where we know what the real answer is, even if it's a bit artificial, helps us gain a better understanding of our models.

In [None]:
true_mu = 2.
true_sigma = 1.

N = 10

In [None]:
df = util.generate_normal_data(true_mu, true_sigma, N)

As always, before we start modeling data, even with bootstrapping,
we should 

In [None]:
df.mean()[0], df.std()[0]

In [None]:
sns.boxplot(df, width=0.25); sns.rugplot(df);

## Bootstrap Sampling

As we've done previously, we can use bootstrap sampling to estimate our uncertainty about the values of parameters.

In the cells below, run bootstrapping on the data to estimate the uncertainty in the value of $\mu$
by producing a histogram of the bootstrapped values.
Don't be afraid to reuse code from previous labs/homeworks.

The first cell produces a list of length `num_boots` of pandas `Series`, each of which is a bootstrap sample.

In [None]:
num_boots = 1000

boots = [df[0].sample(frac=1, replace=True) for _ in range(num_boots)]

## Defining Normal Models

Our normal models will all have a graph as below:

In [None]:
util.make_normal_model_graph()

That is, the variable x, which we observe $N$ times,
is parameterized by a variable we don't observe: $\mu$.

This variable sets the mean of our data.

The function `define_normal_model` creates a model of this kind,
where $\mu$ is a Normal random variable whose mean and standard deviation are set by
the first and second arguments, `mu_prior_mean` and `mu_prior_sd`.

The random variable $\mu$ represents our state of belief about what the mean of our data might be,
before we have observed any data.
On average, this variable will have the value `mu_prior_mean` and samples will be spread around that value
with standard deviation `mu_prior_sd`.

In [None]:
def define_normal_model(mu_prior_mean, mu_prior_sd, observed_data):
    with pm.Model() as normal_model:
        mu = pm.Normal("mu", mu=mu_prior_mean, sd=mu_prior_sd)
        
        pm.Normal("xs", mu=mu, sd=true_sigma, observed=observed_data)
    return normal_model

We define three models: `agnostic_model`, `right_model`, and `wrong_model`.

Agnostic means, loosely, ["knowing nothing"](https://en.wiktionary.org/wiki/agnostic).
This model makes much weaker assumptions about what the value of $\mu$ probably is.
This is the model we have when we know little about our data, aside from the form of the model.

The `right_model`, on the other hand, makes a much stronger assumption on what the value of $\mu$ is,
and it has the right guess.
This is the kind of model we have when we are well-informed about our data.

The `wrong_model` makes just as strong an assumption as the right model does,
but its assumption is wrong. It expresses the belief that the mean of the data is within ±0.75 of 0, but it is not.
This is the kind of model we have when we are mis-informed about our data.

In [None]:
agnostic_model = define_normal_model(0, 10, df)

wrong_model = define_normal_model(0, 0.25, df)

right_model = define_normal_model(true_mu, 0.25, df)

## Sample from Prior

Let's look at our priors on $\mu$ for each model. Remember that the prior is what we believe about our data before looking at the particular values we observed.

Draw samples of `mu` from the prior using `sample_prior_predictive` for each of the three `pyMC` models.
The cell below shows how to do this for the `right_model`.

Plot a histogram for the samples from each model and answer the questions below.

Why is it fair to say that the agnostic model makes weaker assumptions about the value of $\mu$ than do the other models?
What parameter choice, in the definition of the model, is responsible for this?
Above, it was claimed that the wrong model encodes the belief that $\mu$ is within about ±0.3 of 0.
How would you determine that from the histogram?
In these terms, what belief does the right model represent?
Why don't we plot a prior for our bootstrapping-based model?

In [None]:
num_prior_samples = 10000

right_prior = pm.sample_prior_predictive(samples=num_prior_samples,
                                         model=right_model)

## Sampling from the Posteriors of the Parameters

The posterior represents what we believe about our data after we've looked at the values we've observed.

First, let's look at what each of our models believes about the true mean and the standard deviation of the data
after having observed the values.
Plot histograms of the samples of $\mu$ from each of our models
(the three `pyMC` models and the bootstrapping model)
and then answer the questions below.

Remember that, for unobserved variables in `pyMC` models, we sample from the posterior by running `pm.sample`.
Make sure to keep the `trace` around, since we'll need that to look at our posterior over future values of the data.

For the unosberved variables in the bootstrapping model, we need to calculate their values on all of the bootstrapped samples.


For the `pyMC` models, how have the beliefs about $\mu$ changed?
Do they differ more for some of the models than for others? If so, which one(s)?
Can you give an intuitive explanation, in your own words, for why each model's beliefs changed the way they did?
Directly compare the posterior for the agnostic model to that for the bootstrapping model.
If they are different, why are they different? If they are similar, why are they similar?

# Model Comparison

## Sampling from the Posterior of the Observed Variable

In this example, we know the correct value of $\mu$, since we're simulating the entire process.
In a real experiment, we wouldn't know the true value, so we wouldn't know which model to trust.
That is, in our simulation, we know which model is the right model and which model is the wrong model,
but we wouldn't know that in most real life situations.

So how do we determine which model to believe?

The first thing to check is whether the model's posterior over the observed variable matches
the data we observed. If it doesn't, then the model doesn't consider the data we observed very likely.
And while it's possible that the data we drew was unrepresentative, the larger a sample, the less likely that is,
and so the more evidence we have that the model is incorrect.

Because it can be used to predict future values of the observed variable, this distribution is known as the "posterior predictive" distribution.

The quantitative version of this approach to model comparison is called _maximum likelihood modeling_,
and it's especially common in cases where an uninformative prior is chosen.

The code in the cell below will generate samples from the posterior over the observed variable x for the right model using `sample_posterior_predictive`. Note that it requires a copy of a `trace` from the same model.

The cells beneath that will compare the posterior predictive to the observed data using a histogram and rugplot.

Use `sample_posterior_predictive` to draw samples according to the posteriors of the wrong and agnostic models.
Then, add the results to the `postpreds` list, as in the example code below (order is important!),
and then run the cell containing `compare_multiple_postpreds_to_observed`
to plot the comparisons.
Then answer the questions below.

Example Code:
```python
postpreds = [right_postpred["xs"], agnostic_postpred["xs"], wrong_postpred["xs"]]
```

How well does the posterior predictive distribution match the data for each model?
Which model is doing worst according to this criterion?
How do its predictions differ from the observed data? Why might this be?

In [None]:
num_posterior_samples = 1000

right_postpred = pm.sample_posterior_predictive(
    right_trace, samples=num_posterior_samples, model=right_model)

In [None]:
postpreds = [right_postpred["xs"]]

In [None]:
util.compare_multiple_postpreds_to_observed(
    postpreds, df, titles=["Right Model", "Agnostic Model", "Wrong Model"])

This method of picking models by how well their samples match the data has its flaws.
For example, the bootstrap always come out looking better than any other model possibly can.
Let's see why this is.

The observed values for 100 samples from the posterior predictive distribution of each model will be plotted below.

Note that the "bootstrapping" is the same as "sampling from the posterior predictive" for a bootstrap-based model.
Sample values will appear as black tick marks, with the observed values overlaid in blue.

In [None]:
boots_np = np.asarray([np.asarray(boot) for boot in boots])

util.elementwise_compare_multiple_postpreds_to_observed(
    [boots_np] + postpreds, df,
    titles=["Bootstraps", "Right Model", "Agnostic Model", "Wrong Model"])

Explain, in your own words, why model comparison on the basis of posterior predictive similarity to the data would always pick the bootstrap model.

## Model Comparison by Surprise

In order to quantitatively characterize which models are best supported by the data,
we need to consider not just how well the models predict the data,
but also how well the prior characterizes the data.

The full details of how this is done are outside of the scope of this class, since
1) comparing models via sampling is an area of active development in `pyMC` and in research
2) existing methods tend to use ideas outside of the core concepts of this class.

The curious can look up [Bayes factors](https://en.wikipedia.org/wiki/Bayes_factor).

But in short, we can perform a form of model comparison by computing a statistic called
the [surprise](https://charlesfrye.github.io/stats/2016/03/29/info-theory-surprise-entropy.html),
or the negative log-likelihood, on samples from the prior.
A model with lower surprise, averaged over samples from the prior, is better.
This method is known as marginal likelihood comparison.

In [None]:
def calculate_surprise(model, prior):
    model_logp = model.observed_RVs[0].logp
    surprises = [-model_logp(mu=mu) for mu in prior["mu"]]
    return np.mean(surprises)

In [None]:
calculate_surprise(right_model, right_prior)

In [None]:
calculate_surprise(agnostic_model, agnostic_prior)

In [None]:
calculate_surprise(wrong_model, wrong_prior)

The value of the surprise for the bootstrap model is independent of the data.
It's just the number of datapoints times the log of the datapoints.

In [None]:
calculate_surprise(right_model, right_prior)

In [None]:
bootstrap_logp = N * np.log(N)

bootstrap_logp

According to this "surprise" criterion, which model is best?
How does the agnostic model perform? Why? _Hint: what do samples from the agnostic prior look like?_

## What happens with more data?

As the amount of data increases, the effect of the prior decreases, and the posteriors of all models become about the same.

Return to the top of the lab and set the variable `N` to `500` instead of `10`.
Re-run the entire lab and answer the questions below.
All of the code should run correctly with this change.
If you're working with a partner, have one person change the value of `N` and the other keep it the same,
as the questions involve comparing the results across values of `N`.

Do the priors change? Why or why not?
What's different about the posteriors over $\mu$? How aobut the predictive posteriors over x?
Keep an eye of the scale of the x-axis.
What happens to the surprises? Does the ranking of the models by surprise change or not?