<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Parameters, Priors, and Posteriors 02

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import random

import daft
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

In [None]:
import utils.daft
import utils.plot as plot

## What if we don't know the value of the parameters?

Often, our models are incomplete: the parameters of some random variables are, themselves, random variables.

### For example, we considered a model of the process of measurement

In [None]:
utils.daft.make_bivariate_graph("S", "M", observed=False);

We looked at the joint distribution of both the signal and the measurement.

In [None]:
with pm.Model() as measurement_model:
    signal = pm.Normal("signal", mu=0, sd=1)
    measurement = pm.Normal("measurement", mu=signal, sd=1)
    
measurement_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    measurement_model, draws=10000, progressbar=True))

In [None]:
print(measurement_samples.head())

## Once we have observed the measurement, we want to know likely values of the signal

In [None]:
utils.daft.make_bivariate_graph("S", "M", observed=True);

The signal is what we _really_ want to know --
the sound of someone's voice speaking,
an episode of _Game of Thrones_,
the location of another car on the highway.

But the measurement --
the output of a radio,
the pixel values on our TV screen,
the voltages in our LIDAR detector --
is what we _actually get_, and so we must **work backwards**.

$$
p(S \lvert M = m)
$$

### According to one view, this is also the problem of perception

_Working backwards_
is precisely the problem faced by the mind when it tries to turn sensation into perception.

For example, the eyes detect the fluctuations of electromagnetic fields (a measurement),
but they are intended to do things like identify the presence or absence of predators (a signal).

Hence the idea that the brain solves this problem using the tools of probability and statistics
that we learn in this class, which dates back
to [the 19th century](https://en.wikipedia.org/wiki/Unconscious_inference).

Normally, this proceeds automatically, and we don't notice,
but sometimes we can catch it in action:

In [None]:
Image("img/necker_cube.png", width=320)

By <a href="//commons.wikimedia.org/w/index.php?title=User:BenFrantzDale&amp;action=edit&amp;redlink=1" class="new" title="User:BenFrantzDale (page does not exist)">BenFrantzDale</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=2040007">Link</a>

Having observed this image, a two-dimensional pattern of dark and bright pixels,
the mind seeks to determine what causes or factors in the world
might have given rise to it.

Two 3-dimensional objects could have given rise to this image:
a cube whose front face points down and to the left or up and to the right.
See [Wikipedia](https://en.wikipedia.org/wiki/Necker_cube) for details.

In [None]:
utils.daft.make_bivariate_graph("Cube", "Image", scale=3.7, observed=True);

This idea is sometimes called the
[Bayesian model of perception](https://www.sciencedirect.com/science/article/pii/S0022249615000061).
An even more recent view has it that the brain performs
[MCMC sampling](http://www.cnbc.cmu.edu/~tai/papers/lee_mumford_josa.pdf),
just like pyMC!
One piece of evidence is the fact that
your perceptions switch back and forth,
[as though your mind were "drawing samples"](https://link.springer.com/article/10.1186/1471-2202-12-S1-P320).
Bayesian inference with MCMC has also been considered as a model for
[Charles Bonnet Syndrome](https://papers.nips.cc/paper/4097-hallucinations-in-charles-bonnet-syndrome-induced-by-homeostasis-a-deep-boltzmann-machine-model),
where individuals who go blind late in life have complex visual hallucinations.

Of course, the brain has to deal with models much more complicated
than just two one-dimensional normal distributions or the cube example above.

## Once we have observed the measurement, we want to know likely values of the signal

In [None]:
utils.daft.make_bivariate_graph("S", "M", observed=True);

Note that this is different from the case of observed or known values of the _parameters_.

In [None]:
utils.daft.make_parameters_graph(observed=True);

Here, we could just set the values of the parameters to be equal to the known or observed values:
```python
X = pm.Foo("X", beta_0=beta_0_known, beta_1=beta_1_known, beta_2=beta_2_known)
```

This strategy won't work for the model of signals and measurements,
because the measurement is not a parameter for the signal.

Remember that we typically draw our arrows coming _from_ a variable that is used to determine
the values of another variable.
These represent the "natural order" for thinking about our model.

Often, this comes from a loosely mechanistic model of our data:
here, that imprecise mechanism is the argument from the Central Limit Theorem.

### Incorporating observations into models is done by the `observed` keyword argument.

Presume we have taken a measurement and observed the value `1`.

In [None]:
observations = 1
# N = 10  # uncomment for multiple observations
# observations = pm.Normal.dist(mu=1, sd=1).random(size=N)  # uncomment for multiple observations

with pm.Model() as measurement_observed:
    signal = pm.Normal("signal", mu=0, sd=1)
    measurement = pm.Normal("measurement", mu=signal, sd=1, observed=observations)

If we add that information to our model
by passing `observed=1` to the `measurement` variable,
then calling `pm.sample` will draw from the conditional distribution
$p(S \lvert M = 1)$ instead of the marginal distribution $p(S)$.

Put another way:
the original model specified our uncertainty about the values of the signal and the measurement,
and samples from it were used to numerically and visually represent that uncertainty.

Once we've observed the value of one of the random variables,
the state of our knowledge changes,
and so we'd like the samples to change to reflect that.

pyMC does this for us!

More generally, if we have observed $K$ random variables in our model,
`pm.sample` will produce values from the conditional distribution of the $J$ remaining variables:

$$
p(\beta_1, \beta_2, \dots \beta_J \lvert X_1=x_1, X_2=x_2 \dots X_K=x_K)
$$

In [None]:
measurement_observed_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    measurement_observed, draws=500, chains=4))

In [None]:
print(measurement_observed_samples.head())

Note that the `measurement` doesn't show up among our samples!

In [None]:
f, ax = plt.subplots(figsize=(8, 4))
sns.distplot(measurement_samples["signal"], ax=ax, label="before obs")  # notice: from model _without_ observed
sns.distplot(measurement_observed_samples["signal"], ax=ax, label="after obs", color="C2");
ax.set_xlim([-4, 4]); plt.legend();

Change `observations` to something else, like `5` or `-2`, and then re-run the model definition, sampling, and plotting code.

You'll see the `after obs` distribution move around:
for values larger than `1`, it will move further to the right,
for values less than  `1`, it will move to the left.

This example is simple to the point of perhaps being uninteresting.
But if you apply this principle to images,
and treat the pixels as dependent, Normally-distributed signals,
you are en route to discovering
[JPEG](http://nautil.us/blog/the-math-trick-behind-mp3s-jpegs-and-homer-simpsons-face).

In Latin, "from before" and "from after" are _a priori_ and _a posteriori_,
and so these distributions are typically called the **prior** and the **posterior** on $S$.

In [None]:
f, ax = plt.subplots(figsize=(8, 4))
sns.distplot(measurement_samples["signal"], ax=ax, label=r"Prior: $p(S)$")  # notice: from model _without_ observed
sns.distplot(measurement_observed_samples["signal"], ax=ax, label="""Posterior:\n$p(S \\vert M=m)$""", color="C2");
ax.set_xlim([-4, 4]); plt.legend();

## This is the _killer feature_ of pyMC and other MCMC libraries.

You write down a _forwards model_
that

1. describes <font color="#003262"> **uncertainty about unknown quantities** </font> like parameters,

This is the <font color="#003262">**prior**</font>.

and then

2. explains what <font color="#FDB515">**the distribution of the data _would be_**</font>,
if you did know those unknown quantities.

This is the <font color="#FDB515">**likelihood**</font>.

pyMC can then "work backwards" and tell you

3. what <font color="#2ca02c"> **your new beliefs
about the unknown quantities are**</font>,
once you've seen that data.

This is the <font color="#2ca02c">**posterior**</font>.

These new beliefs are expressed as _samples_, drawn by `pm.sample`.

### Sampling from Models with `observed` variables

```python
with pm.Model() as model:
    # random variable to model uncertainty about the truth before seeing data: the _prior_
    parameter_var = pm.SomeRandomVariable(
        "prior_variable", parameter=value)  
    # random variable to model uncertainty in data, given parameters: the _likelihood_
    observed_var = pm.OtherRandomVariable(
        "data_variable", parameter=parameter_var,
        observed=data_we_saw)
    
# samples to estimate uncertainty about the truth after seeing data: the _posterior_
samples = shared_util.sample_from(model)
```

### In contrast, frequentist model fitting:

One only writes down a forwards model for going from parameters to data:
there is only the <font color="#FDB505">**likelihood**</font>.

There is no mention of uncertainty in parameters.
After you observe data, you search for parameters that make the observed data most likely.

Your answer is then a single number for each parameter,
rather than a collection of samples or,
more generally, a distribution.

This is called _maximum likelihood estimation_.

In the previous lecture, when I picked the "best" binomial fits for the conditional distributions of $S$,
I used maximum likelihood methods.

## The _working backwards_ is done via Bayes' Rule

In [None]:
utils.daft.make_bivariate_graph(r"$\beta$", "X", observed=True);

Consider the above model, where

$$
X \sim \text{Foo}(\beta) \\
\beta \sim \text{Bar}(\lambda)
$$

meaning that $\beta$ is the parameter for the distribution of $X$
and we know the distribution $p(\beta)$, which has fixed parameter $\lambda$.

$$
p(X, \beta) = \color{green}{p(\beta\lvert X)} * p(X) = \color{darkgoldenrod}{p(X\lvert\beta)} * \color{darkblue}{p(\beta)} $$

The two expressions in the center and right are the two
equivalent expressions for the joint distribution of $X, \beta$
(the leftmost expression).

The colored components are known as:

<p style="text-align: center;">
the $\color{green}{\text{posterior on } \beta}$,
the $\color{darkgoldenrod}{\text{likelihood}}$,
and the $\color{darkblue}{\text{prior on } \beta}$,
</p>

The likelihood and the prior are part of our model:
they are $\text{Foo}$ and $\text{Bar}$.

But the posterior is what we really want:
it tells us what we believe about $\beta$,
once we know something about $X$.

So we rearrange to get the posterior alone:

$$\color{green}{p(\beta\lvert X)} = \color{darkgoldenrod}{p(X\lvert\beta)} * \color{darkblue}{p(\beta)}\ /\ p(X) $$

This equation is known as
[Bayes' Rule](https://charlesfrye.github.io/stats/2016/02/04/bayes-rule.html).

And it looks like the value $p(X)$ is the last piece we need:
it's not specified in our model, but it's showing up in this equation.

In statistical physics, it is known as the
[partition function](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics)),
in mathematical statistics, it is a logarithm away from the
[cumulant-generating function](https://en.wikipedia.org/wiki/Cumulant).
In these disciplines,
it is a holy grail to be quested after.
Computing it is, for large problems,
insanely difficult, except in special cases.
This problem is "definitely hard" in the sense that,
if it were generically possible to solve easily,
many other hard problems would turn out to be easy as well.

Side note: if you've heard of NP problems,
([simple explanation](https://simple.wikipedia.org/wiki/P_versus_NP),
[less simple explanation](https://stackoverflow.com/questions/1857244/what-are-the-differences-between-np-np-complete-and-np-hard)),
perhaps from the show
[_Numb3rs_](https://en.wikipedia.org/wiki/Uncertainty_Principle_(Numbers)#Writing),
know that this comes from a class of harder problems,
[#P](https://en.wikipedia.org/wiki/%E2%99%AFP).

And so it is something like the
the greatest book that could ever be written in 410 pages,
that is lost inside an almost-infinite library,
in a short story by Jorge Luis Borges,
[_La Biblioteca de Babel_](https://en.wikipedia.org/wiki/The_Library_of_Babel):
we know it exists, is one a finite number of objects,
and has all the answers we seek,
but we have no hope of finding it.

One of the reasons for the success of MCMC methods,
like those used in pyMC,
is that they can avoid the requirement to know, compute, or estimate $p(X)$.

$$\color{green}{p(\beta\lvert X)} \propto \color{darkgoldenrod}{p(X\lvert\beta)} * \color{darkblue}{p(\beta)} $$

We won't go into detail here about how this is achieved,
but the "price" is that though we had mathematical forms for the distributions
$p(X\vert\beta)$ and $p(\beta)$,
we don't have a mathematical form for the distribution
$p(\beta\vert X)$.
Instead, we only have samples from that distribution,
which we can use to estimate or approximate it.

That is why this class
emphasizes the view of distributions as "what our samples approximate"
rather than the view of distributions as "mathemetical formulas that tell us probabilities".

We will always be able to draw samples,
but we will not always have formulas.

There are many reasons why frequentist modeling won out for the first century or so of statistical modeling
including the fact that, for a long time, calculating $p(X)$ was a major bottleneck for all except certain special models, just as computing the form of the sampling distribution was a bottleneck for frequentist statistics.

The advent of powerful computers and easy-to-use MCMC libraries changed that,
and so is changing the practice of statistics.

But another reason was the intense hostility of one of the founding fathers of statistics,
Ronald A. Fisher, to what was then called _inverse probability_:

> This is not the place to enter into the subtleties of a prolonged controversy;
it will be sufficient in this general outline of the scope of Statistical Science
to reaffirm my personal conviction, [which I have sustained elsewhere](https://projecteuclid.org/download/pdf_1/euclid.ba/1340370565),
that **the theory of inverse probability is founded upon an error,
and must be wholly rejected**.

- R. A. Fisher, _Statistical Methods for Research Workers_, 1925.

This "error" was, essentially, to define probability subjectively, in terms of beliefs,
instead of objectively, in terms of frequencies and populations.

## When we want to sample from other distributions, we use other pyMC methods:

```python
with pm.Model() as model:
    # random variable to model uncertainty about the truth before seeing data: the _prior_
    parameter_var = pm.SomeRandomVariable("prior_variable", parameter=value)  
    # random variable to model uncertainty in data, given parameters: the _likelihood_
    observed_var = pm.OtherRandomVariable("data_variable", paramter=parameter_var, observed=data_we_saw)
    
# samples to estimate uncertainty before seeing data
pripred_samples = pm.sample_prior_predictive(model=model)

with model:
    # samples to estimate uncertainty about the parameters after seeing data: the _posterior_
    post_samples_trace = pm.sample()

# samples to estimate uncertainty in what future data we might see:
postpred_samples = pm.sample_posterior_predictive(post_samples_trace, model=model)
```

## Priors

The distribution we place on the parameter is called a _prior_: it represents our beliefs before, or prior to, observing any data.

The random variables representing our parameters will themselves have parameters.
It is common to call these _hyperparameters_, when it is important to be clear.

If I am uncertain about the parameters of my parameters,
I might replace them with random variables:

```python
with pm.Model() as deep_model:
    # random variable to model uncertainty about the hyperparameter
    hyperparameter_var = pm.SomeOtherRandomVariable("hyperprior_variable", parameter=value)
    # random variable to model uncertainty about the parameter
    # before seeing data, given hyperparameter: the _prior_
    parameter_var = pm.SomeRandomVariable("prior_variable", parameter=hyperparameter_var)  
    # random variable to model uncertainty in data, given parameters: the _likelihood_
    observed_var = pm.OtherRandomVariable("data_variable", paramter=parameter_var, observed=data_we_saw)
```

Somewhat less commonly, the distribution of this random variable is called a _hyperprior_.

Note the possibility of infinite regress:
the hyperprior will have parameters,
about which we might be uncertain,
and so use random variables,
which themselves will have parameters,
about which we might be uncertain,
and so on.

At a certain point,
you must stop and say that there is at least one distribution you're willing to assume, _a priori_.
Compare this to the
[Münchhausen trilemma](https://en.wikipedia.org/wiki/M%C3%BCnchhausen_trilemma),
the famous problem in epistemology.

### Priors are broader than just the distributions of parameters

For example, every time we called `pm.sample` on a model with no `observed` variables,
the samples were drawn according to our prior over the variables.

In general: a prior expresses what we know about a quantity we don't know the exact value of.

#### Examples:

- The $2^{2^{100}}$th digit of $\pi$ is equally likely to be odd or even
- I will die at [age 76, ±16 years](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3285408/)

Which of these do you believe?

- A news program announces that "a study has shown" that smoking does not increase risk of lung cancer.

- A news program announces that "a study has shown" that people
[do not read instruction manuals](https://academic.oup.com/iwc/article/28/1/27/2363584).

- A news program announces that "a study has shown" that vaping leads to [lipoid pneumonia](https://www.healthline.com/health/lipoid-pneumonia).

What are some priors you have?

### Priors prevent us from being at the mercy of data

Bootstrapping avoided the need to specify a prior.

With bootstrapping, we could estimate a posterior
if we could frame our inference problem in terms of a statistic.

Frequentist techniques in general avoid priors.

But this flexibility comes at a price:
- for some problems, there is no statistic of the data that supports our inference
- if we just look at likelihoods, we can end up making very silly inferences.

### What we do in the shadows

You awaken in your bed to find your room dark.

There are two explanations for this:

1. You have awoken in the night.

2. The sun has gone out.

In [None]:
utils.daft.make_bivariate_graph("Sun\nGone", "Room\nDark", scale=3.8);

### Specifying the Model

#### <font color="#003262"> Prior </font>

We put 100:1 odds against the sun having gone out.
```python
sun_gone_out = pm.Categorical("sun_gone", p=[100, 1])
```

#### <font color="#FDB515"> Likelihood </font>

If the sun is gone, then the room will be dark:
```
pm.Categorical("room_dark", p=[0, 1])
```

If the sun is not gone, then the room may be dark or light, with equal probability.
```
pm.Categorical("room_dark", p=[0.5, 0.5])
```

In [None]:
with pm.Model() as sun_model:
    sun_gone_out = pm.Categorical("sun_gone", p=[100, 1])  # our prior beliefs: sun is probably not gone
    ps = [[0.5, 0.5], [0, 1]]   # likelihood: if sun is gone, it's definitely dark, otherwise 50/50
    is_dark = pm.Categorical("room_dark", p=pm.math.switch(sun_gone_out, ps[1], ps[0]))

In [None]:
sun_samples = shared_util.samples_to_dataframe(shared_util.sample_from(sun_model, draws=10000))
N_sun = len(sun_samples)

In [None]:
f, ax = plt.subplots();
ax.bar([0, 1], sun_samples["room_dark"].value_counts() / N_sun);
ax.set_xticks([0, 1.]); ax.set_xticklabels(["room bright", "room dark"]);

First, our prior for the room being dark:
both observations have roughly equal prior probability.

In [None]:
f, ax = plt.subplots();
ax.bar([0, 1], sun_samples["sun_gone"].value_counts() / N_sun);
ax.set_xticks([0, 1.]); ax.set_xticklabels(["sun still here", "sun gone"]);

And our prior for the sun going out:
the sun is much more likely to be present than absent.

In [None]:
posts_f, ax = plt.subplots(figsize=(8, 8));
plt.bar([0, 1], np.log(sun_samples["sun_gone"].value_counts()), 0.25, label="before observing", lw=4, ec="k");
plt.ylabel("log probabilities"); plt.legend();
ax.set_xticks([0.25, 1.25]); ax.set_xticklabels(["sun still here", "sun gone"]); plt.tight_layout();
ax.set_ylim(0, 25); ax.set_xlim(-0.25, 1.75);

To make it easier to compare the values, we look at the logarithms of the probabilities (up to a constant).

In [None]:
utils.daft.make_bivariate_graph("Sun\nGone", "Room\nDark", observed=True, scale=3.8);

Now, to model how we update our beliefs having observed the brightness level of the room.

First, let's assume that we have seen, one time, that the room is dark.

In [None]:
with pm.Model() as sun_model_observed:
    sun_gone_out = pm.Categorical("sun_gone", p=[100, 1])
    ps = [[0.5, 0.5], [0, 1]]
    is_dark = pm.Categorical("room_dark", p=pm.math.switch(sun_gone_out, ps[1], ps[0]), observed=1)

In [None]:
sun_post_samples = shared_util.samples_to_dataframe(shared_util.sample_from(sun_model_observed, draws=10000))

In [None]:
ax.bar([0.25, 1.25], np.log(sun_post_samples["sun_gone"].value_counts()),
       0.25, label="after observing\ndark", color="C2", lw=4, ec="k");
ax.legend(); plt.tight_layout(); posts_f

In [None]:
utils.daft.make_bivariate_graph("Sun\nGone", "Room\nDark", observed=True, scale=3.8, plate=True);

Next, let's assume that we have observed this same fact $N>1$ times (below, `1000`).

It is common to make $N$ independent observations of the same random variable (or variables),
while the parameters stay fixed.
To prevent the need to draw $N$ copies,
we instead put a square around the node (or nodes) we're repeatedly observing.
This is called a _plate_.

In [None]:
with pm.Model() as sun_model_many_observed:
    sun_gone_out = pm.Categorical("sun_gone", p=[100, 1])
    ps = [[0.5, 0.5], [0, 1]]
    is_dark = pm.Categorical("room_dark", p=pm.math.switch(sun_gone_out, ps[1], ps[0]), observed=[1]*1000)

In [None]:
sun_many_post_samples = shared_util.samples_to_dataframe(
    shared_util.sample_from(sun_model_many_observed, draws=10000))

In [None]:
ax.bar([0.5, 1.5], np.log(sun_many_post_samples["sun_gone"].value_counts()),
       0.25, label="after observing\ndark 1000 times", color="C4", lw=4, ec="k");
ax.legend(); plt.tight_layout(); posts_f

### We can sample from priors even in models with `observed` data

We just use **`sample_prior_predictive`**:

In [None]:
prior_samples = shared_util.samples_to_dataframe(pm.sample_prior_predictive(
    samples=10000, model=sun_model_observed))

The output of `pm.sample_prior_predictive`
is still compatible with `shared_util.samples_to_dataframe`,
but it's slightly different from the output of `pm.sample`:
it is literally a dictionary,
rather than being a `MultiTrace`.

In [None]:
prior_samples.head()

In [None]:
f, ax = plt.subplots(figsize=(8, 8));
plt.bar([0, 1], sun_samples["room_dark"].value_counts() / len(sun_samples),
        width=0.4, label="pm.sample, no observed");
plt.bar([0.4, 1.4], prior_samples["room_dark"].value_counts() / len(prior_samples),
        width=0.4, label="pm.sample_prior_predictive"); plt.ylim(0, 1.); plt.legend();
ax.set_xticks([0.2, 1.2]); ax.set_xticklabels(["sun still here", "sun gone"]); plt.tight_layout();

### Choosing Your Priors

**Common Priors, and what they imply about your beliefs**

`Categorical`: These are my beliefs about each possible value

(`Discrete`)`Uniform`: The value is between these two numbers

`Normal`: I know the value approximately, up to some spread

`LogNormal`: I know the order of magnitude approximately, up to some spread

`Cauchy`: I know almost nothing about this variable

### Cromwell's Rule: Never Say Never

When setting priors, take the advice of Oliver Cromwell, Lord Protector of the Commonwealth of England, Scotland, and Ireland, 1653 - 1658.

In [None]:
Image("img/oliver_cromwell.jpg", width=250)

From [Wikipedia](https://en.wikipedia.org/wiki/Oliver_Cromwell).

> Is it therefore infallibly agreeable to the Word of God, all that you say?
**I beseech you, in the Bowels of Christ, consider the Possibility that you may be mistaken**.

-Oliver Cromwell, [Letter to the Kirk of Scotland](http://www.olivercromwell.org/Letters_and_speeches/letters/Letter_129.pdf), August 3rd, 1650

The rule was coined by Bayesian statistician [Dennis Lindley](https://en.wikipedia.org/wiki/Dennis_Lindley), whose quote about it is illuminating for the "has the sun gone out?" example above:

> \[L\]eave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.

This also applies to likelihoods:
it was probably unwise to say the room would _certainly_ be dark if the sun had gone out.

### Flatter priors imply less knowledge

In [None]:
superwide_normal_samples = pd.Series(
    pm.Normal.dist(mu=0, sd=10).random(size=10000))  # bigger scale parameter, wider
wide_normal_samples = pd.Series(
    pm.Normal.dist(mu=0, sd=3).random(size=10000))  # bigger scale parameter, wider
narrow_normal_samples = pd.Series(
    pm.Normal.dist(mu=0, sd=1).random(size=10000))  # smaller scale parameter, skinnier

In [None]:
plt.figure(figsize=(8, 8))
sns.distplot(superwide_normal_samples, label=r"Very Uncertain Prior")
sns.distplot(wide_normal_samples, label=r"Intermediate Prior")
sns.distplot(narrow_normal_samples, color="C3", label=r"More Certain Prior")
plt.xlabel("value"); plt.legend(loc=[0.6, 0.8], framealpha=1); plt.xlim(-10, 10); plt.tight_layout();

In general: parameters that change spread change degree of uncertainty,
while parameters that change location change estimated value.

### For variables with infinitely-many possibilities, improper priors are the flattest priors

They make the most limited assumptions about the value of the variable:
they only assume its domain.

**Improper Priors**

`Flat`: I know nothing about this variable, except it is a number (improper!)

`HalfFlat`: I know nothing about this variable, except that it's positive (improper!)

This allows us to recapitulate the results of frequentist procedures:

In [None]:
observed_vals = pd.Series(pm.Normal.dist(mu=0, sd=1).random(size=10))

In [None]:
with pm.Model() as improper_model:
    mu = pm.Flat("mu"); sigma = pm.HalfFlat("sigma")
    X = pm.Normal("X", mu=mu, sd=sigma, observed=observed_vals)

In [None]:
improper_trace = shared_util.sample_from(improper_model, target_accept=0.9)
improper_samples = shared_util.samples_to_dataframe(improper_trace)

In [None]:
f, ax = plt.subplots(figsize=(8, 8))
sns.distplot(improper_samples["mu"], label="model with\nflat prior");
sns.distplot([np.mean(observed_vals.sample(frac=1, replace=True)) for _ in range(100)], label="bootstrapping");
plt.legend();

The above plot demonstrates that the results of applying a model with a flat prior to the data
are very similar to applying bootstrapping.

But at a cost: improper distributions lose some of the good things about generative models

In [None]:
try:
    improper_pripred = pm.sample_prior_predictive(model=improper_model)
except ValueError as e:
    print(e)

### Important: prior must match any constraints on that variable.

Example:

In [None]:
utils.daft.make_bivariate_graph("sigma", "X", observed=True)

Where $\sigma$ sets the standard deviation of X.
The standard deviation cannot be negative.

Say we choose as our prior a very wide `Normal` centered at `0`,
attempting to choose a relatively flat, unopinionated prior:

In [None]:
observed_vals = 0

with pm.Model() as prior_model:
    sigma = pm.Normal("sigma", 0, 1000)
    X = pm.Normal("X", mu=0, sd=sigma, observed=observed_vals)

Attempting to sample from that model can cause problems: 

In [None]:
try:
    shared_util.sample_from(prior_model, draws=1000)
except pm.parallel_sampling.ParallelSamplingError as e:
    print(e)

Depending on the exact sample drawn, you will either get an error or just warnings:
black text against a red background.

The message `Bad initial energy`, accompanied by `X   -inf`,
indicates that one of the samples had a likelihood of 0:
something impossible occurred.

In this case, it happens when the standard deviation is negative.

## <font color="#FDB515"> Likelihoods </font>

Likelihoods say:
if I could observe the _parameters_ but not the _data_,
then this distribution would express my uncertainty about what data I might see.

### Common Likelihoods, and how they relate the parameters to data

`Normal`: values get more unlikely as they get further from `mu`, at a rate determined by `1/sd` (aka `tau`)

`Binomial`: data is the outcome of a number of `N` independent attempts, each with probability `p` of occurring

`Poisson`: data is the outcome of independent attempts where `N` is large or infinite and `p` is small or infinitesimal, with `mu=N * p`.

`Exponential`: values get more unlikely as they get larger, at a rate determined by `lam`
OR
data is the time in between events in a memoryless process, occuring about `lam` every unit time

`Laplace`: values get more unlikely as they get further from `mu` in absolute difference, at a rate determined by `1/b`

Note that some are mechanistic, others are not:
the `Normal` is not particularly mechanistic,
but the `Poisson` and `Binomial` are.

The `Exponential` might be mechanistic in some models,
but not in others.

We don't typically sample directly from a likelihood in the course of modeling, but if you wanted to:

```python
pm.Foo.dist(parameter_0=parameter_0_fixed, ... ).random(size=N)
```

## <font color="#2ca02c"> Posteriors </font>

Posteriors represent our beliefs after we have observed data for a random variable in our model.

We do not specify them directly:
instead, they are computed from the combination of
our <font color="#FDB515">prior</font>,
our <font color="#003262">**likelihood**</font>,
and the data we observed.

So we don't interact with them as pyMC random variables with distributions;
instead we interact with them as _collections of samples_.

### Sampling from posteriors works differently for unobserved and observed variables.

### For unobserved variables, use `pm.sample`

We've covered this fairly extensively, so moving on.

### For observed variables, use `pm.sample_posterior_predictive`

The output of `pm.sample_posterior_predictive`
is still compatible with `shared_util.samples_to_dataframe`,
but it's slightly different from the output of `pm.sample`:
it is literally a dictionary,
rather than being a `MultiTrace`.

In [None]:
mu = 0; N = 50
observed_vals = pm.Normal.dist(mu=mu, sd=1).random(size=N)

with pm.Model() as observed_normals:
    sd = pm.HalfNormal("sd", sd=10)
    X = pm.Normal("X", mu=0, sd=sd, observed=observed_vals)

In [None]:
with observed_normals:
    # pm.sample returns samples from posterior of unobserved variables as a trace
    posterior_trace = pm.sample(target_accept=0.99)
    
# pm.sample_posterior_predictive returns samples from the posterior of observed variables as a dict
postpreds_dict = pm.sample_posterior_predictive(posterior_trace, model=observed_normals)
postpreds_df = shared_util.samples_to_dataframe(postpreds_dict)

In [None]:
print(postpreds_df.head())

In [None]:
print(postpreds_df.iloc[0]["X"])

Notice: each row of `postpreds_df` has a sample
the same size as the `observed` dataset.

Each one is a sample drawn from the likelihood of that observed variable
with the parameters given by a sample from the posterior, one from each of the values in the `posterior_trace`.
Each one is a dataset that might have been observed if the parameter had been equal to that value.

In [None]:
plt.figure(figsize=(8, 6));
[sns.distplot(row["X"], color="k") for ii, row in postpreds_df.sample(frac=1).iloc[:5].iterrows()]
sns.distplot(observed_vals * 1, color="C2", rug=True, label="observed");
plt.ylim(*(1.5 * np.array(plt.ylim()))); plt.legend(); plt.tight_layout();

This plot compares individual rows of the `postpreds_df` to the observed data.

When the posterior is tight around the correct value of the parameter,
the samples from `posterior_predictive` will look like the observed data.

If the samples from `posterior_predictive` don't look like your data,
that doesn't mean your model is wrong,
it just means that there's residual uncertainty in your posteriors of the parameters
that's sufficient to show .

There might be another model that does better,
one whose posterior predictions, given that data,
would look like the observed data.
But it could also be that the model just needs more observations:
increase `N` to `50` from `5` and re-run the cells afterwards.

Also, just because the samples look like the data provided via `observed`,
that doesn't mean that the model is correct.
You don't get points for being able to predict data you've already seen.

Resolving this issue and figuring out which model for a given dataset is best
is one of the core problems of statistics and data science.
We'll look at some basic methods for model comparison in lab this week,
and then we'll see more sophisticated methods in the future.

## Summary of Sampling Methods

```python
with pm.Model() as model:
    # random variable to model uncertainty about the truth before seeing data: the _prior_
    parameter_var = pm.SomeRandomVariable("prior_variable", parameter=value)  
    # random variable to model uncertainty in data, given parameters: the _likelihood_
    observed_var = pm.OtherRandomVariable("data_variable", paramter=parameter_var, observed=data_we_saw)
    
# samples to estimate uncertainty before seeing data
pripred_samples = pm.sample_prior_predictive(model=model)

with model:
    # samples to estimate uncertainty about the parameters after seeing data: the _posterior_
    post_samples_trace = pm.sample()

# samples to estimate uncertainty in what future data we might see:
postpred_samples = pm.sample_posterior_predictive(post_samples_trace, model=model)
```