In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Introduction

What happens when we hit the Inference Button (tm)? Is it gradient descent that is happening underneath the hood? What is this "sampler" we speak about, and what exactly is it doing?

As we take a detour away from PyMC3 for a moment,
those fundamental questions are the questions
that we are going to go through in this chapter.
It's my hope that you'll enjoy peeling back the covers
on _some_ of what happens underneath the hood.

## In the beginning...

First off, we must remember that with Bayesian statistical inference,
we are most concerned with computing the posterior distribution of parameters
conditioned on data, $P(H|D)$.
Here, $H$ refers to the parameter set and the model,
while $D$ refers to the data.

Because it is a conditional distribution,
by invoking the rules of probability, where

$$P(H,D)=P(H|D)P(D)=P(D|H)P(D)$$

if you were to treat each of the $P$s as algebraic elements,
then a simple rearrangement gives us:

$$P(H|D)=\frac{P(D|H)P(H)}{P(D)}$$

This, then, is Bayes' rule as applied to joint distributions
of data and model parameters.

One hiccup shows here, though,
in that we cannot analytically know how to calculate $P(D)$.
The reason for this is that we don't have an analytical form
for the probability distribution of how data could have been configured.
In practice, we treat that as a normalizing constant,
since philosophically, data are considered constant
while parameters are random variables. 
Hence, our posterior distribution
is calculated as a proportionality term:

$$P(H|D) \propto P(D|H)P(H)$$

### Let's look at an illustration

To make everything a bit more concrete, let's look at what I call
a "simplest complex example",
one that is not too hard to "grok" (geek slang for _understand_),
but one that is complex enough to be interesting.

We're going to inspect this particular model:


$$\mu_d \sim N(\mu=3, \sigma=1)$$
$$\sigma_d \sim Exp(\lambda=13)$$
$$d \sim N(\mu=\mu_d, \sigma=\sigma_d)$$

We have Gaussian-distributed data,
where the mean of the data distribution,
$\mu_d$ is a Gaussian-distributed random variable
that has a configuration that specifies our prior belief about it
having not seen any data,
while the variance of the data distribution,
$\sigma_d$ is an Exponentially-distributed random variable,
also configured in a way that specifies our prior without having seen any data.

The model's PGM looks something like this:

In [None]:
from daft import PGM

G = PGM()
G.add_node("mu", content=r"$\mu$")
G.add_node("sigma", content=r"$\sigma$", x=1)
G.add_node("d", content="d", x=0.5, y=-1)

G.add_edge("mu", "d")
G.add_edge("sigma", "d")
G.show()

### Exercise: Generative Model

In [None]:
def generative_model():
    mu = norm(loc=8, scale=3).rvs()
    sigma = expon(scale=3).rvs()
    data = norm(loc=mu, scale=sigma).rvs()
    return data

### Exercise: Joint log-likelihood

Write down the joint log-likelihood between data and the model parameters under the pre-specified priors.

In [None]:
from scipy.stats import norm, expon
def log_like(mu, sigma, data):
    mu_like = norm(loc=8, scale=1).logpdf(mu)
    sigma_like = expon(scale=3).logpdf(sigma)
    data_like = norm(loc=mu, scale=sigma).logpdf(data).sum() # sum is important!
    return mu_like + sigma_like + data_like

In [None]:
def generate_data(n):
    for i in range(n):
        yield generative_model()

Now, I'm going to give you some _actual_ data,
and I'd like you to propose a $\mu$ and a $\sigma$,
and then evaluate their joint log-likelihood with the data.

In [None]:
import numpy as np

true_mu = 2
true_sigma = 1
data = norm(true_mu, true_sigma).rvs(150)
log_like(-2, 1, data)

Let's plot how the log likelihood varies with $\mu$ and $\sigma$.
This will give us a great way to visualize the posterior distribution space.

In [None]:
from ipywidgets import interact, FloatSlider, IntSlider
import matplotlib.pyplot as plt
import seaborn as sns

min_mu, max_mu = -3, 15
mu = FloatSlider(min=min_mu, max=max_mu, value=0, step=0.1, description=r"$\mu$")
min_sigma, max_sigma = 0.1, 5
sigma = FloatSlider(min=min_sigma, max=max_sigma, value=1, step=0.1, description=r"$\sigma$")
num_data = IntSlider(min=0, max=100, step=1, value=20, description="Num Data Points")


@interact(mu=mu, sigma=sigma, num_data=num_data)
def plot_univariate_posterior(mu, sigma, num_data):
    mu_range = np.linspace(min_mu, max_mu, 100)
    sigma_range = np.linspace(min_sigma, max_sigma, 100)
    d = data[0:num_data]
    ll_sigma = [log_like(mu, s, d) for s in sigma_range]
    ll_mu = [log_like(m, sigma, d) for m in mu_range]
    fig, ax = plt.subplots(figsize=(8,4), nrows=1, ncols=2)
    ax[0].plot(sigma_range, np.exp(ll_sigma))
    ax[0].set_xlabel("$\sigma$")
    ax[0].set_title(f"$\mu$ fixed at {mu}")
    ax[1].plot(mu_range, np.exp(ll_mu))
    ax[1].set_xlabel("$\mu$")
    ax[1].set_title(f"$\sigma$ fixed at {sigma}")
    ax[0].set_ylabel("log likelihood")
    sns.despine()
    plt.tight_layout()

If you're on the online version of this notebook, you'll have some sliders that you can play with.

The first slider, $\mu$, allows you to fix $\mu$ at a particular value,
while letting $\sigma$ vary.
The second  slider, $\sigma$, allows you to fix $\sigma$ at a particular value,
while letting $\mu$ vary.
On the y-axis of both is the joint log likelihood of the model and data,
but with $\mu$ fixed with the data on the $\sigma$ plot,
and with $\sigma$ fixed with the data on the $\mu$ plot.

There's a few things you might want to note.

Firstly, if you set Num Data to 0, you should see the joint log likelihood with no data, a.k.a. the prior log likelihood.

Secondly, the parameter sliders for $\mu$ and $\sigma$ will allow you to "fix" a parameter value.
The plot on the left has $\mu$ fixed, while the plot on the right has $\sigma$ fixed.
In both cases, the data are also fixed at what was observed.

Thirdly, as you add the number of data points used to evaluate the log likelihood,
you should see the true values of the parameters converge towards the true value,
conditioned on you setting the "fixed" parameter value (using the sliders) to the true value too.
When the "fixed" parameters are wrong, inference about the other parameter will be wrongn too.
This shows the importance of _jointly_ inferring the values.

## Sampling: Metropolis-Hastings

An easy-to-understand sampler that we can start with
is the Metropolis-Hastings sampler.
I first learned it in a grad-level computational biology class,
but I expect most statistics undergrads should have
a good working knowledge of the algorithm.

Here's how the algorithm works,
shamelessly copied (and modified)
from the [Wikipedia article](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm):

- For each parameter $p$, do the following.
- Initialize an arbitrary point for the parameter (this is $p_t$, or $p$ at step $t$).
- Define a probability density $P(p_t)$, for which we will draw new values of the parameters. Here, we will use $P(p) = Normal(p_{t-1}, 1)$.
- For each iteration:
    - Generate candidate new candidate $p_t$ drawn from $P(p_t)$.
    - Calculate the likelihood of the data under the previous parameter value(s) $p_{t-1}$: $L(p_{t-1})$
    - Calculate the likelihood of the data under the proposed parameter value(s) $p_t$: $L(p_t)$
    - Calculate acceptance ratio $r = \frac{L(p_t)}{L(p_{t-1})}$.
    - Generate a new random number on the unit interval: $s \sim U(0, 1)$.
    - Compare $s$ to $r$.
        - If $s \leq r$, accept $p_t$.
        - If $s \gt r$, reject $p_t$ and continue sampling again with $p_{t-1}$.
        
Because of a desire for convenience,
we could choose to use a normal distribution to sample all values.
However, that distribution choice is possibly going to bite us during sampling,
because the values that we could possibly sample for the $\sigma$ parameter
can take on negatives,
but when a negative $\sigma$ is passed
into the normally-distributed likelihood,
we are going to get computation errors!
This is because the scale parameter of a normal distribution
can only be positive, and cannot be negative or zero.
(If it were zero, there would be no randomness.)

### Transformations as a hack

The key problem here is that the support of the Exponential distribution
is bound to be positive real numbers only.
That said, we can get around this problem
simply by sampling amongst the unbounded real number space $(-\inf, +\inf)$,
and then transforming the number by a math function to be in the bounded space.

One way we can transform numbers from an unbounded space
to a positive-bounded space
is to use the exponential transform:

$$y = e^x$$

For any given value $x$, $y$ will be guaranteed to be positive.

And _voila_!
The key trick here was to **sample in unbounded space**,
but **evalute log-likelihood in bounded space**.
We call the "unbounded" space the _transformed_ space,
while the "bounded" space is the _original_ or _untransformed_ space.
We have implemented the necessary components
to compute posterior distributions on parameters!

This is something that a probabilistic programming language, such as PyMC3, will do for you,
so you don't have to worry about transforming your random variables for sampling.
(That said, you still have to worry about transforming your RVs for model building,
as we have seen in the chapter on hierarchical modelling.

### Putting things all together

Let's see how we an implement a simple MCMC sampler,
one that uses the Metropolis-Hastings algorithm, for sampling.

Follow along the block of code below:
_look at the comments for notes on the important bits!_

In [None]:
from tqdm.autonotebook import tqdm
##### Read the code and commennts carefully! #####

# Here, we initialize our "guesses" for mu and sigma.
mu_prev = np.random.normal()
sigma_prev = np.random.normal()

# Keep a history of the parameter values and ratio.
mu_history = dict()
sigma_history = dict()
ratio_history = dict()

for i in tqdm(range(1000)):
    # We record the history of values on each loop.
    mu_history[i] = mu_prev
    sigma_history[i] = sigma_prev
    
    # Now, we propose new values centered on the previous values
    mu_t = np.random.normal(mu_prev, 0.1)
    sigma_t = np.random.normal(sigma_prev, 0.1)

    # Compute joint log likelihood
    LL_t = log_like(mu_t, np.exp(sigma_t), data)
    # NOTE: because sigma has to be positive, we apply an exponential transform
    LL_prev = log_like(mu_prev, np.exp(sigma_prev), data)

    # Calculate the difference in log-likelihoods
    # (or a.k.a. ratio of likelihoods)
    diff_log_like = LL_t - LL_prev
    if diff_log_like > 0:
        ratio = 1
    else:
        # We need to exponentiate to get the correct ratio,
        # since all of our calculations were in log-space
        ratio = np.exp(diff_log_like)

    # Defensive programming check
    if np.isinf(ratio) or np.isnan(ratio):
        raise ValueError(f"LL_t: {LL_t}, LL_prev: {LL_prev}")

    # Ratio comparison step
    ratio_history[i] = ratio
    p = np.random.uniform(0, 1)
    if ratio >= p:
        mu_prev = mu_t
        sigma_prev = sigma_t

Now, let's visualize how our sampling went.

In [None]:
import janitor
import pandas as pd

trace = (
    pd.DataFrame(
        pd.Series(sigma_history, name="sigma")
    )
    .join(
        pd.Series(mu_history, name="mu")
    )
    # We have to transform the sampled values into the correct space.
    .transform_column("sigma", np.exp)
)

trace.plot();

It looks like we were able to recover the true $\mu$ and $\sigma$,
as well as estimate the uncertainty around their values!

Notice how it took a couple few dozenns of steps before the trace becomes **stationary**,
that is it becomes a flat trend-line.
If we prune the trace to just the values after the 200th iteration,
we get the following trace:


In [None]:
trace.loc[200:].plot();

## The Important Lessons

This short, short interlude should have given you
a flavour of how MCMC sampling happens.
Here are some of the main things that I hope you've taken away from this chapter.

### Sampling, not optimization

Note here that we didn't _optimize_ the values of $\mu$ and $\sigma$
to maximize likelihood.
Rather, we were simulating the full joint likelihood space via sampling,
while also using the sampler to get us within the _region_ of the true values.
With this, we are able to obtain a _distribution_ rather than a _point estimate_.

### Log-likelihood is all you need

I've written that line multiple times in this series,
but hopeffully this time round, the reason why is clear:
in order to use sampling to estimate the values of the parameters conditioned on data,
we need the joint log-likelihood of our data and model.

### Joint sampling

As the interactive example should have revealed,
_jointly_ estimating the correct values is very important
if we don't want to have biases in the estimates.
As such, we must perform sampling on both values simultaneously,
rather than independently.

### Sane defaults

The act of implementing a sampler is educational
in the same way that reinventing the wheel teaches you about the wheel.
But for a large fraction of problems,
a custom sampler isn't necessary,
and reimplementing them on every new problem is tedious!

That's where a probabilistic programming language (PPL) steps in!
Any probablistic programming language should provide you with

1. a library of probability distributions with a uniform API,
2. a library of MCMC samplers with also a uniform API,
3. a method for automatically constructing a joint log-likelihood function from a model specification and data

For the samplers, a well-designed PPL will provides you with _very sane defaults_
on which sampler to use, and how they are to be configured.
For example, in PyMC3, the No-U-Turn-Sampler,
one that leverages the gradient of the log likelihood w.r.t. the parameters
to help propose new values to sample,
is the default for continuous random variables,
because it converges in fewer steps (especially for high dimensional problems).