(econmt-bayesian)=
# Bayesian Inference

In this chapter, we'll look at how to perform analysis and regressions using Bayesian techniques.

Let's import a few of the packages we'll need first. Two key packages that we'll be using that you might not have seen before are [**pymc3**](https://docs.pymc.io/), a Bayesian inference package, and [**Bambi**](https://bambinos.github.io/), which stands for *BAyesian Model-Building Interface*. We'll also use [**arviz**](https://arviz-devs.github.io/) for visualisation some Bayesian inference results but this will get installed when you intall **pymc3**. You should follow the install instructions for these carefully and, if you're confident with using different Python environments, it's a good idea to spin up a new 'bayes' environment to try them out in. The chapter on {ref}`code-preliminaries` covers basic information on how to install new packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import pymc3 as pm
import arviz as az

# Plot settings
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
az.style.use("arviz-darkgrid")

# Pandas: Set max rows displayed for readability
pd.set_option("display.max_rows", 6)

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator
# Turn off warnings
warnings.filterwarnings('ignore')

## Introduction

The biggest difference between the Bayesian and frequentist approaches (that we've seen in the other chapters) is probably that, in Bayesian models, the parameters are not assumed to be fixed but instead are treated as random variables whose uncertainty is described using probability distributions. The data are considered fixed. You might see the 'inverse probabiliy' formulation of a Bayesian model written as $p(\theta | y)$ where the $y$ are the data, and the $\theta$ are the model parameters. An interesting aspect of Bayesianism is that there is just one estimator: Bayes' theorem.

This is a contrast with the frequentist view, which holds that the data are random but the model parameters are fixed, and models often expressed as functions, for example $f(y | \theta)$. Frequentist inference typically involves deriving estimators for the model parameters, and these are usually created to minimise the bias, minimise the variance, or maximise the efficiency.

As a reminder, Bayes' theorem says that ${\displaystyle P(A\mid B)={\frac {P(B\mid A)P(A)}{P(B)}}}$, where $A$ and $B$ are distinct events and $P(A)$ is the probability of event A happening. When dealing with data and model parameters, and ignoring a rescaling factor, this can be written as:

$$
p({\boldsymbol{\theta }}|{\boldsymbol{y}})\propto p({\boldsymbol{y}}|{\boldsymbol{\theta }})p({\boldsymbol{\theta }}).
$$

In these equations:

1. $p({\boldsymbol{\theta}})$ is the prior put on model parameters—what we think the distribution will look like.
2. $p({\boldsymbol{y}}|{\boldsymbol{\theta }})$ is the likelihood of this data given a particular set of model parameters.
3. $p({\boldsymbol{\theta }}|{\boldsymbol{y}})$ is the posterior probability of those model parameters given the observed data.

Bayesian modelling proceeds as highlighted in the review article, *Bayesian statistics and modelling* {cite}`van2021bayesian`:

![The Bayesian research cycle.](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs43586-020-00001-2/MediaObjects/43586_2020_1_Fig1_HTML.png?as=webp)

One key strength of the Bayesian approach is that it preserves uncertainty—by construction, it includes the degree of belief you have in a parameter. This makes it especially useful in cases where uncertainty is important. One disadvantage of the Bayesian approach is that it's not always as fast.

Actually estimating $p({\boldsymbol{y}}|{\boldsymbol{\theta }})p({\boldsymbol{\theta }})$ is done computationally, at least for all but the most trivial examples.

The methods used to this are all some variation on the *Monte Carlo* method. Monte Carlo techniques use random numbers to choose samples that allow you to estimate properties of interest. It can be used to perform numerical integration and is used for discrete particle simulations in fields like physics and chemistry.

![An illustation of Monte Carlo integration](https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/MonteCarloIntegrationCircle.svg/440px-MonteCarloIntegrationCircle.svg.png)

*Yoderj, Mysid; Wikipedia*

Say we wanted to compute the integral of a 2D shape, a circle, to get an area. In the figure above, random pairs of numbers (say, $x, y$) are chosen within the unit square. By taking the ratio of the number of points that land within the 2D shape (40) to all samples (50), we can get an approximation of the area by computing (area of square) x ratio of samples $= 3.2 \approx \pi$. This type of algorithm is a form of rejection sampling.

Markov Chain Monte Carlo takes this a step further. A Markov Chain is a stochastic process where the chance of being in a particular state (for example, taking a particular sample) depends only on the previous state. These have equilibrium distributions; essentially a long-run average distribution over states. Markov Chain Monte Carlo finds the distribution you're after (the posterior) by recording states from the chain; the more steps, the more closely the distribution of samples will match the distribution you're trying to find out about. There's a nice walkthrough of MCMC available [here](https://people.duke.edu/~ccc14/sta-663/MCMC.html#markov-chain-monte-carlo-mcmc).

There are many algorithms for undertaking MCMC, with perhaps the most famous being the Metropolis-Hastings algorithm. For this, we have a probably density function to sample from, a function $Q$ that quantifies where to make a transition, and $\theta_0$, the first guess at the parameters. Starting from $\theta = \theta_0$, the algorithm boils down to:
1. starting from this point $\theta$, determine its match with the target distribution by estimating $p = p({\boldsymbol{y}}|{\boldsymbol{\theta}})p({\boldsymbol{\theta}})$
2. propose a new sample, $\theta'$, using some transition function, and check its match to the target (let this match be $p'$)
3. compute the ratio of densities, $\frac{p'}{p}$, as a metric of favourability of the new point
4. generate a random number $r \thicksim \mathcal{U}[0, 1]$ and compare it to the ratio; accept the new point only if $r < \frac{p'}{p}$
5. repeat

The clever trick here is that, over time and with more samples, the points you draw will be from the target (posterior) distribution because, at each loop, we are forcing the samples to stick to the posterior distribution. The set of samples is often called a trace. You can find a nice walkthrough of the Metropolis-Hastings algorithm, with code, [here](https://towardsdatascience.com/from-scratch-bayesian-inference-markov-chain-monte-carlo-and-metropolis-hastings-in-python-ef21a29e25a).

The Metropolis-Hastings algorithm is a good place to start to understand what's going on in computational Bayesian inference, but there are more modern and powerful algorithms available today, most notably Hamiltonian Monte Carlo (HMC), which takes its inspiration from physics (where the Hamiltonian is an operator that returns the total energy of the system). Instead of sampling completely randomly, which can be inefficient, HMC sees the sampler get 'rolled' (launched in a smooth trajectory) along the estimates of the posterior. There's a very good explanation of this approach [here](https://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/).

The most commonly used HMC sampler is the No U-Turn Sampler (NUTS), and this is what we'll be using for most of our examples.

## Bayesian Modelling: A Simple Example

We're going to set up a very simple example using synthetic data and a known data generating process;

$$
y = \alpha + \beta x + \varepsilon
$$

where $\varepsilon \thicksim \mathcal{N}(0, \gamma)$. For our priors, we will assume:

- $\alpha \thicksim \mathcal{N}(0, 10)$
- $\beta \thicksim \mathcal{N}(0, 10)$
- $\gamma \thicksim \mathcal{N_{+}}(\sigma=1)$

These are known as *weakly informative priors*, because they are quite diffuse—the 10th and 90th percentile of a normal distribution with $\sigma=10$ spans from, roughly, -13 to 13. With a weakly informative prior, a reasonably large amount of data will ensure that the prior will not be important and the posterior will hone in on a sensible range. We're also implicitly assuming that the data are normalised so that 1 is a meaningful change and zero a sensible reference point. You may want to, for example, scale by the standard deviation of the data or take logs of (positive) predictors, so that coeficients can be interpreted as percent changes.

We know the 'true' values of the parameters, and will set those first, and then simulate some outcome data. Next, we'll end the Bayesian modelling part of the code, which starts `with pm.Model() as linear_model:`. Then, we must set our priors, giving the parameters meaningful names. Second, we build our model by writing down our expected outcome and the expected likelihood of observations according to that model. Finally, we run the trace: this is the part that estimates the posterior distribution.

In [None]:
# True parameter values
alpha_true, beta_true, gamma_true = 1, 2.5, 1.5

# Size of dataset
size = 100

# Predictor variable
X = prng.standard_normal(size)

# Simulate outcome variable
Y = alpha_true + beta_true * X + prng.standard_normal(size) * gamma_true

with pm.Model() as linear_model:
    # Priors for unknown model parameters
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta = pm.Normal("beta", mu=0, sigma=10)
    gamma = pm.HalfNormal("sigma", sigma=1)

    # Expected value of outcome
    mu = alpha + beta * X

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=gamma, observed=Y)

    trace = pm.sample(1000, return_inferencedata=True)

So what just happened? The 'NUTS sampler' was auto-assigned. NUTS stands for the 'No-U-turn sampler'. This is a just a fancy way of picking the points to sample $p({\boldsymbol{y}}|{\boldsymbol{\theta }})p({\boldsymbol{\theta }})$ to build up a picture of $p({\boldsymbol{\theta }}|{\boldsymbol{y}})$. It's a good default option for sampling.

Let's take a quick look at the model we just estimated using the `model_to_graphviz` method—though note that this is a very simple model, and this method comes into its own in more complex cases.

In [None]:
pm.model_to_graphviz(linear_model)

As discussed in the Introduction, the set of samples forms a trace, and this trace got returned by the `pm.sample` method above. We can now look at the trace that we created in the 2 separate chains we ran.

In [None]:
az.plot_trace(trace);

This looks like fairly meaningless noise, so let's walk through what we're seeing here. On the left-hand side, we can see the results of the two chains we ran for each of the three parameters. The x-axis is the parameter value, the y-axis is the probability density (no values shown). We're actually looking at the posterior distributions here! We'll come on to whether they make sense or not later, but for now we just know we've got some estimates of the posterior.

On the right-hand side, we have some traces, with the sample number on the x-axis and the parameter value on the y axis. Just looking at these gives our first sense of whether the sampling has gone well or not. In general, we want to see a trace that is a random scatter around a mean value. It's bad news if you see chains that do not converge to a small area and do not mix together.

From the above, it looks like we successfully created some samples and have what appears to be a good estimate of the posterior. 

Now let's look at our actual estimates of these parameters by running the `summary` function:

In [None]:
az.summary(trace, round_to=2)

In Bayesian inference, the coefficient or parameter estimates are extremely easy to interpret. For $\alpha$, we have a mean of 1.03 with a standard deviation of 0.15, and a 94% confidence window between 0.74 and 1.32. HDI stands for highest density interval, and you can choose a different percentage using the `hdi_prob=` keyword. We can produce a visualisation of the parameter estimates:

In [None]:
az.plot_posterior(trace, var_names=['alpha']);

This is telling us exactly what it looks like; the estimate is, with good confidence, around 1.

There were a bunch of other numbers in the trace summary; what did they mean?

You can further explore the trace (including interactively) by running it in a code block alone.

In [None]:
trace

It would be good to compare the posteriors to the priors, to see how far we got away from those weakly informative normal (and, in one case, half-normal) priors we used. Remember, it's not a good sign if our posterior is very much influenced by our prior (unless we have a reason to have a strong prior).

To do this we need to assemble all of the information on our model, priors, and posteriors in one object, for which there's a couple of extra steps (we'll put it in `pm_data`).

In [None]:
pm_data = az.from_pymc3(
        prior=pm.sample_prior_predictive(model=linear_model),
        posterior_predictive=pm.sample_posterior_predictive(trace, model=linear_model),
        model=linear_model,
    )
# Add in our traces:
pm_data.extend(trace)

In [None]:
az.plot_dist_comparison(pm_data, var_names=["sigma"])

In [None]:
az.plot_ppc(pm_data, group="posterior", figsize=(12, 6))