[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/bayes_course_dec_2023/blob/master/notebooks/Section2_2-Model_Checking.ipynb)

# MCMC Output Processing and Model Checking with ArviZ

ArviZ is a Python package for exploratory analysis of Bayesian models. It includes functions for posterior analysis, model checking, comparison and diagnostics. ArviZ is designed to work with output from a wide range of Bayesian inference libraries, including PyMC, emcee, Stan, Pyro, and TensorFlow Probability.

ArviZ is built on top of the popular libraries xarray and matplotlib. It is also built with the same design principles as PyMC, so if you are familiar with PyMC, you will find ArviZ easy to use.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats as st
import pymc as pm
import pytensor.tensor as pt
import arviz as az
import warnings
warnings.simplefilter("ignore")

RANDOM_SEED = 20090425

### Example: Effect of coaching on SAT scores

This example was taken from Gelman *et al.* (2013):

> A study was performed for the Educational Testing Service to analyze the effects of special coaching programs on test scores. Separate randomized experiments were performed to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test- Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other.

We are given the estimated coaching effects (`y`) and their sampling variances (`s`). The estimates were obtained by independent experiments, with relatively large sample sizes (over thirty students in each school), so you it can be assumed that they have approximately normal sampling distributions with known variances variances.

In [None]:
y = np.array([28, 8, -3, 7, -1, 1, 18, 12])
s = np.array([15, 10, 16, 11, 9, 11, 10, 18])
schools = np.array(
    [
        "Choate",
        "Deerfield",
        "Phillips Andover",
        "Phillips Exeter",
        "Hotchkiss",
        "Lawrenceville",
        "St. Paul's",
        "Mt. Hermon",
    ]
)

with pm.Model(coords={'school': schools}) as schools:
    
    mu = pm.Normal("mu", 0, sigma=1e6)
    tau = pm.HalfCauchy("tau", 5)

    theta = pm.Normal("theta", mu, sigma=tau, dims='school')

    obs = pm.Normal("obs", theta, sigma=s, observed=y)

In [None]:
with schools:
    # Model fitting
    schools_trace = pm.sample(500, tune=0, random_seed=RANDOM_SEED)
    # Posterior predictive sampling
    pm.sample_posterior_predictive(schools_trace, extend_inferencedata=True)

After running an MCMC simulation, `sample` returns an `arviz.InferenceData` object containing the samples for all the stochastic and named deterministic random variables. 

Data corresponding to each type of sampling is available as an `InferenceData` attribute.

In [None]:
post = schools_trace.posterior
post

Arbitary values can be added to an `InferenceData` object using the dictionary assignment syntax.

In [None]:
post["log_tau"] = np.log(post["tau"])
schools_trace.posterior

### Combining chains and draws

`arviz.extract` is a convenience function aimed at taking care of the most common subsetting operations with MCMC samples. It can:
- Combine chains and draws
- Return a subset of variables (with optional filtering with regular expressions or string matching)
- Return a subset of samples. Moreover by default it returns a random subset to prevent getting non-representative samples due to bad mixing.
- Access any group

In [None]:
az.extract(post)


If we only want a subset of the samples, we can use `extract` to resample them.

> Use a random seed to get the same subset from multiple groups: `az.extract(idata, num_samples=100, rng=3)` and `az.extract(idata, group="log_likelihood", num_samples=100, rng=3)` will continue to have matching samples

In [None]:
az.extract(post, num_samples=100)

Sometimes we want to strip away the `InferenceData` overhead and access the raw values. We can extract values as a NumPy array using the `values` attribute.

In [None]:
az.extract(post).mu.values


Computing the mean and other summary statistics can be perfomed directly. By default, this will summarize across all dimensions.

In [None]:
post.mean()

Often, however, we may just want to summarize across chains and draws, especially for multivariate quantities.

In [None]:
post.theta.mean(dim=["chain", "draw"])

Finally, subsets of the output can be extracted using the `sel` method. For example, let's look at the posterior distribution of the coaching effects for just one school:

In [None]:
post["theta"].sel(school="Choate")

Or, we can extract a single chain, for all variables:

In [None]:
post.sel(chain=0)

## Model Checking

The final step in Bayesian computation is model checking, in order to ensure that inferences derived from your sample are valid

There are **two components** to model checking:

1. Convergence diagnostics
2. Goodness of fit

Convergence diagnostics are intended to detect **lack of convergence** in the Markov chain Monte Carlo sample; it is used to ensure that you have not halted your sampling too early. However, a converged model is not guaranteed to be a good model. 

The second component of model checking, goodness of fit, is used to check the **internal validity** of the model, by comparing predictions from the model to the data used to fit the model. 

## Convergence Diagnostics

Valid inferences from sequences of MCMC samples are based on the
assumption that the samples are derived from the true posterior
distribution of interest. Theory guarantees this condition as the number
of iterations approaches infinity. It is important, therefore, to
determine the **minimum number of samples** required to ensure a reasonable
approximation to the target posterior density. Unfortunately, no
universal threshold exists across all problems, so convergence must be
assessed independently each time MCMC estimation is performed. The
procedures for verifying convergence are collectively known as
*convergence diagnostics*.

There are a handful of easy-to-use methods for checking convergence. Since you cannot prove convergence, but only show lack of convergence, there is no single method that is foolproof. So, its best to look at a suite of diagnostics together. 

We will cover the canonical set of checks:

- Sampler statistics
- Variable plotting
- Divergences
- R-hat
- Effective Sample Size


## Sampler Statistics

When checking for convergence or when debugging a badly behaving sampler, it is often helpful to take a closer look at what the sampler is doing. For this purpose some samplers export statistics for each generated sample.

NUTS provides several metrics related to the performance of the sampler.

In [None]:
schools_trace.sample_stats

The sample statistics variables are defined as follows:

- `process_time_diff`: The time it took to draw the sample, as defined by the python standard library time.process_time. This counts all the CPU time, including worker processes in BLAS and OpenMP.

- `step_size`: The current integration step size.

- `diverging`: (boolean) Indicates the presence of leapfrog transitions with large energy deviation from starting and subsequent termination of the trajectory. “large” is defined as `max_energy_error` going over a threshold.

- `lp`: The joint log posterior density for the model (up to an additive constant).

- `energy`: The value of the Hamiltonian energy for the accepted proposal (up to an additive constant).

- `energy_error`: The difference in the Hamiltonian energy between the initial point and the accepted proposal.

- `perf_counter_diff`: The time it took to draw the sample, as defined by the python standard library time.perf_counter (wall time).

- `perf_counter_start`: The value of time.perf_counter at the beginning of the computation of the draw.

- `n_steps`: The number of leapfrog steps computed. It is related to `tree_depth` with `n_steps <= 2^tree_dept`.

- `max_energy_error`: The maximum absolute difference in Hamiltonian energy between the initial point and all possible samples in the proposed tree.

- `acceptance_rate`: The average acceptance probabilities of all possible samples in the proposed tree.

- `step_size_bar`: The current best known step-size. After the tuning samples, the step size is set to this value. This should converge during tuning.

- `tree_depth`: The number of tree doublings in the balanced binary tree.

It can be helpful to plot some of these variables, rather than staring at vectors of numbers!

In [None]:
schools_trace.sample_stats["tree_depth"].plot(col="chain", ls="none", marker=".", alpha=0.3);

In [None]:
schools_trace.sample_stats["acceptance_rate"].plot.hist(bins=20, density=True);

## Output Visualization with ArviZ

[ArviZ](https://arviz-devs.github.io/arviz/) is a Python package for exploratory analysis of Bayesian models. It includes functions for posterior analysis, model checking, comparison and diagnostics and is desingefd to work with a range of Bayesian inference libraries (not just PyMC).

ArviZ is built on top of the popular libraries xarray and matplotlib. It is also built with the same design principles as PyMC, so if you are familiar with PyMC, you will find ArviZ easy to use.

ArviZ accomodates many different input data types, even simple NumPy arrays.

In [None]:
az.plot_posterior(np.random.randn(100_000));

Plotting a dictionary of arrays, ArviZ will interpret each key as the name of a different random variable. Each row of an array is treated as an independent series of draws from the variable, called a _chain_. Below, we have 10 chains of 50 draws, each for four different distributions.

In [None]:
size = (10, 50)
az.plot_forest(
    {
        "normal": np.random.randn(*size),
        "gumbel": np.random.gumbel(size=size),
        "student t": np.random.standard_t(df=6, size=size),
        "exponential": np.random.exponential(size=size),
    }, 
);

## Traceplot 

Perhaps the most-used ArviZ plot is the traceplot, obtained via the `plot_trace` function. This is a simple plot that is a good quick check to make sure nothing is obviously wrong, and is usually the first diagnostic step you will take. You've seen these already: just the time series of samples for an individual variable.



The `plot_trace` function from ArViZ by default generates a kernel density plot and a trace plot, with a different color for each chain of the simulation.

In [None]:
az.plot_trace(schools_trace, var_names=['mu', 'tau']);
plt.tight_layout();

This sample is deliberately inadequate. Looking at the trace plot, the problems should be apparent.

Can you identify the issues, based on what you learned in the previous section?

### Exercise: Take a quiz!

[See how well you can identify sampling problems by looking at their traceplots](https://canyon289.github.io/bayesian-model-evaluation/lessonplans/mcmc_basics/#/14)

The slides will show you a trace, and you have to guess whether the sampling is from one of:

- MCMC with step size too small
- MCMC with step size too large
- MCMC with adequate step size
- Independent samples from distribution

In [None]:
az.plot_pair(
    schools_trace.posterior.theta,
    coords={"school": ["Choate", "Deerfield", "Phillips Andover"]},
);

## Divergences

As we have seen, Hamiltonian Monte Carlo (and NUTS) performs numerical integration in order to explore the posterior distribution of a model. When the integration goes wrong, it can go dramatically wrong. 

For example, here are some Hamiltonian trajectories on the distribution of two correlated variables. Can you spot the divergent path?

![divering HMC](images/diverging_hmc.png)

The reason that this happens is that there may be parts of the posterior which are **hard to explore** for geometric reasons. Two ways of solving divergences are

1. **Set a higher "target accept" rate**: Similarly (but not the same) as for Metropolis-Hastings, larger integrator steps lead to lower acceptance rates. A higher `target_accept` will generally cause a smaller step size, and more accurate integration.
2. **Reparametrize**: If you can write your model in a different way that has the same joint probability density, you might do thpt. A lot of work is being done to automate this, since it requires careful work, and one goal of a probabilistic programming language is to iterate quickly. See [Hoffmann, Johnson, Tran (2018)](https://arxiv.org/abs/1811.11926), [Gorinova, Moore, Hoffmann (2019)](https://arxiv.org/abs/1906.03028).

You should be wary of a trace that contains many divergences (particularly those clustered in particular regions of the parameter space), and give thought to how to fix them.

### Divergence example

The trajectories above are from a famous example of a difficult geometry: Neal's funnel. It is problematic because the geometry is very different in some regions of the state space relative to others. Specifically, for hierarchical models, as the scale parameter changes in size so do the values of the parameters it is constraining. When the variance is close to zero, the parameter space is very constrained relative to the majority of the support.

In [None]:
def neals_funnel(dims=1):
    with pm.Model() as funnel:
        v = pm.Normal('v', 0, 3)
        x_vec = pm.MvNormal('x_vec', mu=pt.zeros(dims), cov=2 * pt.exp(v) * pt.eye(dims), shape=dims)
    return funnel

with neals_funnel():
    funnel_trace = pm.sample(random_seed=RANDOM_SEED)

PyMC provides us feedback on divergences, including a count and a recommendation on how to address them. 

In [None]:
funnel_trace

In [None]:
diverging_ind = funnel_trace.sample_stats['diverging'].values[0].nonzero()
diverging_ind

In [None]:
ax = az.plot_pair(funnel_trace)
ax.plot(funnel_trace.posterior['v'].sel(chain=0).values[diverging_ind], funnel_trace.posterior['x_vec'].sel(chain=0).values[diverging_ind].squeeze(), 'y.');

The `plot_parallel` function in the ArViZ library is a convenient way to identify patterns in divergent traces:

In [None]:
az.plot_parallel(schools_trace);

We have already seen this phenomenon in the radon example from the previous section section. Let's re-run the random-slopes model, which has a hierarchical model for the basement effect or radon measurements.

In [None]:
# Import radon data
radon_data = pd.read_csv('../data/radon.csv', index_col=0)

counties = radon_data.county.unique()
n_counties = counties.shape[0]
county = radon_data.county_code.values
log_radon = radon_data.log_radon.values
floor_measure = radon_data.floor.values
log_uranium = np.log(radon_data.Uppm.values)
county_lookup = dict(zip(counties, np.arange(n_counties)))

In [None]:
with pm.Model() as varying_slope:
    
    # Priors
    mu_b = pm.Normal('mu_b', mu=0., sigma=10)
    sigma_b = pm.HalfCauchy('sigma_b', 5)
    
    # Common intercepts
    a = pm.Normal('a', mu=0., sigma=10)
    # Random slopes
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=n_counties)
    
    # Model error
    sigma_y = pm.HalfCauchy('sigma_y',5)
    
    # Expected value
    y_hat = a + b[county] * floor_measure
    
    # Data likelihood
    y_like = pm.Normal('y_like', mu=y_hat, sigma=sigma_y, observed=log_radon)

In [None]:
with varying_slope:
    varying_slope_trace = pm.sample(cores=2, random_seed=RANDOM_SEED)

If we plot the locations of the divergences, we can see that they are located near the tip of the funnel.

In [None]:
x = pd.Series(varying_slope_trace.posterior['b'].sel(chain=0)[:, 10], name='slope')
y = pd.Series(varying_slope_trace.posterior['sigma_b'].sel(chain=0), name='slope group variance')
diverging = varying_slope_trace.sample_stats['diverging'].sel(chain=0)

jp = sns.jointplot(x=x, y=y, ylim=(0, .7), alpha=0.3)
jp.ax_joint.plot(x.values[diverging], y.values[diverging], 'yo');

When the group variance is small, this implies that the individual random slopes are themselves close to the group mean. In itself, this is not a problem, since this is the behavior we expect. However, if the sampler is tuned for the wider (unconstrained) part of the parameter space, it has trouble in the areas of higher curvature. The consequence of this is that the neighborhood close to the lower bound of $\sigma_b$ is sampled poorly; indeed, in our chain it is not sampled at all below 0.1. The result of this will be biased inference.

Now that we've spotted the problem, what can we do about it? The best way to deal with this issue is to reparameterize our model.

### Solution: Non-centered Parameterization

As we saw in the previous section, this is due to the use of a **centered** parameterization of the slope random effect. That is, the individual county effects are distributed around a county mean, with a spread controlled by the hierarchical standard deviation parameter. 

Here is the DAG of this centered model:

In [None]:
pm.model_to_graphviz(varying_slope)

We can remove the issue with sampling geometry by **reparameterizing** our model:

In [None]:
with pm.Model() as varying_slope_noncentered:
    
    # Priors
    mu_b = pm.Normal('mu_b', mu=0., sigma=10)
    sigma_b = pm.HalfCauchy('sigma_b', 5)
    
    # Common intercepts
    a = pm.Normal('a', mu=0., sigma=10)
    
    # Non-centered random slopes
    # Centered: b = Normal('b', mu_b, sigma=sigma_b, shape=counties)
    z = pm.Normal('z', mu=0, sigma=1, shape=n_counties)
    b = pm.Deterministic("b", mu_b + z * sigma_b)
    
    # Model error
    sigma_y =pm.HalfCauchy('sigma_y',5)
    
    # Expected value
    y_hat = a + b[county] * floor_measure
    
    # Data likelihood
    y_like = pm.Normal('y_like', mu=y_hat, sigma=sigma_y, observed=log_radon)
    

This is a **non-centered** parameterization. By this, we mean that the random deviates are no longer explicitly modeled as being centered on $\mu_b$. Instead, they are independent standard normals $\upsilon$, which are then scaled by the appropriate value of $\sigma_b$, before being location-transformed by the mean.

In [None]:
pm.model_to_graphviz(varying_slope_noncentered)

This model samples much better.

In [None]:
with varying_slope_noncentered:
    noncentered_trace = pm.sample(2000, tune=1000, cores=2, random_seed=RANDOM_SEED)

The non-centered parameterization can fully explore the support of the posterior, and divergences are very rare.

In [None]:
x = pd.Series(noncentered_trace.posterior['b'].sel(chain=0)[:, 75], name='slope')
y = pd.Series(noncentered_trace.posterior['sigma_b'].sel(chain=0), name='slope group variance')

sns.jointplot(x=x, y=y, ylim=(0, .7));

## Potential Scale Reduction: $\hat{R}$

Roughly, $\hat{R}$ (*R-Hat*, or the *Gelman-Rubin statistic*) is the ratio of between-chain variance to within-chain variance. This diagnostic uses multiple chains to
check for lack of convergence, and is based on the notion that if
multiple chains have converged, by definition they should appear very
similar to one another; if not, one or more of the chains has failed to
converge.

$\hat{R}$ uses an analysis of variance approach to
assessing convergence. That is, it calculates both the between-chain
varaince (B) and within-chain varaince (W), and assesses whether they
are different enough to worry about convergence. Assuming $m$ chains,
each of length $n$, quantities are calculated by:

$$\begin{align}B &= \frac{n}{m-1} \sum_{j=1}^m (\bar{\theta}_{.j} - \bar{\theta}_{..})^2 \\
W &= \frac{1}{m} \sum_{j=1}^m \left[ \frac{1}{n-1} \sum_{i=1}^n (\theta_{ij} - \bar{\theta}_{.j})^2 \right]
\end{align}$$

for each scalar estimand $\theta$. Using these values, an estimate of
the marginal posterior variance of $\theta$ can be calculated:

$$\hat{\text{Var}}(\theta | y) = \frac{n-1}{n} W + \frac{1}{n} B$$

Assuming $\theta$ was initialized to arbitrary starting points in each
chain, this quantity will overestimate the true marginal posterior
variance. At the same time, $W$ will tend to underestimate the
within-chain variance early in the sampling run. However, in the limit
as $n \rightarrow 
\infty$, both quantities will converge to the true variance of $\theta$.
In light of this, $\hat{R}$ monitors convergence using
the ratio:

$$\hat{R} = \sqrt{\frac{\hat{\text{Var}}(\theta | y)}{W}}$$

This is called the **potential scale reduction**, since it is an estimate of
the potential reduction in the scale of $\theta$ as the number of
simulations tends to infinity. In practice, we look for values of
$\hat{R}$ close to one (say, less than 1.1) to be confident that a
particular estimand has converged. 

In ArViZ, the `summary` table, or a `plot_forest` with the `r_hat` flag set, will calculate $\hat{R}$ for each stochastic node in the trace.

In [None]:
az.summary(schools_trace)

### Exercise

Clearly the model above has not yet converged (we only ran it for 100 iterations without tuning, after all). Try running the `schools` model for a larger number of iterations, and see when $\hat{R}$ converges to 1.0.

In [None]:
with schools:
    trace = pm.sample(1000, tune=2000)

In [None]:
az.summary(trace)

## Effective Sample Size

In general, samples drawn from MCMC algorithms will be autocorrelated. Unless the autocorrelation is very severe, this is not a big deal, other than the fact that autocorrelated chains may require longer sampling in order to adequately characterize posterior quantities of interest. The calculation of autocorrelation is performed for each lag $i=1,2,\ldots,k$ (the correlation at lag 0 is, of course, 1) by: 

$$\hat{\rho}_i = 1 - \frac{V_i}{2\hat{\text{Var}}(\theta | y)}$$

where $\hat{\text{Var}}(\theta | y)$ is the same estimated variance as calculated for the Gelman-Rubin statistic, and $V_i$ is the variogram at lag $i$ for $\theta$:

$$\text{V}_i = \frac{1}{m(n-i)}\sum_{j=1}^m \sum_{k=i+1}^n (\theta_{jk} - \theta_{j(k-i)})^2$$

This autocorrelation can be visualized using the `plot_autocorr` function in ArViZ:

In [None]:
az.plot_autocorr(schools_trace, var_names=['mu', 'tau'], combined=True);

You can see very severe autocorrelation in `mu`, which is not surprising given the trace that we observed earlier.

The amount of correlation in an MCMC sample influences the **effective sample size** (ESS) of the sample. The ESS estimates how many *independent* draws contain the same amount of information as the *dependent* sample obtained by MCMC sampling.

Given a series of samples $x_j$, the empirical mean is

$$
\hat{\mu} = \frac{1}{n}\sum_{j=1}^n x_j
$$

and the variance of the estimate of the empirical mean is 

$$
\operatorname{Var}(\hat{\mu}) = \frac{\sigma^2}{n},
$$
where $\sigma^2$ is the true variance of the underlying distribution.

Then the effective sample size is defined as the denominator that makes this relationship still be true:

$$
\operatorname{Var}(\hat{\mu}) = \frac{\sigma^2}{n_{\text{eff}}}.
$$

The effective sample size is estimated using the partial sum:

$$\hat{n}_{eff} = \frac{mn}{1 + 2\sum_{i=1}^T \hat{\rho}_i}$$

where $T$ is the first odd integer such that $\hat{\rho}_{T+1} + \hat{\rho}_{T+2}$ is negative.

The issue here is related to the fact that we are **estimating** the effective sample size from the fit output. Values of $n_{eff} / n_{iter} < 0.001$ indicate a biased estimator, resulting in an overestimate of the true effective sample size.

Vehtari *et al* (2019) recommend an ESS of at least 400 to ensure reliable estimates of variances and autocorrelations. They also suggest running at least 4 chains before calculating any diagnostics.

Its important to note that ESS can vary across the quantiles of the MCMC chain being sampled. 

In [None]:
az.plot_ess(schools_trace, var_names=['mu'])
plt.tight_layout();

Using ArViZ, we can visualize the evolution of ESS as the MCMC sample accumulates. When the model is converging properly, both lines in this plot should be approximately linear.

The standard ESS estimate, which mainly assesses how well the centre of the distribution is resolved, is referred to as **bulk-ESS**. In order to estimate intervals reliably, it is also important to consider the **tail-ESS**.

In [None]:
az.plot_ess(schools_trace, var_names=['mu'], kind='evolution');

ESS statistics can also be tabulated, by generating a `summary` of the parameters of interest.

In [None]:
az.summary(trace)

It is tempting to want to **thin** the chain to eliminate the autocorrelation (*e.g.* taking every 20th sample from traces with autocorrelation as high as 20), but this is a waste of time. Since thinning deliberately throws out the majority of the samples, no efficiency is gained; you ultimately require more samples to achive a particular desired sample size. 

## Bayesian Fraction of Missing Information

The Bayesian fraction of missing information (BFMI) is a measure of how hard it is to
sample level sets of the posterior at each iteration. Specifically, it quantifies **how well momentum resampling matches the marginal energy distribution**. 

$$\text{BFMI} = \frac{\mathbb{E}_{\pi}[\text{Var}_{\pi_{E|q}}(E|q)]}{\text{Var}_{\pi_{E}}(E)}$$

$$\widehat{\text{BFMI}} = \frac{\sum_{i=1}^N (E_n - E_{n-1})^2}{\sum_{i=1}^N (E_n - \bar{E})^2}$$

A small value indicates that the adaptation phase of the sampler was unsuccessful, and invoking the central limit theorem may not be valid. It indicates whether the sampler is able to *efficiently* explore the posterior distribution.

Though there is not an established rule of thumb for an adequate threshold, values close to one are optimal. Reparameterizing the model is sometimes helpful for improving this statistic.

BFMI calculation is only available in samples that were simulated using HMC or NUTS.

In [None]:
az.bfmi(trace)

Another way of diagnosting this phenomenon is by comparing the overall distribution of 
energy levels with the *change* of energy between successive samples. Ideally, they should be very similar.

If the distribution of energy transitions is narrow relative to the marginal energy distribution, this is a sign of inefficient sampling, as many transitions are required to completely explore the posterior. On the other hand, if the energy transition distribution is similar to that of the marginal energy, this is evidence of efficient sampling, resulting in near-independent samples from the posterior.

In [None]:
az.plot_energy(trace);

In [None]:
y = np.array([28, 8, -3, 7, -1, 1, 18, 12])
s = np.array([15, 10, 16, 11, 9, 11, 10, 18])
schools = np.array(
    [
        "Choate",
        "Deerfield",
        "Phillips Andover",
        "Phillips Exeter",
        "Hotchkiss",
        "Lawrenceville",
        "St. Paul's",
        "Mt. Hermon",
    ]
)

with pm.Model(coords={'school': schools}) as schools_uncentered:
    
    mu = pm.Normal("mu", 0, sigma=1e6)
    tau = pm.HalfCauchy("tau", 5)

    z = pm.Normal('z', dims='school')
    theta = pm.Deterministic("theta", mu + tau*z, dims='school')

    obs = pm.Normal("obs", theta, sigma=s, observed=y)

    trace_uncentered = pm.sample()

In [None]:
az.plot_energy(trace_uncentered)

## Goodness of Fit

As noted at the beginning of this section, convergence diagnostics are only the first step in the evaluation
of MCMC model outputs. It is possible for an entirely unsuitable model to converge, so additional steps are needed to ensure that the estimated model adequately fits the data. 

One intuitive way of evaluating model fit is to compare model predictions with the observations used to fit
the model. In other words, the fitted model can be used to simulate data, and the distribution of the simulated data should resemble the distribution of the actual data.

Fortunately, simulating data from the model is a natural component of the Bayesian modelling framework. Recall, from the discussion on prediction, the posterior predictive distribution:

$$p(\tilde{y}|y) = \int p(\tilde{y}|\theta) f(\theta|y) d\theta$$

Here, $\tilde{y}$ represents some hypothetical new data that would be expected, taking into account the posterior uncertainty in the model parameters. 

Sampling from the posterior predictive distribution is easy in PyMC. The `sample_posterior_predictive` function draws posterior predictive samples from all of the observed variables in the model. Consider the PKU model, 
where IQ is modeled as a Gaussian random variable, which is thought to be influenced by blood Phe levels.

The posterior predictive distribution of deaths uses the same functional
form as the data likelihood, in this case a binomial stochastic. Here is
the corresponding sample from the posterior predictive distribution (we typically need very few samples relative to the MCMC sample):

In [None]:
with schools_uncentered:
    pm.sample_posterior_predictive(trace_uncentered, extend_inferencedata=True)

The degree to which simulated data correspond to observations can be evaluated visually. This allows for a qualitative comparison of model-based replicates and observations. If there is poor fit, the true value of the data may appear in the tails of the histogram of replicated data, while a good fit will tend to show the true data in high-probability regions of the posterior predictive distribution.

In [None]:
az.plot_ppc(trace_uncentered);

In [None]:
az.plot_ppc(schools_trace, kind='cumulative');

A quantitative approach is to calculate quantiles of each observed data point relative to the corresponding distribution of posterior-simulated values. For an adequate fit, there should not be severe peaks in the histogram near zero and one.

In [None]:
from scipy.stats import percentileofscore

plt.hist([np.round(percentileofscore(_x, _y)/100, 2) for _x,_y in zip(schools_trace.posterior_predictive['obs'].sel(chain=0), y)], bins=25);


---

## Reference

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science. A Review Journal of the Institute of Mathematical Statistics, 457–472.

[Vehtari, Gelman, Simpson, Carpenter, Bürkner (2019)](https://arxiv.org/abs/1903.08008) Rank-normalization, folding, and localization: An improved $\hat{R}$ for assessing convergence of MCMC


[Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016.](http://doi.org/10.1007/s11222-013-9416-2)
