# The Bayesian Workflow


In [None]:
import numpy as np
import pymc as pm
import arviz as az
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import matplotlib.pyplot as plt
import load_covid_data

# Visual style
az.style.use('arviz-doc')
pio.templates.default = 'plotly_white'

# Plotly defaults
px.defaults.template = 'plotly_white'
px.defaults.width = 900
px.defaults.height = 500
_base = pio.templates['plotly_white']
_tmpl = go.layout.Template(_base)
_tmpl.layout.hovermode = 'x unified'
pio.templates['hoverx'] = _tmpl
pio.templates.default = 'plotly_white+hoverx'

# Reproducibility
RANDOM_SEED = 20090425
RNG = np.random.default_rng(RANDOM_SEED)


In this session, we will learn to evaluate the quality of our models using statistical and visual diagnostics. We'll also discuss a comprehensive **Bayesian workflow** that promotes model development through an iterative process.The Bayesian Workflow is a process for developing and testing Bayesian models, intended to detect problems and produce reliable results.

Here are the steps we'll follow:

1. Explore data: find patterns the model should capture, identify anomalies

2. Build a model: translate ideas into code

3. Prior predictive check: confirm that the priors make sense and the data-generating process is appropriate

4. Fit the model: run the sampler

5. Convergence diagnostics: confirm that the results are a sample from the posterior distribution

6. Posterior predictive check: confirm that the model includes the features it needs to reproduce the important patterns in the data

7. Model improvement: modify the model until fit is satisfactory

Each step of this process is iterative, and at every step you have to decide whether to go on or loop back.


Strengths of Bayesian statistics that are critical here:
* Great flexibility to quickly and iteratively build statistical models
* Offers principled way of dealing with uncertainty
* Don't just want most likely outcome but distribution of all possible outcomes
* Allows expert information to guide model by using informative priors

## COVID-19 Dataset

First we'll load data on COVID-19 cases from the WHO. In order to ease analysis we will remove any days were confirmed cases was below 100 (as reporting is often very noisy in this time frame). It also allows us to align countries with each other for easier comparison.

In [None]:
df = load_covid_data.load_data(drop_states=True, filter_n_days_100=2)
countries = df['country'].unique()
n_countries = len(countries)
df = df.filter(pl.col('days_since_100') >= 0)
df.head()

Let's step through the workflow!

### 1. Explore the data

We will look at German COVID-19 cases. At first, we will only look at the first 30 days after Germany crossed 100 cases, later we will look at the full data.

In [None]:
country = 'Germany'
date = '2020-07-31'
df_country = df.filter(pl.col('country') == country).filter(pl.col('date') <= pl.datetime(2020, 7, 31)).head(30)

fig = go.Figure()

# Add confirmed cases line
fig.add_trace(
    go.Scatter(
        x=df_country['date'],
        y=df_country['confirmed'],
        mode='lines+markers',
        name='Confirmed cases',
        line=dict(color='royalblue', width=2),
        marker=dict(size=6)
    )
)

# Update layout
fig.update_layout(
    title=dict(
        text=f'COVID-19 Cases in {country}',
        x=0.5
    ),
    xaxis=dict(
        title='Date',
        tickformat='%b %d\n%Y'
    ),
    yaxis=dict(
        title='Confirmed cases',
        gridcolor='lightgray'
    ),
    width=1000,
    height=800,
    plot_bgcolor='white'
)

fig.show()

Look at the above plot and think of what type of model you would build to model the data.

## 2. Build model

The above line kind of looks exponential. This matches with knowledge from epidemiology whereas early in an epidemic it grows exponentially.

In [None]:
# Get time-range of days since 100 cases were crossed
t = df_country['days_since_100'].to_numpy()
# Get number of confirmed cases for Germany
confirmed = df_country['confirmed'].to_numpy()

with pm.Model() as model_exp1:
    # Intercept
    a = pm.Normal('a', mu=0, sigma=100)

    # Slope
    b = pm.Normal('b', mu=0.3, sigma=0.3)

    # Exponential regression
    growth = a * (1 + b) ** t

    # Error term
    eps = pm.HalfNormal('eps', 100)

    # Likelihood
    pm.Normal('obs',
              mu=growth,
              sigma=eps,
              observed=confirmed)

In [None]:
pm.model_to_graphviz(model_exp1)

Just looking at the above model, what do you think? Is there anything you would have done differently?

## 3. Run prior predictive check

Without even fitting the model to our data, we generate new potential data from our priors. Usually we have less intuition about the parameter space, where we define our priors, and more intution about what data we might expect to see. A prior predictive check thus allows us to make sure the model can generate the types of data we expect to see.

The process works as follows:

1. Pick a point from the prior $\theta_i$
2. Generate data set $x_i \sim f(\theta_i)$ where $f$ is our likelihood function (e.g. normal).
3. Rinse and repeat $n$ times.

In [None]:
with model_exp1:
    prior_pred = pm.sample_prior_predictive()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(prior_pred.prior_predictive['obs'].values.squeeze().T, color="0.5", alpha=.1)
ax.set(ylim=(-1000, 1000),
       xlim=(0, 10),
       title="Prior predictive",
       xlabel="Days since 100 cases",
       ylabel="Positive cases");

Does this look sensible? Why or why not? What is the prior predictive sample telling us?

### What's wrong with this model?

Above you hopefully identified a few issues with this model:
1. Cases can't be negative
2. Cases can not start at 0, as we set it to start at above 100.
3. Case counts can't go down

Let's improve our model. The presence of negative cases is due to us using a Normal likelihood. Instead, let's use a `NegativeBinomial`, which is similar to `Poisson` which is commonly used for count-data but has an extra dispersion parameter that allows more flexiblity in modeling the variance of the data.

We will also change the prior of the intercept to be centered at 100 and tighten the prior of the slope.

The negative binomial distribution uses an overdispersion parameter, which we will describe using a gamma distribution. A companion package called `preliz`, a library for prior distribution elicitation, has a nice utility called `maxent` that will help us parameterize this prior, as the gamma distribution is not as intuitive to work with as the normal distribution.

In [None]:
import preliz as pz

gamma_params = pz.maxent(
    pz.Gamma(),
    lower=0.1, upper=20,
    mass=0.95
)
gamma_params

In [None]:
plt.hist(pm.draw(pm.Gamma.dist(alpha=2, beta=0.2), 1000), bins=20);

In [None]:
t = df_country['days_since_100'].to_numpy()
confirmed = df_country['confirmed'].to_numpy()

with pm.Model() as model_exp2:
    # Intercept
    a = pm.Normal('a', mu=100, sigma=25)

    # Slope
    b = pm.Normal('b', mu=0.3, sigma=0.1)

    # Exponential regression
    growth = a * (1 + b) ** t

    alpha = pz.maxent(
        pz.Gamma(),
        lower=0.1, upper=20,
        mass=0.95,
        plot=False
    ).to_pymc("alpha")

    # Likelihood
    pm.NegativeBinomial('obs',
                 growth,
                 alpha=alpha,
                 observed=confirmed)

In [None]:
with model_exp2:
    prior_pred = pm.sample_prior_predictive()

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(prior_pred.prior_predictive['obs'].values.squeeze().T, color="0.5", alpha=.1)
ax.set(ylim=(-100, 1000),
       xlim=(0, 10),
       title="Prior predictive",
       xlabel="Days since 100 cases",
       ylabel="Positive cases");

In [None]:
with model_exp2:
    trace_exp2 = pm.sample(random_seed=RANDOM_SEED)

That looks much better. However, we can include even more prior information. For example, we know that the intercept *can not* be below 100 because of how we sliced the data. We can thus create a prior that does not have probability mass below 100. For this, we use the PyMC `HalfNormal` distribution; we can apply the same for the slope which we know is not going to be negative.

In [None]:
t = df_country['days_since_100'].to_numpy()
confirmed = df_country['confirmed'].to_numpy()

with pm.Model() as model_exp3:
    # Intercept
    a0 = pm.HalfNormal('a0', sigma=25)
    a = pm.Deterministic('a', a0 + 100)

    # Slope
    b = pm.HalfNormal('b', sigma=0.2)

    # Exponential regression
    growth = a * (1 + b) ** t

    gamma_params = pm.find_constrained_prior(
        pm.Gamma,
        lower=0.1, upper=20,
        init_guess={'alpha':6, 'beta':1},
        mass=0.95
    )
    alpha = pm.Gamma("alpha", **gamma_params)

    # Likelihood
    pm.NegativeBinomial('obs',
                 growth,
                 alpha=alpha,
                 observed=confirmed)
    
    prior_pred = pm.sample_prior_predictive()

In [None]:
az.plot_posterior(prior_pred, var_names=['a', 'b'], group='prior');

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

ax.plot(az.extract(prior_pred.prior_predictive)['obs'], color="0.5", alpha=.1)
ax.set(ylim=(0, 1000),
       xlim=(0, 10),
       title="Prior predictive",
       xlabel="Days since 100 cases",
       ylabel="Positive cases");

Note that even though the intercept parameter can not be below 100 now, we still see data generated at below hundred. Why? 

## 4. Fit model

In [None]:
with model_exp3:
    # Inference button (TM)
    trace_exp3 = pm.sample(random_seed=RANDOM_SEED)

## 5. Assess convergence

There are **two components** to model checking:

1. Convergence diagnostics
2. Goodness of fit

Convergence diagnostics are intended to detect **lack of convergence** in the Markov chain Monte Carlo sample; it is used to ensure that you have not halted your sampling too early. However, a converged model is not guaranteed to be a good model. 

### Convergence Diagnostics

Valid inferences from sequences of MCMC samples are based on the
assumption that the samples are derived from the true posterior
distribution of interest. Theory guarantees this condition as the number
of iterations approaches infinity. It is important, therefore, to
determine the **minimum number of samples** required to ensure a reasonable
approximation to the target posterior density. Unfortunately, no
universal threshold exists across all problems, so convergence must be
assessed independently each time MCMC estimation is performed. The
procedures for verifying convergence are collectively known as
*convergence diagnostics*.

There are a handful of easy-to-use methods for checking convergence. Since you cannot prove convergence, but only show lack of convergence, there is no single method that is foolproof. So, its best to look at a suite of diagnostics together. 

We will cover the canonical set of checks:

- Variable plotting
- Divergences
- R-hat
- Effective Sample Size

### Traceplot 

Perhaps the most-used ArviZ plot is the traceplot, obtained via the `plot_trace` function. This is a simple plot that is a good quick check to make sure nothing is obviously wrong, and is usually the first diagnostic step you will take. You've seen these already: just the time series of samples for an individual variable.

The `plot_trace` function from ArViZ by default generates a kernel density plot and a trace plot, with a different lines style for each chain of the simulation.


In [None]:
az.plot_trace(trace_exp3, var_names=['a', 'b', 'alpha']);

This is a good example of a traceplot; the chains are mixing well and that there is no evidence of non-convergence. Notice that the chains are almost identical, indicating that the chains have converged to the same distribution.

This is not surprising since the model is fairly simple and we tuned the sampler for 1000 iterations (the default in PyMC).

Let's look at a trace from the model sampled with deliberately inadequate tuning:

In [None]:
with model_exp3:
    trace_exp3_bad = pm.sample(tune=10,random_seed=RANDOM_SEED)

Notice that we now have some warnings from the sampler that we did not see before.

Looking at the traceplot, we see that the chains are not mixing well, and you can visually discriminate between the chains.

In [None]:
az.plot_trace(trace_exp3_bad, var_names=['a', 'b', 'alpha']);

### Potential Scale Reduction: $\hat{R}$

Roughly, $\hat{R}$ (*R-Hat*, or the *Gelman-Rubin statistic*) is the ratio of between-chain variance to within-chain variance. This diagnostic uses multiple chains to
check for lack of convergence, and is based on the notion that if
multiple chains have converged, by definition they should appear very
similar to one another; if not, one or more of the chains has failed to
converge.

$\hat{R}$ uses an analysis of variance approach to
assessing convergence. That is, it calculates both the between-chain
variance (B) and within-chain variance (W), and assesses whether they
are different enough to worry about convergence. Assuming $m$ chains,
each of length $n$, quantities are calculated by:

$$\begin{align}B &= \frac{n}{m-1} \sum_{j=1}^m (\bar{\theta}_{.j} - \bar{\theta}_{..})^2 \\
W &= \frac{1}{m} \sum_{j=1}^m \left[ \frac{1}{n-1} \sum_{i=1}^n (\theta_{ij} - \bar{\theta}_{.j})^2 \right]
\end{align}$$

for each scalar estimand $\theta$. Using these values, an estimate of
the marginal posterior variance of $\theta$ can be calculated:

$$\hat{\text{Var}}(\theta | y) = \frac{n-1}{n} W + \frac{1}{n} B$$

Assuming $\theta$ was initialized to arbitrary starting points in each
chain, this quantity will overestimate the true marginal posterior
variance. At the same time, $W$ will tend to underestimate the
within-chain variance early in the sampling run. However, in the limit
as $n \rightarrow 
\infty$, both quantities will converge to the true variance of $\theta$.
In light of this, $\hat{R}$ monitors convergence using
the ratio:

$$\hat{R} = \sqrt{\frac{\hat{\text{Var}}(\theta | y)}{W}}$$

This is called the **potential scale reduction**, since it is an estimate of
the potential reduction in the scale of $\theta$ as the number of
simulations tends to infinity. In practice, we look for values of
$\hat{R}$ close to one (say, less than 1.1) to be confident that a
particular estimand has converged. 

In [None]:
az.summary(trace_exp3, var_names=['a', 'b', 'alpha'])

### Effective Sample Size

In general, samples drawn from MCMC algorithms will be autocorrelated. Unless the autocorrelation is very severe, this is not a big deal, other than the fact that autocorrelated chains may require longer sampling in order to adequately characterize posterior quantities of interest. The calculation of autocorrelation is performed for each lag $i=1,2,\ldots,k$ (the correlation at lag 0 is, of course, 1) by: 

$$\hat{\rho}_i = 1 - \frac{V_i}{2\hat{\text{Var}}(\theta | y)}$$

where $\hat{\text{Var}}(\theta | y)$ is the same estimated variance as calculated for the Gelman-Rubin statistic, and $V_i$ is the variogram at lag $i$ for $\theta$:

$$\text{V}_i = \frac{1}{m(n-i)}\sum_{j=1}^m \sum_{k=i+1}^n (\theta_{jk} - \theta_{j(k-i)})^2$$

The amount of correlation in an MCMC sample influences the **effective sample size** (ESS) of the sample. The ESS estimates how many *independent* draws contain the same amount of information as the *dependent* sample obtained by MCMC sampling.

Given a series of samples $x_j$, the empirical mean is

$$
\hat{\mu} = \frac{1}{n}\sum_{j=1}^n x_j
$$

and the variance of the estimate of the empirical mean is 

$$
\operatorname{Var}(\hat{\mu}) = \frac{\sigma^2}{n},
$$
where $\sigma^2$ is the true variance of the underlying distribution.

Then the effective sample size is defined as the denominator that makes this relationship still be true:

$$
\operatorname{Var}(\hat{\mu}) = \frac{\sigma^2}{n_{\text{eff}}}.
$$

The effective sample size is estimated using the partial sum:

$$\hat{n}_{eff} = \frac{n}{1 + 2\sum_{i=1}^T \hat{\rho}_i}$$

where $T$ is the first odd integer such that $\hat{\rho}_{T+1} + \hat{\rho}_{T+2}$ is negative.

In [None]:
az.summary(trace_exp3_bad, var_names=['a', 'b', 'alpha'])

### Bayesian Fraction of Missing Information

The Bayesian fraction of missing information (BFMI) is a measure of how hard it is to
sample level sets of the posterior at each iteration. Specifically, it quantifies **how well momentum resampling matches the marginal energy distribution**. 

$$\text{BFMI} = \frac{\mathbb{E}_{\pi}[\text{Var}_{\pi_{E|q}}(E|q)]}{\text{Var}_{\pi_{E}}(E)}$$

$$\widehat{\text{BFMI}} = \frac{\sum_{i=1}^N (E_n - E_{n-1})^2}{\sum_{i=1}^N (E_n - \bar{E})^2}$$

BFMI is essentially a measure of the association between the energy of a state and the energy of the next state, or more precisely, it compares the average squared change in energy between successive samples to the overall variance of the energy across all samples. The "missing information" refers to the information that the sampler fails to gain about the posterior because it cannot efficiently traverse the energy landscape.

A small value indicates that the adaptation phase of the sampler was unsuccessful, and invoking the central limit theorem may not be valid. It indicates whether the sampler is able to *efficiently* explore the posterior distribution.

Though there is not an established rule of thumb for an adequate threshold, values close to one are optimal. Reparameterizing the model is sometimes helpful for improving this statistic.

In [None]:
az.bfmi(trace_exp3)

Another way of diagnosting this phenomenon is by comparing the overall distribution of 
energy levels with the *change* of energy between successive samples. Ideally, they should be very similar.

If the distribution of energy transitions is narrow relative to the marginal energy distribution, this is a sign of inefficient sampling, as many transitions are required to completely explore the posterior. On the other hand, if the energy transition distribution is similar to that of the marginal energy, this is evidence of efficient sampling, resulting in near-independent samples from the posterior.

As an example, if we look at the energy plot of our eight schools model, the low BFMI values (which result in poor overlap in the energy distributions) suggest taht the sampler is having trouble exploring different energy levels.

In [None]:
az.plot_energy(trace_exp3);

## 6. Check Model Fit

The second component of model checking, goodness of fit, is used to check the **internal validity** of the model, by comparing predictions from the model to the data used to fit the model.

Convergence diagnostics are only the first step in the evaluation
of MCMC model outputs -- It is possible for an entirely unsuitable model to converge, so additional steps are needed to ensure that the estimated model adequately fits the data. 

One intuitive way of evaluating model fit is to compare model predictions with the observations used to fit
the model. In other words, the fitted model can be used to simulate data, and the distribution of the simulated data should resemble the distribution of the actual data.

Fortunately, simulating data from the model is a natural component of the Bayesian modelling framework. Recall, from the discussion on prediction, the posterior predictive distribution:

$$p(\tilde{y}|y) = \int p(\tilde{y}|\theta) f(\theta|y) d\theta$$

Here, $\tilde{y}$ represents some hypothetical new data that would be expected, taking into account the posterior uncertainty in the model parameters. 

Sampling from the posterior predictive distribution is easy in PyMC. The `sample_posterior_predictive` function draws posterior predictive samples from all of the observed variables in the model. 

In [None]:
with model_exp3:
    # Draw sampels from posterior predictive
    post_pred = pm.sample_posterior_predictive(trace_exp3.posterior)

The degree to which simulated data correspond to observations can be evaluated visually. This allows for a qualitative comparison of model-based replicates and observations. If there is poor fit, the true value of the data may appear in the tails of the histogram of replicated data, while a good fit will tend to show the true data in high-probability regions of the posterior predictive distribution.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(post_pred.posterior_predictive['obs'].sel(chain=0).values.squeeze().T, color='0.5', alpha=.05)
ax.plot(confirmed, color='r', label='data')
ax.set(xlabel="Days since 100 cases", 
       ylabel="Confirmed cases (log scale)",
       # ylim=(0, 100_000), 
       title=country, 
       yscale="log");

OK, that does not look terrible, the data is at least inside of what the model can produce. Let's look at residuals for systematic errors:

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
resid = post_pred.posterior_predictive["obs"].sel(chain=0) - confirmed
ax.plot(resid.T, color="0.5", alpha=.01);
ax.set(ylim=(-50_000, 200_000), ylabel="Residual",
       xlabel="Days since 100 cases");

What can you see?

### Prediction and forecasting

We might also be interested in predicting on unseen or data, or, in the case time-series data like here, in forecasting. In `PyMC` you can do so easily using `pm.Data` nodes. What it allows you to do is define data to a PyMC model that you can later switch out for other data. That way, when you for example do posterior predictive sampling, it will generate samples into the future.

Let's change our model to use `pm.Data` instead.

In [None]:
with pm.Model() as model_exp4:
    # pm.Data needs to be in the model context so that we can
    # keep track of it.
    # Then, we can then use it like any other array.
    t_data = pm.Data('t', df_country['days_since_100'].to_numpy())
    confirmed_data = pm.Data('confirmed', df_country['confirmed'].to_numpy())

    # Intercept
    a0 = pm.HalfNormal('a0', sigma=25)
    a = pm.Deterministic('a', a0 + 100)

    # Slope
    b = pm.HalfNormal('b', sigma=0.2)

    # Exponential regression
    growth = a * (1 + b) ** t_data

    # Likelihood
    pm.NegativeBinomial('obs',
                 growth,
                 alpha=pm.Gamma("alpha", mu=6, sigma=1),
                 observed=confirmed_data)
    
    trace_exp4 = pm.sample(random_seed=RANDOM_SEED)

In [None]:
with model_exp4:
    # Update our data containers.
    # Recall that because confirmed is observed, we do not
    # need to specify any data, as that is only needed
    # during inference. But do have to update it to match
    # the shape.
    pm.set_data({'t': np.arange(60),
                 'confirmed': np.zeros(60, dtype='int')})

    post_pred = pm.sample_posterior_predictive(trace_exp4.posterior)

As we held data back before, we can now see how the predictions of the model

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(post_pred.posterior_predictive['obs'].sel(chain=0).squeeze().values.T, color='0.5', alpha=.05)
ax.plot(df_country["confirmed"].to_numpy(), color='r', label="in-sample")
df_confirmed = df.filter((pl.col('country') == country) & (pl.col('date') <= pl.lit(date).str.to_datetime()))['confirmed']
ax.plot(np.arange(29, len(df_confirmed)), df_confirmed[29:].to_numpy(),
        color='b', label="out-of-sample")
ax.set(xlabel='Days since 100 cases', ylabel='Confirmed cases',
       title=country, yscale="log");
ax.legend();

## 7. Improve model

As an alternative to exponential growth, let's consider a logistic growth model:

$$
y(t) = \frac{c}{1 + a e^{-b t}}
$$

where `c` is the carrying capacity, `a` controls the initial value, and `b` is the growth rate.

### Logistic model

<img src="https://s3-us-west-2.amazonaws.com/courses-images-archive-read-only/wp-content/uploads/sites/924/2015/11/25202016/CNX_Precalc_Figure_04_07_0062.jpg"/>

In [None]:
df_country = df.filter((pl.col('country') == country) & (pl.col('date') <= pl.lit(date).str.to_datetime()))

with pm.Model() as logistic_model:
    t_data = pm.Data('t', df_country["days_since_100"].to_numpy())
    confirmed_data = pm.Data('confirmed', df_country["confirmed"].to_numpy())

    # Intercept
    a0 = pm.HalfNormal('a0', sigma=25)
    intercept = pm.Deterministic('intercept', a0 + 100)

    # Slope
    b = pm.HalfNormal('b', sigma=0.2)
    
    carrying_capacity = pm.Uniform('carrying_capacity',
                                   lower=1_000,
                                   upper=80_000_000)
    # Transform carrying_capacity to a
    a = carrying_capacity / intercept - 1

    # Logistic
    growth = carrying_capacity / (1 + a * pm.math.exp(-b * t_data))

    # Likelihood
    pm.NegativeBinomial('obs',
                 growth,
                 alpha=pm.Gamma("alpha", mu=6, sigma=1),
                 observed=confirmed_data)
    
    prior_pred = pm.sample_prior_predictive()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(prior_pred.prior_predictive['obs'].squeeze().values.T, color="0.5", alpha=.1)
ax.set(title="Prior predictive",
       xlabel="Days since 100 cases",
       ylabel="Positive cases",
       yscale="log",
);

In [None]:
with logistic_model:
    # Inference
    trace_logistic = pm.sample(random_seed=RANDOM_SEED, target_accept=0.9)
    
    # Sample posterior predcitive
    pm.sample_posterior_predictive(trace_logistic, extend_inferencedata=True)

In [None]:
az.plot_trace(trace_logistic);

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(trace_logistic.posterior_predictive['obs'].sel(chain=0).squeeze().values.T, color='0.5', alpha=.05)
ax.plot(df_confirmed.to_numpy(), color='r')
ax.set(xlabel='Days since 100 cases', ylabel='Confirmed cases',
       title=country);

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
resid = trace_logistic.posterior_predictive["obs"].sel(chain=0).squeeze().values - df_confirmed.to_numpy()
ax.plot(resid.T, color="0.5", alpha=.01);
ax.set(ylabel="Residual",
       xlabel="Days since 100 cases");

### Model comparison

While it seems clear that the logistic model is a better fit to the data, we can also use one of several model comparison techniques to make this more rigorous.

#### Leave-one-out Cross-validation (LOO)

LOO cross-validation is an estimate of the **out-of-sample predictive fit**. In cross-validation, the data are repeatedly partitioned into training and holdout sets, iteratively fitting the model with the former and evaluating the fit with the holdout data. 

The estimate of out-of-sample predictive fit from applying LOO cross-validation to a Bayesian model is:

$$lppd_{loo} = \sum_{i=1}^N \log p_{post(-i)}(y_i) =  \sum_{i=1}^N \log \left(\frac{1}{S} \sum_{s=1}^S p(y_i| \theta^{(is)})\right)$$

so, each prediction is conditioned on $N-1$ data points, which induces an **underestimation of the predictive fit** for smaller $N$. The resulting estimate of effective samples size is:

$$p_{loo} = lppd - lppd_{loo}$$

But, using cross-validation for a Bayesian model, fitting $N$ copies of the model under different subsets of the data is computationally expensive. However, Vehtari *et al.* (2016) introduced an efficient computation of LOO from MCMC sample, which are corrected using **Pareto-smoothed importance sampling (PSIS)** to provide an estimate of point-wise out-of-sample prediction accuracy.

This involves estimating the importance sampling LOO predictive distribution

$$p(\tilde{y}_i | y_{-i}) \approx \frac{\sum_{s=1}^S w_i(\theta^{(s)}) p(\tilde{y}_i|\theta^{(s)})}{\sum_{s=1}^S w_i(\theta^{(s)})}$$

where the importance weights are:

$$w_i(\theta^{(s)}) = \frac{1}{p(y_i | \theta^{(s)})} \propto \frac{p(\theta^{(s)}|y_{-i})}{p(\theta^{(s)}|y)}$$

The predictive distribution evaluated at the held-out point is then:

$$p(y_i | y_{-i}) \approx \frac{1}{\frac{1}{S} \sum_{s=1}^S \frac{1}{p(y_i | \theta^{(s)})}}$$

However, the posterior is likely to have a *smaller variance and thinner tails* than the LOO posteriors, so this approximation induces instability due to the fact that the importance ratios can have high or infinite variance.

To deal with this instability, a generalized **Pareto distribution** fit to the upper tail of the distribution of the importance ratios can be used to construct a test for a finite importance ratio variance. If the test suggests the variance is infinite then importance sampling is halted.

LOO using Pareto-smoothed importance sampling is implemented in ArviZ in the `loo` function.

> Model comparison requires the log-likelihoods of the respective models. For efficiency, these are not computed automatically, so we need to manually calculate them.

In [None]:
with logistic_model:
    pm.compute_log_likelihood(trace_logistic)

az.loo(trace_logistic)

Is this a good model? It is really only interpretable relative to another model fit to the same data.

First we need to sample from `model_exp4` with the full dataset, and compute the log-likelihood.

In [None]:
with model_exp4:
    pm.set_data({"t": df_country["days_since_100"].to_numpy(),
                 "confirmed": df_country["confirmed"].to_numpy()})
    
    trace_exp4_full = pm.sample(random_seed=RANDOM_SEED)
    pm.compute_log_likelihood(trace_exp4_full)

Now we can use the ArviZ `compare` function:

In [None]:
az.plot_compare(az.compare({"exp4": trace_exp4_full, 
            "logistic": trace_logistic}))

As you can see, the logistic model provides a much better fit to the data. 

Although there is still some small bias in the residuals but overall we might think our model is quite good. Let's see how it does on a different country.

In [None]:
country = 'US'
df_country = df.filter(pl.col('country') == country).filter(pl.col('date') <= pl.lit(date).str.to_datetime())
df_confirmed = df_country["confirmed"]
px.line(
    x=df_country["days_since_100"].to_numpy(),
    y=df_country["confirmed"].to_numpy(),
    title=f'{country} - Confirmed Cases',
    labels={'x': 'Days since 100 cases', 'y': 'Confirmed cases'}
).update_layout(width=800, height=600)

As you can see, the data looks quite different. Let's see how our logistic model fits this.

In [None]:
# df_confirmed = df.loc[lambda x: (x.country == country), 'confirmed']
df_confirmed = df.filter(pl.col('country') == country).filter(pl.col('date') <= pl.lit(date).str.to_datetime())['confirmed']

with pm.Model() as logistic_model:
    t_data = pm.Data('t', df_country["days_since_100"].to_numpy())
    confirmed_data = pm.Data('confirmed', df_country["confirmed"].to_numpy())

    # Intercept
    a0 = pm.HalfNormal('a0', sigma=25)
    intercept = pm.Deterministic('intercept', a0 + 100)

    # Slope
    b = pm.HalfNormal('b', sigma=0.2)
    
    carrying_capacity = pm.Uniform('carrying_capacity',
                                   lower=1_000,
                                   upper=100_000_000)
    # Transform carrying_capacity to a
    a = carrying_capacity / intercept - 1

    # Logistic
    growth = carrying_capacity / (1 + a * pm.math.exp(-b * t_data))

    # Likelihood
    pm.NegativeBinomial('obs',
                 growth,
                 alpha=pm.Gamma("alpha", mu=6, sigma=1),
                 observed=confirmed_data)

In [None]:
with logistic_model:
    trace_logistic_us = pm.sample(random_seed=RANDOM_SEED)

Already we see some problems with sampling which should make us suspicious that this model might not be the best for this data.

In [None]:
az.plot_trace(trace_logistic_us);

In [None]:
with logistic_model:
    pm.sample_posterior_predictive(
        trace_logistic_us, extend_inferencedata=True)

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(trace_logistic_us.posterior_predictive['obs'].sel(chain=0).squeeze().values.T, color='0.5', alpha=.05)
ax.plot(df_confirmed.to_numpy(), color='r')
ax.set(xlabel='Days since 100 cases', ylabel='Confirmed cases',
       title=country);

As you can see, the model is not a great fit to this data. Why? What assumptions does the model make about the spread of COVID-19?

In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w