# Building Models with PyMC and MCMC Fundamentals

In the previous session, we explored the foundations of Bayesian inference, including Bayes' theorem, conjugate priors, and the mechanics of Bayesian updating for simple models. 

This session introduces PyMC, a powerful probabilistic programming framework that enables us to build and analyze complex Bayesian models. We'll learn how to specify models using PyMC's intuitive API, understand the theoretical foundations of Markov Chain Monte Carlo (MCMC) methods that make modern Bayesian computation possible, and work through practical examples that demonstrate the complete modeling workflow.

By the end of this session, you will be able to:

1. **Build probabilistic models in PyMC**: Understand PyMC's core components including distributions, random variables, and the model context
2. **Specify model structure**: Learn how to encode assumptions about data generating processes using priors and likelihoods
3. **Understand MCMC fundamentals**: Grasp why we need MCMC, how it works conceptually, and what makes it powerful for Bayesian inference
4. **Implement complete Bayesian analyses**: Build, fit, and interpret results from real-world models including linear regression

## Why PyMC?

PyMC provides several key advantages for Bayesian modeling:

- **Expressive model specification**: Write models that look like their mathematical notation
- **Automatic differentiation**: No need to derive gradients by hand
- **State-of-the-art samplers**: Access to efficient MCMC algorithms like NUTS (No-U-Turn Sampler)
- **Comprehensive diagnostics**: Built-in tools for assessing convergence and model quality
- **Integration with the PyData ecosystem**: Works seamlessly with NumPy, Pandas, and visualization libraries

Let's begin by setting up our environment and exploring PyMC's fundamental concepts.

In [None]:
import itertools
import numpy as np
import pymc as pm
import arviz as az
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import matplotlib.pyplot as plt
import scipy.stats as st

# Visual style
az.style.use('arviz-doc')
pio.templates.default = 'plotly_white'

# Plotly defaults
px.defaults.template = 'plotly_white'
px.defaults.width = 450
px.defaults.height = 250
_base = pio.templates['plotly_white']
_tmpl = go.layout.Template(_base)
_tmpl.layout.hovermode = 'x unified'
pio.templates['hoverx'] = _tmpl
pio.templates.default = 'plotly_white+hoverx'

# Reproducibility
RANDOM_SEED = 20090425
RNG = np.random.default_rng(RANDOM_SEED)

## Building Models in PyMC

Probabilistic programming represents a paradigm shift in how we approach statistical modeling. Instead of deriving update equations or coding samplers by hand, we declare the structure of our model and let the framework handle the computational details. PyMC exemplifies this approach by providing an intuitive interface that closely mirrors mathematical notation while leveraging sophisticated algorithms under the hood.

### The Philosophy of Probabilistic Programming

Probabilistic programming fundamentally changes how we think about statistical modeling. Rather than starting with computational constraints and working toward a simplified model, we begin with our actual beliefs about the world and let the appropriate algorithm handle the computational challenges. This approach allows us to build models that truly reflect our understanding of the problem domain, incorporating complex dependencies, hierarchical structures, and realistic uncertainty without being limited by what we can solve analytically or implement by hand.

We start by specifying what we know:
- **Prior knowledge** about parameters before seeing data
- **The data generating process** that connects parameters to observations
- **The observed data** itself

From this specification, a probabilistic programming framework like PyMC automatically constructs the computational graph needed for inference, applies appropriate transformations for constrained parameters, and selects suitable sampling algorithms.

### PyMC's Core Abstractions

PyMC organizes probabilistic models around several key concepts:

1. **The Model Context**: Every PyMC model exists within a context that tracks relationships between variables. This context manager pattern ensures that all model components are properly registered and connected.

2. **Random Variables**: These represent quantities with uncertainty. In Bayesian modeling, parameters are random variables with prior distributions, and data are random variables with likelihood distributions.

3. **Distributions**: PyMC provides a comprehensive library of probability distributions. Each distribution can create random variables when used within a model context.

4. **Deterministic Transformations**: Often we need to transform parameters or compute derived quantities. PyMC tracks these deterministic relationships to maintain the full model structure.

5. **Observed Data**: By marking random variables as observed, we condition the model on actual data, transforming prior distributions into posterior distributions.

Let's explore these concepts through hands-on examples.

## Model Contexts and Random Variables

As we have seen, the canonical way to specify PyMC models is using a `Model` context manager. Generally speaking, a context manager is a Python idiom that define what happens when entering and exiting a with statement. They provide a clean, reliable way to set up and tear down resources,

As an analogy, `Model` is a tape machine that records what is being added to the model; it keeps track the random variables (observed or unobserved) and other model components. The model context then computes some simple model properties, builds a **bijection** mapping that transforms between Python dictionaries and numpy/Pytensor ndarrays. 

More importantly, a `Model` contains methods to compile Pytensor functions that take Random Variables--that are also
initialised within the same model--as input.

In [None]:
with pm.Model() as model:
    z = pm.Normal('z', mu=0., sigma=5.)

In [None]:
model.named_vars

In [None]:
model.compile_logp()({'z': 2.5})  

## The PyMC API

Bayesian inference begins with a probability model that relates unknown parameters to observed data. PyMC provides high-level building blocks for constructing these models:

### Core Components

- **Stochastic Random Variables**: Variables whose values are not completely determined by their parents. These represent uncertainty in the model parameters or data generating process.
    - **Prior distributions** for model parameters
    - **Likelihood distributions** for observed data

    $$x \sim \text{Normal}(\mu, \sigma)$$

- **Deterministic Variables**: Variables whose values are completely determined by their parents through a mathematical operation; no additional randomness is added by the operation. These represent transformations or combinations of other variables.

    $$y = \mu + \beta x$$

- **Observed Variables**: Variables whose values are observed and used to update the posterior distribution.

    $$y_{obs} \sim \text{Normal}(\mu, \sigma)$$

    While they look like stochastic variables, they are not. They are observed and used to update the posterior distribution.


### The Distribution Class

A stochastic variable is represented in PyMC by a `Distribution` class. This structure adds functionality to Pytensor's `pytensor.tensor.random.op.RandomVariable` class, mainly by registering it with an associated PyMC `Model` -- so `Distribution` objects are only usable inside of a `Model` context.

`Distribution` subclasses (i.e. implementations of specific statistical distributions) will accept several arguments when constructed. Some of the most important are:

`name`
: Name for the new model variable. This argument is **required**, and is used as a label and index value for the variable.


In [None]:
with pm.Model():
    x = pm.Normal(name="x")

`shape`
: The variable's shape.


In [None]:
with pm.Model():
    x_matrix = pm.Normal("x_matrix", shape=(3, 3))

pm.draw(x_matrix)

`dims`
: A tuple of dimension names known to the model.


In [None]:
city_names = ["Vancouver", "Calgary", "Toronto", "Montreal", "Halifax"]

with pm.Model(coords={'city': city_names}) as model:
    x_city = pm.Normal("x_city", dims="city")

In [None]:
with model:
    samples = pm.sample_prior_predictive(1000)

az.plot_forest(samples.prior)

`initval`
: Numeric or symbolic untransformed initial value of matching shape, or one of the following initial value strategies: "moment", "prior". Depending on the sampler's settings, a random jitter may be added to numeric, symbolic or moment-based initial values in the transformed space.

In [None]:
with pm.Model() as model:
    x = pm.Normal('x', initval=-2)

In [None]:
model.rvs_to_initial_values

PyMC includes most of the **probability density functions** (for continuous variables) and **probability mass functions** (for discrete variables) used in statistical modeling. These distributions are divided into five distinct categories:

* Univariate continuous
* Univariate discrete
* Multivariate
* Mixture
* Timeseries

Probability distributions are all subclasses of `Distribution`, which in turn has two major subclasses: `Discrete` and `Continuous`. In terms of data types, a `Continuous` random variable is given whichever floating point type is defined by `pytensor.config.floatX`, while `Discrete` variables are given `int16` types when `pytensor.config.floatX` is `float32`, and `int64` otherwise.

In [None]:
x.dtype

Multivariate and Timeseries random variables are vector-valued, rather than scalar (though `Continuous` and `Discrete` variables may have non-scalar values).

In [None]:
x.shape.eval()

In [None]:
x_city.shape.eval()

All of the `Distribution` subclasses included in PyMC will have two key methods, `rng_fn()` and `logp()`, which are used to generate random values and compute the log-probability of a value, respectively.

```python
class SomeDistribution(Continuous):
    def __init__(...):
        ...
    @classmethod
    def rng_fn(cls, rng, size=None, ...):
        ...
        return random_samples

    def logp(value, *params):
        ...
        return log_prob
```

PyMC expects the `logp()` method to return a log-probability evaluated at the passed `value` argument. This method is used internally by all of the inference methods to calculate the model log-probability that is used for fitting models.

Users do not call this method directly; it is used internally by PyMC to implement the user-facing `logp()` function.

In [None]:
pm.logp(x, value=0).eval()

In [None]:
pm.logp(x_city, value=np.random.randn(5)).eval()

The `rng_fn()` method is used to simulate values from the variable, and is used internally for predictive sampling.

Users call the `pm.draw()` function to simulate values from a random variable.

In [None]:
pm.draw(x, draws=10)

In [None]:
pm.draw(x_city, draws=5)

Distributions will optionally have `cdf` and `icdf` methods, representing the cumulative distribution function and inverse cumulative distribution functions, respectively.

Sometimes we wish to use a particular statistical distribution, without using it as a variable in a model; for example, to generate random numbers from the distribution. For this purpose, `Distribution` objects have a method `dist` that returns a **stateless** probability distribution of that type; that is, without being wrapped in a PyMC random variable object.

In [None]:
x = pm.Exponential.dist(10)
samples = pm.draw(x, draws=1000)

px.histogram(
    samples, title="Exponential Distribution Samples"
).update_layout(
    xaxis_title="Value", yaxis_title="Count", showlegend=False
)

### Custom Distributions

If you have a well-behaved density function, we can use it in a model to build a model log-likelihood function. Almost any Pytensor function can be turned into a
distribution using the `CustomDist` function. For exmaple, a **uniformly-distributed** stochastic variable could be created manually from a function that computes its log-probability as follows:

In [None]:
def uniform_logp(value, lower, upper):
    return pm.math.switch(
        (value > upper) | (value < lower), 
        -np.inf, 
        -pm.math.log(upper - lower + 1)
    )

with pm.Model():

    u = pm.CustomDist('u', 0, 10, logp=uniform_logp, dtype='float32')

In [None]:
pm.logp(u, 3.5).eval()

Passing values outside the support of the distribution to `logp()` will return `-inf`, since the value has no probability. 

In [None]:
pm.logp(u, -4).eval()

To emphasize, the Python function passed to `CustomDist` should compute the *log*-density or *log*-probability of the variable. That is why the return value in the example above is `-log(upper-lower+1)` rather than `1/(upper-lower+1)`.

In [None]:
pm.logp(u, 1).eval()

Passing values outside the support of the distribution to `logp()` will return `-inf`, since the value has no probability. 

## Building Models in PyMC

Now that we understand the basic building blocks of PyMC models, let's see how to combine them to build a complete model.

### Fish Weight Prediction

Let's imagine we are working in the data science team of an e-commerce company. In particular, we sell really good and fresh fish to our clients (mainly fancy restaurants). 

When we ship our products, there is a very important piece of information we need: the weight of the fish. This is important for two reasons: 

1. Because we _bill_ our clients according to weight. 

2. Because the company that delivers the fish to our clients has different price tiers for weights, and those tiers can get _really_ expensive. So we want to know the probability of an item being above that line. In other words, estimating uncertainty is important here!

![](images/weighingfish.jpg)


The problem we face is that we purchase our fish in bulk. This means we only know the total weight of our entire order, but we don't have the weights of the individual fish. You might think the obvious solution is simply to weigh each fish one by one.

However, this approach has significant drawbacks. Manually weighing each fish is costly, requires a lot of time, and demands substantial labor. This process is inefficient and impractical for our needs.

Given these challenges, we need to explore alternative solutions. 

### A solution

While researching the problem, we discovered that our wholesale supplier has detailed information on the size of each individual fish, including their length, height, and width. Since it is infeasible to weigh individual fish, the supplier uses a **camera** to record the size of each fish. 

However, the company used to try to weigh each fish manually until costs became prohibitive. As a result, we have a valuable **training dataset** consisting of different types of fish with their accurately -measured weights.

![](images/fishvideo.png)

### Linear regression analysis

Let's import the data and take a look at it.

In [None]:
fish_market = pl.read_csv("../data/fish-market.csv")
fish_market.schema

We have collected 159 measurements, and all columns in our dataset have the appropriate data types.

For each observation, the dataset includes the following information: the species of the fish, its weight, height, and width, as well as three distinct length measurements. You might be wondering why we have three different measurements for the fish's length. Let's delve into some summary statistics to better understand the data and its significance.

In [None]:
fish_market.describe()

Things to note:

- Though there are no missing data, there are some zero-weight fish! -- either the fish was below the minimum weight for the scale, or there was a mistake during data collection. 
- The standard deviation of the columns are very high, especially for weights.
- There are three columns for length, which is interesting. We will explore this further.

In [None]:
numeric_data = fish_market.drop("Species")
corr_matrix = numeric_data.corr().to_numpy().round(2)

fig = px.imshow(
    corr_matrix,
    x=numeric_data.columns,
    y=numeric_data.columns,
    zmin=-1,
    zmax=1,
    color_continuous_scale='RdBu_r',  
    aspect='auto'
)

for i in range(len(numeric_data.columns)):
    for j in range(len(numeric_data.columns)):
        fig.add_annotation(
            x=i,
            y=j,
            text=str(corr_matrix[j, i]),
            showarrow=False,
            font=dict(size=16, color='black')
        )

fig.update_layout(
    coloraxis_colorbar_title='Correlation',
    width=800,
    height=800
)


The three length measurements are highly correlated with each other. This means they essentially carry the same information. Without additional details to distinguish among them, we should arbitrarily choose one measurement and discard the other two. Keeping all three would be redundant and unnecessary since they do not provide unique information.

There is nothing inherently Bayesian about this step. The concept of *multicollinearity* is a fundamental concern in both Bayesian and frequentist statistics. In essence, if you include multiple variables that convey similar information in your regression model, you will end up with very unstable parameter estimates. This redundancy does not improve your model's predictive power and can, in fact, lead to misleading results. Thus, it is crucial to identify and address multicollinearity to maintain the robustness and reliability of your model.

In [None]:
fish_market = fish_market.drop(["Length2", "Length3"])

## Visual data exploration

It's always a good idea to plot your data! Plotly's `scatter_matrix` function is a great way to visualize the relationships between variables in your dataset. This function creates a matrix of scatterplots, with each variable plotted against every other variable. 

In [None]:
fig = px.scatter_matrix(
    fish_market,
    dimensions=["Length1", "Height", "Width", "Weight"],
    color="Species",
    opacity=0.7,
    height=1000,
    width=1000
)

fig.update_traces(diagonal_visible=True, showupperhalf=True, showlowerhalf=True)
fig.update_layout(
    dragmode='select',
    hovermode='closest'
)

Thus, it is clear that any model we build must account for the differences in the relationships between variables across species. This is where Bayesian linear regression comes in handy. By incorporating **domain knowledge** about the relationships between variables and the differences across species, we can build a more robust and reliable model.

In [None]:
variables = ["Length1", "Height", "Width", "Weight"]
fig = make_subplots(rows=2, cols=2, subplot_titles=variables)

for i, var in enumerate(variables):
    row = i // 2 + 1
    col = i % 2 + 1
    
    for species in fish_market["Species"].unique():
        species_data = fish_market.filter(pl.col("Species") == species).to_pandas()
        
        fig.add_trace(
            go.Box(
                y=species_data[var],
                name=species,
                boxpoints='all',
                jitter=0.5,
                pointpos=0,
                marker=dict(opacity=0.5),
                line=dict(width=1),
                showlegend=(i == 0)
            ),
            row=row,
            col=col
        )

fig.update_layout(
    height=800,
    width=800,
)

The most diverse species are Bream, Whitefish, Perch, and Pike. This diversity likely makes them more versatile for sale and cooking because they come in a wide range of sizes, including different weights, widths, and heights. This variety allows for more options in preparation methods and recipes, catering to various culinary needs.

On the other hand, the Smelt is a very small fish that is typically used in specialized recipes. Its smaller size and specific preparation methods make it less versatile than the more diverse species like Bream, Whitefish, Perch, and Pike. A quick internet search will show you that they are usually fried and served as appetizers, at least in Europe.

## A non-Bayesian linear regression

Now that we have a clearer understanding of the data we're working with, let's move on to developing a predictive model. Our specific task is to **predict the weight of a fish based on its width, height, and length**. While we've chosen these particular variables for our analysis, it's important to note that different combinations of independent and dependent variables could also be used, depending on the specific requirements of the study.

The most promising approach for our task is to develop a **physical model**. This involves leveraging the inherent relationships between height, width, and weight, which are governed by physical proportions that impose natural lower and upper bounds on these variables. In a professional context, such a model would likely yield the most accurate and reliable predictions due to its basis in the physical characteristics of fish.

However, creating a detailed physical model can be quite complex. Therefore, for our initial attempt, we can use a simple **ordinary least squares (OLS)** regression to establish a relationship between the dependent variable (weight) and the independent variables (width, height, and length).

From our data exploration, we observed that weight is not linearly related to the other variables. This non-linear relationship suggests that a direct application of linear regression may not be effective. To address this issue, we often need to apply some form of data transformation to better fit the model to the data.

In this scenario, a **logarithmic transformation** of the data appears to be a suitable choice. This transformation can help counteract the exponential increase in weight as the fish's width, height, and length increase. By applying a log-transform, we can linearize the relationship between these variables, making it more appropriate for linear regression analysis.

### Taking the log of all covariates

In [None]:
fish_market = fish_market.with_columns([
    pl.col("Width").log().alias("log_width"),
    pl.col("Height").log().alias("log_height"), 
    pl.col("Length1").log().alias("log_length"),
    pl.col("Weight").log().alias("log_weight")
])

In [None]:
# Display the first few rows to see the transformed data
fish_market.head()


### Simple OLS regression

An easy way to perform OLS regression is via the `seaborn` graphics library. The `lmplot` function creates a scatterplot of the data and fits a regression line to the data.

In [None]:
import seaborn as sns

fish_complete = fish_market.filter(pl.col("Weight") != 0)

sns.lmplot(
    data=fish_complete,
    x="log_height",
    y="log_weight",
    hue="Species",
    col="Species",
    height=3,
    col_wrap=4,
);

The output here is purely visual, but in log space, our input variables seem linearly related to weight, so there is good reason to believe that a linear model is appropriate here.

Let's go ahead and fit a linear model to the data using PyMC.

## Multiple linear regression

We can specify a linear model to predict the weight of a fish based on its width, height, and length.

$$
\begin{aligned}
\text{priors}\\
\mu[s] &\sim \mathrm{Normal}(0, 1)\\
\beta[s, k] &\sim \mathrm{Normal}(0, 0.5)\\
\sigma &\sim \mathrm{HalfNormal}(1)\\
\text{linear model}\\
\mu_i &= \mu[s_i]\\
        & \quad + \beta[s_i, 0] \times \log(\text{width}_i)\\
        & \quad + \beta[s_i, 1] \times \log(\text{height}_i)\\
        & \quad + \beta[s_i, 2] \times \log(\text{length}_i)\\
\text{likelihood}\\
\log(\text{weight}_i) &\sim \mathrm{Normal}(\mu_i, \sigma)\\
\end{aligned}
$$


where $s_i$ is the species index corresponding to fish _i_:


$$
s_i \in \{ 0, 1, \ldots, {S-1} \}.
$$


In Wilkinson notation, the model can be written as:


`log(weight) ~ 0 + species + log(width):species + log(height):species + log(length):species`. 


The `0 + species` component means that we just have $S$ intercept terms, one for each species, with no global intercept. 

The remaining terms (e.g. `log(width):species`) represent an interaction between the predictor and the `species` category. So there will be one coefficient for the $\log(width)$ slope (in this case) for each species.

So, each species has its own intercept and slopes for width, height, and length. This is an **unpooled model** because we are essentially fitting a separate regression for each species!

In order to make this work, we need to encode the species as a categorical variable. We can do this using the `Categorical` type in `polars`.

In [None]:
pl.Series(fish_complete["Species"]).cast(pl.Categorical).to_physical().sort()

### Define dimensions & coordinates

Having encoded species as a categorical column, we also extract the unique species values, to be used as coordinates (labels) for the parameters in our model.

### Prior Specification

The first step in specifying a PyMC model is defining the prior distributions for each unknown parameter.

- what are the unknown parameters?
- which distributions should we use to characterize our beliefs about their values?

In [None]:
species_idx = pl.Series(fish_complete["Species"]).cast(pl.Categorical).to_physical().to_numpy()
species = pl.Series(fish_complete["Species"]).cast(pl.Categorical).unique().sort()

### Prior Specification

Let's start by specifying a prior for the intercept. Looking at the notation above, we see that we need a parameter for the intercept for each species. So we need $S$ parameters, one for each species.

We will create a vector of length $S$ for the intercepts, and specify a normal prior for each element of the vector.

In [None]:
with pm.Model() as fish_unpooled:

    mu = pm.Normal('mu', mu=0, sigma=1, shape=len(species))

Better yet, we can used **named dimensions** to specify the shape of the random variable.

In [None]:
coords = {"species": species}

with pm.Model(coords=coords) as fish_unpooled:

    mu = pm.Normal('mu', mu=0, sigma=1, dims='species')


This endows the `mu` random variable with a `species` dimension, which we can use to index into the intercepts for each species.

In [None]:
mu.shape.eval()


In [None]:
fish_unpooled.named_vars_to_dims

Next, we need to specify priors for the slopes. Looking at the notation above, we see that we need a parameter for the slope for each species and each predictor. So we need a matrix of $S \times 3$ parameters. This means that the `dims` argument should be a tuple with two elements: the list of species and the list of predictors.

We also need to specify predictor variable coordinates, which will be used to index into the slope parameters.

In [None]:
coords = {
    "species": species,
    "slopes": ["width_effect", "height_effect", "length_effect"]    
}

with pm.Model(coords=coords) as fish_unpooled:

    mu = pm.Normal('mu', mu=0, sigma=1, dims='species')

    beta = pm.Normal('beta', mu=0, sigma=1, dims=('species', 'slopes'))

beta.shape.eval()

In [None]:
fish_unpooled.coords

Finally, we need to specify a prior for the standard deviation. We will use a half-normal distribution, which is the positive half of a zero-mean normal distribution.

Since `sigma` is a scalar, it does not require `dims`.

In [None]:
with fish_unpooled:

    sigma = pm.HalfNormal('sigma', sigma=1)

### Deterministic Variables

A deterministic variable is one whose values are **completely determined** by the values of their parents.

In our model, the expected weight is a deterministic function of the intercept, slopes, and their associated predictor variables.

In [None]:
log_width = fish_complete.get_column("log_width").to_numpy()
log_height = fish_complete.get_column("log_height").to_numpy()
log_length = fish_complete.get_column("log_length").to_numpy()

with fish_unpooled:

    expected_weight = (
        mu[species_idx]
        + beta[species_idx, 0] * log_width
        + beta[species_idx, 1] * log_height
        + beta[species_idx, 2] * log_length
    )

There are two types of deterministic variables in PyMC:

#### Anonymous deterministic variables

The easiest way to create a deterministic variable is to operate on or transform one or more variables in a model directly, as we have done above for `expected_weight`.

These are called *anonymous* variables because they are not named variables, as we did for `beta` above. We simply specified the variable as a Python (or, Pytensor) expression. This is therefore the simplest way to construct a determinstic variable. The only caveat is that the values generated by anonymous determinstics are not recorded to the model output during model fitting. So, this approach is only appropriate for intermediate values in your model that you do not wish to obtain posterior estimates for, alongside the other variables in the model.

#### Named deterministic variables

To ensure that deterministic variables' values are accumulated during sampling, they should be instantiated using the **named deterministic** interface; this uses the `Deterministic` function to create the variable. Two things happen when a variable is created this way:

1. The variable is given a name (passed as the first argument)
2. The variable is appended to the model's list of random variables, which ensures that its values are tallied.

If we had wanted to track the expected weight, we could have done so by specifying a named deterministic variable:
with fish_unpooled:

```python
expected_weight = pm.Deterministic('expected_weight', 
    mu[species_idx]
    + beta[species_idx, 0] * log_width
    + beta[species_idx, 1] * log_height
    + beta[species_idx, 2] * log_length)
)
```




### Observed Random Variables

Stochastic random variables whose values are observed are represented by a different class than unobserved random variables. An `ObservedRV` object is instantiated any time a stochastic variable is specified with data passed as the `observed` argument. 

In our model, the observed fish weights (on the log scale) are represented by an `ObservedRV` using a normal random variable.

In [None]:
log_weight = fish_complete.get_column("log_weight").to_numpy()

with fish_unpooled:

    pm.Normal(
        "log_obs",
        mu=expected_weight,
        sigma=sigma,
        observed=log_weight,
    )

## Parameter Transformation

To support efficient sampling by PyMC's MCMC algorithms, any continuous variables that are **constrained** to a sub-interval of the real line are **automatically transformed** so that their support is unconstrained. This frees sampling algorithms from having to deal with boundary constraints.

For example, if we look at the variables we have created in the model so far:

In [None]:
print(fish_unpooled.value_vars)

The model's `value_vars` attribute stores the values of each random variable actually used by the model's log-likelihood.

As the name suggests, the variable `sigma` has been log-transformed, and this is the space over which posterior sampling takes place. When a sample is drawn, the value of the transformed variable is simply back-transformed to recover the original variable.

By default, auto-transformed variables are ignored when summarizing and plotting model output, since they are not generally of interest to the user.


### The Model DAG

Having specified the model, we can visualize the model  using the `to_graphviz` method.

This displays the model as a directed acyclic graph (DAG). The nodes represent the random variables, and the edges represent the dependencies between them. All Bayesian models are DAGs.

In [None]:
fish_unpooled.to_graphviz()

## Markov Chain Monte Carlo Methods

Now that we understand how to specify models in PyMC, we need to address a fundamental question: how do we actually fit these models? For all but the simplest cases, the posterior distribution cannot be derived analytically. This is where Markov Chain Monte Carlo (MCMC) methods become essential.

To understand how MCMC works in practice, let's explore the fundamental concepts through concrete examples. We'll start with basic Monte Carlo integration and build up to understanding why sophisticated algorithms like those in PyMC are necessary.

### The Challenge of Bayesian Computation

Recall Bayes' theorem:

$$P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)}$$

The denominator, $P(x) = \int P(x|\theta) P(\theta) \, d\theta$, is called the marginal likelihood or evidence. For most models, this integral is intractable:

1. **High dimensionality**: With multiple parameters, we need to integrate over many dimensions
2. **Complex dependencies**: Parameters often have intricate relationships
3. **Non-standard distributions**: The posterior rarely has a recognizable form

### Monte Carlo Integration

The foundation of all Monte Carlo methods is a simple but powerful idea: we can approximate integrals using random samples. Consider estimating the expected value of a function $h(\theta)$ under a probability distribution $p(\theta)$:

$$E[h(\theta)] = \int h(\theta) p(\theta) d\theta$$

If we can draw samples $\theta_1, \theta_2, ..., \theta_n$ from $p(\theta)$, then by the Law of Large Numbers:

$$E[h(\theta)] \approx \frac{1}{n} \sum_{i=1}^n h(\theta_i)$$

This approximation becomes exact as $n \to \infty$, and we can quantify the uncertainty in our estimate using the Central Limit Theorem.

The key challenge is: **how do we draw samples from a distribution when we only know it up to a normalizing constant?**

While Monte Carlo integration is powerful, it assumes we can directly sample from the distribution of interest. Markov Chain Monte Carlo (MCMC) methods provide an elegant solution to this problem.

### The Markov Chain Monte Carlo Solution

MCMC methods construct a Markov chain whose stationary distribution is the posterior distribution we want to sample from. The "Markov" property means that each sample depends only on the previous sample, not the entire history. The "Monte Carlo" aspect refers to the use of random sampling.

> A Markov chain is a sequence of random variables $\theta_1, \theta_2, ..., \theta_n$ that satisfy the Markov property:
>
> $$P(\theta_t | \theta_1, \theta_2, ..., \theta_{t-1}) = P(\theta_t | \theta_{t-1})$$
> 
> This means that the probability of the next state depends only on the current state, not the entire history.

The general MCMC algorithm follows this pattern:

1. Start at some initial parameter values $\theta^{(0)}$
2. For iteration $t = 1, 2, ...$:
   - Propose new parameter values $\theta^*$ based on current values $\theta^{(t-1)}$
   - Accept or reject the proposal based on the posterior probability
   - Set $\theta^{(t)} = \theta^*$ if accepted, otherwise $\theta^{(t)} = \theta^{(t-1)}$

Different MCMC algorithms vary in how they propose new values and decide whether to accept them. The art lies in designing proposals that efficiently explore the posterior distribution.


#### The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm generates a Markov chain whose stationary distribution is the target posterior distribution. The key insight is the **Metropolis acceptance criterion**, which ensures detailed balance:

$$A(\theta^* | \theta) = \min\left\{1, \frac{\tilde{p}(\theta^*)}{\tilde{p}(\theta)} \cdot \frac{q(\theta|\theta^*)}{q(\theta^*|\theta)}\right\}$$

where:
- $\tilde{p}(\theta)$ is the unnormalized posterior
- $q(\theta^*|\theta)$ is the proposal distribution
- $A(\theta^* | \theta)$ is the acceptance probability

This criterion automatically adjusts for asymmetric proposals and ensures the chain converges to the correct distribution.

Let's implement the general Metropolis-Hastings algorithm:

In [None]:
def metropolis_hastings(pdf, prop_dist, init=0):
    """General Metropolis-Hastings sampler.
    
    Args:
        pdf: Target probability density function (unnormalized is ok)
        prop_dist: Proposal distribution (scipy.stats distribution)
        init: Initial value
        
    Yields:
        (sample, accepted): Current sample and whether it was accepted
    """
    current = init
    while True:
        
        # Propose new state from proposal distribution
        prop = prop_dist.rvs()
        
        # Calculate acceptance ratio
        p_accept = min(1, pdf(prop) / pdf(current) * 
                      prop_dist.pdf(current) / prop_dist.pdf(prop))
        
        # Accept or reject
        accept = np.random.rand() < p_accept
        if accept:
            current = prop
        yield current, accept
        
def gen_samples(draws, sampler):
    """Generate samples from a sampler."""
    samples = np.empty(draws)
    accepts = 0
    for idx, (z, accept) in itertools.takewhile(lambda j: j[0] < draws, enumerate(sampler)):
        accepts += accept
        samples[idx] = z
    return samples, accepts

# Example: Sample from a mixture of Gaussians
def target_pdf(x):
    """Mixture of two Gaussians"""
    return 0.3 * st.norm.pdf(x, -2, 0.8) + 0.7 * st.norm.pdf(x, 3, 1.2)

# Use a wide normal as proposal
proposal_dist = st.norm(0, 10)

# Generate samples
samples, accepts = gen_samples(10_000, metropolis_hastings(target_pdf, proposal_dist))

# Visualize results
t = np.linspace(-6, 8, 500)
pdf_values = [target_pdf(x) for x in t]

hist = go.Histogram(
    x=samples,
    histnorm='probability density',
    name='Samples',
    marker=dict(color='rgba(0, 0, 255, 0.7)')
)

pdf_curve = go.Scatter(
    x=t, y=pdf_values,
    mode='lines',
    name='True PDF',
    line=dict(color='orange', width=2)
)

go.Figure(
    data=[hist, pdf_curve]
).update_layout(
    title=f'Metropolis-Hastings: {samples.size:,d} samples with {100 * accepts / samples.size:.1f}% acceptance rate',
    xaxis_title='Value',
    yaxis_title='Density',
    width=750,
    height=400
)

### Random Walk Metropolis-Hastings

The general Metropolis-Hastings algorithm can be inefficient if the proposal distribution is poorly chosen. A popular special case is **Random Walk Metropolis**, where the proposal is centered at the current position:

$$\theta^* \sim \mathcal{N}(\theta, \sigma^2)$$

This symmetric proposal simplifies the acceptance ratio because the proposal terms cancel out:

$$A(\theta^* | \theta) = \min\left\{1, \frac{\tilde{p}(\theta^*)}{\tilde{p}(\theta)}\right\}$$

The key hyperparameter is the step size $\sigma$:
- **Too small**: Chain takes tiny steps and explores slowly (high acceptance, slow mixing)
- **Too large**: Many proposals are rejected (low acceptance, slow mixing)
- **Just right**: Balance between acceptance rate and step size (typically 20-50% acceptance)

Let's implement and compare different step sizes:

In [None]:
def random_walk_metropolis(pdf, step_size, init=0):
    """Random walk Metropolis algorithm.
    
    Args:
        pdf: Target probability density function (unnormalized is ok)
        step_size: Standard deviation of proposal distribution
        init: Initial value
        
    Yields:
        (sample, accepted): Current sample and whether it was accepted
    """
    current = init
    while True:
        # Random walk proposal
        prop = current + np.random.normal(0, step_size)
        
        # Simple acceptance ratio (proposal terms cancel)
        p_accept = min(1, pdf(prop) / pdf(current))
        
        # Accept or reject
        accept = np.random.rand() < p_accept
        if accept:
            current = prop
        yield current, accept

# Compare different step sizes
fig = make_subplots(rows=3, cols=1, 
                    subplot_titles=['Small Step Size (σ=0.1)', 
                                   'Medium Step Size (σ=8.0)', 
                                   'Large Step Size (σ=70.0)'])

step_sizes = [0.1, 8.0, 70.0]

for i, step_size in enumerate(step_sizes, 1):
    # Generate samples
    samples, accepts = gen_samples(10_000, random_walk_metropolis(target_pdf, step_size))
    
    # Calculate t and pdf values for the line
    t = np.linspace(samples.min(), samples.max(), 500)
    pdf_values = [target_pdf(x) for x in t]
    
    # Add histogram
    fig.add_trace(
        go.Histogram(
            x=samples,
            histnorm='probability density',
            marker=dict(color='rgba(0, 0, 255, 0.7)'),
            name=f"Samples (σ={step_size})",
            showlegend=False
        ),
        row=i, col=1
    )
    
    # Add PDF line
    fig.add_trace(
        go.Scatter(
            x=t,
            y=pdf_values,
            mode='lines',
            line=dict(color='orange', width=2),
            name=f"True PDF",
            showlegend=False
        ),
        row=i, col=1
    )
    
    # Add annotation with acceptance rate
    fig.add_annotation(
        text=f"Acceptance rate: {100 * accepts / samples.size:.1f}%",
        xref=f"x{i}", yref=f"y{i}",
        x=0.95, y=0.95,
        xanchor="right", yanchor="top",
        showarrow=False,
        bgcolor="white",
        bordercolor="black",
        borderwidth=1
    )

fig.update_layout(height=900, width=700, showlegend=False)
fig.update_xaxes(title_text="Value", row=3, col=1)
fig.update_yaxes(title_text="Density")
fig.show()

### Visualizing MCMC Behavior

To better understand how different step sizes affect the sampler's behavior, let's look at trace plots that show the evolution of the chain over time:

In [None]:
n_steps = 1000

fig = make_subplots(rows=3, cols=1, 
                    subplot_titles=['Small Step Size (σ=0.1): Slow Exploration', 
                                   'Medium Step Size (σ=8.0): Good Mixing', 
                                   'Large Step Size (σ=70.0): Many Rejections'])

for i, step_size in enumerate(step_sizes, 1):
    sampler = random_walk_metropolis(target_pdf, step_size)
    trace = []
    for _ in range(n_steps):
        sample, _ = next(sampler)
        trace.append(sample)
    
    fig.add_trace(
        go.Scatter(
            y=trace,
            mode='lines',
            line=dict(width=1),
            name=f"σ={step_size}",
            showlegend=False
        ),
        row=i, col=1
    )
    
    fig.add_hline(y=-2, line_dash="dash", line_color="red", 
                  row=i, col=1, annotation_text="Mode 1")
    fig.add_hline(y=3, line_dash="dash", line_color="red", 
                  row=i, col=1, annotation_text="Mode 2")

fig.update_layout(height=900, width=800, showlegend=False)
fig.update_xaxes(title_text="Iteration", row=3, col=1)
fig.update_yaxes(title_text="Value")
fig.show()

### Modern MCMC: The No-U-Turn Sampler (NUTS)

While Metropolis-Hastings and its variants are foundational, they suffer from a critical limitation: they explore parameter space through a **random walk**. This means proposals are made blindly, without any information about where high-probability regions might lie. It's like searching for treasure in a dark room by taking random steps—you'll eventually find it, but it's terribly inefficient.

#### The Random Walk Problem

Random walk samplers face several challenges:
- **Inefficient exploration**: Most proposals in high dimensions get rejected
- **Slow mixing**: Takes many steps to move between distant regions of high probability  
- **Correlation issues**: When parameters are correlated, the sampler zigzags inefficiently through parameter space
- **Tuning nightmare**: Step size requires careful manual tuning—too small and you explore slowly, too large and most proposals get rejected

#### Enter Gradient Information: A Guided Search

Modern algorithms like Hamiltonian Monte Carlo (HMC) and its extension, the No-U-Turn Sampler (NUTS), revolutionize MCMC by using **gradient information**—the derivative of the log-posterior with respect to parameters. This is like having a compass that always points toward regions of higher probability.

Instead of stumbling randomly, NUTS:
1. **Follows the gradient contours**: Uses derivatives to understand the "shape" of the posterior
2. **Makes intelligent proposals**: Proposes distant points along paths of high probability
3. **Maintains detailed balance**: Still guarantees correct sampling from the posterior

The fundamental difference is stark: random walk methods explore blindly through trial and error, accepting or rejecting moves based solely on posterior probability. Gradient-based methods, by contrast, use calculus to determine the direction of steepest ascent in probability, enabling them to make informed, efficient moves toward high-probability regions.

#### Key Advantages of NUTS:

1. **Automatic tuning**: NUTS adapts its parameters during warmup, eliminating manual tuning headaches
2. **Efficient exploration**: Can traverse the entire posterior in a single trajectory rather than thousands of random steps
3. **Handles correlations naturally**: Moves along the principal axes of variation, not fighting against correlation structure
4. **Scales to high dimensions**: Performance doesn't degrade catastrophically as dimensionality increases
5. **Fewer samples needed**: Often 100s of effective samples from NUTS equals 10,000s from random walk methods

#### When NUTS Dominates:

- **High-dimensional parameter spaces** (dozens to thousands of parameters)
- **Complex posterior geometries** with strong correlations, ridges, or funnel shapes
- **Hierarchical models** where parameters at different levels interact
- **Any model with continuous parameters** where gradients can be computed

#### The Computational Trade-off:

NUTS requires computing gradients at each step, making individual iterations more expensive than simple random walk methods. However, this cost is overwhelmingly offset by needing far fewer samples for the same effective sample size. It's like paying more for a sports car that gets you there 100x faster than walking.

#### Implementation in PyMC:

PyMC uses NUTS as the default sampler for continuous variables. When you call `pm.sample()`, PyMC:

1. **Automatically differentiates** your model to compute gradients (via automatic differentiation)
2. **Runs adaptation** to learn the geometry of your posterior (mass matrix and step size)
3. **Generates efficient samples** using momentum to explore the posterior
4. **Provides diagnostics** to assess convergence and sampling quality

The beauty of PyMC is that all this sophisticated machinery is hidden behind a simple interface. However, understanding why gradient-based methods are superior helps us appreciate why modern Bayesian computation is so powerful and why NUTS has become the workhorse of practical Bayesian analysis.

### Sampling with NUTS

The `pm.sample()` function is the main interface for sampling from a model. It has a number of optional arguments that allow us to customize the sampling process.

The most important arguments are:

- `draws`: The number of samples to draw
- `tune`: The number of samples to use for tuning
- `chains`: The number of chains to run

We will use the default values for these arguments, which are:

- `draws`: 1000
- `tune`: 1000
- `chains`: min(4, number of CPU cores)

In [None]:
with fish_unpooled:
    trace_unpooled = pm.sample()

The `plot_trace` function in ArviZ creates diagnostic plots showing both the trace (time series of samples) and marginal posterior distributions for each parameter.

Traceplots are useful for evaluating the performance of our MCMC sampling. In these plots, we aim to see a "fuzzy caterpillar" pattern on the right side, which indicates that the chains are **mixing well** and exploring the parameter space effectively. This is evidence to suggest the chains have converged (to something!) and are providing a reasonable representation of the posterior distribution.

In [None]:
az.plot_trace(trace_unpooled, var_names='sigma');

Let's look at the posterior distributions for the intercepts. Multivariate posterior distributions are best visualized using a forest plot.

In [None]:
az.plot_forest(trace_unpooled, var_names='mu', transform=np.exp);

The intercepts look small (even on the nominal scale) which seems odd. But recall how intercepts are interpreted: they are the expected value of the outcome when all predictors are zero. In this case, that means when the log of the width, height, and length are zero. This is an awkward from an interpretive standpoint. 

How could we improve this?

Give it a try, and re-run the improved model.

In [None]:
log_width = (fish_complete.get_column("log_width") - fish_complete.get_column("log_width").mean()).to_numpy()
log_height = (fish_complete.get_column("log_height") - fish_complete.get_column("log_height").mean()).to_numpy()
log_length = (fish_complete.get_column("log_length") - fish_complete.get_column("log_length").mean()).to_numpy()

with pm.Model(coords=coords) as fish_unpooled:

    # priors
    mu = pm.Normal("mu", sigma=1.0, dims="species")
    beta = pm.Normal("beta", sigma=0.5, dims=("slopes", "species"))

    # linear regression
    expected_weight = (
        mu[species_idx]
        + beta[0, species_idx] * log_width
        + beta[1, species_idx] * log_height
        + beta[2, species_idx] * log_length
    )
    # observational noise
    sigma = pm.HalfNormal("sigma", 1.0)

    # likelihood
    log_obs = pm.Normal(
        "log_obs", mu=expected_weight, sigma=sigma, observed=log_weight
    )

    # sampling
    trace_unpooled = pm.sample()

Now we have meaningful intercepts -- the expected weight of a fish with average width, height, and length for each species.

In [None]:
az.plot_trace(trace_unpooled, var_names='mu', transform=np.exp);

When we have vector-valued parameters, a forest plot is convenient for visualizing them.

In [None]:
az.plot_forest(trace_unpooled, var_names="beta", transform=np.exp);

In [None]:
az.plot_trace(trace_unpooled, var_names="sigma", transform=np.exp);

## Predicting out-of-sample

In statistical workflows, a common task is to make predictions using new, unseen data, often referred to as "out-of-sample" data. In PyMC, the most straightforward approach to achieve this is by utilizing the `Data` container. This container allows PyMC and ArviZ to specify the data used for training the model, and then allow you to modify it later on.

#### Splitting Data into Training and Test Sets

To illustrate this functionality, let's randomly select 90% of our data as the training dataset for the model, while reserving the remaining 10% as the test data. This test data will be unseen by the model during the training process, allowing us to evaluate its performance on new, previously unseen data when making predictions.

By following this approach, you can effectively train your model on a subset of the available data and then assess its predictive capabilities on the held-out test data, mimicking real-world scenarios where predictions need to be made on new, unobserved data points.

In [None]:
fish_test = (
    fish_complete.sample(fraction=0.1, seed=1)
    .with_row_index()
)
test_idx = fish_test.get_column("index")
fish_train = (
    fish_complete.with_row_index()
    .filter(~pl.col("index").is_in(test_idx))
)

Since the dataset changed compared to the previous model, we also have to redefine our coordinates:

In [None]:
species_idx = pl.Series(fish_train.get_column("Species")).cast(pl.Categorical).to_physical().to_numpy()
species = fish_train.get_column("Species").unique(maintain_order=True).sort()

coords = {
    "slopes": ["width_effect", "height_effect", "length_effect"],
    "species": species,
    "obs_idx": range(fish_train.height),
}

In PyMC, we can use `pm.Data` objects for our data. It allows you to define data as a symbolic node in the model that you can later switch out for other data. 

Let's re-write our model to use `pm.Data` objects for our data.

In [None]:
with pm.Model(coords=coords) as fish_unpooled_oos:
    # data
    log_width = pm.Data(
        "log_width", 
        (fish_train.get_column("log_width") - fish_train.get_column("log_width").mean()).to_numpy(),
        dims="obs_idx"
    )
    log_height = pm.Data(
        "log_height",
        (fish_train.get_column("log_height") - fish_train.get_column("log_height").mean()).to_numpy(),
        dims="obs_idx"
    )
    log_length = pm.Data(
        "log_length",
        (fish_train.get_column("log_length") - fish_train.get_column("log_length").mean()).to_numpy(),
        dims="obs_idx"
    )
    log_weight = pm.Data(
        "log_weight",
        fish_train.get_column("log_weight").to_numpy(),
        dims="obs_idx"
    )
    s = pm.Data("species_idx", species_idx, dims="obs_idx")

    # priors
    mu = pm.Normal("mu", sigma=1.0, dims="species")
    beta = pm.Normal("beta", sigma=0.5, dims=("slopes", "species"))

    # linear regression
    expected_weight = (
        mu[s]
        + beta[0, s] * log_width
        + beta[1, s] * log_height
        + beta[2, s] * log_length
    )
    # observational noise
    sigma = pm.HalfNormal("sigma", 1.0)

    # likelihood
    log_obs = pm.Normal(
        "log_obs", mu=expected_weight, sigma=sigma, observed=log_weight, dims="obs_idx"
    )

In [None]:
pm.model_to_graphviz(fish_unpooled_oos)

In [None]:
with fish_unpooled_oos:
    trace_unpooled_oos = pm.sample()

Checking the traceplots:

In [None]:
az.plot_trace(trace_unpooled_oos, transform=np.exp);

Now we want to see how this model would work in production: given some fish morphometrics, can we accurately predict the weight of the fish?

To do this, we use `set_data` to change the inputs from the training set to the test set. First, let's query our test data:

In [None]:
# Encode the species
species_idx_test = pl.Series(fish_test.get_column("Species")).cast(pl.Categorical).to_physical().cast(pl.Int64)

species_idx_test

Now we apply these values to the `Data` nodes in the model.

Note that we are shifting the input variables using the training set mean and standard deviation. You always want to use the same transformation on the test set as you did on the training set!

In [None]:
with fish_unpooled_oos:
    pm.set_data(
        coords={"obs_idx": range(len(fish_test))},
        new_data={
            "log_height": fish_test.get_column("log_height").to_numpy() - fish_train.get_column("log_height").mean(),
            "log_length": fish_test.get_column("log_length").to_numpy() - fish_train.get_column("log_length").mean(), 
            "log_width": fish_test.get_column("log_width").to_numpy() - fish_train.get_column("log_width").mean(),
            "log_weight": np.zeros(len(fish_test)),
            "species_idx": species_idx_test.cast(pl.UInt32).to_numpy(),
        },
    )

### Use updated values to predict outcomes

We want to use our fitted model to predict the weight of the fish in the test set. This involves simulating observations from the model for each fish in the test set using the posterior distribution of the parameters.

Fortunately, simulating data from the model is a natural component of the Bayesian modelling framework. Recall, from the discussion on prediction, the posterior predictive distribution:

$$p(\tilde{y}|y) = \int p(\tilde{y}|\theta) f(\theta|y) d\theta$$

Here, $\tilde{y}$ represents some hypothetical new data that would be expected, taking into account the posterior uncertainty in the model parameters. 

Sampling from the posterior predictive distribution is easy in PyMC. The `sample_posterior_predictive` function draws posterior predictive samples from all of the observed variables in the model.

In [None]:
with fish_unpooled_oos:
    pm.sample_posterior_predictive(
        trace_unpooled_oos,
        predictions=True,
        extend_inferencedata=True,
    )

How good are these imputations? Glad you asked. Remember that our data are not _really_ out-of-sample; we just cut them out from our original dataset, so we can compare our predictions to the true weights. This is a simple line of code in ArviZ (we just exponentiate the predicted log weights to compare them to the true weights):

In [None]:
az.plot_posterior(
    trace_unpooled_oos.predictions,
    ref_val=fish_test.get_column("Weight").to_list(),
    transform=np.exp,
);

So the predicted values all fell within the predictive distributions -- not all within the 95% interval, but there were no extreme predictions.

## From predictions to business insights

Recall from the introduction that there are different price tiers for weights, and those tiers can get _really_ expensive, so we want to know the probability of an item being above any theshold.

- $> 250$
- $> 500$
- $> 750$
- $> 1000$

Since we have calculated posterior distributions, we have the ability to compute these probabilities for any new fish we observe.


In [None]:
# Extract projections to numpy array
predictions = (
    np.exp(
        az.extract(trace_unpooled_oos.predictions)
        .to_array()
        .to_numpy()
        .squeeze()
    )
)

Now we can see what proportion are above $250$ grams.

In [None]:
threshold = 250
(predictions >= threshold).mean(axis=1).round(2)

If we take something like a 0.5 probability as being "above", we can make a decision about each:

In [None]:
(predictions >= threshold).mean(axis=1).round(2) > 0.5


But remember that there are four thresholds $(250, 500, 750, 1000)$, so let's generalize this approach to the other three thresholds. We'll also plot these probabilities of being above thresholds.

In [None]:
predictions = np.exp(trace_unpooled_oos.predictions)

axes = az.plot_posterior(predictions, color="k")

for k, threshold in enumerate([250, 500, 750, 1000]):
    probs_above_threshold = (predictions >= threshold).mean(dim=("chain", "draw"))

    for i, ax in enumerate(axes.ravel()):
        ax.axvline(threshold, color=f"C{k}")
        _, pdf = az.kde(
            predictions["log_obs"].sel(obs_idx=i).stack(sample=("chain", "draw")).data
        )
        ax.text(
            x=threshold - 35,
            y=pdf.max() / 2,
            s=f">={threshold}",
            color=f"C{k}",
            fontsize="16",
            fontweight="bold",
        )
        ax.text(
            x=threshold - 20,
            y=pdf.max() / 2.3,
            s=f"{probs_above_threshold.sel(obs_idx=i)['log_obs'].data * 100:.0f}%",
            color=f"C{k}",
            fontsize="16",
            fontweight="bold",
        )
        ax.set_title(f"New fish\n{i}", fontsize=16)
        ax.set(xlabel="Weight\n", ylabel="Plausible values")
plt.suptitle(
    "Probability of weighing more than thresholds", fontsize=26, fontweight="bold"
);

---

## References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). *Bayesian Data Analysis* (3rd ed.). Chapman and Hall/CRC.

Martin, O. A., Kumar, R., & Lao, J. (2021). *Bayesian Analysis with Python: Introduction to statistical modeling and probabilistic programming using PyMC3 and ArviZ* (2nd ed.). Packt Publishing.



In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w