# From Linear Regression to Hierarchical Models

## Comprehensive Overview

In this extensive session, we embark on a thorough exploration of regression modeling within the Bayesian framework, progressing from foundational linear regression through generalized linear models to the sophisticated world of hierarchical modeling. This progression represents not just increasing mathematical complexity, but a journey through increasingly flexible and powerful approaches to understanding data.

Statistical modeling is fundamentally about learning from data—extracting patterns, quantifying relationships, and making predictions. The Bayesian approach transforms how we think about these tasks by treating all unknown quantities as random variables with probability distributions. This shift from point estimates to probability distributions provides a richer, more nuanced understanding of uncertainty that is essential for good decision-making.

We begin with linear regression, the workhorse of statistical modeling. Despite its apparent simplicity, linear regression in the Bayesian framework offers profound insights into parameter uncertainty, prior specification, and model checking. Through a detailed case study of predicting fish weights for an e-commerce company, we'll explore real-world complications like multicollinearity, non-linear relationships, and the need for robust inference. This example has been carefully chosen to illustrate common challenges that arise in practice.

Building on this foundation, we expand to generalized linear models (GLMs), which extend the linear regression framework to handle outcomes that don't follow normal distributions. Count data, binary outcomes, proportions, and other non-normal responses are ubiquitous in real applications. GLMs provide a unified framework for modeling these diverse data types through the elegant machinery of link functions and exponential family distributions. We'll explore this through a detailed analysis of count data, learning about Poisson regression, diagnosing overdispersion, and implementing solutions through negative binomial models.

The session culminates with hierarchical models, arguably one of the most important developments in modern statistical practice. Real-world data often has natural grouping structures—students within schools, patients within hospitals, measurements within individuals, products within categories. Hierarchical models respect these structures while sharing information across groups in a principled way. This 'partial pooling' approach represents a sophisticated compromise between treating all data as identical (complete pooling) and treating each group as completely independent (no pooling). Through an extensive analysis of radon contamination in Minnesota homes, we'll see how hierarchical models automatically adapt to varying sample sizes, providing better estimates for data-poor groups while respecting the individuality of data-rich groups.

Throughout this session, we emphasize not just the mechanics of model fitting, but the art of model building—how to start simple, diagnose problems, and incrementally build complexity. We'll see how posterior predictive checks help us identify model inadequacies, how to interpret parameters in different model formulations, and how to make principled decisions about model complexity.

By the end of this session, you'll have not just knowledge of these techniques, but an understanding of when and how to apply them effectively to your own data analysis challenges.

In [None]:
import numpy as np
import pymc as pm
import arviz as az
import polars as pl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import matplotlib.pyplot as plt
import seaborn as sns
import platform

# Visual style
az.style.use('arviz-doc')
pio.templates.default = 'plotly_white'

# Plotly defaults
px.defaults.template = 'plotly_white'
px.defaults.width = 900
px.defaults.height = 500
_base = pio.templates['plotly_white']
_tmpl = go.layout.Template(_base)
_tmpl.layout.hovermode = 'x unified'
pio.templates['hoverx'] = _tmpl
pio.templates.default = 'plotly_white+hoverx'

# Reproducibility
RANDOM_SEED = 20090425
RNG = np.random.default_rng(RANDOM_SEED)


## Part 1: Comprehensive Bayesian Linear Regression

Linear regression is perhaps the most fundamental tool in statistical modeling. At its core, it seeks to understand how one or more predictor variables relate to an outcome variable. While conceptually simple, linear regression provides a rich framework for exploring fundamental concepts in Bayesian inference.

In the classical (frequentist) approach, we find the 'best' line through our data, typically using least squares. The Bayesian approach fundamentally changes how we think about this problem. Instead of seeking a single best line, we consider all possible lines, weighted by their plausibility given the data and our prior beliefs.

This shift in perspective—from point estimates to probability distributions—has profound implications:

1. **Uncertainty Quantification**: We don't just estimate that a slope is 0.73; we can say there's a 95% probability it's between 0.65 and 0.81. This is crucial when decisions depend on understanding the range of plausible values.

2. **Prior Information**: We can incorporate existing knowledge. If previous studies suggest an effect is small, we can encode this through an informative prior, leading to more stable estimates.

3. **Probabilistic Predictions**: Predictions come with full probability distributions. Instead of saying 'the predicted weight is 250g ± 20g', we have the entire predictive distribution, allowing us to answer questions like 'what's the probability the weight exceeds 300g?'

4. **Natural Regularization**: Priors act as regularizers, automatically balancing model complexity with data fit. This helps prevent overfitting, especially with limited data.

5. **Hierarchical Extensions**: The Bayesian framework naturally extends to hierarchical models, as we'll see later in this session.

## Example: Fish Weight Prediction

In this lesson, we'll imagine we are working in the data science team of an e-commerce company. In particular, we sell really good and fresh fish to our clients (mainly fancy restaurants). 

When we ship our products, there is a very important piece of information we need: the weight of the fish. This is important for two reasons: 

1. Because we _bill_ our clients according to weight. 

2. Because the company that delivers the fish to our clients has different price tiers for weights, and those tiers can get _really_ expensive. So we want to know the probability of an item being above that line. In other words, estimating uncertainty is important here!

![](images/weighingfish.jpg)


The problem we face is that we purchase our fish in bulk. This means we only know the total weight of our entire order, but we don't have the weights of the individual fish. You might think the obvious solution is simply to weigh each fish one by one.

However, this approach has significant drawbacks. Manually weighing each fish is costly, requires a lot of time, and demands substantial labor. This process is inefficient and impractical for our needs.

Given these challenges, we need to explore alternative solutions. 

### A solution

While researching the problem, we discovered that our wholesale supplier has detailed information on the size of each individual fish, including their length, height, and width. Since it is infeasible to weigh individual fish, the supplier uses a **camera** to record the size of each fish. 

However, the company used to try to weigh each fish manually until costs became prohibitive. As a result, we have a valuable **training dataset** consisting of different types of fish with their accurately -measured weights.

![](images/fishvideo.png)

### Exploratory data analysis

Let's import the data and take a look at it.

In [None]:
try:
    fish_market = pl.read_csv("../data/fish-market.csv")
except FileNotFoundError:
    DATA_URL = "https://raw.githubusercontent.com/pymc-labs/ccc-workshop/main/data/"
    fish_market = pl.read_csv(DATA_URL + "fish-market.csv")
fish_market.schema

We have collected 159 measurements, and all columns in our dataset have the appropriate data types.

For each observation, the dataset includes the following information: the species of the fish, its weight, height, and width, as well as three distinct length measurements. You might be wondering why we have three different measurements for the fish's length. Let's delve into some summary statistics to better understand the data and its significance.

In [None]:
fish_market.null_count()

No missing values, which is nice.

Next let's peek at some summary statistics:

In [None]:
fish_market.describe()

Things to note:

- Though there are no missing data, there are some zero-weight fish! -- either the fish was below the minimum weight for the scale, or there was a mistake during data collection. 
- The standard deviation of the columns are very high, especially for weights.
- There are three columns for length, which is interesting. We will explore this further.

In [None]:
numeric_data = fish_market.drop("Species")
corr_matrix = numeric_data.corr().to_numpy().round(2)

fig = px.imshow(
    corr_matrix,
    x=numeric_data.columns,
    y=numeric_data.columns,
    zmin=-1,
    zmax=1,
    color_continuous_scale='RdBu_r',  
    aspect='auto'
)

for i in range(len(numeric_data.columns)):
    for j in range(len(numeric_data.columns)):
        fig.add_annotation(
            x=i,
            y=j,
            text=str(corr_matrix[j, i]),
            showarrow=False,
            font=dict(size=16, color='black')
        )

fig.update_layout(
    coloraxis_colorbar_title='Correlation',
    width=800,
    height=800
)


The three length measurements are highly correlated with each other. This means they essentially carry the same information. Without additional details to distinguish among them, we should arbitrarily choose one measurement and discard the other two. Keeping all three would be redundant and unnecessary since they do not provide unique information.

There is nothing inherently Bayesian about this step. The concept of *multicollinearity* is a fundamental concern in both Bayesian and frequentist statistics. In essence, if you include multiple variables that convey similar information in your regression model, you will end up with very unstable parameter estimates. This redundancy does not improve your model's predictive power and can, in fact, lead to misleading results. Thus, it is crucial to identify and address multicollinearity to maintain the robustness and reliability of your model.

In [6]:
fish_market = fish_market.drop(["Length2", "Length3"])

## Visual data exploration

It's always a good idea to plot your data! Plotly's `scatter_matrix` function is a great way to visualize the relationships between variables in your dataset. This function creates a matrix of scatterplots, with each variable plotted against every other variable. 

In [None]:
fig = px.scatter_matrix(
    fish_market,
    dimensions=fish_market.columns,
    color=None,
    opacity=0.7,
    height=1000,
    width=1000
)

fig.update_traces(diagonal_visible=True, showupperhalf=True, showlowerhalf=True)
fig.update_layout(
    dragmode='select',
    hovermode='closest'
)

All variables exhibit linear relationships with each other, with one notable exception: weight. Weight appears to increase exponentially in relation to the other variables. However, this exponential growth is not limitless; it plateaus due to a natural upper limit on weight.

Additionally, we observe several trends within the data that may indicate differences in how these variables interact across various species. These trends suggest that the relationships between the variables are not uniform across all species, potentially due to unique biological or ecological factors influencing each species.

So, let's break down the data by species and see if we can identify any patterns.

In [None]:
fig = px.scatter_matrix(
    fish_market,
    dimensions=["Length1", "Height", "Width", "Weight"],
    color="Species",
    opacity=0.7,
    height=1000,
    width=1000
)

fig.update_traces(diagonal_visible=True, showupperhalf=True, showlowerhalf=True)
fig.update_layout(
    dragmode='select',
    hovermode='closest'
)

Thus, it is clear that any model we build must account for the differences in the relationships between variables across species. This is where Bayesian linear regression comes in handy. By incorporating **domain knowledge** about the relationships between variables and the differences across species, we can build a more robust and reliable model.

In [None]:
variables = ["Length1", "Height", "Width", "Weight"]
fig = make_subplots(rows=2, cols=2, subplot_titles=variables)

for i, var in enumerate(variables):
    row = i // 2 + 1
    col = i % 2 + 1
    
    for species in fish_market["Species"].unique():
        species_data = fish_market.filter(pl.col("Species") == species).to_pandas()
        
        fig.add_trace(
            go.Box(
                y=species_data[var],
                name=species,
                boxpoints='all',
                jitter=0.5,
                pointpos=0,
                marker=dict(opacity=0.5),
                line=dict(width=1),
                showlegend=(i == 0)
            ),
            row=row,
            col=col
        )

fig.update_layout(
    height=800,
    width=800,
)

The most diverse species are Bream, Whitefish, Perch, and Pike. This diversity likely makes them more versatile for sale and cooking because they come in a wide range of sizes, including different weights, widths, and heights. This variety allows for more options in preparation methods and recipes, catering to various culinary needs.

On the other hand, the Smelt is a very small fish that is typically used in specialized recipes. Its smaller size and specific preparation methods make it less versatile than the more diverse species like Bream, Whitefish, Perch, and Pike. A quick internet search will show you that they are usually fried and served as appetizers, at least in Europe.

## A non-Bayesian linear regression

Now that we have a clearer understanding of the data we're working with, let's move on to developing a predictive model. Our specific task is to **predict the weight of a fish based on its width, height, and length**. While we've chosen these particular variables for our analysis, it's important to note that different combinations of independent and dependent variables could also be used, depending on the specific requirements of the study.

The most promising approach for our task is to develop a **physical model**. This involves leveraging the inherent relationships between height, width, and weight, which are governed by physical proportions that impose natural lower and upper bounds on these variables. In a professional context, such a model would likely yield the most accurate and reliable predictions due to its basis in the physical characteristics of fish.

However, creating a detailed physical model can be quite complex. Therefore, for our initial attempt, we can use a simple **ordinary least squares (OLS)** regression to establish a relationship between the dependent variable (weight) and the independent variables (width, height, and length).

From our data exploration, we observed that weight is not linearly related to the other variables. This non-linear relationship suggests that a direct application of linear regression may not be effective. To address this issue, we often need to apply some form of data transformation to better fit the model to the data.

In this scenario, a **logarithmic transformation** of the data appears to be a suitable choice. This transformation can help counteract the exponential increase in weight as the fish's width, height, and length increase. By applying a log-transform, we can linearize the relationship between these variables, making it more appropriate for linear regression analysis.

### Taking the log of all covariates

In [10]:
fish_market = fish_market.with_columns([
    pl.col("Width").log().alias("log_width"),
    pl.col("Height").log().alias("log_height"), 
    pl.col("Length1").log().alias("log_length"),
    pl.col("Weight").log().alias("log_weight")
])

In [None]:
# Display the first few rows to see the transformed data
fish_market.head()

# Check for null values in the dataframe
fish_market.null_count()


### Simple OLS regression

An easy way to perform OLS regression is via the `seaborn` graphics library. The `lmplot` function creates a scatterplot of the data and fits a regression line to the data.

In [None]:
import seaborn as sns

fish_complete = fish_market.filter(pl.col("Weight") != 0)

sns.lmplot(
    data=fish_complete,
    x="log_height",
    y="log_weight",
    hue="Species",
    col="Species",
    height=3,
    col_wrap=4,
);

The output here is purely visual, but in log space, our input variables seem linearly related to weight, so there is good reason to believe that a linear model is appropriate here.

Let's go ahead and fit a linear model to the data using PyMC.

### Baseline Model

Let's start with a very simple "null" model: just a global mean with no predictors. 
$$
\begin{aligned}
\log(\text{weight}) &\sim \mathrm{Normal}(\mu, \sigma)\\
\mu &\sim \mathrm{Normal}(0, 1)\\
\sigma &\sim \mathrm{HalfNormal}(1)\\
\end{aligned}
$$

This corresponds to `log(weight) ~ 1` in [Wilkinson notation](https://uk.mathworks.com/help/stats/wilkinson-notation.html).

In [None]:
with pm.Model() as fish_simple:

    # Prior
    mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma", 1.0)

    # Likelihood
    pm.Normal(
        "log_weight",
        mu=mu,
        sigma=sigma,
        observed=fish_complete["log_weight"].to_numpy(),
    )

pm.model_to_graphviz(fish_simple)

Now to fit the model:

In [None]:
with fish_simple:
    trace_simple = pm.sample()

In [None]:
az.summary(trace_simple, round_to=2)

We will dig into model checking and diagnostics in a later section (which will explain most of the values in the `summary` table), but for now we can plot the posterior distribution of the model parameters and do some informal, visual checks.

In [None]:
az.plot_trace(trace_simple);

Traceplots are useful for evaluating the performance of our MCMC sampling. In these plots, we aim to see a "fuzzy caterpillar" pattern on the right side, which indicates that the chains are **mixing well** and exploring the parameter space effectively. This is evidence to suggest the chains have converged (to something!) and are providing a reasonable representation of the posterior distribution.

In addition to traceplots, **rank plots** serve as another diagnostic tool for assessing the quality of your MCMC samples. Rank plots display the ranks of sampled values for each parameter, and we look for the histograms in these plots to be approximately uniform. If one chain samples some values significantly more than the other chains, then the ranks of its samples will be markedly higher or lower than other chains and the histograms won't be uniform. Uniform histograms suggest that the sampler has performed well, meaning the samples are not exhibiting degeneracies. 

In [None]:
az.plot_rank(trace_simple);

## Interpreting parameters

This model is very simple, so the mean coefficient estimates are the mean and standard deviation, respectively, of the sample. 

If we go back to our trace plot, our posterior uncertainty doesn't seem big. But remember that we're on log scale, so it would be best to work on the nominal scale. 

Fortunately, `plot_trace` has a `transform` argument we can use.

In [None]:
az.plot_trace(trace_simple, transform=np.exp, var_names="mu");

So there is a reasonable amount of uncertainty in our estimate of the mean weight of a fish, which is perhaps surprising given we have pooled all the data. 

Now let's look at **model fit**. We will explore model checking in detail later in the course, but for now, we can use a simple technique that we have already seen in the previous section: posterior predictive checks.

Posterior predictive checks (PPCs) are a great way to validate a model. The idea is to generate data from the model using parameters from the posterior distribution and compare these samples to the observed data.

Let's generate these simulated datasets now.

In [None]:
with fish_simple:
    ppc = pm.sample_posterior_predictive(trace_simple, extend_inferencedata=True)

ax = az.plot_ppc(trace_simple)
ax.set_xlabel("log_weight");

The data are clearly heterogeneous, as evidenced by the multiple peaks in the log-weight, but our model fails to capture them accurately. This discrepancy suggests that the model is struggling to fit the data properly. Consequently, the model resorts to increasing the posterior uncertainty and observational noise (`sigma`). Essentially, the model compensates for its inability to accurately represent the data by broadening its predictions, sacrificing precision for coverage.

### Adding predictors to our model

It is time to introduce predictors to our model, and see how much they improve prediction.


$$
\begin{aligned}
\text{priors}\\
\mu[s] &\sim \mathrm{Normal}(0, 1)\\
\beta[s, k] &\sim \mathrm{Normal}(0, 0.5)\\
\sigma &\sim \mathrm{HalfNormal}(1)\\
\text{linear model}\\
\mu_i &= \mu[s_i]\\
        & \quad + \beta[s_i, 0] \times \log(\text{width}_i)\\
        & \quad + \beta[s_i, 1] \times \log(\text{height}_i)\\
        & \quad + \beta[s_i, 2] \times \log(\text{length}_i)\\
\text{likelihood}\\
\log(\text{weight}_i) &\sim \mathrm{Normal}(\mu_i, \sigma)\\
\end{aligned}
$$


where $s_i$ is the species index corresponding to fish _i_:


$$
s_i \in \{ 0, 1, \ldots, {S-1} \}.
$$


In Wilkinson notation, the model can be written as:


`log(weight) ~ 0 + species + log(width):species + log(height):species + log(length):species`. 


The `0 + species` component means that we just have $S$ intercept terms, one for each species, with no global intercept. 

The remaining terms (e.g. `log(width):species`) represent an interaction between the predictor and the `species` category. So there will be one coefficient for the $\log(width)$ slope (in this case) for each species.

So, each species has its own intercept and slopes for width, height, and length. This is an **unpooled model** because we are essentially fitting a separate regression for each species!

In order to make this work, we need to encode the species as a categorical variable. We can do this using the `Categorical` type in `polars`.

In [None]:
pl.Series(fish_complete["Species"]).cast(pl.Categorical).to_physical().sort()

### Define dimensions & coordinates

Having encoded species as a categorical column, we also extract the unique species values, to be used as coordinates (labels) for the parameters in our model.

In [21]:
species_idx = pl.Series(fish_complete["Species"]).cast(pl.Categorical).to_physical().to_numpy()
species = pl.Series(fish_complete["Species"]).cast(pl.Categorical).unique().sort()
coords = {
    "slopes": ["width_effect", "height_effect", "length_effect"],
    "species": species,
}

We will also make use of `Data` containers to include the data explicitly in the model. This will be useful later, when we want to predict out-of-sample.

In [22]:
with pm.Model(coords=coords) as fish_unpooled:
    # data
    log_width = pm.Data("log_width", fish_complete.get_column("log_width").to_numpy())
    log_height = pm.Data("log_height", fish_complete.get_column("log_height").to_numpy())
    log_length = pm.Data("log_length", fish_complete.get_column("log_length").to_numpy())
    log_weight = pm.Data("log_weight", fish_complete.get_column("log_weight").to_numpy())
    s = pm.Data("species_idx", species_idx)

    # priors
    mu = pm.Normal("mu", sigma=1.0, dims="species")

    # each species gets a slope for each predictor thx to `dims`:
    beta = pm.Normal("beta", sigma=0.5, dims=("species", "slopes"))

    # linear regression
    expected_weight = (
        mu[s]
        + beta[s, 0] * log_width
        + beta[s, 1] * log_height
        + beta[s, 2] * log_length
    )

    # observational noise
    sigma = pm.HalfNormal("sigma", 1.0)

    # likelihood
    pm.Normal(
        "log_obs",
        mu=expected_weight,
        sigma=sigma,
        observed=log_weight,
    )

It's always helpful to plot the model before fitting it. This can help you catch errors in the model specification, and also give you a sense of what the model is doing.

In [None]:
pm.model_to_graphviz(fish_unpooled)

In [None]:
with fish_unpooled:
    trace_unpooled = pm.sample()
    # Posterior predictive
    pm.sample_posterior_predictive(trace_unpooled, extend_inferencedata=True)

Inspecting the posterior parameter estimates, notably the intercepts:

In [None]:
az.plot_trace(trace_unpooled, var_names='mu', transform=np.exp);

The intercepts look small (even on the nominal scale) which seems odd. But recall how intercepts are interpreted: they are the expected value of the outcome when all predictors are zero. In this case, that means when the log of the width, height, and length are zero. This is an awkward from an interpretive standpoint. 

How could we improve this?

Give it a try, and re-run the improved model.

In [26]:
# Write your answer here

Now we have meaningful intercepts -- the expected weight of a fish with average width, height, and length for each species.

In [None]:
az.plot_trace(trace_unpooled, var_names='mu', transform=np.exp);

When we have vector-valued parameters, a forest plot is convenient for visualizing them.

In [None]:
az.plot_forest(trace_unpooled, var_names="beta", transform=np.exp);

In [None]:
az.plot_trace(trace_unpooled, var_names="sigma", transform=np.exp);

There is a good sign here: the posterior uncertainty around `sigma` is much lower than before, i.e we picked up much more information on the fish weights. But did this improve our posterior predictions?

In [None]:
with fish_unpooled:
    pm.sample_posterior_predictive(trace_unpooled, extend_inferencedata=True)
ax = az.plot_ppc(trace_unpooled)
ax.set_xlabel("log_obs");

## Predicting out-of-sample

In statistical workflows, a common task is to make predictions using new, unseen data, often referred to as "out-of-sample" data. In PyMC, the most straightforward approach to achieve this is by utilizing the `Data` container. This container allows PyMC and ArviZ to specify the data used for training the model, and then allow you to modify it later on.

#### Splitting Data into Training and Test Sets

To illustrate this functionality, let's randomly select 90% of our data as the training dataset for the model, while reserving the remaining 10% as the test data. This test data will be unseen by the model during the training process, allowing us to evaluate its performance on new, previously unseen data when making predictions.

By following this approach, you can effectively train your model on a subset of the available data and then assess its predictive capabilities on the held-out test data, mimicking real-world scenarios where predictions need to be made on new, unobserved data points.

In [31]:
fish_test = (
    fish_complete.sample(fraction=0.1, seed=1)
    .with_row_index()
)
test_idx = fish_test.get_column("index")
fish_train = (
    fish_complete.with_row_index()
    .filter(~pl.col("index").is_in(test_idx))
)

Since the dataset changed compared to the previous model, we also have to redefine our coordinates:

In [32]:
species_idx = pl.Series(fish_train.get_column("Species")).cast(pl.Categorical).to_physical().to_numpy()
species = fish_train.get_column("Species").unique(maintain_order=True).sort()
coords["species"] = species.to_list()
coords["obs_idx"] = range(fish_train.height)

In [None]:
with pm.Model(coords=coords) as fish_unpooled_oos:
    # data
    log_width = pm.Data(
        "log_width", 
        (fish_train.get_column("log_width") - fish_train.get_column("log_width").mean()).to_numpy(),
        dims="obs_idx"
    )
    log_height = pm.Data(
        "log_height",
        (fish_train.get_column("log_height") - fish_train.get_column("log_height").mean()).to_numpy(),
        dims="obs_idx"
    )
    log_length = pm.Data(
        "log_length",
        (fish_train.get_column("log_length") - fish_train.get_column("log_length").mean()).to_numpy(),
        dims="obs_idx"
    )
    log_weight = pm.Data(
        "log_weight",
        fish_train.get_column("log_weight").to_numpy(),
        dims="obs_idx"
    )
    species_idx_ = pm.Data("species_idx", species_idx, dims="obs_idx")

    # priors
    mu = pm.Normal("mu", sigma=1.0, dims="species")
    beta = pm.Normal("beta", sigma=0.5, dims=("slopes", "species"))

    # linear regression
    expected_weight = (
        mu[species_idx_]
        + beta[0, species_idx_] * log_width
        + beta[1, species_idx_] * log_height
        + beta[2, species_idx_] * log_length
    )
    # observational noise
    sigma = pm.HalfNormal("sigma", 1.0)

    # likelihood
    log_obs = pm.Normal(
        "log_obs", mu=expected_weight, sigma=sigma, observed=log_weight, dims="obs_idx"
    )

    # sampling
    trace_unpooled_oos = pm.sample()
    pm.sample_posterior_predictive(trace_unpooled_oos, extend_inferencedata=True)

In [None]:
pm.model_to_graphviz(fish_unpooled_oos)

Checking the traceplots:

In [None]:
az.plot_trace(trace_unpooled_oos, transform=np.exp);

Now we want to see how this model would work in production: given some fish morphometrics, can we accurately predict the weight of the fish?

To do this, we use `set_data` to change the inputs from the training set to the test set. First, let's query our test data:

In [None]:
# Encode the species
species_idx_test = pl.Series(fish_test.get_column("Species")).cast(pl.Categorical).to_physical().cast(pl.Int64)

species_idx_test

Now we apply these values to the `Data` nodes in the model.

Note that we are shifting the input variables using the training set mean and standard deviation. You always want to use the same transformation on the test set as you did on the training set!

In [37]:
with fish_unpooled_oos:
    pm.set_data(
        coords={"obs_idx": range(len(fish_test))},
        new_data={
            "log_height": fish_test.get_column("log_height").to_numpy() - fish_train.get_column("log_height").mean(),
            "log_length": fish_test.get_column("log_length").to_numpy() - fish_train.get_column("log_length").mean(), 
            "log_width": fish_test.get_column("log_width").to_numpy() - fish_train.get_column("log_width").mean(),
            "log_weight": np.zeros(len(fish_test)),
            "species_idx": species_idx_test.cast(pl.UInt32).to_numpy(),
        },
    )

We now call `sample_posterior_predictive` once again, but this time we specify `predictions=True` since these are not posterior predictive checks, and they will be store in a different attribute on the trace.

### Use updated values to predict outcomes

In [None]:
with fish_unpooled_oos:
    pm.sample_posterior_predictive(
        trace_unpooled_oos,
        predictions=True,
        extend_inferencedata=True,
    )

How good are these imputations? Glad you asked. Remember that our data are not _really_ out-of-sample; we just cut them out from our original dataset, so we can compare our predictions to the true weights. This is a simple line of code in ArviZ (we just exponentiate the predicted log weights to compare them to the true weights):

In [None]:
az.plot_posterior(
    trace_unpooled_oos.predictions,
    ref_val=fish_test.get_column("Weight").to_list(),
    transform=np.exp,
);

So the predicted values all fell within the predictive distributions -- not all within the 95% interval, but there were no extreme predictions.

## Exercise: Refitting the model

Given the success of the model, you go back and try to fit it to data collected by another vendor, only to find that the predictions aren't nearly as good!

Frustrated, you go back to the drawing board... they deal with the same type of fish, but what's wrong with their data?

One of their colleagues mentions something about not having use the same equipment to weight the fish, because the "old manager always tried to cut costs".
They used a much cheaper scale ...

With this information in hand, make the appropriate modifications to the model to accomodate the new data. 

Here is the data:

In [None]:
try:
    new_fish = pl.read_csv("../data/new_fish.csv")
except FileNotFoundError:
    DATA_URL = "https://raw.githubusercontent.com/pymc-labs/ccc-workshop/main/data/"
    new_fish = pl.read_csv(DATA_URL + "new_fish.csv")
new_fish.describe()

Try to diagnose the issue and propose a new model (a slight variation) that may help in dealing with the properties of this new dataset better!

In [None]:
# Write your answer here

## From predictions to business insights

Recall from the introduction that there are different price tiers for weights, and those tiers can get _really_ expensive, so we want to know the probability of an item being above any theshold.

- $> 250$
- $> 500$
- $> 750$
- $> 1000$

Since we have calculated posterior distributions, we have the ability to compute these probabilities for any new fish we observe.


In [42]:
# Extract projections to numpy array
predictions = (
    np.exp(
        az.extract(trace_unpooled_oos.predictions)
        .to_array()
        .to_numpy()
        .squeeze()
    )
)

Now we can see what proportion are above $250$ grams.

In [None]:
threshold = 250
(predictions >= threshold).mean(axis=1).round(2)

If we take something like a 0.5 probability as being "above", we can make a decision about each:

In [None]:
(predictions >= threshold).mean(axis=1).round(2) > 0.5


But remember that there are four thresholds $(250, 500, 750, 1000)$, so let's generalize this approach to the other three thresholds. We'll also plot these probabilities of being above thresholds.

In [None]:
predictions = np.exp(trace_unpooled_oos.predictions)

axes = az.plot_posterior(predictions, color="k")

for k, threshold in enumerate([250, 500, 750, 1000]):
    probs_above_threshold = (predictions >= threshold).mean(dim=("chain", "draw"))

    for i, ax in enumerate(axes.ravel()):
        ax.axvline(threshold, color=f"C{k}")
        _, pdf = az.kde(
            predictions["log_obs"].sel(obs_idx=i).stack(sample=("chain", "draw")).data
        )
        ax.text(
            x=threshold - 35,
            y=pdf.max() / 2,
            s=f">={threshold}",
            color=f"C{k}",
            fontsize="16",
            fontweight="bold",
        )
        ax.text(
            x=threshold - 20,
            y=pdf.max() / 2.3,
            s=f"{probs_above_threshold.sel(obs_idx=i)['log_obs'].data * 100:.0f}%",
            color=f"C{k}",
            fontsize="16",
            fontweight="bold",
        )
        ax.set_title(f"New fish\n{i}", fontsize=16)
        ax.set(xlabel="Weight\n", ylabel="Plausible values")
plt.suptitle(
    "Probability of weighing more than thresholds", fontsize=26, fontweight="bold"
);

---
## Part 2: Generalized Linear Models (GLMs)

In the previous section, we worked with linear regression where the outcome variable was continuous and could be reasonably modeled with a normal distribution. However, many real-world outcomes don't fit this comfortable framework. Consider these common scenarios:

- **Count data**: Number of accidents at an intersection, disease cases in a population, customer complaints per day, goals scored in a match
- **Binary outcomes**: Customer churn (yes/no), disease presence (positive/negative), loan default (yes/no), email clicked (yes/no)
- **Proportions**: Market share, exam pass rates, survey response rates, survival probabilities
- **Strictly positive continuous**: Waiting times, insurance claim amounts, reaction times, rainfall amounts
- **Ordinal data**: Customer satisfaction (1-5 stars), pain levels, educational attainment levels

Trying to force these into a linear regression framework with normal errors often fails spectacularly:
- Predictions can be nonsensical (negative counts, probabilities > 1)
- Variance assumptions are violated (variance often depends on mean)
- Inference is invalid (confidence intervals include impossible values)
- Relationships are misspecified (effects are often multiplicative, not additive)

Generalized Linear Models (GLMs) elegantly solve these problems by extending the linear regression framework in two key ways.

## Poisson regression: Unbounded count data

This model is inspired by [a project by Ian Osvald](http://ianozsvald.com/2016/05/07/statistically-solving-sneezes-and-sniffles-a-work-in-progress-report-at-pydatalondon-2016/), which is concerned with understanding the various effects of external environmental factors upon the allergic sneezing of a test subject.

We're going to work with simpler data than the original study, which will allow you to clearly see the modeling stakes.

In [None]:
try:
    sneezes = pl.read_csv("../data/poisson_sneeze.csv")
except FileNotFoundError:
    DATA_URL = "https://raw.githubusercontent.com/pymc-labs/ccc-workshop/main/data/"
    sneezes = pl.read_csv(DATA_URL + "poisson_sneeze.csv")
sneezes

+ The subject sneezes N times per day, recorded as `nsneeze (int)`. The data are aggregated per day, to yield a total count of sneezes on that day.
+ The subject may or may not drink alcohol during that day, recorded as `alcohol (boolean)`
+ The subject may or may not take an antihistamine medication during that day, recorded as `meds (boolean)`

We assume that sneezing occurs at some baseline rate, which increases if an antihistamine is not taken, and further increases if alcohol is consumed.

### Visualize the data and set up the model

In [None]:
fig = make_subplots(rows=2, cols=2, 
                   subplot_titles=["meds=0, alcohol=0", "meds=1, alcohol=0",
                                  "meds=0, alcohol=1", "meds=1, alcohol=1"])

for i, alc in enumerate([0, 1]):
    for j, med in enumerate([0, 1]):
        filtered_data = sneezes.filter((pl.col('alcohol') == alc) & (pl.col('meds') == med))
        counts = filtered_data.get_column('nsneeze').value_counts().sort('nsneeze')
        
        fig.add_trace(
            go.Bar(x=counts.get_column('nsneeze'), y=counts.get_column('count'), marker_color='blue'),
            row=i+1, col=j+1
        )

fig.update_layout(
    height=600, 
    width=800,
    showlegend=False,
    title_text="Distribution of Sneezes by Medication and Alcohol"
)
fig.update_xaxes(title_text="Number of Sneezes")
fig.update_yaxes(title_text="Count")


The usual way of thinking about data coming from a Poisson distribution is as number of occurences of an event in a *given timeframe*. Here *number of sneezes per day*. The intensity parameter $\lambda$ specifies how many occurences we expect.

A nice property of the Poisson is that it's defined with only one parameter $\lambda$, which describes both the mean and variance of the Poisson, can be interpreted as the rate of events per unit -- here, if we inferred $\lambda = 2.8$, that would mean the subject is thought to sneeze about 2.8 times per day (implying in addition a $2.8$ variance).

In statistical terms, that means our likelihood is 

$$ Y_{\text{sneeze}} \sim \mathrm{Poisson}(\lambda)$$

> The Poisson probability mass function is:
>
> $$ P(Y=k|\lambda) = \frac{\lambda^k e^{-\lambda}}{k!} $$
>
> where $k$ is the number of occurrences (sneezes in our case) and $\lambda$ is the rate parameter.
>


Now, we need a prior on $\lambda$ ...

This is where the regression component comes in: remember that we want to infer the effect of meds and alcohol on the number of sneezes. So, we can use the usual linear regression formula we are familiar with: 

$$\lambda = \alpha + \beta_{\text{meds}} * \text{meds} + \beta_{\text{alcohol}} * \text{alcohol}$$

We will specify **weakly-informative priors** on the model latent variables.

$$
\begin{aligned}
\alpha &\sim \mathrm{Normal}(0, 5)\\
\beta_{\text{meds}}, \beta_{\text{alcohol}} &\sim \mathrm{Normal}(0, 1)\\
\lambda &= \alpha + \beta_{\text{meds}} * \text{meds} + \beta_{\text{alcohol}} * \text{alcohol}\\
Y_{\text{sneeze}} &\sim \mathrm{Poisson}(\lambda)\\
\end{aligned}
$$

In [None]:
COORDS = {
    "regressor": ["meds", "alcohol"], 
    "obs_idx": range(len(sneezes))
}

M, A, S = sneezes.select(["meds", "alcohol", "nsneeze"]).to_numpy().T

with pm.Model(coords=COORDS) as m_sneeze:
    # weakly informative Normal Priors
    a = pm.Normal("intercept", mu=0, sigma=5)
    b = pm.Normal("slopes", mu=0, sigma=1, dims="regressor")

    # define linear model
    mu = pm.Deterministic("mu", a + b[0] * M + b[1] * A, dims="obs_idx")

    ## Define Poisson likelihood
    y = pm.Poisson("y", mu=mu, observed=S, dims="obs_idx")

pm.model_to_graphviz(m_sneeze)

> ### Model coordinates
> 
> The model coordinates in PyMC are used to define dimensions and labels for the variables in the model. They provide a way to organize and manipulate the model's data with dimensions _names_ instead of raw _shapes_, through the `coords` and `dims` keyword arguments. In the given PyMC model, the coordinates are defined as follows:
> 
> - `regressor`: This coordinate represents the regressor variables in the model, which are `'meds'` and `'alcohol'`.
> - `obs_idx`: This coordinate represents the index of the observations in the model. It is a `RangeIndex` object with a range from 0 to 4000.
> 
> These coordinates allow for more intuitive and readable model specification, as well as easier manipulation and analysis of the model's data.


### Model Fitting

Most GLMs will be fit using the NUTS step method.

In [None]:
with m_sneeze:
    trace_sneeze = pm.sample()

In [None]:
az.summary(trace_sneeze, var_names=["intercept", "slopes"])

While the model sampled to completion, close inspection of the model reveals an issue that could have caused the sampler to fail. Can you spot it? In previous versions of PyMC, this model would have failed to sample.

Propose a more robust version below.

In [7]:
# Write your answer here

Before looking at the results, let's take a step back: the trick we used with the exponential is actually exactly how generalized linear models are defined. The exponential here is called **a link function**, and it's used to map the **output of our linear model** (a priori allowed to live in $(-\infty, \infty)$,

### Link functions

In Generalized Linear Models (GLMs), link functions play a crucial role in connecting the linear predictor to the response variable. The concept of link functions arises from the need to model the relationship between the mean of the response variable and the linear predictor, which can take any real value.

The link function transforms the linear predictor to ensure that the predicted values of the response variable are within a valid range and satisfy the distributional assumptions of the response variable. It maps the linear predictor to the space of the relevant parameter of the response distribution.

Different types of GLMs use different link functions based on the nature of the response variable and the desired distribution. Some commonly used link functions include:

1. Identity Link: This link function is used for continuous response variables and maintains a linear relationship between the linear predictor and the mean of the response variable.

2. Logit Link: This link function is used for binary response variables and maps the linear predictor to the probability of success in a logistic regression model. It ensures that the predicted probabilities are between 0 and 1.

3. Log Link: This link function is used for count data and maps the linear predictor to the mean of a Poisson distribution. It ensures that the predicted mean is positive.

4. Inverse Link: This link function is used for modeling the mean of a Gamma distribution and maps the linear predictor to the inverse of the mean.

The choice of the link function depends on the nature of the response variable and the assumptions of the distribution. The link function allows for flexible modeling of the relationship between the linear predictor and the response variable, enabling the estimation of regression coefficients and making predictions within the appropriate range of the response distribution.

### Checking our inferences

Now is the time to check if our model's results are credible, via posterior predictive checks, to which we were introduced in the previous section.

In [None]:
with m_sneeze:

    # Get posterior predictive samples, and add them to the InferenceData object
    trace_sneeze.extend(pm.sample_posterior_predictive(trace_sneeze))

Let's plot the posterior predictive checks for all the groups.

In [9]:
def plot_sneeze_predictions(idata, color="C0"):
    fig, axs = plt.subplots(2, 2, figsize=(12, 8))

    az.plot_ppc(
        idata,
        ax=axs[0, 0],
        coords={
            "obs_idx": np.where(np.logical_and(sneezes.get_column("alcohol").to_numpy() == 0, sneezes.get_column("meds").to_numpy() == 0))
        },
        color=color,
    )
    az.plot_ppc(
        idata,
        ax=axs[0, 1],
        coords={
            "obs_idx": np.where(np.logical_and(sneezes.get_column("alcohol").to_numpy() == 0, sneezes.get_column("meds").to_numpy() == 1))
        },
        color=color,
    )
    az.plot_ppc(
        idata,
        ax=axs[1, 0],
        coords={
            "obs_idx": np.where(np.logical_and(sneezes.get_column("alcohol").to_numpy() == 1, sneezes.get_column("meds").to_numpy() == 0))
        },
        color=color,
    )
    az.plot_ppc(
        idata,
        ax=axs[1, 1],
        coords={
            "obs_idx": np.where(np.logical_and(sneezes.get_column("alcohol").to_numpy() == 1, sneezes.get_column("meds").to_numpy() == 1))
        },
        color=color,
    )
    axs[0, 0].set_title("No alcohol : No meds")
    axs[0, 1].set_title("No alcohol : Meds")
    axs[1, 0].set_title("Alcohol : No meds")
    axs[1, 1].set_title("Alcohol : Meds")
    return fig, axs

In [None]:
plot_sneeze_predictions(trace_sneeze);

While the model is in the ballpark of the actual data, it is not perfect. The model is underestimating the variance in the data. This is a common issue with Poisson regression, as the variance is constrained to be equal to the mean. When the data's mean and variance are not similar, a Poisson regression will underestimate the variance compared to the true variance observed in the data.

This behavior is quite common with Poisson regression: it often underestimates the variation in the data, simply because real data are more dispersed than our regression expects -- in these cases, data are said to be "overdispersed".

This phenomenon is particularly acute with the Poisson, because as we have seen its variance is mathematically constrained to be equal to its mean. So, when the data's mean and variance are not similar, a Poisson regression will get the variance estimate wrong when compared to the true variance observed in the data.

To convince ourselves, let's compare our data's mean and variance:

In [None]:
sneezes.group_by(["meds", "alcohol"]).agg([
    pl.col("nsneeze").mean().alias("mean"),
    pl.col("nsneeze").var().alias("var")
]).sort(["meds", "alcohol"])

Notice that for each combination of `alcohol` and `meds`, the variance of `nsneeze` is higher than the mean!

## Gamma-Poisson Model

Gamma-Poisson (aka [Negative binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution)) regression is used to model overdispersion in count data. The Gamma-Poisson distribution can be thought of as a Poisson distribution whose rate parameter is gamma distributed, so that rate parameter can be adjusted to account for the increased variance. If you want more details about these models (a.k.a. "continuous mixture models"), I refer you to chapter 12 of [Richard McElreath's excellent _Statistical Rethinking_](https://nbviewer.jupyter.org/github/pymc-devs/resources/blob/master/Rethinking_2/Chp_12.ipynb).

In addition to the Poisson rate, $\lambda$, Gamma-Poisson distributions are parametrized an additional overdispersion parameter, $\alpha$, which controls the shape of the Gamma distribution. 

_The Gamma-Poisson distribution_

We start with a random variable $Y$ that follows a Poisson distribution with rate $\lambda$. Turns out $\lambda$ is also random, and it follows a gamma distribution with parameters $\mu$ and $\alpha$

$$
\begin{aligned}
Y &\sim \text{Poisson}(\lambda) \\
\lambda &\sim \text{Gamma}\left(\mu, \alpha\right)
\end{aligned}
$$


We can marginalize over $\lambda$


$$
\begin{aligned}
p(y \mid \mu, \alpha) &= \int_0^{\infty}{p(y \mid \lambda) p(\lambda \mid \mu, \alpha) d\lambda} \\
&= \binom{y + \alpha - 1}{y}{\left(\frac{\alpha}{\mu + \alpha}\right)}^\alpha {\left(\frac{\mu}{\mu + \alpha}\right)}^y
\end{aligned}
$$


The above describes the probability mass function of a Gamma-Poisson distribution, then we can say 

$$
Y \sim \text{GammaPoisson}(\mu, \alpha)
$$


### Why is it useful?

Well, it relieves us from the previous constraint of our Poisson Distribution, to fix the **mean** to the **variance**.

<br> </br>

<center>
  <img src="images/poisson_gamma_poisson_drake.png" style="width:500px"; />
</center>

The common name for this type of model is [negative binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution) regression.  The negative binomial distribution has two parameters: the mean $\mu$ and the overdispersion parameter $\alpha$. The mean $\mu$ is the rate parameter of the Poisson distribution, and the overdispersion parameter $\alpha$ controls the variance of the distribution.

We'll use the following model...


$$
\begin{aligned}
\beta_{\text{intercept}} &\sim \mathrm{Normal}(0, 5)\\
\beta_{\text{alcohol}} &\sim \mathrm{Normal}(0, 1)\\
\beta_{\text{meds}} &\sim \mathrm{Normal}(0, 1) \\
\alpha &\sim \mathrm{Exponential}(1) \\
\mu_i &= \exp(\beta_{\text{intercept}} + \beta_{\text{meds}} \text{meds}_i + \beta_{\text{alcohol}} \text{alcohol}_i) \\
Y \mid \mu_i, \alpha &\sim \mathrm{NegativeBinomial}(\mu_i, \alpha) \\
\end{aligned}
$$

Let's use this likelihood in PyMC:

In [None]:
with pm.Model(coords=COORDS) as m_sneeze_gp:
    # weakly informative priors
    a = pm.Normal("intercept", mu=0, sigma=5)
    b = pm.Normal("slopes", mu=0, sigma=1, dims="regressor")
    alpha = pm.Exponential("alpha", 1.0)

    # define linear model
    mu = pm.math.exp(a + b[0] * M + b[1] * A)

    ## likelihood
    y = pm.NegativeBinomial("y", mu=mu, alpha=alpha, observed=S, dims="obs_idx")

    trace_sneeze_gp = pm.sample()

In [None]:
az.plot_trace(trace_sneeze_gp);

Sampling went well; let's quickly check the fit.

In [None]:
with m_sneeze_gp:
    trace_sneeze_gp.extend(pm.sample_posterior_predictive(trace_sneeze_gp))

plot_sneeze_predictions(trace_sneeze_gp);

### Exercise: Interaction effect

The predictions look much better than before, **but** we can see there is bias relative to the observations in some groups. For example, in the plot for no alcohol with medication the model is biased to predict fewer sneezes than we actually observe in the medication condition, and the opposite phenomenon in the no medication condition. 

This suggests that we are missing some sort of **interaction effect** between medication and alcohol consumption. The thing is that our model is only able to account for the mean sneezes across both conditions.

Try your hand at adding an interaction term to the model:

In [None]:
COORDS = {
		"regressor": ["meds", "alcohol", "meds : alcohol"], 
		"obs_idx": range(len(sneezes))
}
with pm.Model(coords=COORDS) as m_sneeze_inter:

    # weakly informative priors
    a = pm.Normal("intercept", mu=0, sigma=5)
    b = pm.Normal("slopes", mu=0, sigma=1, dims="regressor")
    alpha = pm.Exponential("alpha", 1.0)

    # define linear model
    mu = pm.math.exp(a + b[0] * M + b[1] * A + b[2] * M * A)

    ## likelihood
    y = pm.NegativeBinomial("y", mu=mu, alpha=alpha, observed=S, dims="obs_idx")

    trace_sneeze_inter = pm.sample()

We see that the slope for the interaction is reliably negative, meaning that taking meds when drinking alcohol will still tame the effects of the latter on sneezing and thus decrease the number of sneezes compared to not taking meds.

In [None]:
with m_sneeze_inter:
    trace_sneeze_inter.extend(pm.sample_posterior_predictive(trace_sneeze_inter))

plot_sneeze_predictions(trace_sneeze_inter);

We can see that the interaction term appears to have removed the biases, and the predictions look much better across the board.

## Generalized Linear Models 

Poisson and negative binomial regressions are particular types of **Generalized Linear Model**. 

### Linear Models

We assume the conditional distribution of the response variable is a normal distribution. We model the mean of that normal distribution with a linear combination of the predictors. Mathematically, we have

$$
\begin{aligned}
\pmb{\beta} &\sim \mathcal{P}_{\pmb{\beta}} \\
\sigma &\sim \mathcal{P}_\sigma \\
\mu_i &= \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \\
Y_i \mid \mu_i, \sigma &\sim \text{Normal}(\mu_i, \sigma)
\end{aligned}
$$

where $\mathcal{P}_{\pmb{\beta}}$ is the joint prior for the regression coefficients and $\mathcal{P}_\sigma$ is the prior on the residual standard deviation.

### Generalized Linear Models

In Generalized Linear Models, we are not restricted to normal likelihoods and we model a function of the mean with a linear combination of the predictors.

$$
\begin{aligned}
\pmb{\beta} &\sim \mathcal{P}_{\pmb{\beta}} \\
\pmb{\theta} &\sim \mathcal{P}_{\pmb{\theta}} \\
g(\mu_i) &= \eta_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \\
Y_i \mid \mu_i, \pmb{\theta} &\sim \mathcal{D}(\mu_i, \pmb{\theta})
\end{aligned}
$$

Which consists of:

* $\eta_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}$ is the **linear predictor**
* $g$ is the **link function**
    * In the Poisson regression model $g$ is the $\log$ function.
    * This point raises a lot of questions, as we directly work with the inverse link function $g^{-1}$ ($\exp$ in the previous case)
    * $g: \Omega \to \mathbb{R}$
    * $g^{-1}: \mathbb{R} \to \Omega$
    * $\Omega$ is the space of the mean parameter
* $\mathcal{D}$ is the **sampling distribution**
    * The conditional distribution of the response variable $Y$
    * Is not necessarily normal

Linear models are specific case of generalized linear models where $\mathcal{D} \equiv \mathcal{N}$ and $g = I$ (*i.e.* the identity function).

## Binomial regression

<img src="https://media.giphy.com/media/C9b7PXxqueVxK/giphy.gif" style="margin:auto" width="450"/>

In France, presidents are elected for five years. In the intervening time, polls are conducted to try and gauge the president's popularity, which often drives his re-election chances. The resulting *survey data* is complicated: measurements can be biased, the polling houses' sampling methods can be poor, response rates can be very low, etc. This provides many challenges for statisticians.

Generally, the polls are trying to answer one question: if a referendum were held today, who would win the presidential election? In other words, polls are a noisy estimate of the president's true, latent popularity. 

### French Polling Data

Let's import some polling data for Emmanuel Macron, the current French president.

In [None]:
try:
    polls = pl.read_csv("../data/macron_popularity.csv")
except FileNotFoundError:
    DATA_URL = "https://raw.githubusercontent.com/pymc-labs/ccc-workshop/main/data/"
    polls = pl.read_csv(DATA_URL + "macron_popularity.csv")
polls

The most important column here is `N_approve`, the number of people who approve of Macron. What we want is a model that takes these polls as raw inputs and infers the true **latent proportion** of people who do approve of the president.

### The Model

The polls can be thought of as realizations of a Binomial sampling process: for each poll, a number $N_{\text{total}}$ of people are surveyed, and $N_{\text{approve}}$ of them say they approve of the president's job. Statistically speaking, we have 


$$N_{\text{approve}} \sim \mathrm{Binomial}(N_{\text{total}}, p_{\text{approve}})$$


where $p_{\text{approve}}$ equals the proportion of people supporting the president. So the simplest Bayesian model involves simply assigning a prior distribution to $p_{\text{approve}}$.

A natural prior for $p_{\text{approve}}$ is the beta distribution, which is the conjugate prior for the Binomial distribution. 

Beta distributions are parametrized by two positive reals, $\alpha$ and $\beta$, so-called shape parameters. It can be difficult to gain an intuition about various combinations of $\alpha$ and $\beta$, so we can adopt the alternative parameterization of the beta distribution in terms of $\mu$ and $\sigma$ (mean and standard deviation), as we have seen in a previous section. Thus, we can more easily specify a prior based on how much prior data we base our beliefs on.

French presidents are usually not that popular, at least compared to the US. So a mean approval around 40% seems reasonable; For $\sigma$, we _know_ presidential approval never goes below 10% or above 90%. So, something like $\sigma = 0.15$ seems to fit the bill. You can play around with the code below to get a sense of how the Beta family behaves:

In [None]:
mu, sigma = 0.4, 0.15

ax = az.plot_dist(
    pm.draw(pm.Beta.dist(mu=mu, sigma=sigma), draws=10_000)
)
ax.set(
    title=f"Random draws from $Beta(\\mu={mu}, \\sigma={sigma})$",
    xlabel="Prior Popularity",
    ylabel="Plausibility",
);

The mathematical link between $(\mu, \sigma)$ and $(\alpha, \beta)$ is as follows:

$$
\begin{aligned}
\kappa &= \frac{\mu * (1 - \mu)}{\sigma^2} - 1 \\
\alpha &= \mu \times \kappa \\
\beta &= (1 - \mu) \times \kappa 
\end{aligned}
$$

which for our model translates to $\alpha = 3.9$ and $\sigma = 5.8$.

Our model is then:

$$
\begin{aligned}
p_{\text{approve}} &\sim \mathrm{Beta(\mu=0.4, \sigma=0.15)}\\
N_{\text{approve}} &\sim \mathrm{Binomial}(N_{\text{total}}, p_{\text{approve}})\\
\end{aligned}
$$


Let's code that up and run it in PyMC:

In [None]:
with pm.Model() as m_raw:
    p_approve = pm.Beta("p_approve", mu=0.4, sigma=0.15)
    n_approve = pm.Binomial(
        "n_approve",
        n=polls.get_column("N_total").to_numpy(),
        p=p_approve,
        observed=polls.get_column("N_approve").to_numpy(),
    )

    trace_raw = pm.sample()
az.summary(trace_raw, round_to=3)

In [None]:
az.plot_trace(trace_raw);

The posterior distribution seems confident (conditional on this model) that the president's approval rating is in the rang of 38.1% to 38.4%, which is rather narrow!

It's because we have more than 300 polls, which is a lot. 

Of course, not all of the polling data should be treated equally: those from 2017 should probably be discarded or down-weighted when trying to estimate 2021 popularity. Also, simple Binomial regressions tend to give overconfident estimates, for reasons similar to the Poisson, that we outlined above. If you add to that the fact that polls are noisy observations, which we did not account for in the model, there are reasons to doubt our model's confidence.

Actually, let's visualize posterior predictions of our model and see whether our skepticism is warranted:

In [None]:
with m_raw:
    trace_raw.extend(pm.sample_posterior_predictive(trace_raw))
az.plot_ppc(trace_raw);

This is somewhat confusing because the default visualization returned by ArviZ is using different bins for the different posterior draws, the posterior predictive mean, and the observed data. Before making any conclusions about the model let's customize the visualization a bit. 

In [22]:
def adjust_lightness(color, amount=0.5):
    import colorsys

    import matplotlib.colors as mc

    try:
        c = mc.cnames[color]
    except:
        c = color
    c = colorsys.rgb_to_hls(*mc.to_rgb(c))
    return colorsys.hls_to_rgb(c[0], c[1] * amount, c[2])


# Generate some graded colors
colors = [adjust_lightness("C0", x) for x in [1.8, 1.6, 1.4, 1.2, 0.9]]

# Quantiles corresponding to color gradings (used in plot below)
quantile_bands = [(0.05, 0.95), (0.15, 0.85), (0.25, 0.75), (0.35, 0.65)]


def plot_ppc_discrete(idata, bins, ax):
    # Add uncertainty bands around a line in plot
    def add_discrete_bands(x, lower, upper, ax, **kwargs):
        for i, (l, u) in enumerate(zip(lower, upper)):
            s = slice(i, i + 2)
            ax.fill_between(x[s], [l, l], [u, u], **kwargs)

    # Get data variable name
    var_name = list(idata.observed_data.data_vars)[0]
    # Extract observations form idata
    y_obs = idata.observed_data[var_name].to_numpy()

    counts_list = []
    for draw_values in az.extract(idata, "posterior_predictive")[var_name].to_numpy().T:
        counts, _ = np.histogram(draw_values, bins=bins)
        counts_list.append(counts)
    counts_arr = np.stack(counts_list)

    # Add quantile bands and median to plot
    for i in range(len(quantile_bands)):
        qts_tmp = np.quantile(counts_arr, quantile_bands[i], axis=0)
        add_discrete_bands(bins, qts_tmp[0], qts_tmp[1], ax=ax, color=colors[i])

    median = np.quantile(counts_arr, 0.5, axis=0)

    ax.step(bins[:-1], median, color=colors[4], lw=2, where="post")

    # Add ground truth to plot
    ax.hist(y_obs, bins=bins, histtype="step", lw=2, color="black", align="mid")

    # Add legend
    handles = [
        Line2D([], [], label="Observed data", color="black", lw=2),
        Line2D([], [], label="Posterior predictive median", color=colors[4], lw=2),
    ]
    ax.legend(handles=handles)

    return ax

In [None]:
fig, ax = plt.subplots()
_, bins = np.histogram(polls.get_column("N_approve"), bins=50)
plot_ppc_discrete(trace_raw, bins, ax)
ax.set(xlabel="Number of approvals");

So, the model is clearly not a great fit to the data. It's not _that_ surprising, since this model is very simple, while approval ratings are the results of complex socio-economic interactions that evolve with time.

### Adding a predictor

A first thing we can do is help our model by adding a variable that is thought to be correlated with presidential approval. For example, the unemployment rate, which we actually already have in our dataset.

More precisely, we will use the _logarithm_ of unemployment, not the raw unemployment rate, which is easier to work with.

$$
\begin{aligned}
\text{baseline} &\sim \mathrm{Normal}(-0.7, 0.5)\\
\beta_{\text{unemp}} &\sim \mathrm{Normal}(0, 0.2)\\
p_{\text{approve}} &= \text{logit}^{-1}(\text{baseline} + \beta_{\text{unemp}} \times \log(\text{unemp}))\\
N_{\text{approve}} &\sim \mathrm{Binomial}(N_{\text{total}}, p_{\text{approve}})\\
\end{aligned}
$$

Note that here we are using a **logit link** function. This has the same role as the exponential in the Poisson regression, except that the logit function links the real line (here, the parameter space, where $baseline$ and $\beta_{\text{unemp}}$ live) to the unit interval line (here, the outcome space, where $p_{\text{approve}}$ lives).

$$x = \text{logit}(p) = \log\left[\frac{p}{1-p}\right]$$

$$p = \text{logit}(x)^{-1} = \frac{1}{1 + e^{-x}}$$

Let's visualize this function to get a better understanding:

In [24]:
def invlogit(x):
    return 1 / (1 + np.exp(-x))

In [None]:
xvals = np.linspace(-8, 8)
plt.plot(xvals, invlogit(xvals))
plt.xlabel("parameter space")
plt.ylabel("probability space");

To gain an intuition, notice that 0 on the logit space translates to a probability of 50%. Similarly, anything below about -5 is close to 0%, and anything above 5 is close to 100%.

This means that the logistic strongly distorts the parameter space: only the numbers on $[-6, 6]$ have a probability different than 0 or 1. So priors that you _think_ are reasonable on the parameter space might turn out to be completely unreasonable on the probability space. So, with GLMs you have to be careful when choosing priors, because of the distortion of space caused by the link function.

#### Selecting priors

Our regression intercept is the baseline popularity. Suppose we behaved like this was an ordinary least squares regression and chose something like a $\mathrm{Normal}(0, 10)$ prior, which usually works well for the intercept of a linear regression. What does this mean on the outcome space?

In [None]:
from scipy.special import expit as logistic

ax = az.plot_kde(
    logistic(pm.draw(pm.Normal.dist(mu=0, sigma=10), draws=20_000)),
    label="baseline ~ $Normal(0, 10)$",
)
ax.set_xlim((0, 1))
ax.set_xlabel("Baseline presidential popularity")
ax.set_ylabel("Density")
ax.set_title("Baseline prior");

This encodes exactly the opposite of our domain knowledge! 

We can do better: we expect most presidents to have a baseline popularity between 20% and 50% -- in other words, French people rarely love their presidents but often _really_ dislike them. $\mathrm{Normal}(-0.7, 0.5)$ looks reasonable in that regard: it expects 95% of the probability mass to be between -1.7 and 0.3, i.e $logistic(-1.7) = 15\%$ and $logistic(0.3) = 57\%$, with a mean approval of $logistic(-0.7) = 33\%$:


> Remember our lecture on choosing priors. If you want to e.g. have $95\%$ of the mass in the prior on $\text{logit}^{-1}(p_{approve})$ lie within $15\%$ and $57\%$, you could also just check for the respective inputs to the `invlogit` function ($-1.7$ and $0.3$ from above)

## Choosing priors computationally

While you can use simulation to construct priors that meet our requirements, it does take some trial and error.

PreliZ is a Python package designed to assist practitioners in the process of prior elicitation, which is a crucial aspect of Bayesian statistics. It provides a set of tools that help transform domain knowledge into well-defined probability distributions. This is particularly useful when working with PyMC, a probabilistic programming library in Python, as it allows for the construction of informative priors that can regularize inferences and make models more robust.

One of the key features of PreliZ is its compatibility with PyMC, enabling users to convert PreliZ distributions directly into PyMC distributions. This seamless integration simplifies the process of specifying priors in PyMC models, ensuring that the priors are both appropriate and informative based on the available domain knowledge.


In [None]:
import preliz as pz

MASS = 0.95  # probability mass we want in an interval
LOWER = -1.7  # lower bound of the interval
UPPER = 0.3  # upper bound of the interval

constrained_normal, _ = pz.maxent(
    pz.Normal(), lower=LOWER, upper=UPPER, mass=MASS
)

constrained_normal

We can now plug these hyperparameters into PyMC's `Normal` distribution:

In [None]:
baseline_prior_samples = pm.draw(pm.Normal.dist(-0.7, 0.5), draws=20_000)

ax = az.plot_kde(
    invlogit(baseline_prior_samples),
    label="baseline ~ $Normal(-0.7, 0.5)$",
)
ax.set_xlim((0, 1))
ax.set_xlabel("Baseline presidential popularity")
ax.set_ylabel("Density")
ax.set_title("Baseline prior");

Much better!

What about the effect of unemployment rate? Some domain knowledge suggests we should expect a mild effect. Unemployment is one of several factors influencing voters' opinion of the president, and partisanship makes movements in popularity less responsive to unemployment -- if you really don't like the president, you probably need to see a very low unemployment rate before starting to give them credit.

All in all, we expect the unemployment to have a small negative effect, but its difficult to quantify. So, let's center our prior on $0$ (i.e no expected effect) and use a weakly regularizing $\sigma$ (in log-odds space): $\beta_{\text{unemp}} \sim \mathrm{Normal}(0, 0.2)$. To see the effect of this prior, we have to plug it into the formula for our model

$$p_{\text{approve}} = \text{logit}^{-1}(\text{baseline} + \beta_{\text{unemp}} \times \log(\text{unemp}))$$

It is useful to simulate some unemployment data. For convenience, we will standardize the real unemployment data (*i.e.* force it to have mean 0 and standard deviation 1), which makes it easier to set our priors, and as a further benefit, the sampler may have an easier time converging.

In [29]:
def standardize(series):
    """Standardize a polars series"""
    return (series - series.mean()) / series.std()


stdz_log_unemployment = standardize(np.log(polls.get_column("unemployment"))).to_numpy()

So, as the data are standardized, simulating a grid of unemployment between -3 and 3 is largely sufficient to cover the whole range of possible data:

In [None]:
unemp_effect_prior_samples = pm.draw(pm.Normal.dist(0.0, 0.2), draws=20_000)
unemp_grid = np.linspace(-3, 3, 200)

prior_approval = invlogit(
    baseline_prior_samples[:, None] + unemp_effect_prior_samples[:, None] * unemp_grid
)

for i in range(100):
    plt.plot(unemp_grid, prior_approval[i], "k", alpha=0.2)
plt.xlabel("$log(unemp)$ (standardized)")
plt.ylabel("prior approval")
plt.title("$\\text{baseline} ~ Normal(-0.7, 0.5), \\beta_{\\text{unemp}} \\sim Normal(0, 0.2)$");

Each line is a possible relationship, according to our model's assumptions, between unemployment and latent popularity, factoring in the baseline effect. Each line shows a smooth relationship between unemployment and popularity. 

We could add a constraint that popularity only decreases with increasing unemployment, but we will not do that here.

#### Exercise

Implement the covariate model in PyMC and sample from it. Assign the output from `pm.sample` to an object called `trace_unemp`. 

In [None]:
with pm.Model() as m_unemp:
    
    # Write your code here

    trace_unemp = pm.sample()

In [None]:
az.plot_trace(trace_unemp);

Let's check out the posterior predictive distribution of approvals, and compare it to the naive model:

In [None]:
with m_unemp:
    trace_unemp.extend(pm.sample_posterior_predictive(trace_unemp))

In [None]:
_, axes = plt.subplots(1, 2, figsize=(12, 5), constrained_layout=True, sharey=True)
_, bins = np.histogram(polls.get_column("N_approve"), bins=50)

plot_ppc_discrete(trace_raw, bins, axes[0])
plot_ppc_discrete(trace_unemp, bins, axes[1])

axes[0].set(title="Model 0: No predictors", xlabel="Number of approvals")
axes[1].set(title="Model 1: log(unemployment)", xlabel="Number of approvals");

The model shows modest improvement in the overall shape of the predictions, but it still has trouble with low approval ratings in particular. It is also quite confident in its predictions, which is not ideal.


## Beta-Binomial regression

As noted, binomial regressions can give overconfident estimates, for reasons similar to the Poisson: the binomial distribution only has one parameter, which closely ties the variance to the expected value:

$$\text{E}(Y) = N_{\text{total}} \times p_{\text{approve}}$$
$$\text{Var}(Y) = N_{\text{total}} \times p_{\text{approve}} \times (1 - p_{\text{approve}})$$

So, as we used the Gamma-Poisson distribution to give more flexibility to the Poisson, we can similarly allow the binomial probability to vary according to some distribution as well. Canonically, we use the [Beta-Binomial distribution](https://en.wikipedia.org/wiki/Beta-binomial_distribution), which handles data that is overdispersed relative to the binomial distribution.

Under this model, the binomial probabilities are no longer fixed, but are rather random variables drawn from a common beta distribution. So, in addition to the number of trials, $n$, Beta-Binomial distributions are parametrized by the beta distribution's shape parameters, $\alpha$ and $\beta$. 

We will specify a covariate model for $\mu$ and a place a prior on $\kappa$ rather than on $\sigma$, since $\sigma$ is awkwardly constrained to the interval $[0, \sqrt(\mu (1 - \mu))]$. 

> ### Mean and sample size parameterization of the beta distribution
>
> An alternative way of thinking about the beta distribution is in terms of mean and sample size parameters. 
>
> The sample size is defined here as:
>
> $$ n = \alpha + \beta - 2 $$
>
> as a parameter $\kappa$, called the **concentration** defined as,
>
> $$ \kappa = n + 2 = \alpha + \beta $$
>
> while the mean is defined as:
>
> $$ p = \frac{\alpha}{\alpha + \beta} $$
>


Here, we will specify:

$$
\begin{aligned}
\text{baseline} &\sim \mathrm{Normal}(-0.7, 0.5)\\
\beta_{\text{unemp}} &\sim \mathrm{Normal}(0, 0.2)\\
p_{\text{approve}} &= \text{logit}^{-1}(\text{baseline} + \beta_{\text{unemp}} \times \log(\text{unemp}))\\
\kappa &\sim \mathrm{Exponential}(1) + 10\\
N_{\text{approve}} &\sim \mathrm{BetaBinomial}(\alpha=p_{\text{approve}} \times \kappa, \beta = (1 - p_{\text{approve}}) \times \kappa, \: N_{\text{total}})
\end{aligned}
$$

So we are assuming that $\kappa$ is *at least* 10, allowing the variance to be larger if necessary. To that end, we can use a trick and define $\kappa = \tilde{\kappa} + 10$, where $\tilde{\kappa} \sim \mathrm{Exponential}(1)$, which works because exponential distributions have a minimum of zero. 

This translates easily into PyMC:

In [None]:
U = stdz_log_unemployment
N, Y = polls.select(["N_total", "N_approve"]).to_numpy().T

with pm.Model() as m_betabin:

    # intercept on logit scale
    baseline = pm.Normal("baseline", -0.7, 0.5)
    # log unemployment slope
    log_unemp_effect = pm.Normal("log_unemp_effect", 0.0, 0.2)

    # invlogit --> logistic
    p_approve = pm.Deterministic("p_approve", pm.math.invlogit(baseline + log_unemp_effect * U))

    # overdispersion parameter
    kappa = pm.Exponential("kappa_offset", 1.0) + 10.0

    n_approve = pm.BetaBinomial(
        "n_approve",
        alpha=p_approve * kappa,
        beta=(1.0 - p_approve) * kappa,
        n=N,
        observed=Y,
    )

    trace_betabin = pm.sample()

az.summary(trace_betabin, var_names="~p_approve", round_to=2)

In [None]:
az.plot_trace(trace_betabin, var_names="~p_approve");

Note how the posterior distributions of `baseline` and `log_unemp_effect` are much wider than before -- that's already a good sign.

Let's see how this propagates to the posterior predictions:

In [None]:
with m_betabin:
    trace_betabin.extend(pm.sample_posterior_predictive(trace_betabin))

In [None]:
_, axes = plt.subplots(1, 3, figsize=(16, 6), constrained_layout=True, sharey=True)
_, bins = np.histogram(polls.get_column("N_approve"), bins=50)

plot_ppc_discrete(trace_raw, bins, axes[0])
plot_ppc_discrete(trace_unemp, bins, axes[1])
plot_ppc_discrete(trace_betabin, bins, axes[2])

axes[0].set(title="Model 0: No predictors", xlabel="Number of approvals")
axes[1].set(title="Model 1: log(unemployment)", xlabel="Number of approvals")
axes[2].set(
    title="Model 2: log(unemployment) + BetaBinomial", xlabel="Number of approvals"
);

That's the best we've seen so far! 

This is still not a production-grade model, but considering how simple it still is, it does not perform badly. Polls really are notoriously noisy, and the beta-binomial seems to characterize this noise pretty well.

Our predictions still aren't very useful though; while they are a reasonable data-generating model, it's because we have cranked up the model's variance. 

### How to improve this model?

We could add more structure to the model to make our predictions more precise. For example, we could think about adding relevant predictor variables, using polling data from previous presidents (approval works in cycle, no matter who's president, so knowing about previous presidents would help), and adding a time component (we already noticed that polls from 2017 shouldn't weigh as much as polls from 2021). This last option would also allow us to do time series predictions, which is intrinsically interesting here.


### Evaluating the unemployment effect

Clearly, the biggest improvement in our model was the result of using a beta-binomial likelihood in place of the simpler binomial. So, it's not clear how much the unemployment rate actually helped.

One way to evaluate this is to look at the implications of varying unemployment values on the model predictions. We can do this without fitting a new model, using the samples from the posterior distribution in the trace object.

In [None]:
trace_betabin.posterior

Let's use our grid of unemployment values as a *counterfactual* and see how the model's predictions change as we move from low to high unemployment rates, had they been observed in the data.

In [40]:
unemp_grid = xr.DataArray(unemp_grid, dims="counterfactual")

We push these values through the model, using the values sampled for the model parameters in the trace.

In [41]:
post_approval = invlogit(trace_betabin.posterior["baseline"] + trace_betabin.posterior["log_unemp_effect"] * unemp_grid)

In [None]:
_, ax = plt.subplots(1, 1, figsize=(12, 5))

az.plot_hdi(unemp_grid, post_approval, ax=ax, backend_kwargs={"label": "Posterior HDI"})
ax.plot(unemp_grid, post_approval.mean(("chain", "draw")), ls="--", lw="3", label="Posterior mean")
ax.scatter(stdz_log_unemployment, polls["N_approve"] / polls["N_total"], c="k", alpha=0.1, label="Observed")

ax.set(xlabel="$log(unemp)$ (standardized)", ylabel="Mean approval", title="Conditional Adjusted Predictions")
ax.legend();

The effect of unemployment is positive with respect to approval rating! This seems like a suspicious result.

### Exercise

Code the model without the unemployment data, but with the Beta-Binomial noise distribution. Assign the output from `pm.sample` to an object called `trace_betabin_no_unemp`.

In [None]:
with pm.Model() as m_betabin_no_unemp:
    
    # Write your code here

    trace_betabin_no_unemp = pm.sample()

az.summary(trace_betabin_no_unemp, round_to=2)

In [None]:
az.summary(trace_betabin_no_unemp, round_to=2)

In [None]:
az.plot_trace(trace_betabin_no_unemp);

In [None]:
with m_betabin_no_unemp:
    trace_betabin_no_unemp.extend(
        pm.sample_posterior_predictive(trace_betabin_no_unemp)
    )

In [None]:
_, axes = plt.subplots(2, 2, figsize=(16, 6), constrained_layout=True, sharey=True)
_, bins = np.histogram(polls.get_column("N_approve"), bins=50)

plot_ppc_discrete(trace_raw, bins, axes[0, 0])
plot_ppc_discrete(trace_unemp, bins, axes[1, 0])
plot_ppc_discrete(trace_betabin, bins, axes[0, 1])
plot_ppc_discrete(trace_betabin_no_unemp, bins, axes[1, 1])


axes[0, 0].set(title="Model 0: No predictors", xlabel="Number of approvals")
axes[1, 0].set(title="Model 1: log(unemployment)", xlabel="Number of approvals")
axes[0, 1].set(
    title="Model 2: log(unemployment) + BetaBinomial", xlabel="Number of approvals"
)
axes[1, 1].set(title="Model 3: BetaBinomial", xlabel="Number of approvals");

Another look at the data shows that something else must be going on ...

In [None]:
time = polls.with_columns(pl.col("date").str.to_datetime()).get_column("date")

_, (up, down) = plt.subplots(2, 1, figsize=(12, 6), sharex=True)

up.scatter(time, stdz_log_unemployment, c="C1", alpha=0.5)
up.set(ylabel="Unemployment")

down.scatter(time, polls.get_column("N_approve") / polls.get_column("N_total"), c="C0", alpha=0.5)
down.set(ylabel="Approval");

## Imputation of Missing Data

As with most textbook examples, the models we have examined so far assume that the associated data are complete. That is, there are no **missing values** corresponding to any observations in the dataset. However, many real-world datasets have missing observations, usually due to some logistical problem during the data collection process. The easiest way of dealing with observations that contain missing values is simply to exclude them from the analysis. However, this results in loss of information if an excluded observation contains valid values for other quantities, and can bias results. An alternative is to impute the missing values, based on information in the rest of the model.

For example, consider a survey dataset for some wildlife species:

    Count   Site   Observer   Temperature
    ------- ------ ---------- -------------
    15      1      1          15
    10      1      2          NA
    6       1      1          11

Each row contains the number of individuals seen during the survey, along with three covariates: the site on which the survey was conducted, the observer that collected the data, and the temperature during the survey. If we are interested in modelling, say, population size as a function of the count and the associated covariates, it is difficult to accommodate the second observation because the temperature is missing (perhaps the thermometer was broken that day). Ignoring this observation will allow us to fit the model, but it wastes information that is contained in the other covariates.

In a Bayesian modelling framework, missing data are accommodated simply by treating them as **unknown model parameters**. Values for the missing data $\tilde{y}$ are estimated naturally, using the posterior predictive distribution:

$$p(\tilde{y}|y) = \int p(\tilde{y}|\theta) f(\theta|y) d\theta$$

This describes additional data $\tilde{y}$, which may either be considered unobserved data or potential future observations. We can use the posterior predictive distribution to model the likely values of missing data.

Consider the coal mining disasters data introduced previously. Assume that two years of data are missing from the time series; we indicate this in the data array by the use of an arbitrary placeholder value, `-999`:

In [None]:
disasters_missing = np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, -999, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, -999, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

N = len(disasters_missing)

To estimate these values in PyMC, we need to convert these placeholder values to `np.nan` (or `None`) values so that they can be handled by the model as missing values:

In [None]:
disasters_missing = np.where(disasters_missing == -999, np.nan, disasters_missing)
disasters_missing

This array can then be passed to the model likelihood, which recognizes the `nan` values as missing and replaces them with stochastic variables of the desired type. For the coal mining disasters problem, recall that disaster events were modeled as Poisson variates:

```python
disasters = Poisson('disasters', mu=rate, observed=disasters_missing)
```

Each element in `disasters` is a Poisson random variable, irrespective of whether the observation was missing or not. The difference is that actual observations are assumed to be data stochastics, while the missing
values are unobserved stochastics. The latter are considered unknown, rather than fixed, and therefore estimated by the fitting algorithm, just as unknown model parameters are.

The model is otherwise unchanged from the complete data case.

In [None]:
with pm.Model() as missing_data_model:

    # Prior for distribution of switchpoint location
    switchpoint = pm.DiscreteUniform('switchpoint', lower=0, upper=N)
    # Priors for pre- and post-switch mean number of disasters
    early_mean = pm.Exponential('early_mean', lam=1.)
    late_mean = pm.Exponential('late_mean', lam=1.)

    # Allocate appropriate Poisson rates to years before and after current
    # switchpoint location
    idx = np.arange(N)
    rate = pm.math.switch(switchpoint >= idx, early_mean, late_mean)

    # Data likelihood
    disasters = pm.Poisson('disasters', rate, observed=disasters_missing)

In [None]:
pm.model_to_graphviz(missing_data_model)

In [None]:
with missing_data_model:
    # This step assignment is a work-around for a bug in PyMC 5.20
    trace_missing = pm.sample(step=pm.Metropolis())

In [None]:
az.plot_posterior(trace_missing, var_names=['disasters_unobserved'])

## Hierarchical Models

Hierarchical or multilevel modeling is a generalization of regression modeling.

*Multilevel models* are regression models in which the constituent model parameters are given **probability models**. This implies that model parameters are allowed to **vary by group**.

Observational units are often naturally **clustered**. Clustering induces dependence between observations, despite random sampling of clusters and random sampling within clusters.

A *hierarchical model* is a particular multilevel model where parameters are nested within one another.

Some multilevel structures are not hierarchical. 

* e.g. "country" and "year" are not nested, but may represent separate, but overlapping, clusters of parameters

For this topic, let's revisit the radon dataset from the first section.


### Example: Radon contamination (Gelman and Hill 2006)

Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.

![radon](images/how_radon_enters.jpg)

The EPA did a study of radon levels in 80,000 houses. Two important predictors:

* measurement in basement or first floor (radon higher in basements)
* county uranium level (positive correlation with radon levels)

We will focus on modeling radon levels in Minnesota.

The hierarchy in this example is households within county. 

### Data organization

First, we import the data and extract Minnesota's data.

In [None]:
import numpy as np
import polars as pl
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import seaborn as sns
import xarray as xr

sns.set_context("notebook")
import warnings

warnings.filterwarnings("ignore", module="mkl_fft")
warnings.filterwarnings("ignore", module="matplotlib")
warnings.filterwarnings("ignore", category=RuntimeWarning)

In [None]:
DATA_URL = "https://raw.githubusercontent.com/pymc-labs/ccc-workshop/main/data/"

try:
    srrs2 = pl.read_csv("../data/srrs2.dat")
except FileNotFoundError:
    srrs2 = pl.read_csv(DATA_URL + "srrs2.dat")

# Import radon data

srrs2.columns = [col.strip() for col in srrs2.columns]
srrs_mn = srrs2.filter(pl.col("state") == "MN")

RANDOM_SEED = 20090425

Next, obtain the county-level predictor, uranium, by combining two variables.

In [None]:
try:
    cty = pl.read_csv("../data/cty.dat")
except FileNotFoundError:
    cty = pl.read_csv(DATA_URL + "cty.dat")

srrs_mn = srrs_mn.with_columns(
    (
        pl.col("stfips").cast(pl.Float32) * 1000
        + pl.col("cntyfips").str.strip_chars().cast(pl.Float32)
    ).alias("fips")
)

cty_mn = cty.filter(pl.col("st") == "MN")
cty_mn = cty_mn.with_columns(
    (
        pl.col("stfips").cast(pl.Float32) * 1000 + pl.col("ctfips").cast(pl.Float32)
    ).alias("fips")
)

Use the `join` method to combine home- and county-level information in a single DataFrame.

In [None]:
srrs_mn = srrs_mn.join(cty_mn[["fips", "Uppm"]], on="fips")
srrs_mn = srrs_mn.unique(subset="idnum")
u = np.log(srrs_mn["Uppm"]).unique().to_numpy()

srrs_mn.shape

We also need a lookup table (`dict`) for each unique county, for indexing.

In [None]:
srrs_mn = srrs_mn.with_columns(pl.col("county").str.strip_chars().cast(pl.Categorical))
county = srrs_mn["county"].to_physical().to_numpy()
mn_counties = srrs_mn["county"].cat.get_categories()

Finally, create local copies of variables.

In [None]:
srrs_mn = srrs_mn.with_columns(
    pl.col("activity").str.strip_chars().cast(pl.Float32).alias("radon")
)
srrs_mn = srrs_mn.with_columns(np.log(pl.col("radon") + 0.1).alias("log_radon"))
floor_measure = srrs_mn["floor"].to_numpy()

Distribution of radon levels in MN (log scale):

In [None]:
srrs_mn.head()

In [None]:
srrs_mn["log_radon"].plot.hist()

In [None]:
srrs_mn["floor"].plot.hist()

## Conventional approaches

The two conventional alternatives to modeling radon exposure represent the two extremes of the bias-variance tradeoff:

***Complete pooling***: 

Treat all counties the same, and estimate a single radon level.

$$y_i = \alpha + \beta x_i + \epsilon_i$$

***No pooling***:

Model radon in each county independently.

$$y_i = \alpha_{j[i]} + \beta x_i + \epsilon_i$$

where $j[i]$ is the county index of observation $i$.

The errors $\epsilon_i$ may represent measurement error, temporal within-house variation, or variation among houses.

### Complete pooling

Here are the point estimates of the slope and intercept for the complete pooling model:

In [10]:
log_radon = srrs_mn["log_radon"].to_numpy()
floor = srrs_mn["floor"].to_numpy()

In [11]:
with pm.Model() as pooled_model:

    mu = pm.Normal("mu", 0, sigma=1e5)
    beta = pm.Normal("beta", mu=0, sigma=1e5)
    sigma = pm.HalfCauchy("sigma", 5)

    theta = mu + beta * floor

    y = pm.Normal("y", theta, sigma=sigma, observed=log_radon)

In [None]:
pm.model_to_graphviz(pooled_model)

In [None]:
with pooled_model:
    pooled_trace = pm.sample(random_seed=RANDOM_SEED)

In [None]:
pooled_trace

In [15]:
means = pooled_trace.posterior.mean(dim=("chain", "draw"))
mu_mean = means['mu'].values
beta_mean = means['beta'].values

In [None]:
plt.scatter(srrs_mn["floor"], srrs_mn["log_radon"])
xvals = np.linspace(-0.2, 1.2)
yvals = mu_mean + beta_mean * xvals
plt.plot(xvals, yvals, "r--");

### No pooling

Estimates of county radon levels for the unpooled model.
Notice the use of `coords` to avoid manage dimensions.

In [17]:
coords = {"county": mn_counties}

with pm.Model(coords=coords) as unpooled_model:

    mu = pm.Normal("mu", 0, sigma=1e5, dims="county")
    beta = pm.Normal("beta", 0, sigma=1e5)
    sigma = pm.HalfCauchy("sigma", 5)

    theta = mu[county] + beta * floor

    y = pm.Normal("y", theta, sigma=sigma, observed=log_radon)

In [None]:
pm.model_to_graphviz(unpooled_model)

In [None]:
with unpooled_model:
    unpooled_trace = pm.sample(random_seed=RANDOM_SEED)

In [None]:
az.plot_forest(
    unpooled_trace,
    var_names=["mu"],
    combined=True,
    figsize=(6, 18),
);

In [21]:
unpooled_estimates = unpooled_trace.posterior.mean(dim=("chain", "draw"))['mu']
unpooled_se = unpooled_trace.posterior.std(dim=("chain", "draw"))['mu']

We can plot the ordered estimates to identify counties with high radon levels:

In [None]:
unpooled_means = unpooled_trace.posterior.mean(dim=("chain", "draw"))
unpooled_hdi = az.hdi(unpooled_trace)

unpooled_means_iter = unpooled_means.sortby("mu")
unpooled_hdi_iter = unpooled_hdi.sortby(unpooled_means_iter.mu)

_, ax = plt.subplots(figsize=(10, 6))
xticks = np.arange(0, 86, 6)
unpooled_means_iter.plot.scatter(x="county", y="mu", ax=ax, alpha=0.8)
ax.vlines(
    np.arange(mn_counties.shape[0]),
    unpooled_hdi_iter.mu.sel(hdi="lower"),
    unpooled_hdi_iter.mu.sel(hdi="higher"),
    color="orange",
    alpha=0.6,
)
ax.set(ylabel="Radon estimate", ylim=(-2, 4.5))
ax.set_xticks(xticks)
ax.set_xticklabels(unpooled_means_iter.county.values[xticks])
ax.tick_params(rotation=45)
sns.despine(trim=True);

Here are visual comparisons between the pooled and unpooled estimates for a subset of counties representing a range of sample sizes.

In [None]:
sample_counties = (
    "LAC QUI PARLE",
    "AITKIN",
    "KOOCHICHING",
    "DOUGLAS",
    "CLAY",
    "STEARNS",
    "RAMSEY",
    "ST LOUIS",
)

fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
axes = axes.ravel()
m = unpooled_trace.posterior.mean(dim=("chain", "draw")).beta
for i, c in enumerate(sample_counties):
    y, x = srrs_mn.filter(pl.col("county") == c)[["log_radon", "floor"]].to_numpy().T
    axes[i].scatter(x + np.random.randn(len(x)) * 0.01, y, alpha=0.4)

    # No pooling model
    b = unpooled_estimates.sel(county=c)

    # Plot both models and data
    xvals = np.linspace(0, 1)
    axes[i].plot(xvals, m.values * xvals + b.values)
    axes[i].plot(xvals, beta_mean * xvals + mu_mean, "r--")
    axes[i].set_xticks([0, 1])
    axes[i].set_xticklabels(["basement", "floor"])
    axes[i].set_ylim(-1, 3)
    axes[i].set_title(c)
    if not i % 2:
        axes[i].set_ylabel("log radon level")

Neither of these models are satisfactory:

* if we are trying to identify high-radon counties, pooling is useless
* we do not trust extreme unpooled estimates produced by models using few observations

## Multilevel and hierarchical models

When we pool our data, we imply that they are sampled from the same model. This ignores any variation among sampling units (other than sampling variance):

![pooled](images/pooled_model.png)

When we analyze data unpooled, we imply that they are sampled independently from separate models. At the opposite extreme from the pooled case, this approach claims that differences between sampling units are to large to combine them:

![unpooled](images/unpooled_model.png)

In a hierarchical model, parameters are viewed as a sample from a population distribution of parameters. Thus, we view them as being neither entirely different or exactly the same. This is ***parital pooling***.

![hierarchical](images/partial_pooled_model.png)

We can use PyMC to easily specify multilevel models, and fit them using Markov chain Monte Carlo.

## Partial pooling model

The simplest partial pooling model for the household radon dataset is one which simply estimates radon levels, without any predictors at any level. A partial pooling model represents a compromise between the pooled and unpooled extremes, approximately a weighted average (based on sample size) of the unpooled county estimates and the pooled estimates.

$$\hat{\alpha} \approx \frac{(n_j/\sigma_y^2)\bar{y}_j + (1/\sigma_{\alpha}^2)\bar{y}}{(n_j/\sigma_y^2) + (1/\sigma_{\alpha}^2)}$$

Estimates for counties with smaller sample sizes will shrink towards the state-wide average.

Estimates for counties with larger sample sizes will be closer to the unpooled county estimates.

In [24]:
with pm.Model(coords=coords) as partial_pooling:

    # Priors
    mu_a = pm.Normal("mu_a", mu=0.0, sigma=1e5)
    sigma_a = pm.HalfCauchy("sigma_a", 5)

    # Random intercepts
    mu = pm.Normal("mu", mu=mu_a, sigma=sigma_a, dims="county")

    # Model error
    sigma_y = pm.HalfCauchy("sigma_y", 5)

    # Expected value
    y_hat = mu[county]

    # Data likelihood
    y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon)

In [None]:
pm.model_to_graphviz(partial_pooling)

In [None]:
with partial_pooling:
    partial_pooling_trace = pm.sample(tune=2000, random_seed=21)

In [None]:
N_county = srrs_mn.group_by("county").agg(pl.col("idnum").count())["idnum"].to_numpy()

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True)
for ax, trace, level in zip(
    axes,
    (unpooled_trace, partial_pooling_trace),
    ("no pooling", "partial pooling"),
):

    # add variable with x values to xarray dataset
    trace.posterior = trace.posterior.assign_coords({"N_county": ("county", N_county)})
    # plot means
    trace.posterior.mean(dim=("chain", "draw")).plot.scatter(
        x="N_county", y="mu", ax=ax, alpha=0.9
    )
    ax.hlines(
        partial_pooling_trace.posterior.mu.mean(),
        0.9,
        max(N_county) + 1,
        alpha=0.4,
        ls="--",
        label="Est. population mean",
    )

    # plot hdi
    hdi = az.hdi(trace).mu
    ax.vlines(
        N_county, hdi.sel(hdi="lower"), hdi.sel(hdi="higher"), color="orange", alpha=0.5
    )

    ax.set(
        title=f"{level.title()} Estimates",
        xlabel="Nbr obs in county (log scale)",
        xscale="log",
        ylabel="Log radon",
    )
    ax.legend(fontsize=10)

Notice the difference between the unpooled and partially-pooled estimates, particularly at smaller sample sizes. The former are both more extreme and more imprecise.

## Varying intercept model

This model allows intercepts to vary across county, according to a random effect.

$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$

where

$$\epsilon_i \sim N(0, \sigma_y^2)$$

and the intercept random effect:

$$\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$$

As with the the “no-pooling” model, we set a separate intercept for each county, but rather than fitting separate least squares regression models for each county, multilevel modeling **shares strength** among counties, allowing for more reasonable inference in counties with little data.

### Exercise

Modify the previous model to include a shared slope, `beta`.

In [34]:
with pm.Model(coords=coords) as varying_intercept:

    # Write your code here

In [None]:
pm.model_to_graphviz(varying_intercept)

In [None]:
with varying_intercept:
    varying_intercept_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED)

In [None]:
pm.plot_forest(
    varying_intercept_trace,
    var_names=["mu"],
    figsize=(6, 18),
    combined=True,
);

The estimate for the `floor` coefficient is approximately -0.66, which can be interpreted as measurements taken on a floor, rather than in a basement, having about half ($\exp(-0.66) = 0.52$) the radon levels after accounting for county.

In [None]:
az.summary(varying_intercept_trace, var_names=["beta"])

In [None]:
xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]})
post = varying_intercept_trace.posterior  # alias for readability
theta = (
    (post.mu + post.beta * xvals)
    .mean(dim=("chain", "draw"))
    .to_dataset(name="Mean log radon")
)

_, ax = plt.subplots()
theta.plot.scatter(
    x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax
)  # scatter
ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2)
# add lines too
ax.set_title("MEAN LOG RADON BY COUNTY");

It is easy to show that the partial pooling model provides more objectively reasonable estimates than either the pooled or unpooled models, at least for counties with small sample sizes.

In [None]:
sample_counties = (
    "LAC QUI PARLE",
    "AITKIN",
    "KOOCHICHING",
    "DOUGLAS",
    "CLAY",
    "STEARNS",
    "RAMSEY",
    "ST LOUIS",
)

fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
axes = axes.ravel()
m = unpooled_trace.posterior.mean(dim=("chain", "draw")).beta
for i, c in enumerate(sample_counties):
    y, x = srrs_mn.filter(pl.col("county") == c)[["log_radon", "floor"]].to_numpy().T
    axes[i].scatter(x + np.random.randn(len(x)) * 0.01, y, alpha=0.4)

    # No pooling model
    b = unpooled_estimates.sel(county=c)

    # Plot both models and data
    xvals = np.linspace(0, 1)
    axes[i].plot(xvals, m.values * xvals + b.values)
    axes[i].plot(xvals, beta_mean * xvals + mu_mean, "r--")
    varying_intercept_trace.posterior.sel(county=c).beta
    post = varying_intercept_trace.posterior.sel(county=c).mean(dim=("chain", "draw"))
    theta = post.mu.values + post.beta.values * xvals
    axes[i].plot(xvals, theta, "k:")
    axes[i].set_xticks([0, 1])
    axes[i].set_xticklabels(["basement", "floor"])
    axes[i].set_ylim(-1, 3)
    axes[i].set_title(c)
    if not i % 2:
        axes[i].set_ylabel("log radon level")

### Exercise: Varying slope model

Alternatively, we can posit a model that allows the counties to vary according to how the location of measurement (basement or floor) influences the radon reading.

$$y_i = \alpha + \beta_{j[i]} x_{i} + \epsilon_i$$

Construct a model called `varying_slope` that implements this alternative model.


In [41]:
with pm.Model(coords=coords) as varying_slope:

    # Write your code here

In [None]:
with varying_slope:
    varying_slope_trace = pm.sample(tune=3000)

## Non-centered Parameterization

The partial pooling models specified above uses a **centered** parameterization of the slope random effect. That is, the individual county effects are distributed around a county mean, with a spread controlled by the hierarchical standard deviation parameter. As the preceding plot reveals, this constraint serves to **shrink** county estimates toward the overall mean, to a degree proportional to the county sample size. This is exactly what we want, and the model appears to fit well--the Gelman-Rubin statistics are exactly 1.

But, on closer inspection, there are signs of trouble. Specifically, let's look at the trace of the random effects, and their corresponding standard deviation:

In [None]:
fig, axs = plt.subplots(nrows=2)
axs[0].plot(varying_slope_trace.posterior.sel(chain=0)["sigma_b"], alpha=0.5)
axs[0].set(ylabel="sigma_b")
axs[1].plot(varying_slope_trace.posterior.sel(chain=0)["beta"], alpha=0.05)
axs[1].set(ylabel="beta");

Notice that when the chain reaches the lower end of the parameter space for $\sigma_b$, it appears to get "stuck" and the entire sampler, including the random slopes `b`, mixes poorly. 

Jointly plotting the random effect variance and one of the individual random slopes demonstrates what is going on.

In [None]:
x = varying_slope_trace.posterior["beta"].sel(chain=0, county="AITKIN").to_series()
x.name = "slope"
y = varying_slope_trace.posterior["sigma_b"].sel(chain=0).to_series()
y.name = "slope group variance"

jp = sns.jointplot(x=x, y=y, ylim=(0, 0.7));

When the group variance is small, this implies that the individual random slopes are themselves close to the group mean. This results in a *funnel*-shaped relationship between the samples of group variance and any of the slopes (particularly those with a smaller sample size). 

In itself, this is not a problem, since this is the behavior we expect. However, if the sampler is tuned for the wider (unconstrained) part of the parameter space, it has trouble in the areas of higher curvature. The consequence of this is that the neighborhood close to the lower bound of $\sigma_b$ is sampled poorly; indeed, in our chain it is not sampled at all below 0.1. The result of this will be biased inference.

Now that we've spotted the problem, what can we do about it? The best way to deal with this issue is to reparameterize our model. Notice the random slopes in this version:

In [45]:
with pm.Model(coords=coords) as varying_slope_noncentered:

    # Priors
    mu_b = pm.Normal("mu_b", mu=0.0, sigma=1e5)
    sigma_b = pm.HalfCauchy("sigma_b", 5)

    # Common intercepts
    mu = pm.Normal("mu", mu=0.0, sigma=1e5)

    # Non-centered random slopes
    # Centered: b = pm.Normal('b', mu_b, sigma=sigma_b, shape=counties)
    z = pm.Normal("z", mu=0, sigma=1, dims="county")
    beta = pm.Deterministic("beta", mu_b + z * sigma_b, dims="county")

    # Model error
    sigma_y = pm.HalfCauchy("sigma_y", 5)

    # Expected value
    y_hat = mu + beta[county] * floor_measure

    # Data likelihood
    y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon)

In [None]:
pm.model_to_graphviz(varying_slope_noncentered)

This is a **non-centered** parameterization. By this, we mean that the random deviates are no longer explicitly modeled as being centered on $\mu_b$. Instead, they are independent standard normals $\upsilon$, which are then scaled by the appropriate value of $\sigma_b$, before being location-transformed by the mean.

This model samples much better.

In [None]:
with varying_slope_noncentered:
    noncentered_trace = pm.sample(
        tune=3000, target_accept=0.99, random_seed=RANDOM_SEED
    )

Notice that the bottlenecks in the traces are (mostly) gone.

In [None]:
fig, axs = plt.subplots(nrows=2)
axs[0].plot(noncentered_trace.posterior.sel(chain=0)["sigma_b"], alpha=0.5)
axs[0].set(ylabel="sigma_b")
axs[1].plot(noncentered_trace.posterior.sel(chain=0)["beta"], alpha=0.05)
axs[1].set(ylabel="beta");

And, we are now fully exploring the support of the posterior.

In [None]:
x = noncentered_trace.posterior["beta"].sel(chain=0, county="AITKIN").to_series()
x.name = "slope"
y = noncentered_trace.posterior["sigma_b"].sel(chain=0).to_series()
y.name = "slope group variance"

jp = sns.jointplot(x=x, y=y, ylim=(0, 0.7));

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, constrained_layout=True)
az.plot_posterior(varying_slope_trace, var_names=["sigma_b"], ax=ax1)
az.plot_posterior(noncentered_trace, var_names=["sigma_b"], ax=ax2)
ax1.set_title("Centered (top) and non-centered (bottom)");

## Varying intercept and slope model

The most general model allows both the intercept and slope to vary by county:

$$y_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i$$


In [51]:
with pm.Model(coords=coords) as varying_intercept_slope:

    # Priors
    mu_a = pm.Normal("mu_a", mu=0.0, sigma=1e5)
    sigma_a = pm.HalfCauchy("sigma_a", 5)

    mu_b = pm.Normal("mu_b", mu=0.0, sigma=1e5)
    sigma_b = pm.HalfCauchy("sigma_b", 5)

    # Random intercepts
    mu = pm.Normal("mu", mu=mu_a, sigma=sigma_a, dims="county")
    
    # Random slopes
    beta = pm.Normal("beta", mu=mu_b, sigma=sigma_b, dims="county")

    # Model error
    sigma_y = pm.Uniform("sigma_y", lower=0, upper=100)

    # Expected value
    y_hat = mu[county] + beta[county] * floor_measure

    # Data likelihood
    y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon)

In [None]:
with varying_intercept_slope:
    varying_intercept_slope_trace = pm.sample(tune=4000, random_seed=RANDOM_SEED)

In [None]:
az.plot_forest(
    varying_intercept_slope_trace,
    var_names=["mu", "beta"],
    figsize=(6, 24),
    combined=True,
    ess=True,
    r_hat=True,
);

In [None]:
xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]})
post = varying_intercept_slope_trace.posterior  # alias for readability
theta = (
    (post.mu + post.beta * xvals)
    .mean(dim=("chain", "draw"))
    .to_dataset(name="Mean log radon")
)

_, ax = plt.subplots()
theta.plot.scatter(
    x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax
)  # scatter
ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2)
# add lines too
ax.set_title("MEAN LOG RADON BY COUNTY");

### Exercise

Reparameterize the `varying_intercept_slope` model to be non-centered, and compare the resulting parameter estimates.

In [55]:
with pm.Model(coords=coords) as varying_intercept_slope_noncentered:

    # Write your code here

## Adding group-level predictors

A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above:

$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$

we may, instead of a simple random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading $u_j$, which is thought to be related to radon levels:

$$\alpha_j = \gamma_0 + \gamma_1 u_j + \zeta_j$$

$$\zeta_j \sim N(0, \sigma_{\alpha}^2)$$

Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium).

Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this.

Group-level predictors also serve to reduce group-level variation $\sigma_{\alpha}$. An important implication of this is that the group-level estimate induces stronger pooling.

In [56]:
with pm.Model(coords=coords) as hierarchical_intercept:

    # Priors
    sigma_a = pm.HalfCauchy("sigma_a", 5)

    # County uranium model
    gamma_0 = pm.Normal("gamma_0", mu=0.0, sigma=1e5)
    gamma_1 = pm.Normal("gamma_1", mu=0.0, sigma=1e5)

    # Uranium model for intercept
    mu_a = pm.Deterministic("mu_a", gamma_0 + gamma_1 * u)
    
    # County variation not explained by uranium
    epsilon_a = pm.Normal("epsilon_a", mu=0, sigma=1, dims="county")
    mu = pm.Deterministic("mu", mu_a + sigma_a * epsilon_a)

    # Common slope
    beta = pm.Normal("beta", mu=0.0, sigma=1e5)

    # Model error
    sigma_y = pm.Uniform("sigma_y", lower=0, upper=100)

    # Expected value
    y_hat = mu[county] + beta * floor_measure

    # Data likelihood
    y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon)

In [None]:
with hierarchical_intercept:
    hierarchical_intercept_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED)

In [None]:
az.plot_trace(hierarchical_intercept_trace, var_names="gamma_1")

In [None]:
uranium = u
post = hierarchical_intercept_trace.posterior.assign_coords(uranium=uranium)
avg_a = post["mu_a"].mean(dim=("chain", "draw")).values[np.argsort(uranium)]
avg_a_county = post["mu"].mean(dim=("chain", "draw"))
avg_a_county_hdi = az.hdi(post, var_names="mu")["mu"]

_, ax = plt.subplots()
ax.plot(uranium[np.argsort(uranium)], avg_a, "k--", alpha=0.6, label="Mean intercept")
az.plot_hdi(
    uranium,
    post["mu"],
    fill_kwargs={"alpha": 0.1, "color": "k", "label": "Mean intercept HPD"},
    ax=ax,
)
ax.scatter(uranium, avg_a_county, alpha=0.8, label="Mean county-intercept")
ax.vlines(
    uranium,
    avg_a_county_hdi.sel(hdi="lower"),
    avg_a_county_hdi.sel(hdi="higher"),
    alpha=0.5,
    color="orange",
)
plt.xlabel("County-level uranium")
plt.ylabel("Intercept estimate")
plt.legend(fontsize=9);

The standard errors on the intercepts are narrower than for the partial-pooling model without a county-level covariate.

### Correlations among levels

In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept.

$$\alpha_j = \gamma_0 + \gamma_1 u_j + \gamma_2 \bar{x} + \zeta_j$$

These are broadly referred to as ***contextual effects***.

In [60]:
# Create new variable for mean of floor across counties
xbar = srrs_mn.group_by("county").agg(pl.col("floor").mean())["floor"].to_numpy()

In [61]:
with pm.Model(coords=coords) as contextual_effect:

    floor_idx = pm.Data("floor_idx", floor.astype(int))
    county_idx = pm.Data("county_idx", county.astype(int))
    radon_data = pm.Data("radon_data", log_radon)

    # Priors
    sigma_a = pm.HalfCauchy("sigma_a", 5)

    # County uranium model for slope
    # gamma = pm.Normal('gamma', mu=0., sigma=1e5, shape=3)
    gamma = pm.Normal("gamma", mu=0.0, sigma=1e5)

    # Uranium model for intercept
    # mu_a = pm.Deterministic('mu_a', gamma[0] + gamma[1]*u + gamma[2]*xbar)
    mu_a = pm.Deterministic("mu_a", gamma)

    # County variation not explained by uranium
    epsilon_a = pm.Normal("epsilon_a", mu=0, sigma=1, dims="county")
    mu = pm.Deterministic("mu", mu_a + sigma_a * epsilon_a)

    # Common slope
    beta = pm.Normal("beta", mu=0.0, sigma=1e15)

    # Model error
    sigma_y = pm.Uniform("sigma_y", lower=0, upper=100)

    # Expected value
    y_hat = mu[county_idx] + beta * floor_idx

    # Data likelihood
    y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=radon_data)

In [None]:
pm.model_to_graphviz(contextual_effect)

In [None]:
with contextual_effect:
    contextual_effect_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED)

In [None]:
az.plot_forest(
    contextual_effect_trace, var_names=["gamma"], combined=True, ess=True, r_hat=True
);

In [None]:
az.summary(contextual_effect_trace, var_names=["gamma"])

So, we might infer from this that counties with higher proportions of houses without basements tend to have higher baseline levels of radon. Perhaps this is related to the soil type, which in turn might influence what type of structures are built.

### Prediction

Gelman (2006) used cross-validation tests to check the prediction error of the unpooled, pooled, and partially-pooled models

**root mean squared cross-validation prediction errors**:

* unpooled = 0.86
* pooled = 0.84
* multilevel = 0.79

There are two types of prediction that can be made in a multilevel model:

1. a new individual within an existing group
2. a new individual within a new group

For example, if we wanted to make a prediction for a new house with no basement in St. Louis and Kanabec counties, we just need to sample from the radon model with the appropriate intercept.

That is, 

$$\tilde{y}_i \sim N(\alpha_{69} + \beta (x_i=1), \sigma_y^2)$$

Because we judiciously set the county index and floor values as shared variables earlier, we can modify them directly to the desired values (69 and 1 respectively) and sample corresponding posterior predictions, without having to redefine and recompile our model. Using the model just above:

In [None]:
prediction_coords = {"obs_id": ["ST LOUIS", "KANABEC"]}
with contextual_effect:
    pm.set_data(
        {
            "county_idx": np.array([69, 31]),
            "floor_idx": np.array([1, 1]),
            "radon_data": np.ones(2),
        }
    )
    stl_pred = pm.sample_posterior_predictive(contextual_effect_trace.posterior)

contextual_effect_trace.extend(stl_pred)

In [None]:
contextual_effect_trace

In [None]:
az.plot_posterior(contextual_effect_trace, group="posterior_predictive");

Prediction for a house within a new county is a little trickier. It is actually easier to create a new model to work with, **but use the trace from the original model for posterior predictive sampling**. 

How can this work?

First, consider how posterior predictive sampling works in PyMC: samples are drawn not from the distributions themselves, but from the set of samples in the trace. Therefore, we can take the trace from the original model, and use it to sample posterior predictions from a new model that has the same variables.

The variables in the new model need only have the same name as the original -- to reinforce this, I will use `pm.Flat` variables as placeholders in this example. The only variables we actually need are the ones that need to be resampled for a new county.

We don't even need `Data` here; we can use raw data, since we are just creating this model to get posterior predictions for houses in this notional new county.

In [None]:
with pm.Model() as new_county_house:

    # New data
    # u_new = np.array([-0.2, 0.3])
    # xbar = np.array([0.5, 0.8])
    floor_idx = np.array([1, 0])

    # Placeholders for variables already in the trace
    sigma_a = pm.Flat("sigma_a")
    gamma = pm.Flat("gamma")  # , shape=3)
    beta = pm.Flat("beta")
    sigma_y = pm.Flat("sigma_y")

    # Calculate new county expected value
    mu_a_new = pm.Deterministic(
        "mu_a_new", gamma
    )  # [0] + gamma[1]*u_new + gamma[2]*xbar)

    # Sample from the county intercept distribution
    mu_new = pm.Normal("mu_new", mu_a_new, sigma_a)

    # Expected value for houses in new county
    y_hat_new = mu_new + beta * floor_idx

    y_new = pm.Normal("y_new", mu=y_hat_new, sigma=sigma_y)

    pp_new = pm.sample_posterior_predictive(
        contextual_effect_trace, var_names=["y_new"]
    )

In [None]:
pm.model_to_graphviz(new_county_house)

## Benefits of Multilevel Models

- Accounting for natural hierarchical structure of observational data
- Estimation of coefficients for (under-represented) groups
- Incorporating individual- and group-level information when estimating group-level coefficients
- Allowing for variation among individual-level coefficients across groups


---
## References

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.

Betancourt, M. J., & Girolami, M. (2013). Hamiltonian Monte Carlo for Hierarchical Models.

Gelman, A. (2006). Multilevel (Hierarchical) modeling: what it can and cannot do. Technometrics, 48(3), 432–435.

In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w

---
## Comprehensive Summary and Key Insights

### The Journey We've Taken

In this extensive session, we've progressed through three fundamental approaches to regression modeling in the Bayesian framework:

1. **Linear Regression**: We started with the foundation, learning how to quantify uncertainty, handle multicollinearity, transform variables for better fit, and implement robust methods for outlier-resistant inference.

2. **Generalized Linear Models**: We extended our toolkit to handle non-normal outcomes, understanding how link functions and appropriate distributions allow us to model counts, proportions, and other constrained outcomes.

3. **Hierarchical Models**: We culminated with partial pooling approaches that respect natural grouping structures in data, automatically balancing between complete and no pooling based on the information available.

### Key Conceptual Insights

#### The Power of Probabilistic Thinking
Throughout this session, we've seen how treating parameters as probability distributions rather than point estimates provides richer inference. This isn't just mathematical sophistication—it's practical. When a shipping company needs to know the probability that a package exceeds a weight threshold, or when a public health official needs to understand the range of plausible contamination levels, probability distributions provide direct answers.

#### The Importance of Model Checking
We've consistently used posterior predictive checks to identify model inadequacies. This iterative process—propose, check, revise—is at the heart of practical Bayesian modeling. No model is perfect, but through careful checking, we can identify where our models fail and how to improve them.

#### The Value of Hierarchical Thinking
The hierarchical modeling framework provides a principled way to handle one of the most common features of real data: grouping structure. By modeling groups as similar but not identical, we get better predictions, especially for groups with limited data. The automatic shrinkage toward the population mean is not a bug—it's a feature that improves our estimates by incorporating information from the broader population.

### Practical Guidelines for Your Own Work

When approaching a new regression problem:

1. **Start Simple**: Begin with the simplest model that could plausibly work. A simple linear regression often reveals important patterns and problems.

2. **Check Thoroughly**: Use posterior predictive checks, residual plots, and other diagnostics to identify where your model fails.

3. **Build Incrementally**: Add complexity based on identified inadequacies, not speculation. If residuals show non-linearity, add transformations. If variance is non-constant, consider robust methods or GLMs.

4. **Respect Structure**: If your data has grouping, use it. Hierarchical models often provide dramatic improvements over pooled or unpooled alternatives.

5. **Quantify Uncertainty**: Report full posteriors or credible intervals, not just point estimates. Decision-makers need to understand uncertainty.

6. **Validate Honestly**: Always check performance on held-out data when possible. In-sample fit can be misleading.

### Connecting to the Broader Bayesian Workflow

The models we've studied fit into a broader workflow:

1. **Problem Formulation**: Understand the scientific question and data structure
2. **Model Building**: Start simple, respect structure, choose appropriate distributions
3. **Prior Specification**: Encode knowledge while ensuring computational stability
4. **Model Fitting**: Use MCMC (or variational inference for larger problems)
5. **Diagnostic Checking**: Ensure chains converged and model fits reasonably
6. **Model Comparison**: Use LOO-CV or WAIC to compare alternatives
7. **Posterior Analysis**: Extract insights, make predictions, inform decisions

### Where to Go From Here

This session provides a solid foundation, but there's always more to learn:

- **Advanced GLMs**: Zero-inflated models, hurdle models, ordinal regression
- **Gaussian Processes**: Non-parametric regression for complex relationships
- **Time Series**: State-space models, dynamic linear models
- **Causal Inference**: Using Bayesian methods for causal questions
- **Computational Methods**: Variational inference, Hamiltonian Monte Carlo tuning

### Final Thoughts

Bayesian regression modeling is as much art as science. The framework provides powerful tools, but applying them effectively requires practice, intuition, and domain knowledge. Every dataset tells a story—our job as modelers is to listen carefully and translate that story into mathematical form.

Remember: all models are wrong, but some are useful. The Bayesian framework helps us build useful models by quantifying uncertainty, incorporating prior knowledge, and respecting the structure in our data. Use these tools thoughtfully, check your work carefully, and always remember that the goal is insight, not just good fit.

In the next session, we'll dive deeper into model checking and the complete Bayesian workflow, building on the foundation we've established here.