[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/vi_pydata_virginia_2025/blob/master/notebooks/Pathfinder.ipynb)

# Pathfinder Variational Inference

Pathfinder (Zhang et al. 2022) offers an elegant solution to many of the common challenges encountered in traditional variational inference. Rather than directly optimizing a variational distribution to match the posterior, Pathfinder constructs a geometric path through the probability space that connects a simple initial distribution to the target posterior. It offers several advantages:

1. **Speed**: Requires far fewer gradient evaluations than ADVI or MCMC warmup
2. **Scalability**: Performs well on larger problems and high-dimensional models
3. **Parallelization**: Can compute ELBO estimates in parallel, unlike ADVI

On large problems, it should scale better than most MCMC algorithms, including gradient-based methods like NUTS, and requires 1-2 orders of magnitude fewer log density and gradient evaluations than ADVI and the MCMC warmup phase. Moreover, Pathfinder can perform the Monte Carlo KL divergence estimates used to compute ELBO in parallel, providing a major advantage over ADVI, which must evaluate the ELBO sequentially.

Like ADVI, the computational efficiencies may come at the cost of a more biased estimate of the posterior but can be managed through the algorithm's settings. 

In this tutorial, we'll look at how Pathfinder works conceptually, how to use it in PyMC, and examine its performance across several example problems.

## How Pathfinder Works

At a high level, Pathfinder works like this:

1. **Optimization Path**: Uses L-BFGS optimization to find a good path through the parameter space
2. **Local Approximations**: Creates normal (Gaussian) approximations at different points along this path
3. **Sample From Approximations**: Monte Carlo samples are drawn from these approximations
4. **ELBO Evaluation**: Selects the best approximation using the Evidence Lower Bound (ELBO)
5. **Final Sampling**: Draws samples from the best approximation for inference

The name "Pathfinder" comes from the fact that it finds a good path through parameter space to locate a high-quality approximation.

### The Optimization Path

Pathfinder starts by using a mathematical optimization technique called L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) to move through the parameter space. 

Think of this like a hiker trying to climb to the top of a mountain by always moving uphill. L-BFGS efficiently tracks the path upward, using information about how steep the slope is (gradients) and how the terrain changes (approximate Hessian matrix).

Starting from a random point at $\theta^{(0)}$, which should be at the tail-region of the posterior distribution, L-BFGS moves through the body of the distribution and towards a local maximum.

As the optimization proceeds, Pathfinder records both the positions (parameters) and the local landscape information (gradients) at each step along the path.

### Creating Local Approximations

At each point along this optimization path, Pathfinder creates a normal (Gaussian) approximation to the posterior distribution. This is based on a second-order Taylor expansion:

$$ \log p(\theta | x) \approx \log p(\theta_0 | x) + g^T(\theta - \theta_0) + \frac{1}{2}(\theta - \theta_0)^T H (\theta - \theta_0) $$

These approximations have:

- A **mean** representing the central estimate of the parameters
- A **covariance matrix** capturing the uncertainty and relationships between parameters

$$
\begin{align*}
\mu^{(l)} &= \theta^{(l)} - \text{H}^{-1}(\theta^{(l)}) \cdot g^{(l)} \\
\Sigma^{(l)} &= \text{H}^{-1}(\theta^{(l)})
\end{align*}
$$

where $\text{H}(\theta^{(l)}) = - \nabla^2 \log f(\theta^{(l)})$ is the approximate Hessian matrix of the log unnormalized density at $\theta^{(l)}$.

Unlike ADVI, where the covariance matrix is either diagonal (mean-field ADVI) or full rank (full-rank ADVI), Pathfinder uses a low-rank plus diagonal factorization of the inverse Hessian factor, where the rank of the estimate for the covariance matrix can be controlled by the user.

The covariance matrix is constructed efficiently using information from the optimization path, capturing local curvature of the posterior distribution.

### Sampling from Approximations

After constructing the inverse Hessian factors along the optimization path, Pathfinder needs to first sample from the resulting normal approximations and then evaluate the log density of these samples. This is necessary to compute the evidence lower bound (ELBO) in the next step, which will allow us to select the best normal approximation.

The **BFGS-Sample** algorithm generates samples from a local normal approximation to the target distribution. It takes optimization trajectory points and their gradients, along with inverse Hessian factors, and produces $K$ samples from the corresponding multivariate normal distribution. The algorithm handles two cases: 

1. when the rank is greater than or equal to the parameter dimension, it uses a direct Cholesky decomposition approach
2. otherwise, it employs a more computationally-efficient thin QR factorization method. 
    
In both cases, the algorithm transforms standard normal random variables using the covariance structure encoded in the inverse Hessian factors to produce properly distributed samples, while also computing their log density values for subsequent ELBO calculations.

![](images/pathfinder.png)

### Selecting the Best Approximation

Once Pathfinder has created these local normal approximations, it needs to decide which one is best. It does this by computing the **Evidence Lower Bound (ELBO)** for each approximation.

The ELBO measures how well each approximation matches the true posterior. The higher the ELBO, the better the approximation. This calculation involves:

1. Drawing Monte Carlo samples from each normal approximation (from above)
2. Evaluating the log density of these samples under both the target distribution and the approximation
3. Computing the average difference

The approximation with the highest ELBO is selected as the best one.

### Final Sampling

Once the best approximation is selected, Pathfinder draws samples from this approximation, re-using the BFGS-Sample function for this. Then, importance sampling is applied to correct for approximation bias.

These samples can then be used for posterior inference, just like samples from MCMC methods, but typically at a fraction of the computational cost.

### Importance Sampling

Like all VI methods, Pathfinder is approximating the posterior $p(\theta|y)$ using a tractable probability density $q(\theta)$ that is some approximation of $p$. 

$$E[h(\theta) | y] = \frac{\int h(\theta) \frac{p(\theta|y)}{q(\theta)} q(\theta) d\theta}{\int \frac{p(\theta|y)}{q(\theta)} q(\theta) d\theta}$$

Expressed this way, $w(\theta) = p(\theta|y) / q(\theta)$ can be regarded as *weights* for the $M$ values of $\theta$ sampled from $q$ that we can use to correct the sample so that it approximates $h(\theta)$. Specifically, the **importance sampling estimate** of $E[h(\theta) | y]$ is:

$$\hat{h}_{is} = \frac{\sum_{i=1}^{M} h(\theta^{(i)})w(\theta^{(i)})}{\sum_{i=1}^{M} w(\theta^{(i)})}$$

where $\theta^{(i)}$ is the $i^{th}$ sample simulated from $q(\theta)$. The standard error for the importance sampling estimate is:

$$\text{SE}_{is} = \frac{\sqrt{\sum_{i=1}^{M} [(h(\theta^{(i)}) - \hat{h}_{is}) w(\theta^{(i)})]^2}}{\sum_{i=1}^{M} w(\theta^{(i)})}$$

The efficiency of importance sampling is related to the selection of the importance sampling distribution $q$.

### Example: Gamma distribution

As a simple illustration of importance sampling, let's consider the problem of estimating a *gamma distribution* using a normal distribution via importance sampling.

The goal will be to estimate $E[X]$ where $X \sim Gamma(3, 2)$



In [None]:
shape_true, scale_true = 3.0, 2.0
target_mean = shape_true * scale_true

We'll center the normal proposal near the true mean with appropriate variance:

In [None]:
mu_proposal, sigma_proposal = 6.0, 3.0

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.constrained_layout.use": True,
        "figure.figsize": (10, 6),
        "font.size": 12,
    }
)

%config InlineBackend.figure_format = 'retina'

np.random.seed(SEED:=42)

x = np.linspace(0, 15, 1000)
gamma_pdf = stats.gamma.pdf(x, shape_true, scale=scale_true)
normal_pdf = stats.norm.pdf(x, mu_proposal, sigma_proposal)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, gamma_pdf, 'r-', lw=2, label=f'Target: Gamma({shape_true}, {scale_true})')
ax.plot(x, normal_pdf, 'b--', lw=2, label=f'Proposal: Normal({mu_proposal}, {sigma_proposal})')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_title('Target vs. Proposal Distributions')
ax.legend()
ax.grid(True, alpha=0.3)

First we generate candidate samples from the normal distribution:

In [None]:
M = 10000
samples = np.random.normal(mu_proposal, sigma_proposal, M)

Calculate importance weights:

`weights = target_pdf(samples) / proposal_pdf(samples)`

which are normalized to sum to 1.

In [None]:
target_pdf = stats.gamma.pdf(samples, shape_true, scale=scale_true)
proposal_pdf = stats.norm.pdf(samples, mu_proposal, sigma_proposal)
weights = target_pdf / proposal_pdf

# Remove invalid weights (from sampling in regions where target is defined but proposal isn't)
valid_idx = (weights > 0) & np.isfinite(weights)
samples = samples[valid_idx]
weights = weights[valid_idx]

# Normalize weights to sum to 1
normalized_weights = weights / np.sum(weights)

These weights are then used to magically convert the normal samples into gamma samples:

In [None]:
# Calculate importance sampling estimate of the mean
is_mean = np.sum(samples * normalized_weights)

# Calculate standard error of the estimate
is_se = np.sqrt(np.sum(((samples - is_mean) * normalized_weights)**2))

# Calculate effective sample size
ess = 1 / np.sum(normalized_weights**2)

print(f"Importance Sampling Estimate: {is_mean:.4f}")
print(f"Standard Error: {is_se:.4f}")
print(f"True Mean: {target_mean}")
print(f"Effective Sample Size: {ess:.1f} (out of {len(samples)} samples)")

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

bins = np.linspace(0, 15, 50)
hist_weights = normalized_weights * len(normalized_weights)  
ax.hist(samples, bins=bins, weights=hist_weights, alpha=0.7, 
         color='skyblue', edgecolor='black', density=True)

ax.plot(x, gamma_pdf, 'r-', lw=2, label=f'Target: Gamma({shape_true}, {scale_true})')

ax.axvline(target_mean, color='r', linestyle='--', lw=2, label=f'True Mean: {target_mean}')
ax.axvline(is_mean, color='g', linestyle='-', lw=2, label=f'IS Mean: {is_mean:.4f}')
ax.set_title('Importance Sampling Results')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.legend()

### Multi-path Enhancement

The basic Pathfinder algorithm described above is called "Single-path Pathfinder." To improve the quality of approximation, especially for complex posteriors, we can run multiple independent paths and combine their results.

This **Multi-path Pathfinder** approach:

1. Runs multiple independent Pathfinder instances from different starting points
2. Generates $M$ draws from its ELBO-maximizing normal approximation
3. Uses importance sampling to combine all draws from all paths
4. Selects the final $R$ samples based on their importance weights

This is particularly useful for complex posteriors that may have multiple modes or non-normal shapes. By combining samples from different paths, we can better capture the true posterior distribution.

## Using Pathfinder in PyMC

Let's see how to use Pathfinder for practical Bayesian modeling in PyMC. We'll need to import the necessary libraries:

In [None]:
import pandas as pd
import pymc as pm
import pymc_extras as pmx
import pytensor.tensor as pt
import arviz as az
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
az.style.use("arviz-darkgrid")

## Example 1: The Eight Schools Problem

The Eight Schools problem is a classic example in Bayesian statistics, originally analyzed by Rubin (1981). It involves estimating the effects of coaching programs on SAT scores across eight schools.

This is a good starting example because:
1. It's small enough to understand easily
2. It has a hierarchical structure that creates some posterior complexity
3. It's commonly used as a benchmark in Bayesian methods

In [None]:
J = 8  # number of schools
y = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])  # observed effects
sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0])  # standard errors

Now let's build a hierarchical model for this problem:

In [None]:
with pm.Model(coords={"school": np.arange(J)}) as model:
    baseline = pm.Normal("baseline", mu=0.0, sigma=10.0)
    school_sd = pm.HalfCauchy("school_sd", 5.0)

    school_offset = pm.Normal("school_offset", mu=0, sigma=1, dims="school")
    _ = pm.Normal(
        "obs",
        mu=baseline + school_sd * school_offset,
        sigma=sigma,
        observed=y,
        dims="school",
    )

Now let's fit this model using different methods for comparison:

In [None]:
with model:
    idata_nuts = pm.sample(random_seed=SEED)
    idata_advi = pm.fit(n=30_000).sample(1000, random_seed=SEED)
    idata_pf = pmx.fit(method="pathfinder", num_paths=1, random_seed=SEED)

Let's create a helper function to visualize and compare our results:

In [None]:
def compare_methods(results, var_names=None):
    """Plot posterior distributions from different inference methods"""
    az.plot_forest(
        [results[key] for key in results.keys()],
        model_names=list(results.keys()),
        var_names=var_names,
        combined=True,
        figsize=(10, 10),
        kind="ridgeplot",
        ridgeplot_alpha=0.5,
    )
    plt.tight_layout()

In [None]:
var_names = [RV.name for RV in model.free_RVs]
res = {"NUTS": idata_nuts, "ADVI": idata_advi, "Pathfinder": idata_pf}

compare_methods(res, var_names=var_names)

### Improving Pathfinder Results

Pathfinder has several tuning parameters we can adjust to improve its performance:

1. **jitter**: Controls how far the starting points are from the initial position
2. **num_paths**: Number of independent Pathfinder runs for Multi-path Pathfinder
3. **maxcor**: History size for L-BFGS optimization

Let's try using Multi-path Pathfinder with more exploratory settings:

In [None]:
with model:
    idata_jitter_pf = pmx.fit(
        method="pathfinder",
        jitter=10.0,
        random_seed=SEED,
    )

    idata_jitter_paths_pf = pmx.fit(
        method="pathfinder",
        num_paths=50,
        jitter=10.0,
        random_seed=SEED,
    )

In [None]:
res["Pathfinder, jitter=10"] = idata_jitter_pf
res["Pathfinder, jitter=10, num_paths=50"] = idata_jitter_paths_pf
compare_methods(res, var_names=var_names)

Another approach for improving model fit is to reparameterize the model. It turns out there is actually room for improvement:

1. It is over-parametrized: we have 9 parameters to estimate schools' behavior (one `baseline` + eight `school_offsets`), when there are only 8 schools. We should either constrain one school offset to 0 (aka reference encoding) or constrain the sum of school offsets to 0 (which is enforced by `ZeroSumNormal`. 

2. The HalfCauchy is a very wide prior on the standard deviation of the population of schools, basically encoding that it's possible that schools are so different from each other that knowing about one doesn't tell us much about the others. This is rarely the case, especially in the social sciences, so using a prior that places less density on very high standard deviations (while still avoiding the problematic value of 0) makes more sense a priori. The following plot shows such a prior:

In [None]:
import preliz as pz

fig, axes = plt.subplots(1, 2, constrained_layout=True, figsize=(12, 4))

pz.HalfCauchy(beta=5).plot_pdf(ax=axes[0], legend="title", pointinterval=True)
axes[0].set(xlabel="Value", ylabel="Density")

pz.Gamma(alpha=2, beta=2).plot_pdf(ax=axes[1], legend="title", pointinterval=True)
axes[1].set(xlabel="Value", ylabel="Density");

Here's the revised model:

In [None]:
with pm.Model(coords={"school": np.arange(J)}) as model:
    baseline = pm.Normal("baseline", mu=0.0, sigma=10.0)
    school_sd = pm.Gamma("school_sd", 2, 2)
    school_offset = pm.ZeroSumNormal("school_offset", dims="school")

    _ = pm.Normal(
        "obs",
        mu=baseline + school_sd * school_offset,
        sigma=sigma,
        observed=y,
        dims="school",
    )

    idata_nuts = pm.sample(nuts_sampler="nutpie", random_seed=SEED)
    idata_advi = pm.fit(n=30_000).sample(1000, random_seed=SEED)
    idata_pf = pmx.fit(method="pathfinder", num_paths=1, random_seed=SEED)

A few notes about this model:

1. We've used `ZeroSumNormal` for the school offsets, which constrains them to sum to zero. This is a better parametrization than the traditional one that can lead to sampling difficulties.

2. We're using a `Gamma(2, 2)` prior for `school_sd` instead of a `HalfCauchy(5)`. The Gamma has less density at very high values, reflecting that schools typically aren't extremely different from each other.

In [None]:
var_names = [RV.name for RV in model.free_RVs]
res = {"NUTS": idata_nuts, "ADVI": idata_advi, "Pathfinder": idata_pf}

compare_methods(res, var_names=var_names)

## Example 2: Modeling Rugby Match Scores

Now let's try Pathfinder on a more complex model. We'll look at a Poisson regression model for rugby match scores, which helps us estimate team attack and defense strengths.

The league is made up by a total of T= 6 teams, playing each other once
in a season. We indicate the number of points scored by the home and the away team in the g-th game of the season (15 games) as $y_{g1}$ and $y_{g2}$ respectively. </p>

In [None]:
try:
    df_all = pd.read_csv("../data/rugby.csv", index_col=0)
except Exception:
    df_all = pd.read_csv(pm.get_data("rugby.csv"), index_col=0)

df_all.head()

In [None]:
home_idx, teams = pd.factorize(df_all["home_team"], sort=True)
away_idx, _ = pd.factorize(df_all["away_team"], sort=True)
coords = {"match": df_all.index, "team": teams}

The vector of observed counts $\mathbb{y} = (y_{g1}, y_{g2})$ is modelled as independent Poisson:
$y_{gi}| \theta_{gj} \tilde\;\;  Poisson(\theta_{gj})$
where the theta parameters represent the scoring intensity in the g-th game for the team playing at home (j=1) and away (j=2), respectively.</p>

We model these parameters according to a formulation that has been used widely in the statistical literature, assuming a log-linear random effect model:
$$log \theta_{g1} = home + att_{h(g)} + def_{a(g)} $$
$$log \theta_{g2} = att_{a(g)} + def_{h(g)}$$


* The parameter home represents the advantage for the team hosting the game and we assume that this effect is constant for all the teams and throughout the season
* The scoring intensity is determined jointly by the attack and defense ability of the two teams involved, represented by the parameters att and def, respectively

* Conversely, for each t = 1, ..., T, the team-specific effects are modelled as exchangeable from a common distribution:

* $att_{t} \; \tilde\;\; Normal(\mu_{att},\tau_{att})$ and $def_{t} \; \tilde\;\;Normal(\mu_{def},\tau_{def})$

* We did some munging above and adjustments of the data to make it **tidier** for our model.
* The log function to away scores and home scores is a standard trick in the sports analytics literature

In [None]:
with pm.Model(coords=coords) as rugby_model:

    home_team = pm.Data("home_team", home_idx, dims="match")
    away_team = pm.Data("away_team", away_idx, dims="match")

    home = pm.Normal("home", mu=0, sigma=1)        # home advantage
    sd_att = pm.HalfNormal("sd_att", sigma=2)      # variability in attack
    sd_def = pm.HalfNormal("sd_def", sigma=2)      # variability in defense
    intercept = pm.Normal("intercept", mu=3, sigma=1)  # baseline scoring rate

    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, dims="team")
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, dims="team")

    atts = pm.Deterministic("atts", atts_star - pt.mean(atts_star), dims="team")
    defs = pm.Deterministic("defs", defs_star - pt.mean(defs_star), dims="team")
    
    home_theta = pt.exp(intercept + home + atts[home_idx] + defs[away_idx])
    away_theta = pt.exp(intercept + atts[away_idx] + defs[home_idx])

    home_points = pm.Poisson(
        "home_points",
        mu=home_theta,
        observed=df_all["home_score"],
        dims="match",
    )
    away_points = pm.Poisson(
        "away_points",
        mu=away_theta,
        observed=df_all["away_score"],
        dims="match",
    )

This model has more parameters than the Eight Schools example, making it a better test for scalability. Let's fit it using different methods:

In [None]:
with rugby_model:
    idata_nuts = pm.sample(random_seed=SEED)
    idata_advi = pm.fit(n=30_000).sample(1000, random_seed=SEED)
    idata_pf = pmx.fit(method="pathfinder", random_seed=SEED)

In [None]:
var_names = [RV.name for RV in rugby_model.free_RVs]
res = {"NUTS": idata_nuts, "ADVI": idata_advi, "Pathfinder": idata_pf}

compare_methods(res, var_names=var_names)

In [None]:
with rugby_model:
    idata_jitter_paths_pf = pmx.fit(
        method="pathfinder", jitter=20.0, num_paths=50, random_seed=SEED
    )

res["pf-jitter-paths"] = idata_jitter_paths_pf

In [None]:
compare_methods(res, var_names=var_names)   

## Example 3: The MLB "Sticky Stuff" Incident

Since the earliest days of the sport, baseball pitchers have applied foreign substances on the ball to help them throw better pitchers. But while it has always been against the rules, is was rarely enforced. In the early days, pine tar was used to allow for a harder grip on the ball, which in turn allows the ball to be spun at a higher rate. Eventually, this evolved to a sticky blend of rosin (powder derived from pine tree sap) and sunscreen. The resulting high spin rates resulted in fewer hits and more strikeouts, and finally led  MLB to a mid-season crackdown in 2021, handing out 10-game suspensions to any pitcher caught using "sticky stuff":

> Any pitcher who possesses or applies foreign substances will be subject to immediate ejection from the game and suspended automatically in accordance with the rules. If a player other than the pitcher is found to have applied a foreign substance to the ball, both the position player and pitcher will be ejected.

![](images/sticky_stuff_scherzer.jpg)

With the advent of remote sensing data, it is possible to track the spin rates (and other metrics) of pitched balls. This data is freely available from the [MLB Advanced Media website](https://baseballsavant.mlb.com/). 

Can we formulate a model to detect any changes in spin rate that are might be attributable to stepped-up enforcement of sticky stuff?

> ... word of its arrival trickled out around June 3, as MLB made it known it planned to increase scrutiny amid record-high strikeout rates. (Washington Post)

The dataset below includes all curve balls thrown by pitchers during the 2021 season.

In [79]:
try:
    spin_rates = pd.read_csv('../data/fastball_spin_rates.csv', index_col=0, parse_dates=['game_date'])
except FileNotFoundError:
    spin_rates = pd.read_csv(pm.get_data('fastball_spin_rates.csv'), index_col=0, 
parse_dates=['game_date'])
spin_rates.head()

Unnamed: 0_level_0,game_date,avg_spin_rate,n_pitches
pitcher_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Wainwright, Adam",2021-04-03,2127.415,12
"Wainwright, Adam",2021-04-08,2179.723,11
"Wainwright, Adam",2021-04-14,2297.968571,7
"Wainwright, Adam",2021-04-20,2159.15,13
"Wainwright, Adam",2021-04-26,2314.515455,11


In [None]:
kopech_fb_spin = spin_rates.assign(day_of_year=spin_rates.game_date.dt.day_of_year).loc['Kopech, Michael'].copy()
kopech_fb_spin.plot.scatter(x='game_date', y='avg_spin_rate', title='Kopech Fastball Spin Rate', ylabel='Spin Rate (rpm)', figsize=(10,4));

In [None]:
n_outputs = 5  
top_pitchers = spin_rates.groupby("pitcher_name").size().nlargest(n_outputs).reset_index()
top_pitchers = top_pitchers.reset_index().rename(columns={"index": "output_idx", 0: "games"})

# Plot average spin rates of top pitchers
fig, ax = plt.subplots(1, 1, figsize=(14, 6))
legends = []
for pitcher in top_pitchers["pitcher_name"]:
    pitcher_data = spin_rates.assign(day_of_year=spin_rates.game_date.dt.day_of_year).loc[pitcher]
    ax.scatter(pitcher_data["day_of_year"], pitcher_data["avg_spin_rate"])
    legends.append(pitcher)
plt.xlabel("Day of year")
plt.ylabel("Average spin rate (rpm)")
plt.legend(legends, loc="upper center");

In [None]:
analysis_subset = spin_rates.assign(day_of_year=spin_rates.game_date.dt.day_of_year).reset_index().merge(top_pitchers, on='pitcher_name', how='right')
X = analysis_subset[['day_of_year', 'output_idx']].values
y = analysis_subset['avg_spin_rate'].values

To model this set of time series, we will use a Gaussian Process (GP) regression model. GPs are a powerful tool for modeling complex, non-linear relationships in data, and they can provide uncertainty estimates for predictions.

### What is a Gaussian Process?

A Gaussian Process (GP) is a powerful and flexible statistical tool that extends the concept of multivariate normal distributions to infinite dimensions. At its core, a Gaussian Process defines a probability distribution over functions, rather than just over finite-dimensional vectors. Think of it as a way to specify a "random function" where any finite collection of points from this function follows a multivariate normal distribution.

Formally, a Gaussian Process is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. We write:

$$f(x) \sim \mathcal{GP}(m(x), k(x, x'))$$

Where $m(x)$ is the mean function that defines the expected value of $f$ at position $x$:

$$m(x) = \mathbb{E}[f(x)]$$

And $k(x, x')$ is the kernel or covariance function that specifies the covariance between any two points:

$$k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))]$$

Why Are Gaussian Processes Useful?
Gaussian Processes are particularly valuable for regression and modeling problems because they provide:

- Function approximation with uncertainty estimates
- Smooth interpolation between observed data points
- Principled uncertainty quantification that grows in regions far from observations
- Flexibility through different kernel choices that encode prior beliefs about function behavior

#### How Do Gaussian Processes Work in Practice?

When working with GPs for regression, we typically have observed data points $(x_i, y_i)$ where $y_i = f(x_i) + \epsilon_i$ and $\epsilon_i$ is noise. The power of GPs lies in predicting function values $f(x_*)$ at new points $x_*$.

Given a set of training points $X$ with observations $y$, and test points $X_*$, the joint distribution of observed values and predictions is:

$$\begin{bmatrix} y | f_* \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} m(X) \ m(X_*) \end{bmatrix}, \begin{bmatrix} K(X,X) + \sigma^2_n I & K(X,X_*) \ K(X_*,X) & K(X_*,X_*) \end{bmatrix}\right)$$

After conditioning on the observed data, the predictive distribution becomes:

$$f_* | X, y, X_* \sim \mathcal{N}(\bar{f}*, \text{cov}(f*))$$

Where:

$$\bar{f}* = m(X) + K(X_, X)[K(X,X) + \sigma^2_n I]^{-1}(y - m(X))$$ 

$$\text{cov}(f_*) = K(X_*,X_*) - K(X_*,X)[K(X,X) + \sigma^2_n I]^{-1}K(X,X_*)$$

#### Choosing a Kernel
The kernel function defines the shape and smoothness of functions drawn from the GP. Some common kernels include:

- Squared Exponential (RBF): $k(x, x') = \sigma^2 \exp\left(-\frac{||x-x'||^2}{2\ell^2}\right)$ for very smooth functions
- Matérn: For functions with different degrees of smoothness
- Periodic: $k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right)$ for repeating patterns

The kernel parameters (like length-scale $\ell$ or output variance $\sigma^2$) control characteristics of the resulting functions and are typically learned from the data.

In [None]:
with pm.Model() as multi_spin_rate_model:
    # Priors
    ell = pm.Gamma("ell", alpha=2, beta=0.5)
    eta = pm.Gamma("eta", alpha=3, beta=1)
    K = eta**2 * pm.gp.cov.ExpQuad(input_dim=2, ls=ell, active_dims=[0])

    # Get the ICM kernel
    W = pm.Normal("W", mu=0, sigma=3, shape=(n_outputs, 2), initval=np.random.randn(n_outputs, 2))
    kappa = pm.Gamma("kappa", alpha=1.5, beta=1, shape=n_outputs)
    B = pm.Deterministic("B", pt.dot(W, W.T) + pt.diag(kappa))
    coreg = pm.gp.cov.Coregion(input_dim=2, B=B, active_dims=[1])
    cov_icm = K * coreg  # Use Hadamard Product for separate inputs

    # Define a Multi-output GP
    gp = pm.gp.Marginal(cov_func=cov_icm)
    sigma = pm.HalfNormal("sigma", sigma=3)
    f = gp.marginal_likelihood("f", X, y, sigma=sigma)

In [None]:
pm.model_to_graphviz(multi_spin_rate_model)

Let's try fitting this model using MCMC:

In [None]:
with multi_spin_rate_model:
    multi_trace = pm.sample()

Notice that this model will take a while to run, as Gaussian processes are computationally expensive. The time complexity of GP regression is $O(n^3)$, where $n$ is the number of data points. This is due to the need to invert the covariance matrix, which scales cubically with the number of points.

Let's try single-path Pathfinder:

In [None]:
with multi_spin_rate_model:
    multi_trace = pmx.fit(method="pathfinder", num_paths=1, random_seed=SEED)

To make predictions with a GP, we need to compute the posterior distribution of the function values at new points given the observed data. This involves conditioning the joint Gaussian distribution on the observed data.

In [None]:
days_pred = np.arange(analysis_subset.day_of_year.min(), analysis_subset.day_of_year.max())
pitcher_ind = np.repeat(np.arange(n_outputs), len(days_pred))
X_new = np.column_stack((np.tile(days_pred, n_outputs), pitcher_ind))

In [None]:
with multi_spin_rate_model:
    preds = gp.conditional("preds", X_new)
    gp_samples = pm.sample_posterior_predictive(multi_trace, var_names=["preds"], random_seed=SEED)

In [None]:
from pymc.gp.util import plot_gp_dist

f_pred = gp_samples.posterior_predictive["preds"].squeeze()

fig, axes = plt.subplots(n_outputs, 1, figsize=(12, 15), sharey=True)

M = len(days_pred)

for idx, pitcher in enumerate(top_pitchers["pitcher_name"]):
    # Prediction
    ax = axes[idx]
    plot_gp_dist(
        ax,
        f_pred[:, M * idx : M * (idx + 1)],
        X_new[M * idx : M * (idx + 1), 0],
        palette="Blues",
        fill_alpha=0.1,
        samples_alpha=0.1,
    )
    # Training data points
    cond = analysis_subset["pitcher_name"] == pitcher
    ax.scatter(analysis_subset.loc[cond, "day_of_year"], analysis_subset.loc[cond, "avg_spin_rate"], color="r")
    ax.set_title(pitcher)

fig.supxlabel("Day of year")
fig.supylabel("Average spin rate (rpm)")
plt.tight_layout()

## When to Use Pathfinder

Based on our examples, here are some guidelines for when Pathfinder might be the right choice:

### Advantages of Pathfinder:

1. **Speed**: Pathfinder is typically much faster than MCMC methods, especially for large models.
2. **Scalability**: Works well with high-dimensional problems where MCMC might struggle.
3. **Initialization**: Can be used to initialize MCMC for faster convergence.
4. **Reasonable approximations**: For many models, the approximation quality is good enough for practical use.

### When to be cautious:

1. **Complex posteriors**: Multi-modal or highly skewed distributions may not be captured well by a small number of normal approximations.
2. **High precision requirements**: When you need the most accurate posterior possible and compute time isn't a concern, MCMC might still be preferable.
3. **Model diagnostics**: For new or complex models, it's a good idea to compare Pathfinder results with MCMC to ensure good approximation.

## Tuning Pathfinder

If you're using Pathfinder and want to improve its performance, here are some parameters to consider tuning:

1. **num_paths**: Increasing this runs multiple independent paths and can help capture more complex posteriors. Try values from 4 (default) to 50 or more for complex problems.

2. **jitter**: Controls how far initial points are from each other. Higher values explore more of the parameter space. Default is 2.0, but try 5.0-20.0 for complex models.

3. **maxcor**: The history size for L-BFGS optimization. Larger values can help with complex curvature. Default is min(model dimension, 10), but you can try higher values for complex models.

4. **num_draws_per_path**: Number of samples drawn from each approximation. Default is 1000, but you might increase for more stable results.

5. **importance_sampling**: Method used for combining results across paths. Options are "psis" (default) or "psir". PSIS generally works better in practice.

## Conclusion

Pathfinder offers a compelling approach to variational inference that can significantly speed up Bayesian modeling workflows. It works by finding an optimization path through parameter space, creating normal approximations along the way, and selecting the best one using the ELBO.

Key takeaways:

1. Pathfinder is generally much faster than MCMC methods like NUTS.
2. The quality of approximation is often good enough for many practical applications.
3. Multi-path Pathfinder with tuned settings can substantially improve results.
4. For critical applications, it's good practice to verify Pathfinder results with MCMC.

This makes Pathfinder a valuable addition to your Bayesian modeling toolkit, especially when you need quick results for exploratory analysis or with large models where MCMC might be prohibitively slow.

---
## References

1. Zhang, Lu, et al. "Pathfinder: Parallel quasi-Newton variational inference." arXiv preprint arXiv:2108.03782 (2021).
2. Rubin, Donald B. "Estimation in parallel randomized experiments." Journal of Educational Statistics 6.4 (1981): 377-401.

In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w