# PyMC Best Practices

PyMC is a powerful library for probabilistic programming in Python, allowing you to define and fit Bayesian statistical models. This document outlines some best practices to help you write efficient, readable, and reliable PyMC code.



## Using Context Managers for Model Definition

Defining models within a context manager (with pm.Model() as model:) ensures that all variables and distributions are properly associated with the model. This practice helps avoid confusion when working with multiple models and makes your code cleaner.

In [None]:
import pymc as pm

with pm.Model() as model:
    # Define priors, likelihood, etc.
    pass  # Your model code here

## Using Data Containers (<span style="background-color: gray;">pm.math.dot</span>)

Using data containers helps in managing data within your model and allows for easy updates without redefining the entire model. This is particularly useful when performing posterior predictive checks or updating models with new data.

In [None]:
import pymc as pm
import numpy as np

# Generate synthetic data
X = np.linspace(0, 1, 100)
Y = 2 * X + np.random.normal(0, 0.1, size=100)

with pm.Model() as model:
    # Define data containers
    x_shared = pm.Data("x_shared", X)
    y_shared = pm.Data("y_shared", Y)
    
    # Define priors
    slope = pm.Normal("slope", mu=0, sigma=1)
    intercept = pm.Normal("intercept", mu=0, sigma=1)
    sigma = pm.HalfNormal("sigma", sigma=1)
    
    # Define likelihood
    mu = intercept + slope * x_shared
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_shared)
    
    # Sample from the posterior
    trace = pm.sample(1000)


## Utilizing the <span style="background-color: gray;">shape</span> Parameter

The  <span style="background-color: gray;">shape</span> parameter allows you to define arrays of random variables efficiently, avoiding repetitive code. This is particularly helpful when dealing with multivariate models or models with multiple parameters of the same kind.

In [None]:
import pymc as pm

# Assume we have 3 predictors
with pm.Model() as model:
    coefficients = pm.Normal("coefficients", mu=0, sigma=1, shape=3)


## Using PyMC's Built-in Dot Product (<span style="background-color: gray;">pm.math.dot</span>)

For linear algebra operations, <span style="background-color: gray;">pm.math.dot</span> provides a clean and readable way to perform dot products within your model. This is especially useful when utilizing the shape parameter for vectorized operations.

In [None]:
import pymc as pm
import numpy as np

# Synthetic data
X = np.random.randn(100, 3)
y = X @ np.array([1.5, -2.0, 0.5]) + np.random.randn(100)

with pm.Model() as model:
    # Data containers
    X_shared = pm.Data("X_shared", X)
    y_shared = pm.Data("y_shared", y)
    
    # Priors
    coefficients = pm.Normal("coefficients", mu=0, sigma=1, shape=3)
    sigma = pm.HalfNormal("sigma", sigma=1)
    
    # Linear algebra with pm.math.dot
    mu = pm.math.dot(X_shared, coefficients)
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_shared)
    
    # Sampling
    trace = pm.sample(1000)


## Using PyMC's Observe Functionality (<span style="background-color: gray;">pm.observe</span>)

The (<span style="background-color: gray;">pm.observe</span>) function allows you to pass data to your model in a flexible way, making it easier to test multiple models or datasets without redefining the entire model each time.

In [None]:
import pymc as pm

# Model 1
# Weakly informative prior
with pm.Model() as m1:
    x = pm.Normal("x", mu=0, sigma=10000)
    y = pm.Normal.dist(x, shape=(5,))
    y_censored = pm.Deterministic("y_censored", pm.math.clip(y, -1, 1))

# Model 2
# Tighter informative prior
with pm.Model() as m2:
    x = pm.Normal("x", mu=10, sigma=2)
    y = pm.Normal.dist(x, shape=(5,))
    y_censored = pm.Deterministic("y_censored", pm.math.clip(y, -1, 1))

# Test 1
new_m1 = pm.observe(m1, {y_censored: [0.9, 0.5, 0.3, 1, 1]})

# Test 2
new_m2 = pm.observe(m2, {y_censored: [0.9, 0.5, 0.3, 1, 1]})


# Vectorization for Efficiency

This is general advice for this class and future classes. 

Using vectorized operations can significantly speed up your model, especially with large datasets. Vectorization reduces the overhead of Python loops and leverages optimized numerical libraries.

NumPy is the general-use library for vectorized operations. JAX provides a powerful tool called Autograd that lets you put gradients on the GPU, significantly speeding up compute time. 

In [None]:
import pymc as pm
import numpy as np

# Large dataset
X = np.random.randn(10000, 3)
y = X @ np.array([1.5, -2.0, 0.5]) + np.random.randn(10000)

with pm.Model() as model:
    coefficients = pm.Normal("coefficients", mu=0, sigma=1, shape=3)
    sigma = pm.HalfNormal("sigma", sigma=1)
    mu = pm.math.dot(X, coefficients)
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
    trace = pm.sample()
