# Lecture 7: Bayesian Inference

**Data 145, Spring 2026: Evidence and Uncertainty**  
**Instructors:** Ani Adhikari, William Fithian

---

**Please run the setup cell below before reading.**

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

plt.style.use('fivethirtyeight')
%matplotlib inline

# Color scheme for Bayesian lectures
# Black = data (likelihood)
# Blue = beliefs (prior and posterior; prior dashed, posterior solid)
# Red = asymptotic approximations (BvM normal approx, dashed)
COLOR_LIKELIHOOD = 'black'       # Data / likelihood
COLOR_PRIOR = 'steelblue'        # Prior belief (dashed)
COLOR_POSTERIOR = 'steelblue'    # Posterior belief (solid)
COLOR_APPROX = 'firebrick'      # Asymptotic approximations (dashed)
COLOR_TRUE = '#000000'           # True parameter value

# Shades of blue for comparing multiple posteriors (prior washout plot)
MULTI_BLUES = ['#1b4f72', '#2e86c1', '#5dade2', '#85c1e9']

## Introduction: From Decision Theory to Bayesian Inference

In Lecture 6, we studied the **decision theory** framework for evaluating estimators:

- A **loss function** $L(\theta, a)$ measures estimation error
- The **risk function** $R(\theta; T) = E_\theta[L(\theta, T(X))]$ averages loss over the sampling distribution
- No single estimator minimizes risk for all $\theta$

A key result was:

> The estimator minimizing **average-case risk** $\int_0^1 \text{MSE}_p(T) \, dp$ for the binomial model turned out to be the **posterior mean** $E[p \mid X] = (X+1)/(n+2)$ — Laplace's estimator!

We arrived at Bayesian statistics through a purely frequentist argument: "which estimator has the smallest average MSE?"

### From unweighted to weighted averages

But why did we take an *unweighted* average over $p \in [0,1]$? Maybe we care more about performance when $p$ is near 0.5 than when it's near 0 or 1 — after all, values near 0.5 come up more often in practice. So we might prefer a *weighted* average: $\int_0^1 \text{MSE}_p(T) \cdot \pi(p)\,dp$ for some weight function $\pi(p)$.

Even more fundamentally: for a parameter like an exponential rate $\lambda > 0$ — which could be any positive number — there's no such thing as an unweighted average over $(0, \infty)$. We *need* a weight function. This weight function $\pi$ is exactly the **prior distribution**.

### Today's roadmap

Today, we develop the **Bayesian perspective** more fully:

1. The general framework: prior, likelihood, and posterior
2. Concrete examples: Beta-Binomial, Gamma-Exponential, and Normal-Normal
3. Conjugate priors and their interpretations
4. Why the likelihood is all that matters
5. Large-sample behavior: the prior washes out

### Why go Bayesian?

The Bayesian framework gives us the full **posterior distribution**, not just a point estimate. With the posterior in hand, we can:

- Report a **credible interval**: an interval $[a, b]$ such that $P(\theta \in [a, b] \mid X) = 0.95$. This is a direct probability statement about $\theta$ — "given the data and our prior, there is a 95% probability that $\theta$ lies in this interval" — unlike a confidence interval, which is a statement about the procedure.
- Make probability statements about $\theta$ (e.g., $P(p > 0.5 \mid X)$)
- Make predictions about future data

### A word of caution

Going Bayesian means layering on additional assumptions. The prior $\pi(\theta)$ is typically difficult to check — unlike a likelihood model (which we can assess with goodness-of-fit tests, residual plots, etc.), we usually only observe $\theta$ indirectly through the data, so there's no direct way to verify whether our prior is reasonable. When the sample size isn't very large, the choice of prior can seriously impact our inferences. We'll come back to these questions — where do priors come from, and what does it even mean for $\theta$ to "have a distribution"? — in Lecture 8.

---

## 1. The Bayesian Framework

### The Setup

The ingredients:
- **Prior**: $\theta \sim \pi(\theta)$ — our beliefs about $\theta$ before seeing data
- **Likelihood**: $X \mid \theta \sim f_\theta(x)$ — the data-generating process, given $\theta$
- **Posterior**: $\theta \mid X \sim \pi(\theta \mid X)$ — our updated beliefs after seeing data

### Bayes' Rule

The posterior is computed via Bayes' rule:

$$\pi(\theta \mid x) = \frac{f_\theta(x) \cdot \pi(\theta)}{\int_{\Theta} f_u(x) \cdot \pi(u) \, du}$$

The denominator $m(x) = \int_{\Theta} f_u(x) \cdot \pi(u) \, du$ is the **marginal likelihood**. It doesn't depend on $\theta$, so:

$$\boxed{\text{Posterior} \propto \text{Likelihood} \times \text{Prior}: \qquad \pi(\theta \mid x) \propto_\theta f_\theta(x) \cdot \pi(\theta)}$$

This proportionality (in $\theta$) is the key trick: we can identify the posterior just by recognizing the functional form in $\theta$, without computing the normalizing constant.

### The Bayes Estimator

The **Bayes risk** of an estimator $T(X)$ with respect to prior $\pi$ is:
$$r_\pi(T) = \int_{\Theta} R(\theta; T) \, \pi(\theta) \, d\theta = E[L(\theta, T(X))]$$

Recall from Lecture 6: we wanted an estimator that was best for every $\theta$, but that's impossible. The Bayes approach averages over $\theta$ instead. And it turns out to have a very nice property: **we can optimize for every $X$ value separately.**

For squared error loss $L(\theta, a) = (\theta - a)^2$:

$$r_\pi(T) = E\left[E[(\theta - T(X))^2 \mid X]\right]$$

The inner expectation depends on $T$ only through $T(X)$ at the observed $X$, so we can minimize it **separately for each $X$**, and this automatically minimizes the overall Bayes risk.

By the bias-variance decomposition (with $\theta$ as the random quantity):
$$E[(\theta - T(X))^2 \mid X] = \text{Var}(\theta \mid X) + (E[\theta \mid X] - T(X))^2$$

The first term doesn't depend on $T$; the second is minimized when:

$$\boxed{T^*(X) = E[\theta \mid X] \qquad \text{(the posterior mean)}}$$

This is remarkable: we couldn't find an estimator best for every $\theta$, but we *can* find one that's best for every $X$.

What if we use a different loss function instead of squared error? The Bayes estimator will be a different summary of the posterior — on the homework, you'll work out what it is for absolute error loss and other examples.

---

## 2. Beta-Binomial Conjugacy

### Recap: The Uniform Prior (Lecture 6)

With $p \sim \text{Uniform}(0,1) = \text{Beta}(1,1)$ and $X \mid p \sim \text{Binomial}(n, p)$:
- Posterior: $p \mid X \sim \text{Beta}(X+1, n-X+1)$
- Bayes estimator: $E[p \mid X] = (X+1)/(n+2)$

### General Beta Prior

Now suppose $p \sim \text{Beta}(\alpha, \beta)$ with density $\pi(p) \propto p^{\alpha-1}(1-p)^{\beta-1}$.

The posterior is:
$$\pi(p \mid x) \propto_p f_p(x) \cdot \pi(p) = \binom{n}{x} p^x (1-p)^{n-x} \cdot p^{\alpha-1}(1-p)^{\beta-1} \propto_p p^{x + \alpha - 1} (1-p)^{n - x + \beta - 1}$$

This is proportional (in $p$) to a $\text{Beta}(x + \alpha, n - x + \beta)$ density. Since two different densities can't be proportional to each other (both integrate to 1, so the proportionality constant must be 1), we conclude:

$$\boxed{p \mid X \sim \text{Beta}(X + \alpha, \; n - X + \beta)}$$

Notice what happened: we started with a Beta prior and ended up with a Beta posterior. When this occurs — when the posterior belongs to the same family as the prior — we say the prior is **conjugate** to the likelihood. We'll see more examples of this shortly.

### Posterior Mean as Weighted Average

Recall the **pseudodata interpretation** from Lecture 6: the $\text{Beta}(\alpha, \beta)$ prior is like imagining we already observed $\alpha - 1$ successes and $\beta - 1$ failures before collecting any data. The posterior adds the real data on top of this pseudodata.

The posterior mean is:
$$E[p \mid X] = \frac{X + \alpha}{n + \alpha + \beta} = \underbrace{\frac{n}{n + \alpha + \beta}}_{w} \cdot \underbrace{\frac{X}{n}}_{\hat{p}_{\text{MLE}}} + \underbrace{\frac{\alpha + \beta}{n + \alpha + \beta}}_{1 - w} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}_{\text{prior mean}}$$

The posterior mean is a **weighted average** of the MLE and the prior mean. The quantity $\alpha + \beta$ acts like a **prior sample size**.

- As $n \to \infty$: $w \to 1$, so the posterior mean $\to$ MLE (data overwhelms the prior)
- As $\alpha + \beta \to \infty$: $w \to 0$, so the posterior mean $\to$ prior mean (strong prior dominates)

### Visualizing Prior, Likelihood, and Posterior

In each plot below, we show the prior, the likelihood (rescaled so its peak matches the posterior, for visual comparison), and the posterior. Notice how the posterior always sits between the prior and likelihood, closer to whichever carries more information.

In [None]:
def plot_beta_binomial(ax, n, x, alpha, beta, title=None):
    """Plot prior, likelihood, and posterior for Beta-Binomial model."""
    p_grid = np.linspace(0.001, 0.999, 500)
    
    # Prior: Beta(alpha, beta)
    prior = stats.beta.pdf(p_grid, alpha, beta)
    
    # Likelihood: p^x (1-p)^(n-x), rescaled
    log_lik = x * np.log(p_grid) + (n - x) * np.log(1 - p_grid)
    lik = np.exp(log_lik - np.max(log_lik))  # normalize peak to 1
    
    # Posterior: Beta(x + alpha, n - x + beta)
    post_a, post_b = x + alpha, n - x + beta
    posterior = stats.beta.pdf(p_grid, post_a, post_b)
    
    # Rescale likelihood to match posterior peak height
    lik_scaled = lik * np.max(posterior)
    
    # Prior: blue dashed.  Likelihood: black solid.  Posterior: blue solid.
    ax.plot(p_grid, prior, color=COLOR_PRIOR, linewidth=2.5, linestyle='--',
            label=f'Prior: Beta({alpha}, {beta})')
    ax.plot(p_grid, lik_scaled, color=COLOR_LIKELIHOOD, linewidth=2.5,
            label='Likelihood (rescaled)')
    ax.plot(p_grid, posterior, color=COLOR_POSTERIOR, linewidth=2.5,
            label=f'Posterior: Beta({post_a}, {post_b})')
    
    # Mark MLE and posterior mean
    mle = x / n
    post_mean = post_a / (post_a + post_b)
    ax.axvline(mle, color=COLOR_LIKELIHOOD, linestyle=':', alpha=0.6, linewidth=1.5)
    ax.axvline(post_mean, color=COLOR_POSTERIOR, linestyle=':', alpha=0.6, linewidth=1.5)
    
    if title:
        ax.set_title(title, fontsize=12, fontweight='bold')
    ax.set_xlabel('$p$', fontsize=11)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, None)
    ax.legend(fontsize=9, loc='upper left')


fig, axes = plt.subplots(2, 2, figsize=(14, 10))

plot_beta_binomial(axes[0, 0], n=16, x=12, alpha=1, beta=1,
                   title='Uniform prior, n=16, X=12')
plot_beta_binomial(axes[0, 1], n=16, x=12, alpha=5, beta=5,
                   title='Beta(5,5) prior, n=16, X=12')
plot_beta_binomial(axes[1, 0], n=100, x=75, alpha=1, beta=1,
                   title='Uniform prior, n=100, X=75')
plot_beta_binomial(axes[1, 1], n=16, x=12, alpha=20, beta=20,
                   title='Strong Beta(20,20) prior, n=16, X=12')

plt.suptitle('Beta-Binomial: Prior, Likelihood, and Posterior',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

*__Figure 1.__ Prior (blue dashed), likelihood (black solid, rescaled), and posterior (blue solid) for the Beta-Binomial model under four scenarios. Top left: a weak uniform prior with moderate data — the posterior tracks the likelihood. Top right: a moderate Beta(5,5) prior pulls the posterior toward 0.5. Bottom left: with $n = 100$ observations, the posterior concentrates tightly around the MLE regardless of the prior. Bottom right: a strong Beta(20,20) prior (prior "sample size" 40) dominates the $n = 16$ data, pulling the posterior toward 0.5. Dotted vertical lines show the MLE (black) and posterior mean (blue).*

### Observations

1. **Weak prior, moderate data** (top left): With a uniform prior ($\alpha + \beta = 2$), the prior is nearly flat, so $\text{posterior} \propto \text{likelihood} \times \text{prior} \approx \text{likelihood} \times \text{const}$. The posterior is *proportional to* the likelihood — not just close to it!

2. **Moderate prior, moderate data** (top right): The Beta(5,5) prior pulls the posterior toward 0.5. The posterior mean compromises between the MLE (0.75) and the prior mean (0.5), weighted by their respective "sample sizes" ($n = 16$ vs. $\alpha + \beta = 10$).

3. **Weak prior, lots of data** (bottom left): With $n = 100$ observations, the posterior is tightly concentrated around the MLE. The prior barely matters.

4. **Strong prior at wrong location** (bottom right): The Beta(20,20) prior has "prior sample size" 40, exceeding the actual sample size of 16. The prior pulls the posterior strongly toward 0.5, away from the MLE.

---

## 3. Gamma-Exponential: Back to the Earthquake Data

In Lecture 1, we modeled the interarrival times of California M $\geq$ 4.0 earthquakes as $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \text{Exponential}(\lambda)$, where $\lambda$ is the daily rate. The MLE was $\hat{\lambda} = n / \sum X_i = 1/\bar{X}$.

Recall that we checked this modeling assumption in Lecture 1 by plotting the histogram of interarrival times and comparing it to the best-fitting exponential density. The exponential model looked reasonable — so let's take it as given and put a prior on $\lambda$.

### The Gamma Distribution

The **Gamma$(\alpha, \beta)$ distribution** (shape $\alpha > 0$, rate $\beta > 0$) has density
$$f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta\lambda}, \qquad \lambda > 0$$

with mean $\alpha/\beta$ and variance $\alpha/\beta^2$. The special case $\alpha = 1$ gives the $\text{Exponential}(\beta)$ distribution.

### Choosing a Prior

Suppose before looking at the data, we believe earthquakes of this magnitude happen roughly once every 20 days, but we're not very confident. A natural prior is $\lambda \sim \text{Gamma}(1, 20) = \text{Exponential}(\text{rate} = 20)$, which has mean $1/20 = 0.05$ per day.

This is a very weak prior: the "prior sample size" is just $\alpha = 1$ (like one prior observation). Unlike the likelihood model, which we could check against the data, there's no way to directly verify whether the prior is reasonable — we only observe $\lambda$ indirectly. But as we'll see, with this much data it hardly matters.

### The Posterior

The likelihood for $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda)$ is:
$$f_\lambda(x_1, \ldots, x_n) = \lambda^n e^{-\lambda \sum x_i}$$

The $\text{Gamma}(\alpha, \beta)$ prior has density $\pi(\lambda) \propto \lambda^{\alpha-1} e^{-\beta\lambda}$, so:
$$\pi(\lambda \mid x) \propto_\lambda f_\lambda(x) \cdot \pi(\lambda) = \lambda^n e^{-\lambda \sum x_i} \cdot \lambda^{\alpha-1} e^{-\beta\lambda} = \lambda^{(n+\alpha)-1} e^{-(\beta + \sum x_i)\lambda}$$

This is proportional to a $\text{Gamma}(n + \alpha, \;\beta + \sum x_i)$ density:

$$\boxed{\lambda \mid X_1, \ldots, X_n \sim \text{Gamma}(n + \alpha, \;\beta + \sum X_i)}$$

The Gamma prior is **conjugate** to the Exponential likelihood — the same pattern as Beta-Binomial.

### Posterior Mean as Weighted Average

The posterior mean is:
$$E[\lambda \mid X] = \frac{n + \alpha}{\beta + \sum X_i} = \frac{n + \alpha}{\beta + n\bar{X}}$$

We can rewrite this as a weighted average of the MLE $\hat{\lambda} = 1/\bar{X}$ and the prior mean $\alpha/\beta$. Dividing numerator and denominator by $\beta + n\bar{X}$:

$$E[\lambda \mid X] = \underbrace{\frac{n}{\beta + n\bar{X}} \cdot \frac{1}{\bar{X}}}_{\approx\; w \cdot \hat\lambda} + \underbrace{\frac{\alpha}{\beta + n\bar{X}}}_{\approx\; (1-w) \cdot \alpha/\beta}$$

More precisely, defining $w = n\bar{X}/(\beta + n\bar{X})$:

$$E[\lambda \mid X] = w \cdot \hat{\lambda}_{\text{MLE}} + (1 - w) \cdot \frac{\alpha}{\beta}$$

where $w \to 1$ as $n \to \infty$ (data overwhelms the prior). Here $\beta$ plays the role of the "prior sample size" in units of $\sum X_i$.

### Pseudodata Interpretation

The prior $\text{Gamma}(\alpha, \beta)$ is like having already observed $\alpha$ pseudo-events over a total pseudo-time of $\beta$, giving a prior rate of $\alpha/\beta$. The posterior adds the real data: $n$ observed events over total time $\sum X_i$.

### Derived Quantities

A key advantage of the Bayesian approach is that once we have the posterior for $\lambda$, we can compute posterior distributions for *any* function of $\lambda$ — for example:
- **Probability of $\geq 1$ earthquake in a week**: $1 - e^{-7\lambda}$
- **90th percentile of earthquakes in a year**: the 90th percentile of $\text{Poisson}(365\lambda)$

Uncertainty propagates automatically through nonlinear transformations, without needing the delta method!

In [None]:
# Load earthquake data from Lecture 1
eq_data = pd.read_csv('../../demos/lec01_earthquakes/data/california_earthquakes_declustered.csv')

# Filter to mainshocks and compute interarrival times
mainshocks = eq_data[eq_data['is_mainshock']].sort_values('time').reset_index(drop=True)

# Parse timestamps (format='ISO8601' handles mixed fractional seconds)
timestamps = pd.to_datetime(mainshocks['time'], format='ISO8601')
interarrivals = timestamps.diff().dt.total_seconds().dropna().values / 86400

n_eq = len(interarrivals)
sum_x = np.sum(interarrivals)
xbar = np.mean(interarrivals)
mle_lambda = 1 / xbar

print(f"Number of mainshocks: {len(mainshocks)}")
print(f"Number of interarrival times: {n_eq}")
print(f"Mean interarrival time: {xbar:.2f} days")
print(f"MLE rate: {mle_lambda:.4f} per day")

# Prior: Gamma(alpha=1, beta=20) — rate parameterization
alpha_prior_eq = 1
beta_prior_eq = 20

# Posterior: Gamma(n + alpha, beta + sum_x)
post_alpha = n_eq + alpha_prior_eq
post_beta = beta_prior_eq + sum_x
post_mean_lambda = post_alpha / post_beta

# Weighted average check
w = n_eq * xbar / (beta_prior_eq + n_eq * xbar)
print(f"\nPosterior: Gamma({post_alpha}, {post_beta:.1f})")
print(f"Posterior mean: {post_mean_lambda:.4f} per day")
print(f"MLE: {mle_lambda:.4f} per day")
print(f"Prior mean: {alpha_prior_eq/beta_prior_eq:.4f} per day")
print(f"Weight on MLE: w = {w:.4f}")

# --- Three-panel figure ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Posterior distribution object
post_dist = stats.gamma(a=post_alpha, scale=1/post_beta)

# We still need MC samples for the 90th-percentile panel (discrete quantity)
np.random.seed(42)
post_samples = np.random.gamma(post_alpha, 1/post_beta, size=50000)

# Panel 1: Posterior for lambda
ax = axes[0]
lam_grid = np.linspace(0.025, 0.055, 500)
prior_pdf = stats.gamma.pdf(lam_grid, a=alpha_prior_eq, scale=1/beta_prior_eq)
posterior_pdf = post_dist.pdf(lam_grid)

# Prior is very spread out — rescale for visual comparison
prior_rescaled = prior_pdf * (np.max(posterior_pdf) / np.max(prior_pdf)) * 0.3

ax.plot(lam_grid, prior_rescaled, color=COLOR_PRIOR, linewidth=2, linestyle='--',
        label='Prior (rescaled)', alpha=0.7)
ax.plot(lam_grid, posterior_pdf, color=COLOR_POSTERIOR, linewidth=2.5,
        label=f'Posterior: Gamma({post_alpha}, {post_beta:.0f})')
ax.axvline(mle_lambda, color=COLOR_LIKELIHOOD, linestyle=':', linewidth=1.5,
           label=f'MLE: {mle_lambda:.4f}')

# 95% credible interval
ci_lo_lam = post_dist.ppf(0.025)
ci_hi_lam = post_dist.ppf(0.975)
mask = (lam_grid >= ci_lo_lam) & (lam_grid <= ci_hi_lam)
ax.fill_between(lam_grid[mask], posterior_pdf[mask], alpha=0.15, color=COLOR_POSTERIOR)
ax.set_xlabel(r'$\lambda$ (earthquakes per day)', fontsize=11)
ax.set_ylabel('Density', fontsize=11)
ax.set_title(r'Posterior for rate $\lambda$', fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.text(0.05, 0.92, f'95% credible interval: [{ci_lo_lam:.4f}, {ci_hi_lam:.4f}]',
        transform=ax.transAxes, fontsize=9, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))


# Panel 2: P(earthquake in a week) = 1 - exp(-7*lambda)
# Analytic PDF via change of variables: p = 1 - exp(-7*lam), lam = -log(1-p)/7
# f_P(p) = f_Lambda(-log(1-p)/7) * 1/(7*(1-p))
ax = axes[1]
prob_week_mle = 1 - np.exp(-7 * mle_lambda)

prob_grid = np.linspace(0.001, 0.999, 500)
lam_of_p = -np.log(1 - prob_grid) / 7
dlam_dp = 1 / (7 * (1 - prob_grid))
prob_week_pdf = post_dist.pdf(lam_of_p) * dlam_dp

# Focus on the region where the density is nontrivial
pw_support = prob_grid[(prob_week_pdf > 1e-3)]
pw_lo, pw_hi = pw_support[0], pw_support[-1]
pw_mask = (prob_grid >= pw_lo) & (prob_grid <= pw_hi)

ax.plot(prob_grid[pw_mask], prob_week_pdf[pw_mask], color=COLOR_POSTERIOR, linewidth=2.5,
        label='Posterior density')
ax.axvline(prob_week_mle, color=COLOR_LIKELIHOOD, linestyle=':', linewidth=1.5,
           label=f'MLE: {prob_week_mle:.3f}')

# 95% credible interval from posterior quantiles of lambda, transformed
# Monotone increasing transformation: larger lambda -> larger p
ci_lo_pw = 1 - np.exp(-7 * ci_lo_lam)
ci_hi_pw = 1 - np.exp(-7 * ci_hi_lam)

ci_mask = pw_mask & (prob_grid >= ci_lo_pw) & (prob_grid <= ci_hi_pw)
ax.fill_between(prob_grid[ci_mask], prob_week_pdf[ci_mask], alpha=0.15,
                color=COLOR_POSTERIOR)

ax.set_xlabel('$P(\\geq 1$ earthquake in a week$)$', fontsize=11)
ax.set_ylabel('Density', fontsize=11)
ax.set_title('Posterior for $P$(EQ in a week)', fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.text(0.05, 0.92, f'95% credible interval: [{ci_lo_pw:.3f}, {ci_hi_pw:.3f}]',
        transform=ax.transAxes, fontsize=9, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Panel 3: 90th percentile of Poisson(365*lambda) — discrete, needs MC
ax = axes[2]
pct90_annual = np.array([stats.poisson.ppf(0.9, 365 * lam) for lam in post_samples])
pct90_mle = stats.poisson.ppf(0.9, 365 * mle_lambda)

# Since this is discrete, use a bar chart of the posterior PMF
vals, counts = np.unique(pct90_annual, return_counts=True)
pmf = counts / len(pct90_annual)
ax.bar(vals, pmf, color=COLOR_POSTERIOR, alpha=0.6, edgecolor='white', width=0.8,
       label='Posterior PMF')
ax.axvline(pct90_mle, color=COLOR_LIKELIHOOD, linestyle=':', linewidth=1.5,
           label=f'MLE: {pct90_mle:.0f}')

ci_lo_p90 = np.percentile(pct90_annual, 2.5)
ci_hi_p90 = np.percentile(pct90_annual, 97.5)
ax.set_xlabel('90th percentile of annual EQ count', fontsize=11)
ax.set_ylabel('Posterior probability', fontsize=11)
ax.set_title("Posterior for 90th pctile\nof annual count", fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.text(0.05, 0.92, f'95% credible interval: [{ci_lo_p90:.0f}, {ci_hi_p90:.0f}]',
        transform=ax.transAxes, fontsize=9, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Gamma-Exponential: Earthquake Rate and Derived Quantities',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

*__Figure 2.__ Posterior inference for California earthquake rates using the Gamma-Exponential conjugate model. Left: the posterior for the daily rate $\lambda$ with 95% credible interval (shaded). Center: the posterior density of $P(\geq 1 \text{ EQ in a week}) = 1 - e^{-7\lambda}$, obtained via the change-of-variables formula. Right: the posterior distribution of the 90th percentile of the annual earthquake count (a discrete quantity, computed by Monte Carlo). With $n = 614$ interarrival times, the weak prior has virtually no effect.*

## 4. Normal-Normal Conjugacy

### Setup

Suppose $X_1, \ldots, X_n \overset{\text{iid}}{\sim} N(\theta, \sigma^2)$ with $\sigma^2$ known.

**Prior**: $\theta \sim N(\mu_0, \tau_0^2)$

### Deriving the Posterior

The likelihood is:
$$f_\theta(x_1, \ldots, x_n) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\!\left(-\frac{(x_i - \theta)^2}{2\sigma^2}\right) \propto_\theta \exp\!\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \theta)^2\right)$$

Expanding the sum: $\sum_{i=1}^n (x_i - \theta)^2 = \sum_{i=1}^n (x_i - \bar{x})^2 + n(\bar{x} - \theta)^2$. The first term doesn't involve $\theta$, so:

$$f_\theta(x_1, \ldots, x_n) \propto_\theta \exp\!\left(-\frac{n(\bar{x} - \theta)^2}{2\sigma^2}\right)$$

(This tells us the likelihood depends on the data only through $\bar{x}$ — we'll come back to this point later.)

Now applying Bayes' rule:
$$\pi(\theta \mid x_1, \ldots, x_n) \propto_\theta \exp\!\left(-\frac{n(\bar{x} - \theta)^2}{2\sigma^2}\right) \cdot \exp\!\left(-\frac{(\theta - \mu_0)^2}{2\tau_0^2}\right)$$

Both factors are Gaussian in $\theta$, so the product is also Gaussian. We need to combine the two quadratics in $\theta$ and complete the square.

<details>
<summary><b>Show algebra: completing the square</b></summary>

The exponent (ignoring the $-1/2$) is:
$$\frac{n}{\sigma^2}(\bar{x} - \theta)^2 + \frac{1}{\tau_0^2}(\theta - \mu_0)^2$$

Expanding:
$$= \frac{n}{\sigma^2}\left(\theta^2 - 2\bar{x}\theta + \bar{x}^2\right) + \frac{1}{\tau_0^2}\left(\theta^2 - 2\mu_0\theta + \mu_0^2\right)$$

Collecting terms in $\theta$:
$$= \left(\frac{n}{\sigma^2} + \frac{1}{\tau_0^2}\right)\theta^2 - 2\left(\frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\tau_0^2}\right)\theta + \text{const}$$

Define the **posterior precision** $\frac{1}{\tau_1^2} = \frac{n}{\sigma^2} + \frac{1}{\tau_0^2}$ and complete the square:

$$= \frac{1}{\tau_1^2}\left(\theta - \mu_1\right)^2 + \text{const}$$

where $\mu_1 = \tau_1^2\left(\frac{n\bar{x}}{\sigma^2} + \frac{\mu_0}{\tau_0^2}\right)$.
</details>

The result:

$$\boxed{\theta \mid X_1, \ldots, X_n \sim N(\mu_1, \tau_1^2)}$$

### The Precision Formulation

The result is cleanest in terms of **precision** ($= 1/\text{variance}$):

$$\text{Posterior precision} = \text{Prior precision} + \text{Data precision}$$
$$\frac{1}{\tau_1^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}$$

The posterior mean is a **precision-weighted average**:

$$\mu_1 = \underbrace{\frac{n/\sigma^2}{n/\sigma^2 + 1/\tau_0^2}}_{w} \cdot \bar{X} \;+\; \underbrace{\frac{1/\tau_0^2}{n/\sigma^2 + 1/\tau_0^2}}_{1-w} \cdot \mu_0$$

This has **exactly the same structure** as the Beta-Binomial:
- The posterior mean is a weighted average of the MLE ($\bar{X}$) and the prior mean ($\mu_0$)
- More data (larger $n$) $\Rightarrow$ more weight on $\bar{X}$
- More precise prior (smaller $\tau_0^2$) $\Rightarrow$ more weight on $\mu_0$

The Normal prior is conjugate to the Normal likelihood, just as the Beta prior is conjugate to the Binomial. This pattern is **universal** across conjugate families.

In [None]:
def plot_normal_normal(ax, xbar, n, sigma2, mu0, tau0_sq, title=None):
    """Plot prior, likelihood, and posterior for Normal-Normal model."""
    # Posterior parameters
    data_prec = n / sigma2
    prior_prec = 1 / tau0_sq
    post_prec = data_prec + prior_prec
    tau1_sq = 1 / post_prec
    mu1 = tau1_sq * (xbar * data_prec + mu0 * prior_prec)
    
    # Grid for plotting
    all_means = [xbar, mu0, mu1]
    all_sds = [np.sqrt(sigma2 / n), np.sqrt(tau0_sq), np.sqrt(tau1_sq)]
    lo = min(m - 4 * s for m, s in zip(all_means, all_sds))
    hi = max(m + 4 * s for m, s in zip(all_means, all_sds))
    theta_grid = np.linspace(lo, hi, 500)
    
    prior = stats.norm.pdf(theta_grid, mu0, np.sqrt(tau0_sq))
    lik = stats.norm.pdf(theta_grid, xbar, np.sqrt(sigma2 / n))
    posterior = stats.norm.pdf(theta_grid, mu1, np.sqrt(tau1_sq))
    
    # Prior: blue dashed.  Likelihood: black solid.  Posterior: blue solid.
    ax.plot(theta_grid, prior, color=COLOR_PRIOR, linewidth=2.5, linestyle='--',
            label=f'Prior: N({mu0}, {tau0_sq})')
    ax.plot(theta_grid, lik, color=COLOR_LIKELIHOOD, linewidth=2.5,
            label=f'Likelihood: N({xbar}, {sigma2/n:.2g})')
    ax.plot(theta_grid, posterior, color=COLOR_POSTERIOR, linewidth=2.5,
            label=f'Posterior: N({mu1:.2f}, {tau1_sq:.3f})')
    
    ax.axvline(xbar, color=COLOR_LIKELIHOOD, linestyle=':', alpha=0.5, linewidth=1.5)
    ax.axvline(mu1, color=COLOR_POSTERIOR, linestyle=':', alpha=0.5, linewidth=1.5)
    
    if title:
        ax.set_title(title, fontsize=12, fontweight='bold')
    ax.set_xlabel(r'$\theta$', fontsize=11)
    ax.set_ylim(0, None)
    ax.legend(fontsize=9)


fig, axes = plt.subplots(1, 3, figsize=(16, 5))

plot_normal_normal(axes[0], xbar=3.0, n=10, sigma2=4.0, mu0=0.0, tau0_sq=1.0,
                   title='Balanced: prior and data\ncomparable precision')
plot_normal_normal(axes[1], xbar=3.0, n=100, sigma2=4.0, mu0=0.0, tau0_sq=1.0,
                   title='Lots of data (n=100):\nposterior near MLE')
plot_normal_normal(axes[2], xbar=3.0, n=10, sigma2=4.0, mu0=0.0, tau0_sq=0.1,
                   title=r'Precise prior ($\tau_0^2=0.1$):' + '\nposterior near prior')

plt.suptitle('Normal-Normal: Prior, Likelihood, and Posterior',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

*__Figure 3.__ Prior (blue dashed), likelihood (black solid), and posterior (blue solid) for the Normal-Normal model. Left: when the prior and data have comparable precision, the posterior sits between them. Center: with $n = 100$ the data precision overwhelms the prior, so the posterior nearly coincides with the likelihood. Right: a very precise prior ($\tau_0^2 = 0.1$) dominates even $n = 10$ observations, keeping the posterior near the prior mean. In all panels, the posterior is narrower than both the prior and the likelihood — combining information always reduces uncertainty.*

### Observations

The same qualitative pattern as the Beta-Binomial:

- The posterior always sits **between** the prior and the likelihood, closer to whichever is more precise (narrower).
- The posterior is always **narrower** than both the prior and the likelihood — combining information reduces uncertainty.
- With lots of data, the posterior is essentially the likelihood.
- With a very precise prior, the posterior stays near the prior mean even if the data disagree.

---

### Conjugate Priors: The Common Pattern

All three examples above share the same structure: the posterior belongs to the **same family** as the prior, with updated parameters. When this happens, we say the prior family is **conjugate** to the likelihood.

| Likelihood | Conjugate prior | Posterior |
|-----------|----------------|-----------|
| Binomial$(n, p)$ | Beta$(\alpha, \beta)$ | Beta$(X + \alpha, \; n - X + \beta)$ |
| Exponential$(\lambda)$ | Gamma$(\alpha, \beta)$ | Gamma$(n + \alpha, \; \beta + \sum X_i)$ |
| Normal$(\theta, \sigma^2)$ | Normal$(\mu_0, \tau_0^2)$ | Normal$(\mu_1, \tau_1^2)$ |

In every case, the posterior mean is a weighted average of the MLE and the prior mean, and the prior parameters have a **pseudodata** interpretation: imaginary observations that get combined with the real data. You'll see another conjugate pair — Gamma-Poisson — on the homework.

---

## 5. The Likelihood Is All That Matters

### A Key Observation

In the posterior formula $\pi(\theta \mid x) \propto_\theta f_\theta(x) \cdot \pi(\theta)$, the data $x$ enter **only through the likelihood function** $f_\theta(x)$ (viewed as a function of $\theta$ for fixed $x$).

If two data sets $x$ and $x'$ produce the same likelihood function — that is, $f_\theta(x) = c \cdot f_\theta(x')$ for all $\theta$ and some constant $c$ — they lead to the **same posterior distribution**, regardless of the prior.

### Sufficient Statistics

Sometimes a single function of the data — a statistic $T(X)$ — is all we need to compute the likelihood function (up to a multiplicative constant not depending on $\theta$). When that happens, $T(X)$ carries all the information in the data about $\theta$, and we call it **sufficient**.

For example, suppose $X_1, \ldots, X_n \overset{\text{iid}}{\sim} \text{Exponential}(\lambda)$. The likelihood is:
$$f_\lambda(x_1, \ldots, x_n) = \lambda^n e^{-\lambda \sum x_i}$$

The only thing about the data that matters (as a function of $\lambda$) is $\sum x_i$ — or equivalently, $\bar{X}$. Two data sets with the same sample mean produce the same likelihood and therefore the same posterior.

Similarly, for the Normal model the likelihood depends on the data only through $\bar{x}$.

We'll study sufficiency more formally in Stat 210A. For now, the key point is that for all of our conjugate models, there's a simple sufficient statistic, and the posterior depends on the data only through it.

Although we've introduced sufficiency in a Bayesian context, it's equally important for frequentist statistics. There's a *sufficiency principle* that says good estimators and other inference procedures should depend on the data only through a sufficient statistic. If you look back at some of the failed estimators from the last several weeks, a common theme is that they don't follow this principle.

---

## 6. Large-Sample Behavior: The Prior Washes Out

### The Main Idea

As the sample size $n$ grows:
- The likelihood $f_\theta(x)$ becomes more and more **concentrated** around the MLE $\hat{\theta}$
- The prior $\pi(\theta)$ is a fixed function — it doesn't change with $n$. In any small neighborhood around the MLE, the prior is roughly constant (just some number $\pi(\hat{\theta})$), while the likelihood has a sharp peak whose height grows with $n$
- So the posterior is dominated by the likelihood: **the prior washes out**

### Asymptotic Normality of the Posterior

We can make this precise using the quadratic approximation to the log-likelihood from Lectures 4–5:
$$\ell(\theta) = \log f_\theta(x) \approx \ell(\hat{\theta}) - \frac{nI(\hat{\theta})}{2}(\theta - \hat{\theta})^2$$

So the likelihood looks like a Gaussian:
$$f_\theta(x) \approx e^{\ell(\hat{\theta})} \cdot \exp\left(-\frac{nI(\hat{\theta})}{2}(\theta - \hat{\theta})^2\right)$$

Since $\pi(\theta) \approx \pi(\hat{\theta})$ near $\hat{\theta}$:
$$\pi(\theta \mid x) \propto_\theta f_\theta(x) \cdot \pi(\theta) \approx \text{const} \cdot \exp\left(-\frac{nI(\hat{\theta})}{2}(\theta - \hat{\theta})^2\right)$$

This is a Normal density:

$$\boxed{\pi(\theta \mid X) \approx N\!\left(\hat{\theta}_{\text{MLE}}, \; \frac{1}{nI(\hat{\theta}_{\text{MLE}})}\right)}$$

This is called the **Bernstein–von Mises theorem** (stated here informally). The posterior variance $1/(nI(\hat{\theta}))$ is the same as the asymptotic variance of the MLE!

For large samples, **Bayesian and frequentist inference agree**:
- Posterior mean $\approx$ MLE
- Posterior standard deviation $\approx$ standard error of MLE
- 95% posterior credible interval $\approx$ asymptotic normal 95% confidence interval $\hat{\theta} \pm 1.96/\sqrt{nI(\hat{\theta})}$

Let's see this in action.

In [None]:
# Demonstrate prior washing out: multiple priors converging as n grows
true_p = 0.7
sample_sizes = [5, 20, 100, 500]

priors = [
    (1, 1, 'Uniform'),
    (2, 2, 'Beta(2,2)'),
    (0.5, 0.5, 'Jeffreys'),
    (10, 2, 'Beta(10,2)'),
]

fig, axes = plt.subplots(1, 4, figsize=(18, 4.5))
np.random.seed(42)

for ax, n in zip(axes, sample_sizes):
    x = np.random.binomial(n, true_p)
    mle = x / n
    p_grid = np.linspace(0.001, 0.999, 500)
    
    # Posteriors from different priors in shades of blue
    for i, (a, b, label) in enumerate(priors):
        posterior = stats.beta.pdf(p_grid, x + a, n - x + b)
        ax.plot(p_grid, posterior, color=MULTI_BLUES[i], linewidth=2,
                label=label)
    
    # BvM Normal approximation in red (dashed)
    if n >= 20 and 0 < mle < 1:
        fisher_var = mle * (1 - mle) / n
        normal_approx = stats.norm.pdf(p_grid, mle, np.sqrt(fisher_var))
        ax.plot(p_grid, normal_approx, color=COLOR_APPROX, linewidth=2,
                linestyle='--', label='BvM approx', alpha=0.8)
    
    # MLE vertical line (black, dotted)
    ax.axvline(mle, color=COLOR_LIKELIHOOD, linestyle=':', linewidth=1.5,
               alpha=0.7, label=f'MLE = {mle:.2f}')
    
    # True value (black, dashed)
    ax.axvline(true_p, color=COLOR_TRUE, linestyle='--', linewidth=1.5, alpha=0.4,
               label=f'True p = {true_p}')
    ax.set_title(f'n = {n}, X = {x}', fontsize=12, fontweight='bold')
    ax.set_xlabel('$p$', fontsize=11)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, None)
    if n == 5:
        ax.legend(fontsize=7, loc='upper left')

plt.suptitle('Posteriors from Different Priors Converge as n Grows',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

*__Figure 4.__ Posterior distributions from four different priors (shades of blue), shown for increasing sample sizes. The true parameter is $p = 0.7$ (black dashed) and the MLE $\hat{p} = X/n$ (black dotted). For small $n$ (left), the posteriors differ substantially — the choice of prior matters. As $n$ grows (right), all posteriors converge to the same shape, concentrated around the MLE. For $n \geq 20$, the red dashed curve shows the Bernstein–von Mises Normal approximation $N(\hat{p}, \hat{p}(1-\hat{p})/n)$, which matches the posteriors closely.*

### Observations

- **Small $n$** (left): The posteriors differ substantially depending on the prior. The choice of prior matters!

- **Moderate $n$** (second panel): The posteriors are starting to converge. Differences are visible but shrinking.

- **Large $n$** (right panels): All posteriors are nearly identical — concentrated around the MLE, and well-approximated by the Bernstein–von Mises Normal distribution $N(\hat{p}, \hat{p}(1-\hat{p})/n)$ (red dashed line).

This is the **Bernstein–von Mises phenomenon**: regardless of the prior, the posterior converges to the same Normal distribution centered at the MLE.

**For large samples**, we don't need to get the prior exactly right — any reasonable prior leads to essentially the same inference. **For small samples**, the prior matters, and this is a *feature*: when data are scarce, it makes sense for prior knowledge to influence our conclusions.

---

## 7. Credible Intervals

The posterior distribution gives us more than a point estimate — it provides a complete description of our uncertainty about $\theta$.

A **$100(1-\alpha)\%$ credible interval** is any interval $[a, b]$ such that $P(\theta \in [a, b] \mid X) = 1 - \alpha$. The most common choice is the **equal-tailed credible interval**: the $\alpha/2$ and $1-\alpha/2$ quantiles of the posterior.

For the Beta-Binomial model with $p \mid X \sim \text{Beta}(X + \alpha, n - X + \beta)$, a 95% credible interval is the 2.5th and 97.5th percentiles of this Beta distribution.

By the Bernstein–von Mises theorem, for large $n$ the posterior is approximately $N(\hat{p}, \hat{p}(1-\hat{p})/n)$, so the 95% credible interval is approximately $\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}$ — the same formula as the asymptotic normal 95% confidence interval. Let's see how close these two intervals are.

In [None]:
# Posterior with credible interval — larger sample
n, x = 80, 60
alpha_prior, beta_prior = 1, 1
post_a, post_b = x + alpha_prior, n - x + beta_prior
post_dist = stats.beta(post_a, post_b)

p_grid = np.linspace(0.001, 0.999, 500)
posterior = post_dist.pdf(p_grid)

# Exact Bayesian credible interval (equal-tailed)
ci_lo, ci_hi = post_dist.ppf(0.025), post_dist.ppf(0.975)
post_mean = post_a / (post_a + post_b)

# Asymptotic normal confidence interval
mle = x / n
se = np.sqrt(mle * (1 - mle) / n)
wald_lo, wald_hi = mle - 1.96 * se, mle + 1.96 * se

# BvM Normal approximation to the posterior
bvm_approx = stats.norm.pdf(p_grid, mle, se)

fig, ax = plt.subplots(figsize=(10, 6))

# Posterior (blue) with shaded credible interval
ax.plot(p_grid, posterior, color=COLOR_POSTERIOR, linewidth=2.5,
        label=f'Posterior: Beta({post_a}, {post_b})')
mask = (p_grid >= ci_lo) & (p_grid <= ci_hi)
ax.fill_between(p_grid[mask], posterior[mask], alpha=0.2, color=COLOR_POSTERIOR,
                label=f'95% credible interval: [{ci_lo:.3f}, {ci_hi:.3f}]')

# BvM Normal approximation (red dashed)
ax.plot(p_grid, bvm_approx, color=COLOR_APPROX, linewidth=2, linestyle='--',
        label=f'BvM approx: N({mle:.3f}, {se**2:.4f})')

# Mark intervals
ax.axvline(post_mean, color=COLOR_POSTERIOR, linestyle=':', linewidth=1.5,
           label=f'Posterior mean: {post_mean:.3f}')
ax.axvline(mle, color=COLOR_LIKELIHOOD, linestyle=':', linewidth=1.5,
           label=f'MLE: {mle:.3f}')

# Show asymptotic normal confidence interval as a horizontal bracket
bracket_y = ax.get_ylim()[1] * 0.05 if ax.get_ylim()[1] > 0 else 0.5
ax.plot([wald_lo, wald_hi], [bracket_y, bracket_y], color=COLOR_APPROX,
        linewidth=3, solid_capstyle='butt',
        label=f'Asymp. normal 95% conf. int.: [{wald_lo:.3f}, {wald_hi:.3f}]')
ax.plot([wald_lo, wald_lo], [bracket_y - 0.2, bracket_y + 0.2],
        color=COLOR_APPROX, linewidth=2)
ax.plot([wald_hi, wald_hi], [bracket_y - 0.2, bracket_y + 0.2],
        color=COLOR_APPROX, linewidth=2)

ax.set_xlabel('$p$', fontsize=12)
ax.set_ylabel('Posterior density', fontsize=12)
ax.set_title(f'Posterior Distribution with 95% Credible Interval\n'
             f'(n={n}, X={x}, uniform prior)', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.set_xlim(0.55, 0.9)
ax.set_ylim(0, None)

plt.tight_layout()
plt.show()

*__Figure 5.__ Posterior distribution $p \mid X \sim \text{Beta}(61, 21)$ (blue solid) after observing $X = 60$ successes in $n = 80$ trials with a uniform prior, alongside the Bernstein–von Mises Normal approximation (red dashed). The shaded blue region is the exact 95% Bayesian credible interval; the red bracket near the x-axis is the asymptotic normal 95% confidence interval $\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}$. With $n = 80$, the two intervals are nearly identical.*

**Interpretation**: Given the data ($X = 60$ successes in $n = 80$ trials) and a uniform prior, there is a 95% posterior probability that $p$ lies in the shaded interval. This is a direct probability statement about $p$ itself.

Notice how close the Bayesian credible interval and the asymptotic normal confidence interval are. This is the Bernstein–von Mises theorem in action: for large $n$, the two approaches give essentially the same answer. The BvM Normal approximation (red dashed) is nearly indistinguishable from the exact posterior (blue solid).

---

## 8. Summary

### Key Concepts

| Concept | Definition |
|---------|------------|
| **Prior** $\pi(\theta)$ | Distribution representing beliefs about $\theta$ before seeing data |
| **Likelihood** $f_\theta(x)$ | Probability of data $x$ given parameter $\theta$ |
| **Posterior** $\pi(\theta \mid x)$ | Updated beliefs: $\propto f_\theta(x) \cdot \pi(\theta)$ |
| **Bayes estimator** | Posterior mean $E[\theta \mid X]$ (for squared error loss) |
| **Conjugate prior** | Posterior stays in the same family as the prior |
| **Sufficient statistic** | A statistic that determines the likelihood (up to a constant) |
| **Credible interval** | Interval with specified posterior probability |

### What We Learned

1. **The Bayesian framework** models $\theta$ as random with a prior, and updates to a posterior via Bayes' rule: Posterior $\propto$ Likelihood $\times$ Prior.

2. **The Bayes estimator** for squared error loss is the posterior mean. We can optimize for every $X$ value simultaneously, even though we can't optimize for every $\theta$. Other loss functions lead to other summaries of the posterior (homework).

3. **Conjugate priors** yield closed-form posteriors. The posterior mean is always a weighted average of the MLE and the prior mean, with weights determined by relative "sample sizes." We saw three examples: Beta-Binomial, Gamma-Exponential, and Normal-Normal.

4. **The likelihood carries all information** from the data. A statistic that determines the likelihood is sufficient — it contains everything the data have to say about $\theta$.

5. **For large samples**, the posterior is approximately $N(\hat{\theta}_{\text{MLE}}, 1/(nI(\hat{\theta})))$ regardless of the prior — Bayesian and frequentist inference converge. The 95% credible interval $\approx$ asymptotic normal 95% confidence interval.

### Next Time (Lecture 8)

- How should we choose a prior? What does it mean for $\theta$ to "have a probability"?
- Objective Bayes: Jeffreys prior and other non-informative priors
- Hierarchical Bayes: when even the prior has unknown parameters
- Computational methods for Bayesian inference