In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

plt.style.use('fivethirtyeight')
%matplotlib inline

## 1. Intervals and Bayesian Estimation

In the Bayesian world, an unknown numerical parameter is not a fixed number. You think of it as a random variable and express your uncertainty about it as a probability distribution. In Data 140 (or EECS 126) you learned [the related terminology](http://prob140.org/textbook/content/Chapter_20/03_Prior_and_Posterior.html), as well as [results](http://prob140.org/textbook/content/Chapter_21/00_The_Beta_and_the_Binomial.html) when the parameter is the probability of heads in a toss of a coin.

You start with a prior distribution on the parameter. The posterior distribution of the parameter, given the data, describes the entirety of your updated opinion about the parameter once you've seen the data. But distributions are sometimes considered to be far too much information. People are accustomed to interval estimates, so intervals are sometimes used even in the Bayesian setting. 

Let $\mathbf{X}$ represent the data and let $\Theta$ be the random parameter. A 95\% *credible interval* for $\Theta$ is any fixed interval $C_{95}$ such that $P(\Theta \in C_{95} \mid \mathbf{X}) = 0.95$. Keep in mind that in this setting, the parameter $\Theta$ is random and the interval $C_{95}$ is fixed.

Assume that the posterior density is smooth and unimodal. Note that we are not assuming we have a lot of data.

**(a)** How many 95\% credible intervals are there for $\Theta$?

**(b)** Set up notation for the posterior cdf and use it to express the endpoints of a 95\% *equal-tailed* credible interval (ETI) for $\Theta$. *Equal-tailed* means the probability $\Theta$ is below the interval equals the probability that it's above the interval.

**(c)** Suppose you have a 95\% ETI for $\Theta$, but now you want to estimate the parameter $g(\Theta)$ where $g$ is an increasing function. How would you construct a 95\% credible interval for $g(\Theta)$? Is it an ETI for $g(\Theta)$?

**(d)** How would you (a Bayesian) compare two online sellers who don't have much of a track record, if all you know about them are these ratings? Construct an appropriate Bayesian model (see Lecture 7 for ideas) and use MAP estimation to compare the underlying satisfaction rates.
- Seller A: 9 satisfied, 1 unsatisfied
- Seller B: 5 satisfied, 0 unsatisfied

**(e)** The case of Seller B suggests a different approach to finding a 95\% credible interval even when the MAP estimate is not $0$ or $1$. The approach is to find an interval of "most probable" values of $\Theta$, with the posterior probability of the interval equal to 95\%.
    
To execute this, look at the density curve and find two points where it has the same height $h$. Find the chance that $\Theta$ is in the interval between the two points. Then adjust $h$ as needed to get the chance closer to 95\%.

Formally, let $f(\theta \mid \mathbf{X})$ denote the posterior density function of $\Theta$ evaluated at the point $\theta$. You are trying to find a height $h^*$ such that
$$
\int_{\theta: f(\theta \mid \mathbf{X}) > h^*} f(\theta \mid \mathbf{X})\,\mathrm{d}{\theta} = 0.95
$$
The resulting interval is called a 95\% Highest Posterior Density Interval (HDPI).

Discuss pros and cons of the two kinds of credible intervals in Parts **(c)** and **(d)**. You could think about key values the intervals must contain, their shape, computational convenience, transformation properties, or a number of other things.

## 2. Gibbs Sampling for a Poisson Model of Soccer Goals

We model the number of goals scored by $n$ soccer teams, where team $i$ scores $X_{ij}$ goals in match $j$:

$$X_{ij} \mid \theta_i \stackrel{\text{ind}}{\sim} \text{Poisson}(\theta_i), \quad i = 1, \ldots, n, \quad j = 1,\ldots, m$$

We use the Gamma distribution in its **rate parameterization**: $\text{Gamma}(\alpha, \beta)$ has density

$$f(\theta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha - 1} e^{-\beta\theta}, \quad \theta > 0$$

with mean $\alpha/\beta$. We place a hierarchical prior:

$$\theta_i \mid \beta \stackrel{\text{i.i.d.}}{\sim} \text{Gamma}(\alpha, \beta), \qquad \beta \sim \text{Gamma}(a, b)$$

where $\alpha, a, b > 0$ are known hyperparameters.

---
### (a) Gamma-Poisson Conjugacy

Suppose $\beta$ is known and we observe a single count $X \mid \theta \sim \text{Poisson}(\theta)$ with prior $\theta \sim \text{Gamma}(\alpha, \beta)$. Show that the posterior is

$$\theta \mid X = x \sim \text{Gamma}(\alpha + x, \; \beta + 1).$$

---
### (b) Gibbs Sampler Updates

In the hierarchical model above, write down the two conditional posteriors needed for a Gibbs sampler:

1. $\theta_i \mid \beta, X$ for each $i$ (use part (a), noting that team $i$ has $m$ independent observations)
2. $\beta \mid \theta_1, \ldots, \theta_n$ (think of the $\theta_i$'s as observed data and $\beta$ as the unknown parameter, then combine the Gamma prior on $\beta$ with the Gamma likelihood for the $\theta_i$'s)

---
### (c) Implementation and Shrinkage

The data below are goals scored per match by 20 Premier League teams in the **first 10 matches** of the 2023-24 season. Implement the Gibbs sampler from part (b) with $\alpha = 15$, $a = 1$, $b = 1/10$, burn-in $B = 200$, and $N = 5000$ post-burn-in samples. The value $\alpha = 15$ was chosen based on data from other Premier League seasons.

For each team, compute two estimates of $\theta_i$:
- The **MLE**: $\hat\theta_i = \bar{X}_i$ (sample mean of the 10 matches)
- The **posterior mean** from the Gibbs sampler

Make a trace plot of $\beta$. Then report the MLE and posterior mean for three teams: **Man City** (most goals), **Man United** (near the middle), and **Sheffield United** (fewest goals). In which direction does the posterior mean shift relative to the MLE, and why?

*Note: `np.random.gamma(a, s)` in NumPy uses the scale parameterization, where scale $= 1/\text{rate}$.*

In [None]:
# Goals scored per match, 20 Premier League teams, first 10 matches of 2023-24
# Source: football-data.co.uk. Rows: teams (alphabetical), Columns: matches (chronological)
teams = ['Arsenal', 'Aston Villa', 'Bournemouth', 'Brentford', 'Brighton',
         'Burnley', 'Chelsea', 'Crystal Palace', 'Everton', 'Fulham',
         'Liverpool', 'Luton', 'Man City', 'Man United', 'Newcastle',
         'Nott\'m Forest', 'Sheffield United', 'Tottenham', 'West Ham', 'Wolves']

goals = np.array([
    [2, 2, 3, 3, 6, 3, 0, 4, 3, 1],  # Arsenal
    [3, 5, 1, 0, 2, 0, 0, 3, 1, 1],  # Aston Villa
    [1, 1, 2, 2, 2, 1, 0, 1, 1, 2],  # Bournemouth
    [1, 2, 2, 3, 0, 0, 3, 1, 3, 1],  # Brentford
    [0, 0, 3, 4, 0, 2, 1, 1, 0, 2],  # Brighton
    [1, 2, 5, 2, 0, 2, 1, 0, 0, 0],  # Burnley
    [2, 2, 0, 2, 3, 2, 4, 5, 4, 1],  # Chelsea
    [1, 0, 1, 3, 1, 2, 2, 4, 0, 0],  # Crystal Palace
    [1, 1, 2, 1, 2, 1, 1, 1, 3, 3],  # Everton
    [3, 1, 1, 0, 2, 3, 0, 0, 0, 5],  # Fulham
    [4, 1, 3, 4, 1, 3, 4, 1, 2, 2],  # Liverpool
    [1, 2, 1, 4, 0, 1, 1, 1, 3, 2],  # Luton
    [5, 3, 4, 3, 5, 6, 3, 4, 0, 0],  # Man City
    [4, 0, 1, 1, 3, 3, 1, 0, 2, 2],  # Man United
    [2, 3, 1, 1, 1, 4, 4, 1, 1, 0],  # Newcastle
    [1, 0, 3, 1, 0, 1, 3, 2, 0, 1],  # Nott'm Forest
    [2, 0, 0, 0, 1, 1, 2, 0, 2, 1],  # Sheffield United
    [3, 1, 0, 5, 2, 3, 2, 1, 3, 1],  # Tottenham
    [1, 2, 0, 3, 1, 1, 0, 2, 0, 2],  # West Ham
    [3, 0, 1, 1, 2, 4, 1, 1, 1, 1],  # Wolves
])

n, m = goals.shape
T = goals.sum(axis=1)  # total goals per team
print(f"{n} teams, {m} matches each")
print(f"Sample means: {goals.mean(axis=1).round(2)}")

In [None]:
...

---
### (d) Prediction on Held-Out Data

The array `goals_rest28` below contains goals scored by the same 20 teams in the **remaining 28 matches** of the season. We did not use this data to fit the model.

For each team, use the MLE and the posterior mean from part (c) to predict the team's average goals per match in the remaining 28 games. Compare the two estimators by computing the mean squared error (MSE) across the 20 teams:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left(\hat\theta_i - \bar{Y}_i\right)^2$$

where $\bar{Y}_i$ is team $i$'s average goals in the held-out 28 matches. Which estimator predicts better? Relate your answer to the bias-variance tradeoff.

In [None]:
# Remaining 28 matches of the 2023-24 season (held out from fitting)
goals_rest28 = np.array([
    [2, 0, 6, 3, 1, 2, 0, 5, 1, 2, 2, 5, 2, 1, 2, 5, 1, 4, 2, 1, 2, 3, 5, 0, 2, 4, 0, 1],  # Arsenal
    [1, 0, 1, 1, 3, 3, 0, 2, 3, 2, 1, 2, 0, 4, 3, 4, 1, 4, 1, 2, 2, 2, 3, 3, 1, 2, 6, 3],  # Aston Villa
    [0, 2, 3, 1, 1, 2, 1, 4, 2, 2, 0, 1, 1, 0, 1, 1, 3, 0, 1, 1, 3, 0, 3, 3, 2, 2, 0, 1],  # Bournemouth
    [1, 1, 0, 2, 2, 0, 2, 2, 1, 0, 1, 1, 2, 3, 3, 0, 5, 3, 1, 0, 2, 1, 0, 1, 2, 1, 1, 2],  # Brentford
    [2, 1, 1, 1, 1, 4, 1, 1, 1, 3, 0, 5, 0, 4, 1, 1, 0, 1, 3, 0, 3, 1, 0, 4, 1, 0, 1, 1],  # Brighton
    [1, 1, 1, 2, 1, 0, 1, 1, 1, 2, 0, 0, 1, 1, 4, 0, 0, 2, 0, 1, 0, 1, 1, 1, 2, 0, 2, 1],  # Burnley
    [2, 4, 0, 3, 3, 3, 4, 1, 1, 6, 2, 2, 1, 0, 2, 1, 2, 0, 0, 1, 3, 1, 2, 2, 0, 2, 3, 1],  # Chelsea
    [1, 1, 3, 2, 1, 1, 1, 1, 2, 1, 5, 0, 5, 0, 0, 1, 0, 3, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3],  # Crystal Palace
    [0, 0, 2, 1, 3, 0, 0, 0, 2, 0, 1, 1, 0, 2, 0, 3, 1, 1, 2, 0, 0, 1, 1, 1, 0, 1, 1, 0],  # Everton
    [3, 1, 3, 5, 0, 1, 1, 0, 2, 3, 1, 0, 1, 4, 0, 1, 0, 0, 0, 2, 2, 0, 1, 3, 1, 0, 3, 2],  # Fulham
    [2, 2, 3, 1, 3, 3, 1, 0, 3, 4, 0, 2, 3, 4, 4, 3, 2, 1, 0, 3, 1, 2, 2, 2, 3, 1, 4, 2],  # Liverpool
    [0, 1, 1, 1, 1, 0, 1, 1, 3, 1, 1, 0, 1, 2, 1, 1, 2, 1, 1, 0, 2, 3, 1, 1, 4, 1, 2, 2],  # Luton
    [2, 1, 2, 4, 3, 4, 3, 5, 2, 3, 2, 1, 3, 1, 1, 2, 2, 1, 4, 1, 2, 3, 2, 3, 1, 2, 3, 0],  # Man City
    [2, 2, 0, 2, 1, 0, 2, 2, 1, 3, 1, 0, 2, 2, 0, 2, 1, 0, 1, 4, 3, 3, 3, 1, 0, 1, 0, 1],  # Man United
    [2, 3, 1, 2, 1, 0, 5, 2, 4, 2, 1, 3, 2, 4, 0, 4, 0, 1, 0, 8, 4, 1, 5, 1, 2, 3, 4, 2],  # Newcastle
    [0, 1, 2, 0, 2, 1, 2, 2, 0, 1, 2, 2, 1, 2, 2, 0, 2, 0, 2, 2, 2, 2, 3, 0, 0, 1, 1, 2],  # Nott'm Forest
    [2, 1, 3, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1, 2, 0, 0, 1, 2, 1, 1, 0, 2, 3, 0, 0],  # Sheffield United
    [1, 2, 4, 4, 2, 1, 0, 2, 2, 0, 2, 0, 2, 1, 3, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 3, 3],  # Tottenham
    [2, 2, 2, 0, 0, 3, 1, 3, 0, 1, 0, 1, 3, 1, 3, 2, 2, 1, 2, 1, 2, 4, 3, 2, 2, 0, 3, 2],  # West Ham
    [1, 2, 1, 0, 1, 2, 2, 0, 1, 2, 0, 0, 1, 0, 2, 0, 1, 0, 2, 1, 1, 2, 2, 4, 2, 0, 2, 3],  # Wolves
])

holdout_means = goals_rest28.mean(axis=1)
print(f"Held-out sample means (28 games): {holdout_means.round(2)}")

In [None]:
...

---
---

## 3. Bayes Estimators Under Different Loss Functions

Let $\theta$ be a parameter with posterior distribution $\pi(\theta \mid X = x)$ given observed data $x$. The **Bayes estimator** under loss $L$ is

$$T^*(x) = \arg\min_{t} \; E\left[L(\theta, t) \mid X = x\right].$$

You know from class that under squared error loss $L(\theta, t)=(\theta - t)^2$, the Bayes estimator is the posterior mean.

---
### (a) Absolute Error Loss

Suppose $\theta$ is continuous with posterior density $\pi(\theta \mid X = x)$. Show that under absolute error loss $L(\theta, t) = |\theta - t|$, the Bayes estimator is the **posterior median**.

*Hint: Split $E[|\theta - t| \mid X = x]$ into two integrals at $t$, differentiate with respect to $t$, and set equal to zero.*

---
### (b) 0-1 Loss

Now suppose $\theta$ takes values in a discrete (finite or countable) set $\Theta$, with posterior PMF $\pi(\theta \mid X = x)$. Under **0-1 loss**

$$L(\theta, t) = \begin{cases} 0 & \theta = t \\ 1 & \theta \ne t \end{cases}$$

show that the Bayes estimator is the **posterior mode** (the value of $\theta$ with the highest posterior probability).

---
---

## 4. Is the MLE Admissible?

Let $X_1, X_2, \ldots, X_n$ be i.i.d. exponential with rate $\lambda$. In Worksheet 3 you found $\hat{\lambda}_n$, the MLE of $\lambda$, and showed that it was biased.

**(a)** Create an unbiased estimator of $\lambda$ by multiplying $\hat{\lambda}_n$ by an appropriate factor.

**(b)** Is $\hat{\lambda}_n$ admissible? Explain.

**(c)** List some pros and cons of using $\hat{\lambda}_n$ and the unbiased estimator you derived in Part **a**.

---
---

## 5. Mean Squared Error in a Bayesian Setting

Let a random variable $X$ be the data you will observe, and let the parameter $\theta$ be a random variable that has a joint distribution with $X$. We get to observe $X$ but not $\theta$. In this exercise you will examine estimators of $\theta$. Some definitions:

- An *estimator* is a function of $X$ alone. That is, an estimator is a random variable $T = g(X)$ for some function $g$.
- An estimator $T$ is *unbiased* if $E(T \mid \theta) = \theta$. Remember that $\theta$ is a random variable.
- The *mean squared error* of an estimator $T$ is $MSE(T) = E\left( (T - \theta)^2\right)$.

**(a)** Let $T$ be an unbiased estimator. By conditioning on $\theta$, find $MSE(T)$ in terms of moments of $T$ and moments of $\theta$.

**(b)** Now let $T$ be the least squares estimator of $\theta$ based on $X$. By conditioning on $X$, find $MSE(T)$ in terms of moments of $T$ and moments of $\theta$.

**(c)** Make the realistic assumption that it is not possible to know $\theta$ exactly by observing $X$. Under this assumption, can the least squares estimator be unbiased?