This note works an example of partial pooling from equation 12.1 of Gelman, Hill, "Data Analysis Using Regression and Multilevel/Hierarchical Models", Cambridge 2007.

## Introduction

The topic of formula 12.1 is: how to partially pool data by hand, though this task is often handled by a hierarchical model solver. 

The idea is: we want to estimate the ideal mean of a value from a few observations at a location. The optimal linear un-biased estimator is to just take the average of the observed values, and hope this is close to the ideal unobserved mean.

The wrinkle in partial pooling is: if we have data from other locations, can we use that to improve our estimate?

Let's work this example with a few simplifying assumptions, using the Python sympy package to do the algebra.

In [1]:
# import packages
import sympy
from sympy.stats import E, Normal

Our idea is:

  * Each location has an unobserved mean value of examples drawn from this location, call this `LocationValue_j`.
  * The locations are related, in that the `LocationValue_j`s are all drawn from some common distribution, this is why we think pooling data could be useful.
  * For a given location we figure some combination of the observations from the location, plus observations from other locations may be a lower expected error estimate than can be found using only observations from the location.

To execute this idea we need to define a great number of variables and their relations as follows.

In [2]:
# define initial variables
MeanLocationValue = sympy.Symbol("MeanLocationValue")  # center of distribution generating garages, unobserved
n_obs = sympy.Symbol("n_j", positive=True)  # total number of observations across all garages, observed.
BetweenLocationsSD = sympy.Symbol("BetweenLocationsSD")  # how garages very from each other in expected behavior, unobserved
ObservedMean = Normal("ObservedMean", mean=MeanLocationValue, std = BetweenLocationsSD / sympy.sqrt(n_obs))  # mean of all observations, observed
LocationValue_j = sympy.Symbol("LocationValue_j")  # actual expected behavior of a given garage, unobserved and the goal to estimate
LocationDistFactor_j = Normal("LocationDistFactor_j", mean=0, std=BetweenLocationsSD)  # how locations differ from each other, unobserved
def_LocationValue_j = MeanLocationValue + LocationDistFactor_j  #  generative definition of LocationDistFactor_j
PerObservationSD = sympy.Symbol("PerObservationSD", positive=True)  # sd of distribution generating observations, unobserved
n_j = sympy.Symbol("n_j", positive=True)  # number of observations at the j-th garage, observed
LocationMean_j = sympy.Symbol("LocationMean_j", mean=LocationValue_j, std=PerObservationSD)  # mean of all observations at garage j, observed
LocationCenterNoise_ji = Normal("LocationCenterNoise_ji", mean=0, std = PerObservationSD / sympy.sqrt(n_j))  # how mean observations at given location vary, unobserved
def_LocationMean_j = LocationValue_j + LocationCenterNoise_ji  # generative definition of LocationCenterNoise_ji
w = sympy.Symbol("w", positive=True)  # our weighting term picking how to pool specific and general observations, to solve for
estimate_j = sympy.Symbol("estimate_j")  # our estimate of the behavior of the j-th garage, to solve for
def_estimate_j = w * LocationMean_j + (1-w) * ObservedMean  # definition of our estimate
expected_error_term = LocationValue_j - estimate_j  # error of our estimate, to minimize square of

Formula 12.1 from Gelman and Hill is as follows.

<img src="IMG_1323.png">

This actual a detailed form `w` from the following expression for our estimate.

In [3]:
def_estimate_j

LocationMean_j*w + (1 - w)*ObservedMean

That is our estimate is `w` times using the observed per-location mean (`LocationMean_j`, the obvious estimate) plus `1-w` times the observed mean of all observations from all locations. Setting `w = 1` gives us the traditional "use only observations from the chosen location" solution. The trick is to find a `w` between `0` and `1` that might have lower expected square error. 

The 12.1 solution is a solution that picks `w` as the following.

In [4]:
neat_soln_approx = 1 / (1 + PerObservationSD**2 / (n_j * BetweenLocationsSD**2))

neat_soln_approx

1/(1 + PerObservationSD**2/(BetweenLocationsSD**2*n_j))

This solution for `w` has some nice properties.

  * `w` goes to `1` (the standard simple solution) as `PerObservationSD` goes to zero. This can be read as: "there is no point in pooling of their is already little uncertainty in the obvious estimate.
  * `w` goes to `1` (the standard simple solution) as `n_j` goes to infinity. This can be read as: "there is no point in pooling if we already have a lot of data for the obvious estimate.
  * `w` goes to `0` (combining all the data) as `PerObservationSD` goes to infinity. This can be read as: "combine all the data if the per-location uncertainty is very high."

The goal is to then derive this solution. First we will derive a similar, solution and then the identical solution.


## The exact solution

We can solve for `w` by performing some substitutions in to our error term. The goal is to minimize the square of this error term, which counts negative errors as also being bad.

The error term can be written as `A - B - C` where `A`, `B`, and `C` are as follows.

In [5]:
A = (1-w) * LocationDistFactor_j
B = w * LocationCenterNoise_ji
C = (1 - w) * (ObservedMean - MeanLocationValue)

We can confirm this as follows

In [6]:
error_term_exact = (
    expected_error_term
        .subs(estimate_j, def_estimate_j)  # definition of estimate_j
        .subs(LocationMean_j, def_LocationMean_j)
        .subs(LocationValue_j, def_LocationValue_j)
).expand().simplify()
assert (error_term_exact - (A - B - C)).expand() == 0
assert (E(C*C) - (1 - w)**2 * BetweenLocationsSD**2 / n_j).expand() == 0
assert E(A).expand() == 0
assert E(B).expand() == 0
assert E(C).expand() == 0

We are assuming each of the random variables `A`, `B`, and `C` are independent of the others and expected value `0`. This means `E[(A - B - C)**2] = E[A**2] + E[B**2] + E[C**2]`. We can then solve for where the derivative of this is zero to get the optimal value for `w`.

In [7]:
soln_exact = sympy.solve(sympy.diff((E(A * A) + E(B * B) + E(C * C)).expand(), w), w)[0]

soln_exact

BetweenLocationsSD**2*(n_j + 1)/(BetweenLocationsSD**2*n_j + BetweenLocationsSD**2 + PerObservationSD**2)

We can neaten this solution for `w` up a bit as follows.

In [8]:
neat_soln_exact = 1 / (1 + PerObservationSD**2 / ((n_j + 1) * BetweenLocationsSD**2))
assert (soln_exact - neat_soln_exact).together().expand() == 0

neat_soln_exact

1/(1 + PerObservationSD**2/(BetweenLocationsSD**2*(n_j + 1)))

Some algebra shows that this differs from the Gelman and Hill solution only in that we have an `(n_j + 1)` where they have an `n_j`.

## Reproducing the Gelman and Hill solution

We can match the Gelman and Hill solution by, during the solving, replacing the visible `MeanLocationValue` with our estimated `ObservedMean` (ignoring the small difference between them).

When we solve in that matter we get the Gelman and Hill `w` as follows.

In [9]:
error_term_approx = (
    expected_error_term
        .subs(estimate_j, def_estimate_j)  # definition of estimate_j
        .subs(ObservedMean, MeanLocationValue)  # this step is an approximation, using the unobserved MeanLocationValue as if it is the observed ObservedMean
        .subs(LocationMean_j, def_LocationMean_j)
        .subs(LocationValue_j, def_LocationValue_j)
).expand().simplify()


In [10]:
A = (1-w) * LocationDistFactor_j
B = w * LocationCenterNoise_ji


In [11]:
assert (error_term_approx - (A - B)).simplify() == 0
assert E(A).expand() == 0
assert E(B).expand() == 0

Again, we can expand `E[(A - B)**2]` as `E[A**2] + E[B**2]` (using the independence and mean-zero properties).

In [12]:
soln_approx = sympy.solve(sympy.diff(E(A**2) + E(B**2), w), w)[0]

soln_approx

BetweenLocationsSD**2*n_j/(BetweenLocationsSD**2*n_j + PerObservationSD**2)

In [13]:
assert (soln_approx - neat_soln_approx).together().expand() == 0

neat_soln_approx

1/(1 + PerObservationSD**2/(BetweenLocationsSD**2*n_j))

And this, as promised matches the text book. I would suggest a slight preference for the exact solution over this one, thought the differences are small.

## Conclusion

The partial pooling improvement for estimating an unseen value from noisy observations depends on a single parameter `PerObservationSD**2 / ((n_j + 1) * BetweenLocationsSD**2)`. This parameter compares the uncertainty in the observations from a single location, to the uncertainty per-location, scaled by how many observations we have at the location in question. When this ratio is small, we don't pool data- we just estimate the average value using data from one location. When this ratio is large, pooling is likely a useful variance reducing procedure.

In practice the above inference is made inside a hierarchical model solver. However, it is good to see the expect form of the pooling strategy.