# Bayesian Analysis with (Py)STAN: The Eight Schools Problem

The "eight schools problem" is a very famous example of a TODO finish. Introuced originally in a 1981 paper by TODO ref, and it is included asn an example in chapter 5 of the fantastic third edition of Bayesian Data Analysis (TODO reference) (which I will refer to as "BDA3").  The model that is commonly used to analyze this problem (and which is described below) is very general and powerful, and an understanding of this model and the XXXXXX.  There are several example STAN implementations of the Eight Schools Problem available online (TODO ref), none of which seem to provide a lot of context about the problem or the hierarchical model commonly used to analyze it.  The **purpose of this document** is to both describe the hierarchical model (somewhat) rigorously and provide a PyStan implementation thereof - all in one place.

The document is structured as follows:

- Introduction of the problem and associated data


- Defintion of the hierarchical model used to do analyses


- Incomplete analytical derivation of the full model posterior


- PySTAN implementation of the model



## Data

#TODO


## Hierarchical Model

We have that $y_{ij}|\theta_j, \sigma^2 \sim_{iid} N(\theta_j, \sigma^2)$.  This implies:

$$
\overline{y_{.j}} \sim N(\theta_j, \sigma_j^2)
$$

Where:

$$
\overline{y_{.j}} = \frac{1}{n_j} \sum_{i=1}^{n_j} y_{ij}
$$
$$
\sigma_j^2 = \frac{\sigma^2}{n_j}
$$


The "dot" notation used to denote the within group sample mean is classically used in analysis of variance.  What this model is saying in words is that XXXX.

Here comes the hierarchical part - we assume that the $\theta_j$ values are themselves sampled independently from a Normal distribution with shared hyperparameters $\mu$ and $\tau$:

$$
\theta_j \sim_{iid} N(\mu, \tau^2)
$$



## Posterior Distribution over Model Parameters

Using a Bayesian approach with the model specified above, the full posterior distribution over all parameters can be factorized using the probability chain rule:

$p(\theta, \mu, \tau | y) = p(\theta|\mu, \tau, y)p(\mu|\tau, y)p(\tau|y)$


Specifying $p(\mu, \tau) = p(\mu | \tau)p(\tau)$ - the prior distribution over our hyperparameters - will give us enough information to analytically solve for the posterior.  We will do this below as we derive the second and third terms of the factorization.  We don't need any information from the prior in order to derive the first term $p(\theta|\mu, \tau, y)$, since we are conditioning on both $\mu$ and $\tau$.  Our model assumptions give us a prior for each $\theta_j$, and a sampling distribution of $y_{ij}$ given $\theta_j$ (a likelihood).  Using these to solve for the posterior on $\theta$ yields:

$$
\theta|\mu, \tau, y \sim N(\hat{\theta}_j, V_j)
$$

Where:

$$
\hat{\theta}_j = \frac{\frac{1}{\sigma_j^2} \bar{y}_{.j} + \frac{1}{\tau^2} \mu  }{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}}
$$

$$
V_j = \frac{1}{ \frac{1}{\sigma_j^2} + \frac{1}{\tau^2}  }
$$

Now, we **assume a uniform conditional prior** on $\mu|\tau$, or in other words assume that $p(\mu | \tau) = c$.  This allows us to factor the joint prior over these hyperparameters as $p(\mu, \tau) = p(\mu|\tau)p(\tau) \propto p(\tau)$.  The assignment of a "noninformative" prior over $\mu|\tau$ is a **modeling decision** that is motivated by the fact that the data we observe will provide us with a lot of information about $\mu$ (pg. 115).  This decision implies the following conditional distribution for $\mu|\tau, y$, the second term in the factorization of the posterior:

$$
\mu|\tau, y \sim N(\hat{\mu}, V_\mu)
$$

Where:

$$
\hat{\mu} = \frac{  \sum_{j=1}^J \frac{1}{\sigma^2_j + \tau^2} \bar{y}_{.j}    }{  \sum_{j=1}^J \frac{1}{\sigma^2_j + \tau^2}  }
$$

$$
V_\mu^{-1} = \sum_{j=1}^J \frac{1}{\sigma_j^2 + \tau^2}
$$


Finally, we deterine the third term of the factorization, the conditional posterior of $\tau|y$.  We can express this as:

$$
p(\tau|y) = \frac{p(\mu, \tau|y)}{p(\mu|\tau, y)}
$$



$$
\propto \frac{p(\tau) \prod_{j=1}^J N(\bar{y}_{.j}|\hat{\mu}, \sigma_j^2 + \tau^2) }{N(\hat{\mu}|\hat{\mu}, V_u)}
$$
$$
\propto p(\tau) V_u^{1/2} \prod_{j=1}^J (\sigma_j^2 + \tau^2)^{-1/2} exp\left(  - \frac{(\bar{y}_{.j} - \hat{\mu})^2} {2(\sigma_j^2 + \tau^2)} \right)
$$

We **assume a uniform prior** on $\tau$: $p(\tau) \propto c$ for $\tau > 0$.  This implies:

$$
p(\tau|y) \propto V_u^{1/2} \prod_{j=1}^J (\sigma_j^2 + \tau^2)^{-1/2} exp\left(  - \frac{(\bar{y}_{.j} - \hat{\mu})^2} {2(\sigma_j^2 + \tau^2)} \right)
$$

We now have an analytical expression (up to constant scaling) of our posterior $p(\theta, \mu, \tau | y)$.  We see that the posterior is quite a complicated function, and instead of attempting to integrate it (e.g. to normalize it) we will prefer to analyze it numerically.  This can be done in a straightforward way - we first sample $\tau$, which can be done by computing $p(\tau|y)$ for a uniformly spaced grid of $\tau$ values and using the inverse CDF method (see [here](https://stephens999.github.io/fiveMinuteStats/inverse_transform_sampling.html)) on this discretized distribution of $\tau$.  We then sample $\mu$ and then $\theta$ from their associated Normal distributions.

--------------------------------------------

To summarize, we:

- expressed the posterior distribution for the parameters of our hierarchical model as a factorization of three conditional posteriors

- Assumed noninformative uniform priors for $p(\mu|\tau)$ and $p(\tau)$

- Used the the assumed priors to analytically express the posterior factors


The full details of the posterior derivation are are omitted from this document.  The reader is encouraged to verify them in BDA3 at ...reference TODO pg TODO to verify them.  This 

## Tau and the Pooled vs. Individual  

The procedure described above allows us to do inference on the $\tau$ parameter of the model after specifying a prior distribution over it.  However, if we fix $\tau = 0$ we will find that the estimates for $\theta_j$ given by our Bayesian procedure will all be equal to $\hat{\mu} = 3$, which corresponds to the first "simple" strategy for estimating $theta_j$ proposed above.  ...TODO finish.  If instead we take $\tau \rightarrow \infty$, we find that the distribution estimates over $\theta_j$ parameters are centered at the TODO.  These results make intuitive sense when we consider the purpose that $\tau$ plays in our model.


In this way we can think of $\tau$ as a sort of knob that we can turn to push our model towards the pooled parameter estimates (first simple strategy), or the individual estimates for $\theta_j$.  This is mentioned as a curiosity - it shows that the hierarchical model we've constructed can be thought of as defining a set of estimation strategies, of which the two simple strategies initially provided are special cases.


## Sampling from the Posterior with STAN

In BDA3, Gelman et. all mention the sampling method described above as a way to numerically analyze the posterior distribution of the model (pg. 118).  However, we can alternatively use a software package like STAN for this if we want to avoid the laborious derivations referenced above.  We can specify a data model to STAN by declaring a likelihood and prior, and when we provide data the software will automatically sample from and analyze the posterior distribution.  STAN is used below to do inference on the parameters of the eight schools model - notice the declarative syntax that is used to specify the model structure and the API that is used to pass


The PyStan GitHub repository can be [here](https://github.com/stan-dev/pystan).

In [3]:
# TODO!

print("hi mom")

hi mom


## References

TODO!