# Hierarchical modelling

Hierarchical structures are commonly found in both natural data and statistical models. These hierarchies can represent various levels of organization or grouping within the data, and incorporating them into Bayesian inference can provide more accurate and insightful results. Such approach to modelling allows to account for different sources of variation in the data.

This hierarchical structure usually arises in problems such as multiparameter models, where parameters can be regarded or connected in some way, or models in complex phenomena with different levels of hierarchy in the data, or models with covariance structure, where the covariance is governed by some other parameters, etc. Bayesian hierarchical modeling is an excellent example of propagating uncertainties among different quantities in complex models with conditional dependencies. Furthermore, Bayesian hierarchical modeling allows us not to make guesses on certain unknown quantities, in contrast to classical statistics.

The Bayesian framework uses the property of defining conditional dependencies among quantities to perform hierarchical modeling, which allows for specifying powerful models with complex structures. 

The basic idea of hierarchical modeling, also known as *multilevel models*, is to organize the model using a set of statements of probability conditional dependencies among quantities and model assumptions. The joint probability should reflect this dependencies. Generally, a hierarchical structure can be written if the joint distribution of the parameters can be decomposed to a series of conditional distributions. The term h*ierarchical models* refers to a general set of modeling principles than to a specific family of models.

Priors of model parameters are the first level of hierarchy. Priors of model parameters can depend on other parameters (prior parameters), and new priors (hyperpriors) are defined over these prior parameters. In this case, the hyperpriors would be the second level of hierarchy. Following this scheme, the structure can be extended to more levels of hierarchy. In principle, there is not a limited level of hierarchy.

Let's illustrate in an equation the posterior distribution of a model parameter in a conditional structure with two-levels of hierarchy. The prior distribution of model parameter $\theta$, $p(\theta|a)$, depends on the prior parameter a, which hyperprior distribution $p(a|b)$ depends on a fixed value $b$ (hyper-parameter). Notice that the posterior distribution is presented as proportional to the likelihood and priors, avoiding the normalizing constant of the marginal likelihood. 

$$
p(\theta|y) \propto p(y|\theta)p(\theta|a)p(a|b).
$$

```{tikz}
% Nodes
\node (b) {$b$};
\node (a) [right of=b] {$a$};
\node (theta) [right of=a] {$\theta$};
\node (y) [right of=theta] {$y$};


% Arrows
\draw[->] (b) -- (a);
\draw[->] (a) -- (theta);
\draw[->] (theta) -- (y);
```

The most common and simple view of a hierarchical structure arises when model parameters come from a common population distribution. For example, a model with a common population distribution, $\mathcal{N}(0, \sigma^2_\mu)$, for the mean parameters, $\mu_i$, of a Normal observation model, can be expressed as follows:

$$
\text{lieklihood: } \quad &p(y_i | \mu_i, \sigma_i) &&= \mathcal{N}(y_i | \mu_i, \sigma), 

\text{prior: } \quad &p(\mu_i, \sigma_i) &&= \mathcal{N}(\mu_i | 0, \sigma_\mu)

\text{prior: } \quad &p(\sigma|d) &&= D_1(\sigma|d)

\text{hyperprior: } &p(\sigma_\mu) &&=D_1(\sigma_\mu|b)
$$


Hierarchical models can be considered as a large set of stochastic formulations that include many popular models such as the random effects, the multilevel, the generalized linear mixed, spatial and temporal, and Gaussian processes. Models with random effects follow a hierarchical structure, since different parameters share a common distribution with a common variance, for example generalized random-effects model. Hierarchical models are widely used in meta-analyses. 

Spatio, temporal and spatio-temporal models also are models that follow a hierarchical structure since common random effects are shared among neighbor variables, for example, first order random walk model, kriging, and Conditional Auto-Regressive Models for Areal Data (we will see those later in the course). Models with Gaussian process priors for functions follow a hierarchical structure since function values are dependent on covariance function parameters.


## Levels of pooling

There are typically three ways to account for hierarchies in Bayesian inference: no pooling, complete pooling, and partial pooling. Let's explore each of these approaches and provide Numpyro code examples for each case.

## No Pooling

In the "no pooling" approach, each data point is treated independently without any grouping or hierarchical structure. This approach assumes that there is no shared information between data points, which can be overly simplistic when there is underlying structure or dependencies in the data.

In [17]:
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

from jax import random
import jax.numpy as jnp

rng_key = random.PRNGKey(678)

In [23]:
# Data
data = jnp.array([10, 12, 9, 11, 8]) # remember to turn data into a jnp array

# Model
def no_pooling_model(data):
    for i, obs in enumerate(data):
        mu_i = numpyro.sample(f"mu_{i}", dist.Normal(0, 10))
        sigma_i = numpyro.sample(f"sigma_{i}", dist.Exponential(1))
        numpyro.sample(f"obs_{i}", dist.Normal(mu_i, sigma_i), obs=data[i])

# Inference
nuts_kernel = NUTS(no_pooling_model)
mcmc = MCMC(nuts_kernel, num_samples=1000, num_warmup=500, progress_bar=False)
mcmc.run(rng_key, data)

# Note how many mu-s and sigma-s are estimated
mcmc.print_summary()

sample: 100%|██████████| 1500/1500 [00:03<00:00, 400.58it/s, 31 steps of size 1.02e-01. acc. prob=0.74] 



                mean       std    median      5.0%     95.0%     n_eff     r_hat
      mu_0      9.98      1.06     10.00      8.17     11.35    338.50      1.00
      mu_1     11.71      1.77     11.97      8.72     14.16    304.16      1.00
      mu_2      8.85      1.35      8.92      6.53     10.57    419.85      1.00
      mu_3     10.81      1.37     10.92      8.91     12.81    212.94      1.00
      mu_4      7.87      1.23      7.96      6.34      9.82    115.95      1.00
   sigma_0      0.94      0.82      0.67      0.08      2.02    259.38      1.00
   sigma_1      1.15      1.11      0.81      0.06      2.53    117.35      1.00
   sigma_2      1.10      0.99      0.85      0.07      2.29    242.58      1.01
   sigma_3      1.06      0.89      0.84      0.08      2.20    102.34      1.00
   sigma_4      1.01      0.91      0.80      0.05      2.12    147.39      1.00

Number of divergences: 120


## Complete Pooling

In the "complete pooling" approach, all data points are treated as if they belong to a single group or population, and the model estimates a single set of parameters for the entire dataset. This approach assumes that there is no variation between data points, which can be overly restrictive when there is actual heterogeneity in the data.

In [24]:
# Model
def complete_pooling_model(data):
    mu = numpyro.sample("mu", dist.Normal(0, 10))
    sigma = numpyro.sample("sigma", dist.Exponential(1))
    obs = numpyro.sample("obs", dist.Normal(mu, sigma), obs=data)

# Inference
nuts_kernel = NUTS(complete_pooling_model)
mcmc = MCMC(nuts_kernel, num_samples=1000, num_warmup=500, progress_bar=False)
mcmc.run(rng_key, data)

# Note how many mu-s and sigma-s are estimated
mcmc.print_summary()

sample: 100%|██████████| 1500/1500 [00:01<00:00, 786.76it/s, 3 steps of size 6.05e-01. acc. prob=0.93] 



                mean       std    median      5.0%     95.0%     n_eff     r_hat
        mu      9.92      0.77      9.94      8.64     11.13    460.42      1.00
     sigma      1.65      0.53      1.55      0.93      2.37    388.83      1.00

Number of divergences: 0


## Partial Pooling

In the "partial pooling" approach, the data is grouped into distinct categories or levels, and each group has its own set of parameters. However, these parameters are constrained by a shared distribution, allowing for both individual variation within groups and shared information across groups.

In the partial pooling example, the `group_ids` variable indicates the group to which each data point belongs. This allows for the estimation of group-specific parameters while sharing information across groups through the shared distributions of `group_mu` and `group_sigma``.

These three approaches represent different ways to account for hierarchies in Bayesian inference, each with its own assumptions and implications for modeling real-world data. Depending on the specific context and data structure, one of these approaches may be more appropriate than the others.