In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Introduction

In the previous chapter, you saw how to build what we might call "multiple" estimation models.
In the example we've been working through, we have gone from
estimationg $p$ for a single store to estimating $p$ for 1400+ stores.

Something that you might have noticed is that
some of the stores had really wide posterior distribution estimates.
Depending on your beliefs about the world,
this might be considered quite dissatisfying.
We might ask, for example, are there really no pieces of information in the data
that we might leverage to make more informed inferences
about the true like-ability of an ice cream shop?

Well, if you remember in the dataset,
there was another column that we did not use, `owner_idx`.
Let's see if that column might be of any use for us.

In [None]:
from bayes_tutorial.data import load_ice_cream

In [None]:
data = load_ice_cream()
data.head()

In [None]:
import janitor
import numpy as np

naive_p = (
    data
    .join_apply(  # calculate naive_p
        lambda x: 
            x["num_favs"] / x["num_customers"] 
            if x["num_favs"] > 0 
            else np.nan, 
        new_column_name="naive_p"
    )
)

(
    naive_p
    .groupby("owner_idx")
    .agg({"naive_p": ["mean", "count", "std"]})
)

In [None]:
import seaborn as sns

sns.swarmplot(data=naive_p, y="naive_p", x="owner_idx");

With the visualization, it seems to me that that each of the owners might have a "characteristic" $p$,
and that each of the owners might also have its own characteristic degree of variability amongst stores.

## Data Generating Process

If we were to re-think our data generating process, we might suggest a slightly modified story.

Previously, we thought of our data generating process as follows:

In [None]:
from bayes_tutorial.solutions.estimation import ice_cream_n_group_pgm, ice_cream_one_group_pgm

ice_cream_n_group_pgm()
# pgm = ice_cream_one_group_pgm()

Here, each shop has its own $p$, and that generates its own "likes".
Each $p_i$ is drawn from its own Beta distribution,
configured with a common $\alpha$ and $\beta$.

What if we tried to capture the idea that each shop draws its $p$ from its owners?
Here's where the notion of hierarchical models comes in!

## Hierarchical Models

In a "hierarchical" version of the ice cream shop model,
we try to express the idea that not only does each shop have its own $p$,
it's $p$ is somehow conditionally dependent on its owner's $p$.

More generally, with a hierarchical model,
we impose the assumption
that each sample draws its key parameters from a "population" distribution.
Underlying this assumption is the idea
that "things from the same group should be put together".

If we ignored (for a moment) the "fixed" variables,
then the hierarchical model would look something like this:

In [None]:
from daft import PGM, Node

G = PGM()
G.add_node("p_shop", content=r"$p_{j, i}$", x=1, y=2, scale=1.2)
G.add_node("likes", content="$l_{j, i}$", x=1, y=1, scale=1.2, observed=True)
G.add_node("p_owner", content=r"$p_{j}$", x=1, y=3, scale=1.2)
G.add_node("p_pop", content=r"$p$", x=1, y=4, scale=1.2)

G.add_edge("p_pop", "p_owner")
G.add_edge("p_owner", "p_shop")
G.add_edge("p_shop", "likes")

G.add_plate(plate=[0.3, 0.3, 1.5, 2.2], label=r"shop $i$")
G.add_plate(plate=[0, -0.1, 2.1, 3.6], label=r"owner $j$")

G.render();

Here, we are expressing the idea that each shop $i$ draws its $p_{j, i}$ from its the $p_{j}$ associated with its owner $j$,
and that its owner $p_{j}$ draws from a population $p$ distribution governing all owners.

In theory, this is really cool.
But implementing this is kind of difficult,
if we think more closely about the structure we've used thus far.
With Beta distributions as priors,
we might end up with a very convoluted structure instead:

In [None]:
G = PGM()
G.add_node("likes", content="$l_{j, i}$", x=1, y=1, scale=1.2, observed=True)
G.add_node("p_shop", content="$p_{j, i}$", x=1, y=2, scale=1.2)
G.add_node("alpha_owner", content=r"$\alpha_{j}$", x=0, y=3, scale=1.2)
G.add_node("beta_owner", content=r"$\beta_{j}$", x=2, y=3, scale=1.2)
G.add_node("lambda_a_pop", content=r"$\lambda_{\alpha}$", x=0, y=4, scale=1.2)
G.add_node("lambda_b_pop", content=r"$\lambda_{\beta}$", x=2, y=4, scale=1.2)
G.add_node("tau_lambda_a", content=r"$\tau_{\lambda_{\alpha}}$", x=0, y=5, fixed=True)
G.add_node("tau_lambda_b", content=r"$\tau_{\lambda_{\beta}}$", x=2, y=5, fixed=True)

G.add_edge("alpha_owner", "p_shop")
G.add_edge("beta_owner", "p_shop")
G.add_edge("p_shop", "likes")
G.add_edge("lambda_a_pop", "alpha_owner")
G.add_edge("lambda_b_pop", "beta_owner")
G.add_edge("tau_lambda_a", "lambda_a_pop")
G.add_edge("tau_lambda_b", "lambda_b_pop")

G.add_plate(plate=[0.5, 0.2, 1, 2.3], label=r"shop $i$")
G.add_plate(plate=[-0.5, 0, 3, 3.5], label=r"owner $j$")
G.render();

I'm not sure how you feel looking at that PGM diagram,
but at least to me, it looks convoluted.
I'd find it a hassle to implement.
Also, I wouldn't be able to bake in interpretability into the model
by directly mapping key parameter values to quantities of interest.

The key problem here is that of _parameterization_.
By _directly_ modelling $p$ with a Beta distribution,
we are forced to place priors on the $\alpha$ and $\beta$ parameters
of the Beta distribution.
That immediately precludes us
from being able to model the "central tendencies"
of owner-level shop ratings.

To get around this, I'm going to introduce you to this idea
of transforming a random variable,
which is immensely helpful in modelling tasks.

## Transformations of random variables

In our application,
being able to model directly the "central tendency" of the $p$,
for each shop and owner, matters a lot.

A Beta distribution parameterization does not allow us
to model $p$ with "central tendencies" directly.

On the other hand, if we were to "transform" the random variable $p$,
which has bounded support between 0 and 1,
into a regime that did not have a bounded support,
we could conveniently use Gaussian distributions,
which have central tendency parameters that we can model
using random variables directly.

### Logit Transform

One such transformation for a random variable that is bounded
is the **logit transform**.
In math form, given a random variable $p$ that is bounded in the $[0, 1]$ interval,
the logit transformation like this:

$$f(p) = \log(\frac{p}{1-p})$$

To help you understand a bit of the behaviour of the logit function, here it is plotted:

In [None]:
import matplotlib.pyplot as plt
from scipy.special import logit
import seaborn as sns


p = np.linspace(0, 1, 1000)
logit_p = logit(p)
fig, ax = plt.subplots(figsize=(3, 3))
plt.plot(p, logit_p)
plt.xlabel("p")
plt.ylabel("logit(p)")
sns.despine();

As you can see, the logit transformation function maps values on the interval between 0 and 1
onto an interval that is in the interval $(-\infty, \infty)$.
And since the transformed random variable has infinite support,
we can use a distribution that has infinite support to model it.

Remember also that we desired a way to model the central tendencies of our random variables,
and so a highly natural choice here is to use the Gaussian distribution,
which has a central tendency parameter $\mu$.
As such, we can instantiate a random variable for the _logit transformed version of $p$_,
and then use the inverse logit transformation to take it back to bounded $(0, 1)$ space,
which we can then use for our Binomial likelihood function for the data.

In [None]:
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider
from scipy.stats import norm
from scipy.special import expit

mu = FloatSlider(value=0, min=-3, max=3, step=0.1)
sigma = FloatSlider(value=1, min=0, max=5, step=0.1)

@interact(mu=mu, sigma=sigma)
def plot_mu_p(mu, sigma):
    xs = np.linspace(mu - sigma * 4, mu + sigma * 4, 1000)
    ys = norm(loc=mu, scale=sigma).pdf(xs)

    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4), sharey=True)
    ax[0].plot(xs, ys)
    ax[0].set_xlabel(r"$\mu$")
    ax[0].set_ylabel("PDF")
    ax[1].plot(expit(xs), ys)
    ax[1].set_xlim(0, 1)
    ax[1].set_xlabel(r"p = invlogit($\mu$)")
    sns.despine()
    plt.show()

## Hierarchical model for "indie" stores

In [None]:
import pymc3 as pm

n_shops = len(data["owner_idx"].unique())
with pm.Model() as model:
    logit_p_overall = pm.Normal(
        "logit_p_overall",
        mu=0,
        sigma=1
    )
    logit_p_owner_mean = pm.Normal(
        "logit_p_owner_mean",
        mu=logit_p_overall,
        sigma=1,
        shape=(n_shops,)
    )
    logit_p_owner_scale = pm.Exponential(
        "logit_p_owner_scale",
        lam=1/5.,
        shape=(n_shops,)
    )
    logit_p_shop = pm.Normal(
        "logit_p_shop",
        mu=logit_p_owner_mean[data["owner_idx"]],
        sigma=logit_p_owner_scale[data["owner_idx"]],
        shape=(len(data),)
    )
    
    p_overall = pm.Deterministic("p_overall", pm.invlogit(logit_p_overall))
    p_shop = pm.Deterministic("p_shop", pm.invlogit(logit_p_shop))
    p_owner = pm.Deterministic("p_owner", pm.invlogit(logit_p_owner_mean))
    like = pm.Binomial("like", n=data["num_customers"], p=p_shop, observed=data["num_favs"])

In [None]:
with model:
    trace = pm.sample(2000)

In [None]:
import arviz as az

with model:
    trace = az.from_pymc3(trace, coords={"p_shop_dim_0": data["shopname"]})

In [None]:
az.plot_posterior(trace, var_names=["p_owner"]);

In [None]:
az.plot_forest(trace, var_names=["p_owner"]);

In [None]:
az.plot_forest(trace, var_names=["logit_p_owner_scale"]);

In [None]:
import janitor

quantiles = (
    trace.posterior
    .stack(draws=("chain", "draw"))
    ["p_owner"]
    .quantile(q=[0.03, 0.5, 0.97], dim=("draws"))
)
quantiles = (
    quantiles
    .to_dataframe()
    .reset_index()
    .pivot_table(columns="quantile", index="p_owner_dim_0", values="p_owner")
    .join_apply(lambda x: x[0.97] - x[0.03], "width")
)
quantiles