In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Introduction

In the previous chapter, you saw how to build what we might call "multiple" estimation models.
In the example we've been working through, we have gone from
estimationg $p$ for a single store to estimating $p$ for 1400+ stores.

Something that you might have noticed is that
some of the stores had really wide posterior distribution estimates.
Depending on your beliefs about the world,
this might be considered quite dissatisfying.
We might ask, for example, are there really no pieces of information in the data
that we might leverage to make more informed inferences
about the true like-ability of an ice cream shop?

Well, if you remember in the dataset,
there was another column that we did not use, `owner_idx`.
Let's see if that column might be of any use for us.

In [None]:
from bayes_tutorial.data import load_ice_cream

In [None]:
data = load_ice_cream()
data.head()

In [None]:
import janitor
import numpy as np

naive_p = (
    data
    .join_apply(  # calculate naive_p
        lambda x: 
            x["num_favs"] / x["num_customers"] 
            if x["num_favs"] > 0 
            else np.nan, 
        new_column_name="naive_p"
    )
)

(
    naive_p
    .groupby("owner_idx")
    .agg({"naive_p": ["mean", "count", "std"]})
)

In [None]:
import seaborn as sns

sns.swarmplot(data=naive_p, y="naive_p", x="owner_idx");

With the visualization, it seems to me that that each of the owners might have a "characteristic" $p$,
and that each of the owners might also have its own characteristic degree of variability amongst stores.

## Data Generating Process

If we were to re-think our data generating process, we might suggest a slightly modified story.

Previously, we thought of our data generating process as follows:

In [None]:
from bayes_tutorial.solutions.estimation import ice_cream_n_group_pgm, ice_cream_one_group_pgm

ice_cream_n_group_pgm()


Here, each shop has its own $p$, and that generates its own "likes".
Each $p_i$ is drawn from its own Beta distribution,
configured with a common $\alpha$ and $\beta$.

What if we tried to capture the idea that each shop draws its $p$ from its owners?
Here's where the notion of hierarchical models comes in!

## Hierarchical Models

In a "hierarchical" version of the ice cream shop model,
we try to express the idea that not only does each shop have its own $p$,
it's $p$ is somehow conditionally dependent on its owner's $p$.

More generally, with a hierarchical model,
we impose the assumption
that each sample draws its key parameters from a "population" distribution.
Underlying this assumption is the idea
that "things from the same group should be put together".

If we ignored (for a moment) the "fixed" variables,
then the hierarchical model would look something like this:

In [None]:
from bayes_tutorial.solutions.hierarchical import hierarchical_p

hierarchical_p()

Here, we are expressing the idea that each shop $i$ draws its $p_{j, i}$ from its the $p_{j}$ associated with its owner $j$,
and that its owner $p_{j}$ draws from a population $p$ distribution governing all owners.

In theory, this is really cool.
But implementing this is kind of difficult,
if we think more closely about the structure we've used thus far.
With Beta distributions as priors,
we might end up with a very convoluted structure instead:

In [None]:
from bayes_tutorial.solutions.hierarchical import convoluted_hierarchical_p

convoluted_hierarchical_p()

I'm not sure how you feel looking at that PGM diagram,
but at least to me, it looks convoluted.
I'd find it a hassle to implement.
Also, I wouldn't be able to bake in interpretability into the model
by directly mapping key parameter values to quantities of interest.

The key problem here is that of _parameterization_.
By _directly_ modelling $p$ with a Beta distribution,
we are forced to place priors on the $\alpha$ and $\beta$ parameters
of the Beta distribution.
That immediately precludes us
from being able to model the "central tendencies"
of owner-level shop ratings.

To get around this, I'm going to introduce you to this idea
of transforming a random variable,
which is immensely helpful in modelling tasks.

## Transformations of random variables

In our application,
being able to model directly the "central tendency" of the $p$,
for each shop and owner, matters a lot.

A Beta distribution parameterization does not allow us
to model $p$ with "central tendencies" directly.

On the other hand, if we were to "transform" the random variable $p$,
which has bounded support between 0 and 1,
into a regime that did not have a bounded support,
we could conveniently use Gaussian distributions,
which have central tendency parameters that we can model
using random variables directly.

### Logit Transform

One such transformation for a random variable that is bounded
is the **logit transform**.
In math form, given a random variable $p$ that is bounded in the $[0, 1]$ interval,
the logit transformation like this:

$$f(p) = \log(\frac{p}{1-p})$$

To help you understand a bit of the behaviour of the logit function, here it is plotted:

In [None]:
import matplotlib.pyplot as plt
from scipy.special import logit
import seaborn as sns

p = np.linspace(0, 1, 1000)
logit_p = logit(p)
fig, ax = plt.subplots(figsize=(3, 3))
plt.plot(p, logit_p)
plt.xlabel("p")
plt.ylabel("logit(p)")
sns.despine();

### Properties of the Logit Transformation

As you can see, the logit transformation function maps values of $p$,
which live on the interval between 0 and 1,
onto an interval that is in the interval $(-\infty, \infty)$.

It starts with the **odds ratio** term, which is $\frac{p}{1-p}$,
which is a ratio of the probability of getting an outcome
to the probability of not getting the outcome.
We then take the odds ratio, and log-transform it.
When the probability of obtaining an outcome is less than 0.5,
we end up in the negative regime,
and when the probability of obtaining an outcome is greater than 0.5,
we end up in the positive regime.

Remember also that we desired a way to model the central tendencies of our random variables,
and so a highly natural choice here is to use the Gaussian distribution,
which has a central tendency parameter $\mu$.
And since the logit of $p$ has infinite support,
we can use a distribution that has infinite support to model it.
As such, we can instantiate a random variable for the _logit transformed version of $p$_,
and then use the inverse logit transformation （also known as the `expit` function in `scipy.special`)
to take it back to bounded $(0, 1)$ space,
which we can then use for our Binomial likelihood function for the data.

### Exercise: Explore the transformation

Use the widgets below to explore how the transformation between the logit ($f(p)$) and original $p$ maps onto one another.

In [None]:
from bayes_tutorial.solutions.hierarchical import plot_mu_p
from ipywidgets import interact, FloatSlider

mu = FloatSlider(value=0.5, min=-3, max=3, step=0.1)
sigma = FloatSlider(value=1, min=0.1, max=5, step=0.1)

interact(plot_mu_p, mu=mu, sigma=sigma);

You should notice a few things:

1. $\mu$ controls the central tendency of the bounded space $p$.
2. $\sigma$ controls the variance of the bounded space $p$.

With this transformation trick, it's possible to model both the central tendencies and the variance _directly_!

Let's see how this can get used, by redoing our model's PGM with the alternative parametrization.

In [None]:
from bayes_tutorial.solutions.hierarchical import hierarchical_pgm

hierarchical_pgm()

How do we read this new PGM?

This is how we read it.

- The "red" random variables are transformed versions of $p$ from their respective $\mu$s.
- The $\mu$s are hierarchically related, which gives us the central tendencies.
- The uncertainty in $\mu$ values (at all levels) are modeled by a variance term.
- Some of the variance terms are fixed, while others are modelled by a random variable; this is a modelling choice.
    - In setting up this problem I had this idea that analyzing the variance term of each owner might be handy, so I've included it in.

## Hierarchical model

Here is the hierarchical model written down in PyMC3.

In [None]:
from bayes_tutorial.solutions.hierarchical import ice_cream_hierarchical_model
ice_cream_hierarchical_model??

In [None]:
model = ice_cream_hierarchical_model(data)

In [None]:
data["owner_idx"].sort_values().unique()

In [None]:
import arviz as az
import pymc3 as pm

with model:
    trace = pm.sample(2000, tune=2000)
    trace = az.from_pymc3(
        trace,
        coords={
            "p_shop_dim_0": data["shopname"],
            "logit_p_shop_dim_0": data["shopname"],
            "logit_p_owner_scale_dim_0": data["owner_idx"].sort_values().unique(),
            "p_owner_dim_0": data["owner_idx"].sort_values().unique(),
            "logit_p_owner_mean": data["owner_idx"].sort_values().unique(),
        },
    )

I am going to ask you to ignore the warnings about divergences for a moment, we will get there in the next chapter!

In [None]:
az.plot_trace(trace, var_names=["p_owner"]);

In [None]:
az.plot_posterior(trace, var_names=["p_owner"]);

## Interpretation In Context

### Owner-level $p$

Analysis of the posterior $p$s for the **owners** tells us that different owners have different characteristic $p$.

We can see this from the forest plot below:

In [None]:
az.plot_forest(trace, var_names=["p_owner"]);

Here, it seems clear to me that the shops belonging to owners 2, 3 and 5 are generally unfavourable,
while the shops belonging to owners 6, 7 and 8 are the best.

Of worthy mention is owner 8, which is actually the set of independently-owned shops.
Those shops are, in general, very well-rated.

### Analysis of Variation

I mentioned earlier that I thought that the variance of the owners' characteristic logits might be interesting to analyze,
and the reason is as such:
If an owner's estimated $\sigma$ is large, that means that the shops might be quite _inconsistent_ in how much customers like them.
If customer service is the primary driver of how good their customers like them,
then that could be actionable information for owners to tighten up on customer service training.

At the same time, a tight distribution (small $\sigma$) coupled with poor ratings means something systematically bad might be happening.

Well, enough with the hypothesizing, let's dive in.

In [None]:
az.plot_forest(trace, var_names=["logit_p_owner_scale"]);

We might want to plot the _joint_ posterior distributions for each of the owners' $p$ and $\sigma$.

In [None]:
locations = trace.posterior["p_owner"].to_dataframe().unstack(-1)
scales = trace.posterior["logit_p_owner_scale"].to_dataframe().unstack(-1)
locations

In [None]:
for i in range(9):
    plt.scatter(locations[("p_owner", i)], scales[("logit_p_owner_scale", i)], alpha=0.3, label=f"{i}")
plt.xlabel("owner p")
plt.ylabel("owner $\sigma$")
sns.despine()
plt.legend();

By plotting the full posterior distribution of owner $\sigma$ against owner $p$,
we can immediately see how some owners are really good (to the right on the $p$ axis)
and very consistent (closer to the bottom on the $\sigma$ axis).

You might also notice that some of the shapes above look "funnel-like".
I have intentionally placed this plot here
as a foreshadowing of what we'll be investigating in the next chapter,
and it's related to the divergences that we saw above.
Those are what we will be diving deeper into later!

### Interpretation In Context

By plotting the $p$ of the owner against the posterior distribution variance,
we can visualize the two pointers made above in a way that communicates really clearly
which owners might need help.

Qualitatively-speaking, owners would ideally want to be in the bottom right quadrant of the plot.
That is where ratings are high and there's very little variability.
Owner 7 fits that bill very nicely, as does owner 6.
The independent shops are overall very highly rated, but they aren't very consistent;
this is the top-right quadrant of the plot.

The worst place to be in is the bottom-left: poor customer ratings, and consistently so.
We might devise further hypotheses as to why:
bad hygiene standards,
lack of training across the board,
some other historical factor etc.

## Analysis of Individual Shops

One of the promises of using a Bayesian hierarchical model here
is the ability to draw _tentative_ conclusions,
conditioned on our model's assumptions,
about the state of certain shops
_even in the low or zero data regime_.
In the machine learning world,
one might claim that this is a form of transfer learning,
or that it is form of one-shot learning.
I'd prefer not to be quoted on that,
so I'll just call it what it actually is:
inference about the state of the world.

### A comparison of naive, bayesian estimated, and shop-level $p$s

One thing we are going to do here is extract out the naive estimates,
which will contain nulls because of a lack of data,
the Bayesian estimated $p$s, which will be fully populated,
and compare them both against the shop-level $p$s.
We should see the effects of a hierarchical model here:
for each store, the $p$ will be centered on the owner's $p$,
but there will be variation around it.

The next few code cells will explicitly show how we gather out the necessary summary statistics,
while also highlighting the use of `pyjanitor`,
a library that I have developed to munge data with a clean, Pythonic API.

Firstly, we grab out the Bayesian estimates from the posterior samples.

In [None]:
import janitor

bayesian_estimates = (
    trace.posterior
    .stack(draws=("chain", "draw"))
    .median(dim="draws")
    ["p_shop"]
    .to_dataframe()
    .reset_index()
    .rename_column("p_shop_dim_0", "shopname")
    .rename_column("p_shop", "bayesian_p")
    .set_index("shopname")
)
bayesian_estimates

Next, we grab out the owner-level $p$.
In principle I would have used the posterior distribution,
but a naive estimate quickly calculated from the naive data
will be very close in our case.
(If you have the notebooks open in Binder,
you should definitely give it a shot
extracting the estimates from the posterior samples instead!)

In [None]:
owner_p = (
    naive_p
    .groupby_agg("owner_idx", new_column_name="owner_p", agg_column_name="naive_p", agg="mean")
    .set_index("shopname")
    .select_columns(["owner_p"])
)
owner_p

Finally, let's join everything together into a single DataFrame.

In [None]:
shrinkage = (
    naive_p
    .set_index("shopname")
    .select_columns(["naive_p", "owner_idx"])
    .join(bayesian_estimates)
    .join(owner_p)
)
shrinkage

Already, one of the advantages of a Bayesian estimate shows up:
we are able to fill in the NaN values left behind by a naive estimate
when no data are available.
How was this possible?
It was possible because the _structure_ of our model
presumed that each store drew its $\mu$ (and hence $p$) from the owner's $\mu$ (and hence $p$),
thus we obtain an estimate for the store,
which will look similar to the owner's $p$.

Let's visualize a comparison of the Bayesian $p$ estimates
against the naive and store-level $p$ estimates. 
We are going to construct a "shrinkage" plot.
(This is a diagnostic plot you can use
to help others visualize the comparison
we are about to go through.)

In [None]:
from ipywidgets import Dropdown

owner_idx = Dropdown(options=list(range(9)), description="Owner")
owner_idx

@interact(owner_idx=owner_idx)
def plot_shrinkage(owner_idx):
    data = (
        shrinkage
        .query("owner_idx == @owner_idx")
        .select_columns(["naive_p", "bayesian_p", "owner_p"])
    )
    nulls = (
        data.dropnotnull("naive_p")
    )
    non_nulls = (
        data.dropna(subset=["naive_p"])
    )
    fig, axes = plt.subplots(figsize=(8, 4), nrows=1, ncols=2 , sharey=True, sharex=True)
    non_nulls.T.plot(legend=False, color="blue", alpha=0.1, marker='o', ax=axes[1], title="has data",)
    nulls.T.plot(legend=False, color="blue", alpha=0.1, marker='o', ax=axes[0], title="no data")
    axes[0].set_ylabel("Estimated $p$")
    
    sns.despine()

The left plot shows the estimates for shops that have zero data.
Rather than estimate that its performance is unknowable,
we estimate that each shop's performance will be pretty close to
the owner-level $p$.

The right plot shows the estimates for shops that _do_ have data.
Those shops that have 1 out of 1 or 0 out of 1 no longer are estimated to have
a rating of 100% or 0% (respectively),
but rather are estimated to have their ratings closer to the owner's $p$.

As you should be able to see, the Bayesian estimates for store's $p$
are _shrunk_ towards the owner-level $p$ estimates
relative to the naive $p$ estimates.
This phenomena is called "shrinkage".

Shrinkage in and of itself is a neutral thing.
Whether it is "good" or "bad" depends on the problem being solved.
In this case, I might consider shrinkage to be good,
because it is preventing us from giving wildly bad guesses.

## Where might the hierarchical modelling assumption be a bad thing?

In this chapter, we have gone in-depth about how hierarchical modelling can be a useful tool
to mathematically bake in the assumption that "birds of a feather flock together".
When this modelling assumption has no _serious_ detrimental effects,
it could be handy.

On the other hand, in an article titled [Meet the Secret Algorithm That's Keeping Students Out of College][wired] on Wired,
a highly revealing paragraph illuminated for me one scenario where this assumption could instead be potentially highly detrimental.

The backdrop here is that in 2020, because of the COVID-19 outbreak, International Baccalaureate examinations worldwide were cancelled,
and so the IB board had to come up with a method to grade students.
Other standardized testing exams, such as the Cambridge University's GCEs and the SATs,
just went ahead with online tests,
but the IB board went with a model instead:

> The idea was to use prior patterns to infer what a student would have scored in a 2020 not dominated by a deadly pandemic. IB did not disclose details of the methodology but said grades would be calculated based on a student’s assignment scores, predicted grades, and historical IB results from their school. The foundation said grade boundaries were set to reflect the challenges of remote learning during a pandemic. For schools where historical data was lacking, predictions would build on data pooled from other schools instead.

Grading individual students using information from their school;
borrowing information from other schools where not enough historical information for a school was present...
These all sound oddly familiar to the kind of thing we've done with ice cream shops.
The only thing here is that the consequences of using a model could be heavily life-shaping for individual students.
Also, the amount of agency afforded to the individual students to influence their grades on a final exam is removed.
I'm going to withold judgment on whether that is good or bad,
though I will state my personal preference for consistently good performance over a long run
rather than one-time tests that may be subject to a lot of noise.

Here, the use of a model may fundamentally be an unfair idea,
if we cannot disentangle long-run performance from confounders in the data.
What are your thoughts after reading the article?

[wired]: https://www.wired.com/story/algorithm-set-students-grades-altered-futures/

## Saving posterior traces

Knowing how to save posterior distribution traces is really handy,
as it allows us the chance to examine and compare model posterior distributions
given different model structures.
(That is what we'll be going through in the next notebook.)

Let's see how to use ArviZ to do this.

In [None]:
from pyprojroot import here

save_path = here() / "data/ice_cream_shop_hierarchical_posterior.nc"
az.to_netcdf(trace, save_path)