In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Introduction

In this chapter, we are going to build off your newfound PyMC3 knowledge to build larger estimation models!

## Recap

In the previous chapter, you saw how we can use PyMC3
to infer, or estimate, the most likely set of values of parameters of a model.

- You defined a probabilistic model with key parameters of interest.
- You used the PyMC3 Inference Button (tm), which is called using `pm.sample(n_mcmc_steps)`.

"Inference", just so we are clear, is not "forward" prediction of a model. 
In statistics, it refers to the activity of ***estimating*** the most likely value of a parameter, given the data and model.
In _Bayesian_ statistics, it refers to the activity of estimating the most likely **set** of parameters,
given the data and model. (The "jargon" term here is the "typical set".)

## Multiple Estimation

Last chapter, you saw how to do "single estimation",
i.e. when there's only one collection of data points
on which we need to perform estimation.

In this chapter, we are going to learn
how to extend PyMC3 estimation code
to handle the case where we have
two or more groups
that we need to perform parameter estimation for.
Everything that you learned in the previous chapter will come in handy here!
Let's get going!

## I scream, you scream, we want ice cream!

We're going to take the classic coin flip problem
and put a different spin on it.

We've got some ice cream shop data!
In here, we have multiple ice cream shops,
and customers are leaving either "thumbs up"
or "thumbs down" ratings
on whether they enjoyed the ice cream shop experience.
The problem task we have at hand
is to rank order the shops
according to their customer experience rating.

This example is going to be the pedagogical one that we work through
over the this chapter and the next (when we discuss hierarchical models).
Other examples and exercises will be available
to help you get practice!

### The Data

Let's load in the data and take a quick look at it
to make sure we know what we're dealing with.

In [None]:
import pandas as pd
from pyprojroot import here
import janitor

In [None]:
from bayes_tutorial.data import load_ice_cream

data = load_ice_cream()
data.head()

Here's a description of the columns:

- `shopname`: A string identifier for the shop.
- `num_customers`: The number of customers that responded to a survey about whether they liked or didn't like the shopping experience.
- `owner_idx`: A numerical index of the holding company for a bunch of stores. Indices 0-7 are multi-store chains, while index 8 is a "catch-all" index for the stores that are independently and locally owned.
- `num_favs`: This is the observed number of likes in the responses.

## Statistical pitfalls

At first glance, the task of estimating how "good" a shop is
might sound like an easy task:
_Just_ calculate:

$$\hat{p} = \frac{n_{favs}}{n_{customers}}$$

Apparently, if we treat this "proportion of likes"
as a surrogate measure of the likeability of a store, 
this should give us an intrinsic measure of how good the store is.

Right??

### Exercise

We're going to see how that could be a difficult thing to justify.

Add a new column to the dataframe that adds a naive estimation of $p$, i.e. $\hat{p}$, for each store.

Some hints:

- You are _definitely_ going to worry about the case where `num_customers = 0`, and come up with a principled solution for that.
    - Your two choices, basically, are to call it 0, or call it some `null` type.

In [None]:
import janitor
import numpy as np

from bayes_tutorial.solutions.estimation import naive_estimate

# My answer:
estimated_p = naive_estimate(data)

In [None]:
# Your dataframe should look something like this.
estimated_p

### Exercise

If you chose `0` as the calculated value of $\hat{p}$ when `num_customers = 0`,
what might you have to be careful about later on?

If you chose `np.nan` (or some other equivalent `null` representation)
as your calculated value of $\hat{p}$ when `num_customers = 0`,
what might you have to be careful about later on?

In [None]:
# Put in your answer in between the triple quotes below.
ans = """
Your answer here.
"""

# My answer is below. Uncomment to read it, or read it at the end.
from bayes_tutorial.solutions.estimation import assumptions

# print(assumptions())

## Constructing Two-Group Estimation Models

To illustrate how to progress from "one group estimation" to "multi-group estimation",
we are going to stop by two-group estimation.

### One-group estimation

To construct a so-called one-group estimation model,
let's build a $p$ estimation model for one of the stores,
say, "Gimpy periwinkle bombay".

In [None]:
store_data = estimated_p.query("shopname == 'Gimpy periwinkle bombay'")
store_data

### Build Model

Remember the protocol for building models:
start from a good likelihood distribution that describes the observed data,
and work backwards to the key parameters of interest.
The best distribution story for the sum of 1/0 trials
is the Binomial distribution.
It takes in two parameters, `n` and `p`.
`n` is known in the data,
but `p` is the intrinsic property
that we are trying to model.

In [None]:
import pymc3 as pm

with pm.Model() as one_group_model:
    p = pm.Beta("p", alpha=2, beta=2)
    like = pm.Binomial(
        "like",
        n=store_data["num_customers"],
        p=p,
        observed=store_data["num_favs"]
    )

In graphical form, this model looks like the following:

In [None]:
from bayes_tutorial.solutions.estimation import ice_cream_one_group_pgm

ice_cream_one_group_pgm()

### Sample from Posterior

Now, we can hit the inference button!

(It helps to give the trace object an informative name though,
so let's call it something other than a generic `trace`.)

In [None]:
with one_group_model:
    trace_one_group = pm.sample(2000)

### Visualize Posterior Distribution

Let's now visualize the posterior distribution of `p`!

In [None]:
import arviz as az

az.plot_posterior(trace_one_group)

### Build Model

Now, let's build the two-group version of this model.

In [None]:
estimated_p

In [None]:
import pymc3 as pm

wanted_stores = ["Crabby smalt walrus", "Gimpy periwinkle bombay"]
two_store_data = estimated_p.query("shopname in @wanted_stores")

with pm.Model() as two_group_model:
    p = pm.Beta("p", alpha=2, beta=2, shape=(len(two_store_data),))
    like = pm.Binomial("like", n=two_store_data["num_customers"], p=p, observed=two_store_data["num_favs"])

The key here is to express the "sample dimension" in the shape of the `p` random variable.

As a graphical model, this model looks like the following:

In [None]:
from bayes_tutorial.solutions.estimation import ice_cream_n_group_pgm

ice_cream_n_group_pgm()

The new thing that might look different from what you've seen before thus far
is the rectangle box.
This is known as a _plate_.
The _plate_ indicates that there's a "cloning" of the random variables.
Instead of one `p` and one `likes`, there are `n_shops` times as many.

### Sample from Posterior

Now, let's use the PyMC3 Inference Button (tm) to sample from the joint posterior distribution.

In [None]:
import arviz as az

with two_group_model:
    trace_two_group = pm.sample(2000)
    trace_two_group = az.from_pymc3(
        trace_two_group,
        coords={"p_dim_0": two_store_data["shopname"]}
    )

A convention to remember here: When you have an RV that has a shape axis, ArviZ will automatically append a `_dim_0` to the end of it in the resulting xarray coordinate system.

As such, if you want to conveniently guarantee that the store labels (the true coordinates) are displayed in the posterior distribution plots, you must pass them in when converting the trace from a PyMC3 trace object into an ArviZ `InferenceData` object.

### Visualize Posterior Distribution

In [None]:
az.plot_posterior(trace_two_group);

### Forest Plot

The "forest plot" is a _compact_ visual representation of posterior distributions.
Let's take a look at how to use it.

In [None]:
az.plot_forest(trace_two_group);

### Interpreting forest plots

- The circle is the median
- The thick bars indicate the inter-quartile range
- The thin bars indicate the 94th percentile range (3-97)
- There are four bars because each of them indicate one MCMC chain.

### Interpreting in context of the problem

Our goal here was to rank-order the stores.

With the model, we can rank-order the stores according to quantiles of the posterior distribution.

- By the median, the store Crabby smalt walrus is better than the Gimpy periwinkle bombay store.
- By the uppper bound (97th percentile), the same holds.
- By the lower-bound (3rd percentile), the same still holds.

Hence, we should be quite confident that Crabby smalt walrus >> Gimpy periwinkle bombay.

### Probability of Superiority

Another approach that we can take to comparing two stores
is to calculate the "probability of superiority" of one store over the other.
Given samples from the posterior distribution, this is trivial to calculate.
Over all pairs of samples taken, we simply have to ask
in what fraction of samples does one store have a higher `p` than the other.

This is a pretty useful way of directly comparing two posterior distributions to one another.

In [None]:
store1 = trace_two_group.posterior.stack(dimensions={"draws": ("chain", "draw")})["p"].sel(p_dim_0="Gimpy periwinkle bombay")
store2 = trace_two_group.posterior.stack(dimensions={"draws": ("chain", "draw")})["p"].sel(p_dim_0="Crabby smalt walrus")

np.sum(store1 > store2) / len(store1)

As we can see, the probability that the store "Gimpy periwinkle bombay" is better than "Crabby smalt walrus" is basically nothing.
This same pattern shows up in the forest plot.

### Exercise: Extend the model to 4 stores

We're now going to build a model that can handle more than just two stores, but an arbitrary number of stores.

In order to test-drive the construction of the estimation model,
we are going to start by ensuring that the model works on just four stores,
but you should write it in such a way that it can work with _any_ number of stores.

In [None]:
wanted_stores = [
    "Crabby smalt walrus",
    "Gimpy periwinkle bombay",
    "Beady razzmatazz jaguar",
    "Snazzy auburn skunk"
]

four_store_data = estimated_p.query("shopname in @wanted_stores")
four_store_data

In [None]:
four_store_data

In [None]:
def ice_cream_store_model(data: pd.DataFrame) -> pm.Model:
    with pm.Model() as model:
        # Your answer here.
        pass
    return model

from bayes_tutorial.solutions.estimation import ice_cream_store_model

In [None]:
with ice_cream_store_model(four_store_data):
    trace_four_store = pm.sample(2000)
    trace_four_store = az.from_pymc3(trace_four_store, coords={"p_dim_0": four_store_data["shopname"]})

In [None]:
az.plot_forest(trace_four_store)

### Interpretation in context

If you remember the data, the store _Snazzy auburn skunk_ had a rating of 1 like out of 1 response.
On a naive, point estimate, 1.0 would be its score for `p`,
but we would probably be left being quite dissatisfied with ranking it first.
After all, there is (qualitatively speaking) very little information available in 1 vote.

With a Bayesian posterior, we now would rank _Snazzy auburn skunk_ in 3rd place according to the median and 3rd percentile,
and 2nd place according to the upper bound 97th percentile of the posterior.
Already, the benefits of a Bayesian approach to rank-ordering stores is visible:
in this particular case, a weakly informative prior distribution helped us regulate the posterior estimates
(this is called "regularization")
away from extreme values.

The posterior distribution width also quantitatively describes how uncertain we are;
**the larger the width of the posterior distribution, the greater the uncertainty.**
This is something worth keeping in mind!


_A microbiome professionals might chime in and remind us to analyze our posteriors -- they're very informative!_

### Exercise: Probability of superiority

Calculate the probability of superiority of Gimpy periwinkle bombay over Snazzy auburn skunk.

In [None]:
store1 = trace_four_store.posterior.stack(dimensions={"draws": ("chain", "draw")})["p"].sel(p_dim_0="Gimpy periwinkle bombay")
store2 = trace_four_store.posterior.stack(dimensions={"draws": ("chain", "draw")})["p"].sel(p_dim_0="Snazzy auburn skunk")

np.sum(store1 > store2) / len(store1)

According to the posterior distributions, Gimpy periwinkle bombay has about a 90% probability of superiority over Snazzy auburn skunk.

### Bonus Exercise

Write a function that takes in the posterior distribution trace
and returns a pandas DataFrame with their $k^{th}$ percentiles,
which we can rank-order after-the-fact. 
(i.e. don't return ranks!)

Some hints:

- You might need to be familiar with `xarray`'s API in order to work through this problem.
- The class methods that you are interested in are probably `stack` and `quantile`.

In [None]:
from bayes_tutorial.solutions.estimation import posterior_quantile

quantiles = posterior_quantile(trace_four_store, q=[0.03, 0.5, 0.97])
quantiles

In [None]:
quantiles.unstack().rank()

## Rank-order all stores

Now that you've built a model that can generalize across multiple samples,
I'd like to invite you to go ahead and rank-order all stores.

In [None]:
# Sample from posterior of model that knows about all stores.

# Your answer below:


# The "correct" answer is here:
from bayes_tutorial.solutions.estimation import trace_all_stores
trace = trace_all_stores(data)

In [None]:
quantiles = posterior_quantile(trace, q=[0.03, 0.5, 0.97])

Based on each of the quantiles, is there a clear winner?

In [None]:
quantiles.unstack().rank().sort_values(("p", 0.97))

In [None]:
quantiles.unstack().rank().sort_values(("p", 0.03))

## Summary

In this chapter, we went a little deeper into the Bayesian workflow. Here's what you've learned from this chapter.

Firstly, you learned how to extend an estimation model that worked with "single" samples,
to performing "multiple estimation", in which you estimated a key parameter for multiple samples.
The key idea here was to learn how to use vectorized syntax.

Secondly, you learned a bit of workflow.
Before we went ahead and built a model to be fit on _all samples_,
we built the model in such a way that it could handle _some_ of the samples.
Only after checking that we could perform posterior sampling on _some_ of the samples
did we then apply the model across _all_ of the samples.

Thirdly, you learned how to handle summary values from the posterior distribution trace.
This includes calculating quantiles of the posterior.
You can calculate means, standard deviations, and more,
but practically speaking, simple quantiles are already quite expressive.
`xarray` syntax is something you will want to become very familiar with,
as `xarray` provides idiomatic high-dimensional data structures
that are useful for storing Bayesian posterior calculations.

Some of the things that should have stood out here are that
we did not make binary/discretized decisions, 
like ranking things, early on. 
Instead, we deferred them
until the full posterior distributions were calculated.
Only then did we try to organize what we concluded.

## Next chapter

In the next chapter, we're going to solve some of the unsatisfying parts of this model.
In particular, if you remember some of those "really wide" posterior distributions
that have a ton of uncertainty in them and can recall yourself being distinctly dissatisfied with them,
then the next chapter might help resolve some of that lingering dissatisfaction!