# Tutorial: **Hierarchical modeling with repeated binary trial data**

In this tutorial we will demonstrate the application of *hierarchical models*
with data from the 1970 season of [Major League Baseball (MLB)](#references)
found in the paper by [Efron and Morris 1975](#references).

## Learning Outcomes

On completion of this tutorial, you should be able:

* to prepare data for running a hierarchical model with Bean Machine;
* to execute a hierarchical model with Bean Machine;
* to explain what different pooling techniques tell us about the data;
* to run diagnostics to understand what Bean Machine is doing; and
* to generalize the techniques demonstrated to build a hierarchical model based
  on new data.

## Problem

Data are from [Efron and Morris 1975](#references), which are taken from the
1970 Major League Baseball season for both the American and National leagues,
[Wikipedia (Major League Baseball)](#references). These data are an example of
_repeated binary trials_ in that a player has multiple opportunities (trials)
for being at-bat and they can either land a hit or miss the ball (binary
outcome). Many useful scenarios can be modeled as having repeated binary
trials. For example; browser click data, watching a video to completion, voting
for or against propositions in a session of congress/parliament, and even rat
tumor development, see [Tarone](#references).

**Our task is to use the given baseball data and create a _hierarchical_ model
that estimates the chances each player has to land a hit when they are at-bat.**
The term `hierarchical` invokes concepts of there being a hierarchy within the
data, _e.g._ country > state > city, however, our baseball data does not contain
an explicit hierarchy. Nonetheless we can use a hierarchical model to describe
the data. We create a "hierarchy" by stating that all the data are from MLB
players. Each individual player is different, however, all players are from the
MLB. So the population is sampled from the MLB, and our individuals are all
different players. Thus our "hierarchy" is MLB-population > MLB-player. We can
do this even though we do not have any explicit data on the MLB population.
Later on we will create a model where we estimate population parameters and we
will use that knowledge to update the chances a player has to hit a ball when
at-bat.

## Prerequisites

We wil be using the following packages within this tutorial.

* [`beanmachine`](#references) the Bean Machine library;
* [`arviz`](https://arviz-devs.github.io/arviz/);
  [`bokeh`](https://docs.bokeh.org/en/latest/docs/) for interactive
  visualizations;
* [`numpy`](https://numpy.org/) and [`pandas`](https://pandas.pydata.org/) for
  data manipulation;
* [`torch`](https://pytorch.org/) for fundamental PyTorch classes; and
* [`statsmodels`](https://www.statsmodels.org/) for simple statistics.

In [None]:
from io import StringIO

import arviz as az
import beanmachine.ppl as bm
import numpy as np
import pandas as pd
import statsmodels.api as sm
import torch
import torch.distributions as dist
from beanmachine.ppl.model import RVIdentifier
from bokeh.io import output_notebook
from bokeh.models import Arrow, Band, ColumnDataSource, HoverTool, VeeHead, Whisker
from bokeh.plotting import figure, gridplot, show
from torch import tensor

The next cell includes convenient configuration settings to improve the notebook
presentation as well as setting a manual seed for `torch` for reproducibilty.

In [None]:
# Plotting settings
az.style.use('arviz-darkgrid')
az.rcParams['plot.backend'] = 'bokeh'
az.rcParams['plot.bokeh.tools'] = ','.join(
    [
        'reset',
        'pan',
        'box_zoom',
        'wheel_zoom',
        'lasso_select',
        'undo',
        'save',
        'hover',
        'crosshair',
    ]
)
az.rcParams['plot.bokeh.figure.dpi'] = 60
az.rcParams['stats.hdi_prob'] = 0.89

# Manual seed for torch
torch.manual_seed(1199)

# Other settings for the notebook
output_notebook()

## Data

We will explore the data first so we can create a _data story_. We will use the
_data story_ to inform decisions that need to be made about priors and sampling
distributions when we begin to make models. The story will be used to create
logical procedures for processing information, see [McElreath](#references) for
further discussion about data stories.

The data columns are explained as follows.

| Column name      | Description                                     |
|------------------|-------------------------------------------------|
| FirstName        | Player's first name                             |
| LastName         | Player's last name                              |
| At-Bats          | 45 for each player                              |
| Hits             | Hits in first 45 at-bats                        |
| BattingAverage   | Batting average after first 45 at-bats          |
| RemainingAt-Bats | Batting average for the remainder of the season |
| SeasonAt-Bats    | Number of at-bats for entire season             |
| Season Hits      | Number of hits for entire season                |
| SeasonAverage    | Batting average for entrie season               |

The selected players include _Roberto Clemente_ explicitly because he was
presumed to be an outlier. See [Efron and Morris, 1977](#references) for further
discussion.

We now read the data in as a Pandas dataframe. The data are replicated below
using a `StringIO` object rather than reading from a `CSV` file.

In [None]:
# Data are from Efron and Morris (1975).
data_string = """FirstName,LastName,At-Bats,Hits,BattingAverage,RemainingAt-Bats,RemainingAverage,SeasonAt-Bats,Season Hits,SeasonAverage
Roberto,Clemente,45,18,0.4,367,0.346,412,145,0.352
Frank,Robinson,45,17,0.378,426,0.2981,471,144,0.306
Frank,Howard,45,16,0.356,521,0.2764,566,160,0.283
Jay,Johnstone,45,15,0.333,275,0.2218,320,76,0.238
Ken,Berry,45,14,0.311,418,0.2727,463,128,0.276
Jim,Spencer,45,14,0.311,466,0.2704,511,140,0.274
Don,Kessinger,45,13,0.289,586,0.2645,631,168,0.266
Luis,Alvarado,45,12,0.267,138,0.2101,183,41,0.224
Ron,Santo,45,11,0.244,510,0.2686,555,148,0.267
Ron,Swaboda,45,11,0.244,200,0.23,245,57,0.233
Rico,Petrocelli,45,10,0.222,538,0.2639,583,152,0.261
Ellie,Rodriguez,45,10,0.222,186,0.2258,231,52,0.225
George,Scott,45,10,0.222,435,0.3034,480,142,0.296
Del,Unser,45,10,0.222,277,0.2635,322,83,0.258
Billy,Williams,45,10,0.222,591,0.3299,636,205,0.251
Bert,Campaneris,45,9,0.2,558,0.2849,603,168,0.279
Thurman,Munson,45,8,0.178,408,0.3162,453,137,0.302
Max,Alvis,45,7,0.156,70,0.2,115,21,0.183"""
with StringIO(data_string) as f:
    df = pd.read_csv(f)

The next cell contains standard `pandas` transformations to make the data
clearer for later analysis.

In [None]:
# Rename columns in order to make the data clearer.
renames = {
    'Hits': 'Current hits',
    'At-Bats': 'Current at-bats',
    'Season Hits': 'Season hits',
    'SeasonAt-Bats': 'Season at-bats',
}
df = df.rename(columns=renames)

# Concatenate the first and last names together.
df['Name'] = df['FirstName'].str.cat(df['LastName'], sep=' ')

# Keep only those columns we will use in the analysis.
df = df[
    [
        'Name',
        'Current hits',
        'Current at-bats',
        'Season hits',
        'Season at-bats',
    ]
].copy()

# Show the resultant dataframe.
df.set_index('Name')

The above dataframe shows us that each player had a different number of
opportunities to be at-bat over the course of the season; the `Season at-bats`
column. We also see a defined slice in time where all the players have had the
same number of opportunities to be at-bat, the `Current at-bats` column. The
`Current at-bats` column is the number of trials each player has had to be
at-bat, all of which are 45. The number of successful hits a player has made
within the first 45 trials is given in the `Current hits` column. From the
dataframe, we can see that we have indeed have _repeated binary trials_ data.

The data are telling us that each player is unique, but also that they are
similar. The players are similar because they are all highly trained in their
sport, and have a much better chance of hitting a ball when at-bat then an
average person not in the MLB. The data also tell us that we have captured a
slice in time where each player has been given the same number of opportunities
to be at-bat, and all of them had different amounts of success at landing a
hit.

This is now our **data story**: MLB players are unique because they have
different abilities associated with hitting a ball, but they are all similar in
that they are all highly skilled athletes. Our **problem** is to now: estimate
player's abilities and their chances of hitting a ball when at-bat.

## Models

We will create three separate models using the above data as we explore how to
create a hierarchical model using Bean Machine|. The models we will be creating
are:

* [complete-pooling](#complete-pooling);
* [no-pooling](#no-pooling);
* [partial-pooling](#parital-pooling).

Each model will estimate a baseball player's chance of landing a hit when they
are at-bat. We can assume that every time a player is at-bat is a new
independent Bernoulli trial (see [Wikipedia (Bernoulli trial)](#references)) so
we can model the chance a player has when at-bat as a Binomial distribution,
see [Wikipedia (Binomial)](#references).

$$
p(y_n | \phi) = \text{Binomial}(y_n | K_n, \phi)
$$

where

* $\phi$ is our prior;
* $y_n$ are the number of times a player makes a hit (number of successes);
* $K_n$ is the number of times that player has been at-bat (number of trials).

**`NOTE`** We can implement all of these models in Bean Machine by defining
random variable objects with the `@bm.random_variable` decorator. These
functions behave differently than ordinary Python functions. In each of the
models below we will see how to use the `@bm.random_variable` decorator.

<div style= "background: #daeaf3; border-left: 3px solid #2980b9; display: block; margin: 16px 0; padding: 12px;">
  Semantics for <code>@bm.random_variable</code> functions:
  <ul>
    <li>They must return PyTorch <code>Distribution</code> objects.</li>
    <li>
      Though they return distributions, callees actually receive a <i>sample</i>
      from the distribution. The machinery for obtaining samples from
      distributions is handled internally by Bean Machine.
    </li>
    <li>
      Inference runs the model through many iterations. During a particular
      inference iteration, a distinct random variable will correspond to
        exactly one sampled value: <b>calls to the same random variable
          function with the same arguments will receive the same sampled value
          within one inference iteration</b>. This makes it easy for multiple
        components of your model to refer to the same logical random variable.
    </li>
    <li>
      Consequently, to define distinct random variables that correspond to
      different sampled values during a particular inference iteration, an
      effective practice is to add a dummy "indexing" parameter to the
      function. Distinct random variables can be referred to with different
      values for this index.
    </li>
    <li>
      Please see the documentation for more information about this decorator.
    </li>
  </ul>
</div>

<a id="complete-pooling"></a>
### Complete-pooling model

Complete pooling assumes every item being modeled is identical. This means for
our baseball data that we will be modeling each player as if their chance of
getting a hit when at-bat is the same for everyone. As we will see later on,
this is going to underestimate top batter's abilities and overestimate poor
batter's abilities.

<center>
  <img src="img/complete-pooling.svg">
</center>

The image above is a graphical representation of what complete pooling does: it
assumes that everyone is sampled from the same population distribution $\phi$.
This model assumes that all the players are the _same_, which is an extreme
statement compared to our data story that said all players were _similar_. This
is something we will fix later on.

If we assume no prior knowledge about the players then our prior ($\phi$)
is a uniform distribution from $[0, 1]$. Bean Machine works well with `Uniform`
priors, but we will choose to use a $\text{Beta(1, 1)}$ distribution as our
prior because it is a conjugate prior to a Binomial distribution. Our
`complete-pooling` model can be written as

$$
\begin{align}
  \phi & \sim\text{Beta}(1, 1)\\
  y    & \sim\text{Binomial}(K, \phi).
\end{align}
$$

#### Beta priors _vs_ Uniform priors
Below we show the similarity between a uniform distribution and a Beta
distribution.

**`Note`** that we have chosen to show a sampling from both distributions, and
then to calculate the probability distribution function using `statsmodels` in
order to compare the two. Analytic plots for a Beta and Uniform distribution
would show plots with exactly straight lines on top of each other.

In [None]:
beta = dist.Beta(1., 1.)
uniform = dist.Uniform(low=0., high=1.)

N = int(1e6)
beta_samples = beta.sample((N,))
uniform_samples = uniform.sample((N,))

beta_kde = sm.nonparametric.KDEUnivariate(beta_samples)
beta_kde.fit()
uniform_kde = sm.nonparametric.KDEUnivariate(uniform_samples)
uniform_kde.fit()

plot = figure(
    plot_width=400,
    plot_height=400,
    title='Beta(1, 1) vs Uniform',
    x_axis_label='Support',
    x_range=[0, 1],
    y_range=[0.8, 1.2],
)

beta_source = ColumnDataSource(
    {
        'x': beta_kde.support,
        'y': beta_kde.density / beta_kde.density.max(),
    }
)
beta_glyph = plot.line(
    x='x',
    y='y',
    source=beta_source,
    line_color='steelblue',
    line_alpha=0.7,
    line_width=2,
    legend_label='Beta(1, 1)',
)

uniform_source = ColumnDataSource(
    {
        'x': uniform_kde.support,
        'y': uniform_kde.density / uniform_kde.density.max(),
    }
)
uniform_glyph = plot.line(
    x='x',
    y='y',
    source=uniform_source,
    line_color='orange',
    line_alpha=0.7,
    line_width=2,
    legend_label='Uniform distribution',
)

plot.yaxis.major_tick_line_color = None
plot.yaxis.minor_tick_line_color = None
plot.yaxis.major_label_text_font_size = '0pt'
plot.outline_line_color = 'black'


show(plot)

We can create our `complete-pooling` model in Bean machine using the
`@bm.random_variable` decorator.

In [None]:
@bm.random_variable
def phi() -> RVIdentifier:
    """Population's ability for hitting a ball when at bat."""
    return dist.Beta(1, 1)


@bm.random_variable
def y(i: int, K: int) -> RVIdentifier:
    """Chance a player has to make a hit when at-bat.

    Parameters
    ----------
    i : int
        An index.
    K : int
        Number of trials (at-bats).
    """
    return dist.Binomial(K, phi())

In the code cell above we have defined our population prior and chance of
landing a hit when at-bat in the top-level namespace of our notebook. Bean
Machine can take advantage of creating models as classes as well, and we
illustrate how to do that below for our `complete-pooling` model. We will not
use the class method for the rest of the tutorial, however, it does show how one
could accomplish creating models as class objects in Python.

In [None]:
class CompletePoolingModel:
    """Example for creating a class object as a Bean Machine model."""

    @bm.random_variable
    def phi(self) -> RVIdentifier:
        """Population's ability for hitting a ball when at bat."""
        return dist.Beta()

    @bm.random_variable
    def y(self, i: int, K: int) -> RVIdentifier:
        """Chance a player has to make a hit when at-bat.

        Parameters
        ----------
        i : int
            An index.
        K : int
            Number of trials (at-bats).
        """
        return dist.Binomial(K, phi())


complete_pooling_model = CompletePoolingModel()

#### Complete-pooling inference

We need to supply observational data to our model defined above. Bean Machine's
inference algorithms expect observations in the form of a dictionary. This
dictionary should consist of `@bm.random_variable` invocations as keys, and
tensor data as values. Recall from above that the `@bm.random_variable`
decorator allows us to define unique random variables by using a dummy index.
Below we use this to our advantage when defining data observations, which are
explicitly coded below.

In [None]:
# NOTE The `at_bats` type should be a native Python type since they are used as
#      arguments for the observation dictionary keys.
at_bats = df['Current at-bats'].astype(int).tolist()

# NOTE The `hits` type should be a Tensor type as Bean Machine requires all
#      values to be a tensor when executing inference.
hits = [tensor(hit) for hit in df['Current hits'].astype(int).tolist()]

# NOTE We are explicitly coding the index for `y` in the dictionary.
complete_pooling_observations = {
    y(0, at_bats[0]): hits[0],
    y(1, at_bats[1]): hits[1],
    y(2, at_bats[2]): hits[2],
    y(3, at_bats[3]): hits[3],
    y(4, at_bats[4]): hits[4],
    y(5, at_bats[5]): hits[5],
    y(6, at_bats[6]): hits[6],
    y(7, at_bats[7]): hits[7],
    y(8, at_bats[8]): hits[8],
    y(9, at_bats[9]): hits[9],
    y(10, at_bats[10]): hits[10],
    y(11, at_bats[11]): hits[11],
    y(12, at_bats[12]): hits[12],
    y(13, at_bats[13]): hits[13],
    y(14, at_bats[14]): hits[14],
    y(15, at_bats[15]): hits[15],
    y(16, at_bats[16]): hits[16],
    y(17, at_bats[17]): hits[17],
}
complete_pooling_observations

We are ready to run inference on our model and observations. Bean Machine
supports a rich library of inference algorithms that share a common infer
method with the following arguments:

| Name           | Usage                                                                                         |
|----------------|-----------------------------------------------------------------------------------------------|
| `queries`      | A list of @bm.random_variable targets to fit posterior distributions for.                     |
| `observations` | The Dict of observations we built up, above.                                                  |
| `num_samples`  | Number of samples to build up distributions for the values listed in queries.                 |
| `num_chains`   | Number of separate inference runs to use. Multiple chains can verify inference ran correctly. |

For this particular problem, we will use the `GlobalNoUTurnSampler`
inference method. We have chosen to use the NUTS sampler because it can be
easily compared to other probabilistic tools.

In [None]:
complete_pooling_queries = [phi()]
complete_pooling_samples = bm.GlobalNoUTurnSampler().infer(
    queries=complete_pooling_queries,
    observations=complete_pooling_observations,
    num_samples=3000,
    num_chains=4,
    num_adaptive_samples=1500,
)

#### Complete-pooling analysis

The `complete_pooling_samples` object contains our inference results. We have
only one parameter in this model, $\phi$, which was supplied as our single query
parameter when we ran inference. We can investigate what the posterior
distribution looks like, but first let us look at the trace plots for $\phi$
to make sure the model was able to mix well for all chains. Bean machine has a
convenient `Diagnostics` attribute that we can use to plot the traces for $\phi$
for each chain, as well as their autocorrelations plots.

In [None]:
complete_pooling_query_trace = bm.Diagnostics(complete_pooling_samples).plot(display=True)

We can see from this figure that the chains mixed well as the trace plot for
$\phi$ does not have any areas where the parameter gets "stuck".

Below we show two other diagnostic statistics: [$\hat{R}$](#references) and
[$N_\text{eff}$](#references).

  * $\hat{R} \in [1, \infty)$ summarizes how effective inference was at
    converging on the correct posterior distribution for a particular random
    variable. It uses information from all chains in order to assess
    whether inference had a good understanding of the distribution or not.
    Values very close to zero indicate that all chains discovered similar
    distributions for a particular random variable. We do not recommend using
    inference results where $\hat{R} > 1.1$, as inference may not have
    converged. In that case, you may want to run inference for more samples.
  * $N_\text{eff} \in [1, \texttt{num}\_\texttt{samples}]$ summarizes how
    independent posterior samples are from one another. Although inference was
    run for `num_samples` iterations, it is possible that those samples were
    very similar to each other (due to the way inference is implemented), and
    may not be representative of the full posterior space. Larger numbers
    are better here, and if your particular use case calls for a certain number
    of samples to be considered, you should ensure that $N_\text{eff}$ is at
    least that large.

In this case, $\hat{R}$ and $N_\text{eff}$ have acceptable values.

In [None]:
complete_pooling_diagnostics = bm.Diagnostics(complete_pooling_samples)
print(f'r_hat: {complete_pooling_diagnostics.split_r_hat(query_list=[phi()]).values[0][0]}')
print(f'n_eff: {complete_pooling_diagnostics.effective_sample_size(query_list=[phi()]).values[0][0]}')

Below we see the posterior distribution for $\phi$. We have a maximum likelihood
estimate of 0.27 with 89% of the highest density of the posterior lying between
0.24 and 0.29.

Why an 89% highest density interval (HDI)? To prevent you from thinking about a
95% confidence interval. See [McElreath](#references) for further discussion on
this subject.

In [None]:
complete_pooling_query_plot = az.plot_posterior({'φ': complete_pooling_samples[phi()].numpy()})

What this is telling us from our model and data is that every player will have
a 27% chance of hitting a ball when at-bat. So everyone has the same chance to
hit a ball when at-bat. This statement's unusualness becomes evident when
we plot all the players in the same plot showing their chances of success when
at-bat.

In [None]:
complete_pooling_data = {'φ': complete_pooling_samples[phi()].numpy()}

complete_pooling_hdis = az.hdi(
    complete_pooling_data,
    hdi_prob=0.89,
).to_dataframe().T.rename(columns={'lower': 'hdi_11%', 'higher': 'hdi_89%'})

# Calculate the summary statistics for the complete pooling model.
complete_pooling_summary_df = az.summary(
    complete_pooling_data,
    round_to=6,
    stat_funcs={'median': np.median},
    extend=True,
).drop(
    ['hdi_5.5%', 'hdi_94.5%'],
    axis=1,
).join(complete_pooling_hdis)
complete_pooling_summary_df['mode'] = np.nan
modes = []
for complete_pooling_query in complete_pooling_queries:
    data = complete_pooling_samples[complete_pooling_query].reshape(-1,).numpy()
    kde = sm.nonparametric.KDEUnivariate(data)
    kde.fit()
    mode = kde.support[np.argmax(kde.density)]
    modes.append(mode)
complete_pooling_summary_df['mode'] = modes

population_mean = (df['Current hits'] / df['Current at-bats']).mean()
population_std = (df['Current hits'] / df['Current at-bats']).std()
x = (df['Current hits'] / df['Current at-bats']).values
posterior_mode = complete_pooling_summary_df['mode'].tolist() * df.shape[0]
posterior_upper_hdi = np.array(complete_pooling_summary_df['hdi_89%'].tolist() * df.shape[0])
posterior_lower_hdi = np.array(complete_pooling_summary_df['hdi_11%'].tolist() * df.shape[0])

complete_pooling_source = ColumnDataSource(
    {
        'x': x,
        'mode': posterior_mode,
        'upper_hdi': posterior_upper_hdi,
        'lower_hdi': posterior_lower_hdi,
        'lower_std': [population_mean - population_std] * df.shape[0],
        'upper_std': [population_mean + population_std] * df.shape[0],
        'name': df['Name'].values,
    }
)

complete_pooling_plot = figure(
    plot_width=500,
    plot_height=500,
    title='Complete-pooling',
    x_axis_label='Observed hits / at-bats',
    y_axis_label='Predicted chance of a hit',
    y_range=[0.05, 0.55],
    x_range=[0.14, 0.41],
)
mean_line = complete_pooling_plot.line(
    x=[0, 1],
    y=[population_mean, population_mean],
    line_color='orange',
    line_width=3,
    level='underlay',
    legend_label='Population mean',
)
std_band = Band(
    base='x',
    lower='lower_std',
    upper='upper_std',
    source=complete_pooling_source,
    level='underlay',
    fill_alpha=0.2,
    fill_color='orange',
    line_width=0.2,
    line_color='orange',
)
complete_pooling_plot.add_layout(std_band)
complete_pooling_whiskers = Whisker(
    base='x',
    upper='upper_hdi',
    lower='lower_hdi',
    source=complete_pooling_source,
    line_color='steelblue',
)
complete_pooling_whiskers.upper_head.line_color = 'steelblue'
complete_pooling_whiskers.lower_head.line_color = 'steelblue'
complete_pooling_plot.add_layout(complete_pooling_whiskers)
complete_pooling_glyph = complete_pooling_plot.circle(
    x='x',
    y='mode',
    source=complete_pooling_source,
    size=10,
    line_color='white',
    fill_color='steelblue',
    legend_label='Players',
)
complete_pooling_tooltips = HoverTool(
    renderers=[complete_pooling_glyph],
    tooltips=[
        ('Name', '@name'),
        ('Posterior Upper HDI', '@upper_hdi{0.000}'),
        ('Posterior Mode', '@mode{0.000}'),
        ('Posterior Lower HDI', '@lower_hdi{0.000}'),
    ],
)
complete_pooling_plot.add_tools(complete_pooling_tooltips)

complete_pooling_plot.legend.location = 'top_left'
complete_pooling_plot.legend.click_policy = 'mute'

show(complete_pooling_plot)

The above plot shows our predictions for each player's chance of hitting a ball
when at-bat, with error bars being the 89% HDI and the scatter plot being the
maximum likelihood of the posterior (mode). The orange horizontal line shows
the mean, and the orange box shows the standard deviation for the actual data.

Our data story said that every player is unique, but we have not captured that
information as the plot shows that every player is the same. The model also
tells us that poor hitting players are just as likely to land a hit when at-bat
than very good hitting players. Recall that at the beginning of this section we
stated that this type of model will overestimate player's abilities that hit
poorly while underestimating player's abilities that hit well. We will revisit
this in the next section, but take note that this is the case.

In hindsight our complete pooling model may seem trivial in that we
successfully obtained the mean and its standard deviation through a
probabilistic programming manner. Why would we do this when we can calculate
the mean and the standard deviation from the given data and get pretty close to
the values we would calculate using a PPL technique? Well, we can do better,
but before we do better we are going to make another model that is equally as
bad as our complete pooling one. The reason why we do this will be come evident
when we get to our third and final model the [partial-pooling
model](#partial-pooling).

<a id="no-pooling"></a>
### No-pooling model

`No-pooling` is the polar opposite to `complete-pooling`, where instead of
treating each player identically, we will treat each player as having their own
separate chance of landing a hit when at-bat. Essentially we will be creating a
model for each player and each model will have no influence on any of the other
models. We will see later on that choosing a `no-pooling` model overestimates
top batter's abilities while underestimating poor batter's abilities. Note that
this is the opposite to what the `complete-pooling` model did and to our data
story that says players are _similar_ not exactly unique.

<center>
  <img src="img/no-pooling.svg">
</center>

$$
\begin{align}
  \theta&\sim\text{Beta}(1, 1)\\
  y&\sim\text{Binomial}(K, \theta)
\end{align}
$$

The above graph shows what a `no-pooling` model does. Each player has a
separate model (the $\theta$s) and no information about a player is shared
within the group. This type of model assumes every player is uniquely
different.

We will use Bean Machine to create this type of model as follows.

In [None]:
@bm.random_variable
def theta(i: int) -> RVIdentifier:
    """An individual player's ability for landing a hit when at-bat.

    Parameters
    ----------
    i : int
        An index.
    """
    return dist.Beta(1, 1)


@bm.random_variable
def y(i: int, K: int) -> RVIdentifier:
    """An individual player's chances of hitting a ball when at-bat.

    Parameters
    ----------
    i : int
        An index.
    K : int
        Number of trials (at-bats).
    """
    return dist.Binomial(K, theta(i))

How to code the `no-pooling` model and how to code the `complete-pooling` model
are quite similar. Note that the difference between the two models is that we
have introduced an index $i$ for both random variables. The index value is used
to create a family of distributions, which is what we need since we are now
creating 18 different models, one for each player we have data for.

#### No-pooling inference

Just like before we will use the `GlobalNoUTurnSampler` inference
method. We construct the observations dictionary similarly to how we
constructed it in the `complete-pooling` model, except now we will construct
the dictionary in a more compact manner.

In [None]:
at_bats = [
    y(i, at_bat)
    for i, at_bat in enumerate(df['Current at-bats'].astype(int).tolist())
]
no_pooling_observations = dict(zip(at_bats, hits))

The queries we are interested in from this model are all the $\theta$'s as they
are the distributions for each player's individual ability. We created a family
of distributions using Bean machine by adding the index $i$ to each random
variable. Now we will use that index to our advantage and use it to create a
list of queries, all of which are the $\theta$'s.

In [None]:
no_pooling_queries = [theta(i) for i in df.index]
no_pooling_samples = bm.GlobalNoUTurnSampler().infer(
    queries=no_pooling_queries,
    observations=no_pooling_observations,
    num_samples=3000,
    num_chains=4,
    num_adaptive_samples=1500,
)

#### No-pooling analysis

Again we will start our analysis of our no pooling model with summary statistics
and trace plots.

In [None]:
no_pooling_theta_traces = bm.Diagnostics(no_pooling_samples).plot(display=True)

All the trace plots look like they have mixed well. Next we check the $\hat{R}$
and $N_{eff}$ values for each $\theta$.

In [None]:
no_pooling_diagnostics = bm.Diagnostics(no_pooling_samples)
data = []
for index, name in df['Name'].iteritems():
    data.append(
        {
            'query': f'θ[{name}]',
            'r_hat': no_pooling_diagnostics.split_r_hat(query_list=[theta(index)]).values[0][0],
            'n_eff': no_pooling_diagnostics.effective_sample_size(query_list=[theta(index)]).values[0][0],
        }
    )
no_pooling_bm_summary = pd.DataFrame.from_dict(data).set_index('query')
no_pooling_bm_summary

The above values for each model look great. Next we plot the posterior values.

In [None]:
no_pooling_data = {
    f'θ[{name}]': no_pooling_samples[theta(i)].numpy()
    for i, name in df['Name'].iteritems()
}
no_pooling_query_plot = az.plot_posterior(no_pooling_data)

Both the trace plots and the summary statistics show that we have good mixing
between the chains. Just like in the complete pooling model analysis, we will
plot each of the player's abilities on a scatter plot.

In [None]:
no_pooling_hdis = az.hdi(
    no_pooling_data,
    hdi_prob=0.89,
).to_dataframe().T.rename(columns={'lower': 'hdi_11%', 'higher': 'hdi_89%'})
no_pooling_summary_df = az.summary(
    no_pooling_data,
    round_to=6,
    stat_funcs={'median': np.median},
    extend=True,
).drop(['hdi_5.5%', 'hdi_94.5%'], axis=1).join(no_pooling_hdis)
no_pooling_summary_df['mode'] = np.nan
modes = []
for no_pooling_query in no_pooling_queries:
    data = no_pooling_samples[no_pooling_query].reshape(-1,).numpy()
    kde = sm.nonparametric.KDEUnivariate(data)
    kde.fit()
    mode = kde.support[np.argmax(kde.density)]
    modes.append(mode)
no_pooling_summary_df['mode'] = modes

x = (df['Current hits'] / df['Current at-bats']).values
posterior_mode = no_pooling_summary_df['mode'].values
posterior_upper_hdi = no_pooling_summary_df['hdi_89%']
posterior_lower_hdi = no_pooling_summary_df['hdi_11%']

no_pooling_source = ColumnDataSource(
    {
        'x': x,
        'mode': posterior_mode,
        'upper_hdi': posterior_upper_hdi,
        'lower_hdi': posterior_lower_hdi,
        'name': df['Name'].values,
    }
)

no_pooling_plot = figure(
    plot_width=500,
    plot_height=500,
    title='No pooling',
    x_axis_label='Observed hits / at-bats',
    y_axis_label='Predicted chance of a hit',
    x_range=[0.14, 0.41],
    y_range=[0.05, 0.55],
)

mean_line = no_pooling_plot.line(
    x=[0, 1],
    y=[population_mean, population_mean],
    line_color='orange',
    line_width=3,
    level='underlay',
    legend_label='Population mean',
)

straight_line = no_pooling_plot.line(
    x=x,
    y=(df['Current hits'] / df['Current at-bats']).values,
    line_color='grey',
    line_alpha=0.7,
    line_width=2.0,
    legend_label='Current hits / Current at-bats',
)

no_pooling_whiskers = Whisker(
    base='x',
    upper='upper_hdi',
    lower='lower_hdi',
    source=no_pooling_source,
    line_color='steelblue',
)
no_pooling_whiskers.upper_head.line_color = 'steelblue'
no_pooling_whiskers.lower_head.line_color = 'steelblue'
no_pooling_plot.add_layout(no_pooling_whiskers)

no_pooling_glyph = no_pooling_plot.circle(
    x='x',
    y='mode',
    source=no_pooling_source,
    size=10,
    line_color='white',
    fill_color='steelblue',
    legend_label='Players',
)

no_pooling_tooltips = HoverTool(
    renderers=[no_pooling_glyph],
    tooltips=[
        ('Name', '@name'),
        ('Posterior Upper HDI', '@upper_hdi{0.000}'),
        ('Posterior Mode', '@mode{0.000}'),
        ('Posterior Lower HDI', '@lower_hdi{0.000}'),
    ],
)
no_pooling_plot.add_tools(no_pooling_tooltips)

no_pooling_plot.add_layout(std_band)

no_pooling_plot.legend.location = 'top_left'
no_pooling_plot.legend.click_policy = 'mute'

show(gridplot([[complete_pooling_plot, no_pooling_plot]]))

In the above plot we we show the predicted chance of landing a hit when at-bat
for our first two models. As we stated in the
[complete-pooling](#complete-pooling) section this model overestimates week
hitting player's abilities and underestimates good hitting player's abilities.
This effect is quite apparent when looking at the no-pooling plot to the right.
In the no-pooling plot we again plot posterior maximum likelihood for each
player and the error bars show the central 89% posterior interval (HDI). The
grey line is discussed below.

The player's abilities for the no-pooling model show a wide spread from 0.15 to
0.399. It turns out that a hitting average for MLB players in the high 300s is
quite rare, see [Wikipedia (All-time-players)](#references) for historical
context. So, we in fact see that this model does what we initially said: it
would underestimate poor player's ability to hit when at-bat and overestimate
highly successful player's ability when at-bat. Recall that this statement is
the exact opposite to what the complete-pooling model did.

So far we have not done a good job of telling our data story through our models.
Our complete-pooling model said every player was the same, while our no-pooling
model said all players are uniquely different. However, our data story said
that all players are different, but similar. What we have done so far is to
model the extreme cases for our data. In the next section we will combine these
two models such that we have a more accurate representation of player's
abilities and their chances of success when at-bat.

The grey line is found by setting

$$
y = \frac{\text{Current hits}}{\text{Current at-bats}}.
$$

Since this is also our x-axis, we have a straight line. Just like in our
complete-pooling model where we discovered the population mean through a
probabliistic programming manner, we have done the same thing with our
no-pooling model. Except in this model we have discovered the posteriors
follow the line made by the above fraction, which can be directly calculated
from the data.

<a id="parital-pooling"></a>
### Partial-pooling model

Partial pooling combines [complete-pooling](#complete-pooling) and
[no-pooling](#no-pooling) models to create a hybrid model that creates separate
models for each player _and simultaneously_ estimates the population's
abilities. Remember we do not have any data on the MLB population, but we can
still create estimates of it. The partial-pooling model does a better job of
estimating all batter's abilities using information about the population. It
will also do a better job of not underestimating poor hitter's abilities and not
overestimating good hitter's abilities.

<center>
  <img src="img/partial-pooling.svg">
</center>

The above image shows that each player has their own chance of hitting when
at-bat ($\theta$) and that each $\theta$ is being sampled from a population
distribution $\phi$.

In the partial-pooling model we are again modeling each player separately since
each player is given their own individual chance of getting a hit when at-bat
($\theta$) just like the no-pooling model.  The new thing about this model is
that we now connect each player's chance of getting a hit to a population
parameter $\phi$, which is similar to what we did in the complete-pooling
model. Our model is then


$$
\begin{align}
  \phi&\sim\text{Beta}(1, 1)\\
  \kappa&\sim\text{Pareto}(1, 1.5)\\
  \theta&\sim\text{Beta}(\phi * \kappa, (1- \phi) * \kappa)\\
  y&\sim\text{Binomial}(K, \theta)
\end{align}
$$

From the graphical representation, we can see that a partial-pooling model is
combining a complete- and no-pooling model together. The new component of the
model is the $\kappa$ parameter, which is sampled from a `Pareto` distribution.
$\kappa$ has been introduced to this model because $\theta$, the distribution
for each player's chance of getting a hit when at-bat, has priors $\alpha$ and
$\beta$, which both need to be estimated. For a more thourough discussion about
$\kappa$, see [Carpenter 2016](#references).

A few plots of the Pareto distribution with varying parameters are shown below.

In [None]:
plot = figure(
    plot_width=400,
    plot_height=400,
    title='Pareto distribution',
    x_axis_label='Support',
    x_range=[0, 5],
)
colors = ['steelblue', 'orange', 'brown', 'magenta']

for i, alpha in enumerate(np.linspace(start=1.5, stop=3, num=4)):
    pareto = dist.Pareto(tensor(1.), tensor(float(alpha)))
    pareto_samples = pareto.sample((N,))
    pareto_kde = sm.nonparametric.KDEUnivariate(pareto_samples)
    pareto_kde.fit()

    pareto_source = ColumnDataSource(
        {
            'x': pareto_kde.support,
            'y': pareto_kde.density / pareto_kde.density.max(),
        }
    )
    pareto_glyph = plot.line(
        x='x',
        y='y',
        source=pareto_source,
        line_color=colors[i],
        line_alpha=0.7,
        line_width=2,
        legend_label=f'α = {alpha}',
    )

plot.yaxis.major_tick_line_color = None
plot.yaxis.minor_tick_line_color = None
plot.yaxis.major_label_text_font_size = '0pt'
plot.outline_line_color = 'black'

show(plot)

We can code our partial pooling model easily in Bean Machine. All we need to do
is reintroduce $\phi$ as our estimate for the population's ability for hitting
a ball when at-bat, and our new model for each individual's abilty for successfully
hitting when at-bat.

In [None]:
@bm.random_variable
def phi() -> RVIdentifier:
    """The population's overall abiltity to hit a ball when at-bat."""
    return dist.Beta(1, 1)


@bm.random_variable
def kappa() -> RVIdentifier:
    """Hyperprior for theta."""
    return dist.Pareto(1, 1.5)


@bm.random_variable
def theta(i: int) -> RVIdentifier:
    """An individual's ability to hit a ball when at-bat.

    Parameters
    ----------
    i : int
        An index.
    """
    alpha = phi() * kappa()
    beta = (1 - phi()) * kappa()
    return dist.Beta(alpha, beta)


@bm.random_variable
def y(i: int, K: int) -> RVIdentifier:
    """Chance of hitting when at-bat.

    Parameters
    ----------
    i : int
        An index.
    K : int
        Number of trials (at-bats).
    """
    return dist.Binomial(K, theta(i))

#### Partial-pooling inference

Again we construct our observations into a dictionary, and sample using the
`GlobalNoUTurnSampler`.

In [None]:
partial_pooling_observations = dict(zip(at_bats, hits))

In [None]:
partial_pooling_queries =  [kappa(), phi()] + [theta(i) for i in df.index]
partial_pooling_samples = bm.GlobalNoUTurnSampler().infer(
    queries=partial_pooling_queries,
    observations=partial_pooling_observations,
    num_samples=3000,
    num_chains=4,
    num_adaptive_samples=1500,
)

#### Partial-pooling analysis

All the trace plots and their autocorrelation plots look good. You should
investigate the trace plot for $\kappa$ and notice the differences between its
trace plot and the trace plots for the $\theta$s. Recall that we required
$\kappa$ to always be greater than 1.

In [None]:
partial_pooling_traces = bm.Diagnostics(partial_pooling_samples).plot(display=True)

Again we will look at the $\hat{R}$ and $N_{eff}$ values for our model.

In [None]:
partial_pooling_diagnostics = bm.Diagnostics(partial_pooling_samples)
partial_pooling_summary_data = []
partial_pooling_summary_data.append(
    {
        'query': 'κ',
        'r_hat': partial_pooling_diagnostics.split_r_hat(query_list=[kappa()]).values[0][0],
        'n_eff': partial_pooling_diagnostics.effective_sample_size(query_list=[kappa()]).values[0][0],
    }
)
partial_pooling_summary_data.append(
    {
        'query': 'φ',
        'r_hat': partial_pooling_diagnostics.split_r_hat(query_list=[phi()]).values[0][0],
        'n_eff': partial_pooling_diagnostics.effective_sample_size(query_list=[phi()]).values[0][0],
    }
)
for index, name in df['Name'].iteritems():
    partial_pooling_summary_data.append(
        {
            'query': f'θ[{name}]',
            'r_hat': partial_pooling_diagnostics.split_r_hat(query_list=[theta(index)]).values[0][0],
            'n_eff': partial_pooling_diagnostics.effective_sample_size(query_list=[theta(index)]).values[0][0],
        }
    )
partial_pooling_bm_summary_df = pd.DataFrame.from_dict(partial_pooling_summary_data).set_index('query')
partial_pooling_bm_summary_df

These all look good so we will investigate the shapes of the posteriors.

In [None]:
partial_pooling_posteriror_data = {
    'κ': partial_pooling_samples[kappa()].numpy(),
    'φ': partial_pooling_samples[phi()].numpy(),
}
for i, name in df['Name'].iteritems():
    partial_pooling_posteriror_data.update({f'θ[{name}]': partial_pooling_samples[theta(i)].numpy()})
partial_pooling_query_plot = az.plot_posterior(partial_pooling_posteriror_data)

All the posteriors look good as well. We will plot the results of our
partial-pooling model in a scatter plot, just like we did for the other two
models.

In [None]:
partial_pooling_hdis = az.hdi(
    partial_pooling_posteriror_data,
    hdi_prob=0.89,
).to_dataframe().T.rename(columns={'lower': 'hdi_11%', 'higher': 'hdi_89%'})
partial_pooling_summary_df = az.summary(
    partial_pooling_posteriror_data,
    round_to=6,
    stat_funcs={'median': np.median},
    extend=True,
).drop(['hdi_5.5%', 'hdi_94.5%'], axis=1).join(partial_pooling_hdis)
partial_pooling_summary_df['mode'] = np.nan
modes = []
for partial_pooling_query in partial_pooling_queries:
    data = partial_pooling_samples[partial_pooling_query].reshape(-1,).numpy()
    kde = sm.nonparametric.KDEUnivariate(data)
    kde.fit()
    mode = kde.support[np.argmax(kde.density)]
    modes.append(mode)
partial_pooling_summary_df['mode'] = modes

partial_pooling_source = ColumnDataSource(
    {
        'x': x,
        'mode': partial_pooling_summary_df.reset_index().loc[2:, 'mode'].values,
        'upper_hdi': partial_pooling_summary_df.reset_index().loc[2:, 'hdi_89%'],
        'lower_hdi': partial_pooling_summary_df.reset_index().loc[2:, 'hdi_11%'],
        'name': df['Name'].values,
    }
)

partial_pooling_plot = figure(
    plot_width=500,
    plot_height=500,
    title='Partial pooling',
    x_axis_label='Observed hits / at-bats',
    y_axis_label='Predicted chance of a hit',
    x_range=[0.14, 0.41],
    y_range=[0.05, 0.55],
)

mean_line = partial_pooling_plot.line(
    x=[0, 1],
    y=[population_mean, population_mean],
    line_color='orange',
    line_width=3,
    level='underlay',
    legend_label='Population mean',
)

straight_line = partial_pooling_plot.line(
    x=x,
    y=(df['Current hits'] / df['Current at-bats']).values,
    line_color='grey',
    line_alpha=0.7,
    line_width=2.0,
    legend_label='Current hits / Current at-bats',
)

partial_pooling_whiskers = Whisker(
    base='x',
    upper='upper_hdi',
    lower='lower_hdi',
    source=partial_pooling_source,
    line_color='steelblue',
)
partial_pooling_whiskers.upper_head.line_color = 'steelblue'
partial_pooling_whiskers.lower_head.line_color = 'steelblue'
partial_pooling_plot.add_layout(partial_pooling_whiskers)

partial_pooling_glyph = partial_pooling_plot.circle(
    x='x',
    y='mode',
    source=partial_pooling_source,
    size=10,
    line_color='white',
    fill_color='steelblue',
    legend_label='Players',
)
partial_pooling_tooltips = HoverTool(
    renderers=[partial_pooling_glyph],
    tooltips=[
        ('Name', '@name'),
        ('Posterior Upper HDI', '@upper_hdi{0.000}'),
        ('Posterior Mode', '@mode{0.000}'),
        ('Posterior Lower HDI', '@lower_hdi{0.000}'),
    ],
)
partial_pooling_plot.add_tools(partial_pooling_tooltips)
partial_pooling_plot.add_layout(std_band)


partial_pooling_plot.legend.location = 'top_left'
partial_pooling_plot.legend.click_policy = 'mute'

show(
    gridplot(
        [
            [complete_pooling_plot, no_pooling_plot],
            [partial_pooling_plot],
        ]
    )
)

The partial-pooling model has shifted our predictions. In fact, it has captured
our data story, which stated that MLB players are unique (they do not lie on the
population mean orange line) and yet they are not all that different (they do
not fall on the grey line). What we see is that player's chances of hitting a
ball when at-bat have moved closer to the mean line from their positions in the
no-pooling model.

In [None]:
partial_pooling_tomean_plot = figure(
    plot_width=500,
    plot_height=500,
    title='Partial pooling shift',
    x_axis_label='Observed hits / at-bats',
    y_axis_label='Predicted chance of a hit',
    x_range=[0.14, 0.41],
    y_range=[0.05, 0.55],
)
mean_line = partial_pooling_tomean_plot.line(
    x=[0, 1],
    y=[population_mean, population_mean],
    line_color='orange',
    line_width=3,
    level='underlay',
    legend_label='Population mean',
)
straight_line = partial_pooling_tomean_plot.line(
    x=x,
    y=(df['Current hits'] / df['Current at-bats']).values,
    line_color='grey',
    line_alpha=0.7,
    line_width=2.0,
    legend_label='Current hits / Current at-bats',
)

np_glyph = partial_pooling_tomean_plot.circle(
    x='x',
    y='mode',
    source=no_pooling_source,
    size=10,
    line_color='steelblue',
    fill_color='white',
    legend_label='No-pooling',
)
partial_pooling_tomean_plot.add_layout(partial_pooling_glyph)
for i, name in df['Name'].iteritems():
    partial_pooling_tomean_plot.add_layout(
        Arrow(
            end=VeeHead(size=10),
            x_start=no_pooling_source.data['x'][i],
            y_start=no_pooling_source.data['mode'][i],
            x_end=partial_pooling_source.data['x'][i],
            y_end=partial_pooling_source.data['mode'][i],
        )
    )

partial_pooling_tomean_plot.add_layout(std_band)
partial_pooling_tomean_plot.legend.location = 'top_left'
partial_pooling_tomean_plot.legend.click_policy = 'mute'

show(partial_pooling_tomean_plot)

## Conclusion

To sum up

* Our complete-pooling model

  * overestimated a player's chance of landing a hit if the player hits poorly
  * understimated a player's chance of landing a hit if the player hits well
  * gave us maximum likelihood estimates within the population mean.


* Our no-pooling model

  * underestimated a player's chance of landing a hit if the player hits poorly
  * overstimated a player's chance of landing a hit if the player hits well.


* Our partial-pooling model

  * estimated a player's chance of landing a hit regardless if they hit poorly
    or well.


With our limited data we were able to create a model that gives us a player's
chance of success when at-bat. Bayesian inference has given us a way to make
more accurate predictions about a player's chance of success with a small data
set and Bean Machine made creating those models very easy.

<a id="references"></a>
## References

* Carpenter B (2016) Hierarchical Partial Pooling for Repeated Binary Trials.
  [https://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html](https://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html)
* Efron B and Morris C (1975) Data analysis using Stein's estimator and its
  generalizations. _Journal of the American Statistical Association_ **70**(350),
  311–319 [doi: 10.1080/01621459.1975.10479864](https://dx.doi.org/10.1080/01621459.1975.10479864).
* Efron B and Morris C (1977) Stein's Paradox in Statistics. _Scientific
  American_ **236**(5), 119–127 [JSTOR](https://www.jstor.org/stable/24954030).
* McElreath R (2020) **Statistical Rethinking: A Bayesian Course with Examples
  in R and Stan** 2nd edition. Chapman and Hall/CRC.
  [doi: 10.1201/9780429029608](https://dx.doi.org/10.1201/9780429029608)
* $N_{\text{eff}}$ [MCMC Handbook](https://www.mcmchandbook.net/HandbookChapter1.pdf)
* $\hat{R}$ [Project Euclid](https://projecteuclid.org/euclid.ss/1177011136)
* Tarone RE (1982) The use of historical control information in testing for a
  trend in proportions. _Biometrics_ **38**(1):215–220
  [doi: 10.2307/2530304](https://doi.org/10.2307/2530304)
  [doi: 10.1214/20-BA1221](https://dx.doi.org/10.1214/20-BA1221)
* [Wikipedia (All-time-players)](https://en.wikipedia.org/wiki/Batting_average_(baseball)#All-time_leaders)
* [Wikipedia (Bernoulli trials)](https://en.wikipedia.org/wiki/Bernoulli_trial)
* [Wikipedia (Binomial)](https://en.wikipedia.org/wiki/Binomial_distribution)
* [Wikipedia (Major League Baseball)](https://en.wikipedia.org/wiki/Major_League_Baseball)