In [None]:
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

This notebook shows how to do hierarchical modelling with Binomially-distributed random variables.

# Problem Setup

Baseball players have many metrics measured for them. Let's say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting ("at bat")). How would you go about this task?

We first need some measurements of batting data. To answer this question, we need to have data on the number of time a player has batted and the number of times the player has hit the ball while batting. Let's see an example dataset below.

In [None]:
df = pd.read_csv('../datasets/baseballdb/core/Batting.csv')
df['AB'] = df['AB'].replace(0, np.nan)
df = df.dropna()
df['batting_avg'] = df['H'] / df['AB']
df = df[df['yearID'] >= 2016]
df = df.iloc[0:15]  # select out only the first 15 players, just for illustration purposes.
df.head(5)

In this dataset, the columns `AB` and `H` are the most relevent.

- `AB` is the number of times a player was **A**t **B**at.
- `H` is the number of times a player **h**it the ball while batting.

The performance of a player can be defined by their batting percentage - essentially the number of hits divided by the number of times at bat. (Technically, a percentage should run from 0-100, but American sportspeople are apparently not very strict with how they approach these definitions.)

# Model 1: Naive Model

One model that we can write is a model that assumes that each player has a batting percentage that is independent of the other players in the dataset. 

A pictorial view of the model is as such:

![](../images/baseball-model.jpg)

Let's implement this model in PyMC3.

In [None]:
with pm.Model() as baseline_model:
    thetas = pm.Beta("thetas", alpha=0.5, beta=0.5, shape=(len(df)))
    like = pm.Binomial('likelihood', n=df['AB'], p=thetas, observed=df['H'])

In [None]:
with baseline_model:
    baseline_trace = pm.sample(2000, njobs=1)

Let's view the posterior distribution traces.

In [None]:
traceplot = pm.traceplot(baseline_trace)

Looks like convergence has been achieved. From a $Beta(\alpha=0.5, \beta=0.5)$ prior, those players for which we have only 1 data point have very wide posterior distribution estimates.

In [None]:
import theano.tensor as tt

with pm.Model() as baseball_model:
    
    phi = pm.Uniform('phi', lower=0.0, upper=1.0)
    kappa_log = pm.Exponential('kappa_log', lam=1.5)
    kappa = pm.Deterministic('kappa', tt.exp(kappa_log))

    thetas = pm.Beta('thetas', alpha=phi*kappa, beta=(1.0-phi)*kappa, shape=len(df))
    like = pm.Binomial('like', n=df['AB'], p=thetas, observed=df['H'])

In [None]:
with baseball_model:
    trace = pm.sample(2000, init='advi')

In [None]:
pm.traceplot(trace)

In [None]:
ylabels = 'AB: ' + df['AB'].astype(str) + ', H: ' + df['H'].astype('str')
pm.forestplot(trace, varnames=['thetas'], ylabels=ylabels)