<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Conjugacy and the Beta-Binomial Model

_Authors: Kiefer Katovich (SF)_

---

### Learning Objectives
- Understand the concept of conjugacy and conjugate priors in Bayesian statistics.
- Set up an example of the beta-binomial model using a subscription probability example.
- Understand binomial likelihood and where it fits in Bayes’ theorem.
- Calculate the maximum likelihood estimate (MLE).
- Understand when and why the MLE point estimate is insufficient.
- Use the beta-binomial model to build our example in a Bayesian framework.
- Understand and visualize the beta distribution.
- Learn about the beta and gamma functions and where they fit in the beta distribution calculation.
- Mathematically derive the conjugacy relationship between the prior and posterior of the beta-binomial model.

### Lesson Guide
- [Introduction](#intro)
- [Review: The Binomial Distribution Probability Mass Function (PMF)](#pmf)
- [Modeling the $p$ Parameter Given Counts of Successes and Failures](#p)
- [The Binomial Likelihood](#likelihood)
- [The Maximum Likelihood Estimate (MLE) for $p$](#mle)
    - [The Likelihood Function](#likelihood-func)
    - [When the MLE Doesn't Make Sense](#nonsense)
- [Bayesian Modeling of the Parameter $p$ and the Beta Distribution](#beta)
- [The Beta PDF and the Beta Function](#beta-pdf)
- [The Gamma Function](#gamma)
- [Defining the Beta Function in Terms of the Gamma Function](#beta-gamma)
- [Putting It All Together: The Beta as a Conjugate Prior to the Binomial Likelihood](#beta-conjugate)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

sns.set_style("whitegrid")

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>
## Introduction
---

**Conjugacy** and conjugate priors are important concepts in Bayesian statistics. The essential idea is that the *posterior* distribution is guaranteed to have the same form as the *prior* distribution when the prior is a conjugate prior to the likelihood function.

There are many conjugate priors and posteriors. They’re extremely useful because they make the prior-posterior update algebraically solvable. When there is no conjugate prior, sampling techniques such as Markov chain Monte Carlo are often necessary.

This lecture covers the most classic conjugate prior scenario: the beta-binomial model. Binomial models are appropriate for binary events. The prior distribution on the probability of a binary event is a beta distribution, which is conjugate to the binomial likelihood. Therefore, we’re guaranteed to get a posterior distribution that is also a beta distribution.

If none of this makes sense right now, don't worry! We’ll be walking through this in great detail.

<a id='pmf'></a>
## Review: The Binomial Distribution Probability Mass Function
---

Recall that the number of success trials in $n$ trials is modeled with the binomial distribution. The binomial distribution has the probability mass function:

### $$ P(k \;|\; n, p) = \binom{n}{k} p^k (1 - p)^{(n-k)} $$

$k$ is the number of successes.

$n$ is the number of total trials.

$p$ is the probability of success for each trial.

**We can plot the probability mass function for a given $n$ and $p$.**

In [4]:
# A:

**If we change the probability of success, $p$ (or the total trials, $n$), we can see that the probability mass function changes — values of $k$ have different probabilities or likelihoods of occurring.**

In [None]:
# A:

<a id='p'></a>
## Modeling the $p$ Parameter Given Counts of Successes and Failures
---

Let's reframe this. 

Say that we were measuring visitors to our site and whether or not they chose to subscribe to our newsletter. We redefine $n$, $k$, and $p$ accordingly:

### $$ \begin{aligned} n &= \text{number of visitors to our website} \\
k &= \text{number of visitors who subscribed} \\
p &= \text{probability of a visitor subscribing (unknown)} \end{aligned}$$

Remember, now we’re _measuring_ $k$ subscribers out of the $n$ visitors. We can consider the measurement of subscribers to be our data.

At this point, we want to make an inference about the $p$ parameter — our probability of a visitor subscribing. We can talk about this in terms of Bayes' theorem:

### $$ P(p \;|\; data) = \frac{ P(data \;|\; p) }{ P(data) } P(p) $$

Or equivalently:

### $$ P(p \;|\; n,k) = \frac{ P(n,k \;|\; p) }{ P(n, k) } P(p) $$

Where we have:

### $$ \begin{aligned} 
P(p \;|\; n,k) &= \text{posterior} \\
P(n,k \;|\; p) &= \text{likelihood} \\
P(n,k) &= \text{marginal probability of the data} \\
P(p) &= \text{prior} 
\end{aligned} $$

<a id='likelihood'></a>
## The Binomial Likelihood
---

Let's start with the likelihood. The likelihood represents the probability of observing $k$ successes out of $n$ trials _given a probability of success, $p$._

This $p$ can be fixed — say at $p = 0.3$ — in which case we would evaluate the likelihood at exactly that point. We could also represent $p$ as a distribution over a range of possible $p$ values, evaluating the likelihood of what $p$ could be for all of our different hypotheses.

Let's start with a fixed value, $p = 0.3$. How do we evaluate the likelihood? As it turns out, the likelihood function is the same as the PMF we wrote above, as this function is literally used to evaluate, "What is the probability of $k$ successes given $n$ trials and $p$ probability of success?" This is what we formulated as the likelihood in the numerator.

**We can use the binomial object initialized with $p = 0.3$ and $n = 25$ to find the likelihood value for a given $k$.**

In [None]:
# A:

<a id='mle'></a>
## The Maximum Likelihood Estimate (MLE) for $p$
---

If we were to _just_ focus on the likelihood part of Bayes' theorem, we could ask, “What is the value of the $p$ parameter that maximizes the value of the likelihood function?" This is precisely what we do in frequentist statistics to find the point estimate of a parameter. 

Remember that frequentists have no interest in the prior or posterior beliefs about the probability of the parameter's value. Frequentists state that there is no probability associated with a parameter (such as our probability of subscription). Instead, there is one _true_ probability of subscription if we were to measure the entire population. 

Because we only take a sample of people, we may measure a probability of subscription that deviates from that true probability. Remember: In frequentist statistics, the data have a probability, rather than the parameter.

**For the binomial distribution, we can easily calculate the value for subscription rate $p$ that makes our observed data the most likely: It’s going to be the fraction of successes that we measured in our data.**

In [None]:
# A:

<a id='likelihood-func'></a>
### The Likelihood Function

We can also derive the MLE more formally. Our scenario here is simple, but for distributions and models that aren’t so simple, this becomes necessary.

**First, define the likelihood function $L$ (which is the same as the PMF):**

### $$ L(n, k \;|\; p) = \binom{n}{k} p^k (1 - p)^{(n-k)} $$

**Take the logarithm of this to get the log likelihood:**

### $$ LL(n, k \;|\; p) = ln\binom{n}{k} + k \cdot ln(p) + (n - k) \cdot ln(1 - p) $$

The log likelihood has nice properties. It allows the computer to perform computations with very small probabilities multiplied together. It also gets rid of our exponents, which makes the derivative easier to find.

**Now, take the derivative of the log likelihood with respect to $p$ and set it to zero.** This will find the value of $p$ that maximizes the log likelihood (the likelihood function is convex):

### $$ \begin{aligned}
\frac{\partial}{\partial p} LL(n, k \;|\; p) &= 0 \\
\frac{k}{p} - \frac{(n-k)}{(1-p)} &= 0 \\
\frac{(n-k)}{(1-p)} &= \frac{k}{p} \\
pn - pk &= k - pk \\
pn - pk + pk &= k \\
p &= \frac{k}{n} \\
\end{aligned}$$

**This distills down to what we calculated before: The fraction of users who subscribed is the maximum likelihood estimate for the subscription rate.**

<a id='nonsense'></a>
### When the MLE Doesn’t Make Sense

Now, say we had $n = 5$ visitors to the site and, to our surprise, all of them subscribed ($k = 5$). Using the MLE for $p$, we would conclude that $p = 1.0$: A person has a 100-percent probability of subscribing when they reach our site.

This, of course, is a flawed conclusion. We’ve only measured five people! 

> **Note:** If we took the frequentist route, we would ask, “What is the probability that we measured this parameter $p = 1.0$ by chance when in fact the true rate is (some predetermined null hypothesis value) $H0_p = 0.3$?" This would be our p value (a.k.a., alpha or type I error), and, with such insufficient data, we would almost certainly fail to reject the null hypothesis.

In [None]:
# A:

<a id='beta'></a>

## Bayesian Modeling of the Parameter $p$ and the Beta Distribution
---

What if we took a Bayesian rather than frequentist approach?

Instead of thinking about the *data* as having a probability, we think of the *parameter*, $p$, as having a probability. In other words, different values of $p$ have different _likelihoods_. We will represent our beliefs about likely values of $p$ with our prior distribution.

**The distribution that represents _a distribution of probabilities_ is the beta distribution. The beta distribution is parameterized by two values, $\alpha$ and $\beta$.**

###  $$ Beta(\alpha, \beta) =
\begin{cases}
\alpha &= \text{number of successes + 1} \\
\beta &= \text{number of failures + 1}
\end{cases} $$

**We can plot the beta distribution for the scenario in which we measured $k = 5$ out of $n = 5$.**

In [None]:
# A:

**We can see from this distribution that our probability with the highest likelihood is one. But remember, other probabilities are also likely.**  Because of our low sample size, $n$, many values other than $p = 1.0$ have reasonable likelihoods.

**What if we measured 20 subscriptions out of 20 visitors?** Plot this scenario below to see how the beta distribution changes.

In [None]:
# A:

<a id='beta-pdf'></a>
## The Beta PDF and the Beta Function
---
You’re probably wondering how the beta distribution is defined. Formally, we define the probability density function of the beta distribution as:

### $$ PDF_{Beta}(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\int_0^1 u^{\alpha-1} (1-u)^{\beta-1}\, du} $$

Here, $x$ falls in the range `[0, 1]` and $u$ represents the values in that range over which we integrate.

In the denominator, we are integrating over the possible probabilities. The denominator of the PDF is actually called the beta function, not to be confused with the beta _distribution_. 

If this looks familiar to the equation for the binomial likelihood above, that's because it is! In the numerator, we essentially have the binomial likelihood equation, only with the shape parameters $\alpha$ and $\beta$ instead of our $k$ and $n$. In the denominator, we're integrating the binomial likelihood.

<a id='gamma'></a>
## The Gamma Function
---

There is another way to write the beta distribution using the gamma function. 

The gamma function is defined as:

###  $$ \Gamma(z) =
\begin{cases}
(z - 1)! &= \text{when z is a positive integer} \\
\int_0^{\infty} x^{z-1} e^{-x} dx &= \text{when z is a complex real number}
\end{cases} $$

**We can plot out a gamma distribution below for $z = 10$ using SciPy's gamma object.**

In [None]:
# A:

<a id='beta-gamma'></a>
## Defining the Beta Function Using the Gamma Function
---

The gamma function is a generalization of the factorial function. The beta _function_ can also be written in terms of the gamma function:

### $$ Beta(\alpha, \beta) = \frac{ \Gamma (\alpha) \Gamma (\beta) }{\Gamma (\alpha + \beta) } = \int_0^1 u^{\alpha-1} (1-u)^{\beta-1}\, du $$

At this point, we can rewrite the beta _distribution_, or probability density function, like so:

### $$ PDF_{Beta}(x) = \frac{\Gamma (\alpha + \beta) }{ \Gamma (\alpha) \Gamma (\beta) }x^{\alpha-1}(1-x)^{\beta-1} $$

<a id='beta-conjugate'></a>
## Putting It All Together: The Beta as a Conjugate Prior to the Binomial Likelihood
---

Remember, our beta distribution is what we’re going to be using as our _prior_ over the probability of subscription, $p$. In other words, we have some distribution of beliefs about which subscription rates are most likely represented by a beta distribution.

**Recall the set up of this problem in terms of Bayes’ theorem:**

### $$ P(p \;|\; n,k) = \frac{ P(n,k \;|\; p) }{ P(k, n) } P(p) $$

**For now, let's ignore the normalizing constant — the marginal probability of the data, $k,n$. We can say that the unnormalized posterior is:**

### $$ P(p \;|\; n,k) \propto P(n,k \;|\; p) \cdot P(p) $$

**And, we can put our binomial likelihood and the beta posterior in where we had the placeholders:**

### $$ P(p \;|\; n,k) \propto \binom{n}{k} p^k (1 - p)^{(n-k)} \cdot \frac{\Gamma (\alpha + \beta) }{ \Gamma (\alpha) \Gamma (\beta) }p^{\alpha-1}(1-p)^{\beta-1} $$

**Let's now define a constant $c$ as:**

### $$ c = \binom{n}{k} \cdot \frac{\Gamma (\alpha + \beta) }{ \Gamma (\alpha) \Gamma (\beta) } $$

**Now, our formula for the unnormalized posterior is:**

### $$ \begin{aligned}
P(p \;|\; n,k) &\propto c \cdot p^k (1 - p)^{(n-k)} \cdot p^{\alpha-1}(1-p)^{\beta-1} \\
P(p \;|\; n,k) &\propto c \cdot p^{(k + \alpha - 1)} (1-p)^{(n - k + \beta - 1)}
\end{aligned}
$$

**If we define a new alpha and beta:**

### $$ \begin{aligned}
\alpha_{posterior} &= k + \alpha_{prior} \\
\beta_{posterior} &= n - k + \beta_{prior}
\end{aligned} $$

**We can see that the posterior distribution can in fact be parameterized as a beta distribution.** The constant term $c$ will be handled when we put the marginal likelihood back in and normalize the posterior distribution to be a proper probability distribution.