# Appendix D: Modelling Distributions

The purpose of this appendix is to discuss some of the stochastic distributions that might be useful for modelling
the outcomes of sporting matches, particularly Australian Rules football.

## Beta-Bernoulli distribution

Consider the situation where the outcome $X$ (i.e. a win or loss) of a match between team A and team B is stochastically decided,
given some (unknown) probability $\theta$ that team A wins. Essentially, the outcome may be considered as a biased coin toss.
However, prior to the match we might consider the relative strengths of teams A and B, and this knowledge might inform us
as to likely values of the probability $\theta$. Consequently, we may model $X$ by a continuous mixture distribution over 
$\theta$, and then use this to derive a posterior distribution for $\theta$.

### Bernoulli data likelihood

Let the variate $X\in\{0,1\}$ represent the outcome of a single match between some arbitrary team A and team B, with $X=1$ indicating a win by team A, and $X=0$ indicating a win by team B.
For convenience, we consider the Bernoulli distribution
\begin{eqnarray}
X\mid\theta & ~\sim~ & \texttt{Bern}(\theta)\,,
\end{eqnarray}
where $\theta$ is the assumed probability that team A wins, which is decided before the match commences.
Thus, we see that
\begin{eqnarray}
p(X=x\mid\theta) & ~=~ & \theta^x\,(1-\theta)^{1-x}\,.
\end{eqnarray}
If we further wish to model the outcomes of an ordered sequence of $n$ *conditionally independent* matches between the same teams A and B, 
with the same parameter
$\theta$, then we will obtain the joint data likelihood
\begin{eqnarray}
p(X=w\mid n,\theta) & ~=~ & \theta^w\,(1-\theta)^{n-w}\,,
\end{eqnarray}
where $w\in\{0,1,\ldots,n\}$ represents the total number of wins for team A, and $\ell=n-w$ represents the total number of
losses.

Note that the use of the Bernoulli distribution explicitly presumes that only wins and losses are possible.
How then might we deal with games for which matches may end in a draw?
It turns out that there are a number of good reasons to treat a draw as being half-a-win and half-a-loss for each team.
Thus, we might define the weighted log-likelihood for a draw as being given by
\begin{eqnarray}
L_\texttt{draw}(\theta) & ~\doteq~ & \frac{1}{2}\ln p(X=1\mid\theta)+\frac{1}{2}\ln p(X=0\mid\theta)
\\& =  &
\frac{1}{2}\ln\theta+\frac{1}{2}\ln\,(1-\theta)
~=~
\ln\left[\theta^{\frac{1}{2}}\,(1-\theta)^{\frac{1}{2}}\right]\,.
\end{eqnarray}
We observe that functionally this takes the same form as
\begin{eqnarray}
L_\texttt{draw}(\theta) & ~=~ & \ln p(X=0.5\mid\theta)\,.
\end{eqnarray}
Consequently, we may numerically treat observed draws as having the value $X=\frac{1}{2}$.

### Beta prior distribution

We seek a non-informative prior distribution for $\theta$. 
Following Box and Tiao [[1]](#Citations "Citation [1]: Bayesian Inference in Statistical Analysis"), 
we desire a transformation $\phi(\theta)$ such that the likelihood $p(X=w\mid n,\theta)$, plotted as a function of $\phi$, remains
approximately invariant in shape and size for some fixed $n$ as $w$ varies (but varies in its mean location). 
A uniform prior for $\phi$ then induces a
non-informative prior for $\theta$. In general, it turns out that the relevant prior is usually inversely
proportional to the square-root of the variance. For the Bernoulli distribution, we have
\begin{eqnarray}
\mathbb{V}[X\mid\theta] & ~=~ \theta\,(1-\theta)\,,
\end{eqnarray}
whereupon the non-informative prior is therefore
\begin{eqnarray}
p(\theta) ~\propto~ \frac{1}{\sqrt{\theta\,(1-\theta)}}
& ~~~\Rightarrow~~~ & \theta~\sim~\mathtt{Beta}\left(\frac{1}{2},\frac{1}{2}\right)
\,.
\end{eqnarray}
Observe that this is just 
[Jeffreys' prior](https://en.wikipedia.org/wiki/Jeffreys_prior "Wikipeida: Jeffreys' prior"), 
which notionally corresponds to a single prior pseudo-match
with half-a-win and half-a-loss, i.e. a draw.

More generally, we might choose an arbitrary beta prior distribution
\begin{eqnarray}
\theta & ~\sim~ & \mathtt{Beta}(\alpha,\beta)
\,,
\end{eqnarray}
for some hyper-parameters $\alpha$ and $\beta$, where
$\alpha$ represents the prior number of pseudo-wins and $\beta$ represents the prior number of pseudo-losses.
Note that technically the parameter dependence is explicitly $\theta\mid\alpha,\beta$. However, by convention
the hyper-parameters are often left implicit to indicate that they are held constant. Despite this useful convention,
it is best to always explicitly condition on all relevant parameters when dealing with the actual probability functions,
in order to avoid accidental ambiguity, e.g. by hiding important modelling assumptions.

The special case of $\alpha=\beta$ corresponds to having no other prior information with which to
preference a win/loss over a loss/win.
We could, for example, choose the 
[Haldane prior](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution") 
with $\alpha=\beta=0$ to indicate that we have no
prior pseudo-matches.
Similarly, the uniform prior with $\alpha=\beta=1$ corresponds to one previous pseudo-win and one pseudo-loss.
Alternatively, 
Kerman [[2]](#Citations "Citation [2]: Neutral noninformative and informative conjugate beta and gamma prior distributions") 
derives $\alpha=\beta=\frac{1}{3}$ as being a "*neutral*" prior that has the property that it leads "to posterior distributions with approximately 50 per cent probability that the true value is either smaller or larger than the maximum likelihood estimate".

Note that the beta distribution is also a good choice due to its complementary symmetry in $\alpha$ and $\beta$.
Suppose that $\alpha$ denotes our pre-match knowledge about the strength of team A, and similarly let
$\beta$ represent our prior knowlegde about team B. Now, $\theta=\theta_A$ is the prior probability of team A winning,
such that $\theta_A\sim\texttt{Beta}(\alpha,\beta)$.
However, the ordering of teams A and B is arbitrary. If we swapped the order of the teams and their respective hyper-parameters, 
then we would instead have $\theta_B\sim\texttt{Beta}(\beta,\alpha)$.
The beta distribution correctly satisfies the constraint that
\begin{eqnarray}
\mathbb{E}[\theta\mid\alpha,\beta] & ~=~ \frac{\alpha}{\alpha+\beta}~=~1-\mathbb{E}[\theta\mid\beta,\alpha]\,. 
\end{eqnarray}
This is the aforementioned complementary symmetry.

### Beta-Bernoulli mixture distribution

The mixture distribution for a single match is now derived as
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \int_0^1 p(x\mid\theta)\,p(\theta\mid\alpha,\beta)\,d\theta
\\& = &
\int_0^1 \theta^x(1-\theta)^{1-x}\,
\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}\,d\theta
\\& = &
\frac{1}{B(\alpha,\beta)}\int_0^1 \theta^{\alpha+x-1}(1-\theta)^{\beta-x}\,d\theta
\\& =  &
\frac{B(\alpha+x,\beta-x+1)}{B(\alpha,\beta)}\int_0^1 
\frac{\theta^{\alpha+x-1}(1-\theta)^{\beta-x}}{B(\alpha+x,\beta-x+1)}\,d\theta\,.
\end{eqnarray}
Recognising the final integrand as the $\texttt{Beta}(\alpha+x,\beta-x+1)$ distribution, we obtain
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ &
\frac{B(\alpha+x,\beta-x+1)}{B(\alpha,\beta)}\,.
\end{eqnarray}
Next, we expand $B(\cdot,\cdot)$ in terms of the *gamma* function $\Gamma(\cdot)$ to obtain
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ &
\frac{\Gamma(\alpha+x)\,\Gamma(\beta+1-x)}{\Gamma(\alpha+\beta+1)}\,
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}\,.
\end{eqnarray}
Finally, using the recurrence relation that $\Gamma(z+1)=z\,\Gamma(z)$, we deduce that
\begin{eqnarray}
p(X=1\mid\alpha,\beta)~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\;\; &
p(X=0\mid\alpha,\beta)~=~\frac{\beta}{\alpha+\beta}\,,
\end{eqnarray}
such that
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{\alpha^x\,\beta^{1-x}}{\alpha+\beta}\,.
\end{eqnarray}
Observe that this is just
\begin{eqnarray}
X\mid\bar{\theta} & ~\sim~ & \texttt{Bern}(\bar{\theta})\,,
\end{eqnarray}
with $\bar{\theta}=\frac{\alpha}{\alpha+\beta}$ being the mean of the beta distribution.

Subsequently, for the case of $n$ *unconditionally independent* matches between teams A and B, we would obtain the joint distribution
\begin{eqnarray}
p_u(X=w\mid n,\alpha,\beta) & ~=~ & \frac{\alpha^w\,\beta^{n-w}}{(\alpha+\beta)^n}~=~\bar{\theta}^w\,(1-\bar{\theta})^{n-w}\,.
\end{eqnarray}
Note that this is notionally different from the *conditionally independent* case assumed
[previously](#Bernoulli-data-likelihood "Section: Bernoulli data likelihood"),
where the $n$ matches all shared the same value of the parameter $\theta$. In the unconditional form, the parameter
$\theta$ is assumed to be resampled before every match.

Consequently, for the conditionally independent case, we would instead derive that
\begin{eqnarray}
p_c(X=w\mid n,\alpha,\beta) & ~=~ & \int_0^1 p(X=w\mid n,\theta)\,p(\theta\mid\alpha,\beta)\,d\theta
\\& = &
\int_0^1 \theta^w\,(1-\theta)^{n-w}\,
\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}\,d\theta
\\& = &
\frac{1}{B(\alpha,\beta)}\int_0^1 \theta^{\alpha+w-1}(1-\theta)^{\beta+n-w-1}\,d\theta
\\& =  &
\frac{B(\alpha+w,\beta+n-w)}{B(\alpha,\beta)}\,.
\end{eqnarray}

### Beta posterior distribution

For the case of observing a sequence of $n$ conditionally independent matches, the posterior distribution is given by
\begin{eqnarray}
p(\theta\mid n,w,\alpha,\beta) & ~=~ &
\frac{p(X=w\mid n,\theta)\,p(\theta\mid\alpha,\beta)}{p_c(X=w\mid n,\alpha,\beta)}
~=~
\frac{\theta^{\alpha+w-1}(1-\theta)^{\beta+n-w-1}}{B(\alpha+w,\beta+n-w)}\,,
\end{eqnarray}
such that
\begin{eqnarray}
\theta\mid n,w & ~\sim~ & \texttt{Beta}(\alpha+w,\beta+n-w)\,.
\end{eqnarray}

### Beta-Bernoulli predictive distribution

Suppose the result of the $(n+1)$-th match is now $X$, conditionally independent of previous matches for the same
parameter $\theta$. Then the likelihood of this result is
\begin{eqnarray}
p(X=x\mid\theta) & ~=~ & \theta^{x}\,(1-\theta)^{1-x}\,.
\end{eqnarray}
Hence, the predictive probability of this result is given by
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & \int_0^1 p(x\mid\theta)\,p(\theta\mid n,w,\alpha,\beta)\,d\theta
\\&~=~&
\int_0^1\frac{\theta^{\alpha+w+x-1}\,(1-\theta)^{\beta+n-w-x}}
{B(\alpha+w,\beta+n-w)}\,d\theta
\\ & = &
\frac{B(\alpha+w+x,\beta+n-w+1-x)}
{B(\alpha+w,\beta+n-w)}
\,.
\end{eqnarray}

In terms of the gamma function, $\Gamma(\cdot)$, this becomes
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & 
\frac{\Gamma(\alpha+w+x)\,\Gamma(\beta+n-w+1-x)}
     {\Gamma(\alpha+\beta+n+1)}\,
\frac{\Gamma(\alpha+\beta+n)}
     {\Gamma(\alpha+w)\,\Gamma(\beta+n-w)}\,.
\end{eqnarray}
For a loss, i.e. $X=0$, the respective probability reduces to
\begin{eqnarray}
p(X=0\mid n,w,\alpha,\beta) & ~=~ & \frac{\beta+n-w}{\alpha+\beta+n}\,,
\end{eqnarray}
again using the recurrence relation that $\Gamma(z+1)=z\,\Gamma(z)$.
The corresponding probability of a win is therefore
\begin{eqnarray}
p(X=1\mid n,w,\alpha,\beta) & ~=~ & \frac{\alpha+w}{\alpha+\beta+n}\,,
\end{eqnarray}
such that
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & \frac{(\alpha+w)^x\,(\beta+n-w)^{1-x}}{\alpha+\beta+n}\,.
\end{eqnarray}
Defining $\hat{\theta}\doteq\frac{\alpha+w}{\alpha+\beta+n}$, we observe that
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & \left(\hat{\theta}\right)^x\,\left(1-\hat{\theta}\right)^{1-x}
\,,
\end{eqnarray}
such that
\begin{eqnarray}
X\mid\hat{\theta} & ~\sim~ & \texttt{Bern}(\hat{\theta})
\,.
\end{eqnarray}

Note that the denominator of $\hat{\theta}$ corresponds to assuming $n$ observed matches plus $\alpha+\beta$ prior pseudo-matches, and the numerator corresponds to $w$ observed wins plus $\alpha$ prior pseudo-wins.
Thus, $\hat{\theta}$ is just a *smoothed* estimate of the empirical probability of a win.

## Gamma-Poisson distribution

Consider a Poisson-like count $X\in\mathbb{Z}^{\ge 0}$ with over-dispersion, namely 
$\mathbb{V}[X]>\mathbb{E}[X]$. This suggests the need for an additonal parameter, beyond the usual Poisson rate $\lambda$, 
to control the extra variance. 
One way in which over-dispersion can arise is when $\lambda$ is only held constant for a single trial 
or a single sequence of trials, but exhibits inter-trial or inter-sequence variation.
This variation may be captured by a prior distribution, giving rise to a continuous mixture distribution.

### Poisson data likelihood

For some event rate $\lambda$ (per unit time), let the count $X$ be distributed as
\begin{eqnarray}
X\mid\lambda & ~\sim~ & \mathtt{Poisson}(\lambda)\,,
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mathbb{E}[X\mid\lambda]~=~\lambda\,, & \;\;\mbox{and}\;\; &
\mathbb{V}[X\mid\lambda]~=~\lambda\,,
\end{eqnarray}
respectively. Then we may consider a 
single sequence of $n$ conditionally independent counts with $\lambda$ held constant just for that sequence.
If the average count of the sequence is $\langle X\rangle=\bar{X}$, then the joint likelihood is given by
\begin{eqnarray}
p(X_1,\ldots,X_n\mid\lambda) & ~=~ & \frac{e^{-n\lambda}\,\lambda^{n\bar{X}}}{\prod_{i=1}^n X_i!}\,.
\end{eqnarray}

### Gamma prior distribution

Consider, in general, some arbitrary prior distribution for the event rate $\lambda$, say $\lambda\sim D(\boldsymbol{\theta})$, governed by one or more 
(constant) hyper-parameters denoted by $\boldsymbol{\theta}$.
Let this distribution have mean and variance given by
\begin{eqnarray}
\mu_\lambda ~\doteq~\mathbb{E}[\lambda\mid\boldsymbol{\theta}]\,, 
& ~~~ & \sigma^2_\lambda~\doteq~\mathbb{V}[\lambda\mid\boldsymbol{\theta}]\,,
\end{eqnarray}
respectively.
Then it follows that the unconditional mean and variance
of the variate $X$ are given
[by](https://en.wikipedia.org/wiki/Law_of_total_expectation
"Wikipedia: Law of total expectation")
\begin{eqnarray}
\mathbb{E}[X] & ~\doteq~ & \mathbb{E}\left[\,\mathbb{E}[X\mid\lambda]\mid\boldsymbol{\theta}\right]
\\& = &
\mathbb{E}[\lambda\mid\boldsymbol{\theta}]~=~\mu_\lambda
\,,
\end{eqnarray}
[and](https://en.wikipedia.org/wiki/Law_of_total_variance
"Wikipedia: Law of total variance")
\begin{eqnarray}
\mathbb{V}[X] & ~\doteq~ & 
\mathbb{V}\left[\,\mathbb{E}[X\mid\lambda]\mid\boldsymbol{\theta}\right]
+
\mathbb{E}\left[\,\mathbb{V}[X\mid\lambda]\mid\boldsymbol{\theta}\right]
\\
& ~=~ & 
\mathbb{V}[\lambda\mid\boldsymbol{\theta}]
+
\mathbb{E}[\lambda\mid\boldsymbol{\theta}]
~=~\sigma^2_\lambda+\mu_\lambda
\,.
\end{eqnarray}
Consequently, we observe that $\mathbb{V}[X]>\mathbb{E}[X]$, such that the overall process is over-dispersed compared
to a simple Poisson process.
Clearly we require a *proper* prior distribution with finite mean and variance.

Following the rationale discussed [previously](#Beta-prior-distribution "Section: Beta prior distribution"),
we first seek a non-informative prior for the Possion rate $\lambda\in (0,\infty)$ via a variance-stabilising transformation. However, this results in the improper prior
\begin{eqnarray}
p(\lambda) & ~\propto~ & \frac{1}{\sqrt{\mathbb{V}[X\mid\lambda]}}~=~\frac{1}{\sqrt{\lambda}}\,.
\end{eqnarray}
If we truncate the domain to $\lambda\in(0,L]$, then the truncated prior becomes proper, namely
\begin{eqnarray}
p(\lambda\mid L) & ~=~ & \frac{1}{2\sqrt{L\lambda}}\,,
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mathbb{E}[\lambda\mid L]~=~\frac{1}{3}L\,, & \;\;\mbox{and}\;\; &
\mathbb{V}[\lambda\mid L]~=~\frac{4}{45}L^2\,,
\end{eqnarray}
respectively. The overall mean and variance are then given by
\begin{eqnarray}
\mathbb{E}[X]~=~\frac{1}{3}L\,, & \;\;\mbox{and}\;\; &
\mathbb{V}[X]~=~\frac{1}{3}L+\frac{4}{45}L^2\,,
\end{eqnarray}
respectively. Clearly, the hyper-parameter $L$ may be estimated (using the method of moments) from the sample mean as $\hat{L}=3\bar{X}$.
However, if the sample variance does not match $\frac{1}{3}\hat{L}+\frac{4}{45}\hat{L}^2$, then 
having only a single prior hyper-parameter does not provide enough modelling flexibility.
Consequently, we require a prior distribution having at least two hyper-parameters.

To deduce a suitable form for such a prior, [recall](#Poisson-data-likelihood "Sectrion: Poisson data likelihood")
that the joint likelihood of a sequence of $n$ trials, considered as a function of $\lambda$ with fixed
observed counts, is proportional to $\lambda^{n\bar{X}}\,e^{-n\lambda}$. This has the same proportional form as
the $\texttt{Gamma}(n\bar{X}+1,n)$ distribution.
Consequently, we consider a prior distribution in the general form
\begin{eqnarray}
\lambda & ~\sim~ & \mathtt{Gamma}(\alpha,\beta)\,,
\end{eqnarray}
where $\alpha$ is the shape parameter and $\beta$ is the rate parameter.
Note that the improper Jeffreys' prior above could notionally be thought of as $\texttt{Gamma}(\frac{1}{2},0)$.

For this gamma distribution, it is 
[known](https://en.wikipedia.org/wiki/Gamma_distribution
"Wikipedia: Gamma distribution")
that the mean and variance are given by
\begin{eqnarray}
\mu_\lambda~=~\frac{\alpha}{\beta}\,, 
& ~~~ & \sigma^2_\lambda~=~\frac{\alpha}{\beta^2}\,,
\end{eqnarray}
respectively.
It then follows that
\begin{eqnarray}
\mathbb{E}[X] & ~=~ & \mu_\lambda~=~\frac{\alpha}{\beta}\,,
\end{eqnarray}
and
\begin{eqnarray}
\mathbb{V}[X] & ~=~ & \mu_\lambda+\sigma^2_\lambda
~=~\frac{\alpha\,(\beta+1)}{\beta^2}\,.
\end{eqnarray}
Not only does this allow for over-dispersion for $\alpha>0$, but the two hyper-parameters $\alpha$ and $\beta$
provide sufficient freedom to flexibly model both the sample mean and variance.

### Negative binomial distribution

For a single trial, the explicit form of the mixture distribution is derived as
\begin{eqnarray}
p(X=k\mid\alpha,\beta) & ~=~ & \int_0^\infty p(X=k\mid\lambda)\,p(\lambda\mid\alpha,\beta)\,d\lambda
\\& = &
\int_0^\infty e^{-\lambda}\frac{\lambda^k}{k!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\lambda^{\alpha-1}\,e^{-\beta\lambda}\,d\lambda
\\& = &
\frac{\beta^\alpha}{k!\,\Gamma(\alpha)}
\int_0^\infty \lambda^{\alpha+k-1}\,e^{-(\beta+1)\lambda}\,d\lambda
%\\& = &
%\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,\frac{\beta^\alpha}{(\beta+1)^{\alpha+k}}
%\int_0^\infty \frac{(\beta+1)^{\alpha+k}}{\Gamma(\alpha+k)}\,\lambda^{\alpha+k-1}\,e^{-(\beta+1)\lambda}\,d\lambda
\\& = &
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,\frac{\beta^\alpha}{(\beta+1)^{\alpha+k}}\,,
\end{eqnarray}
since the final integrand is proportional to the 
$\mathtt{Gamma}(\alpha+k,\beta+1)$
distribution.

In order to show that this is just the real-valued (*Polya* distribution) version of the 
[*negative binomial*](https://en.wikipedia.org/wiki/Negative_binomial_distribution "Wikipedia: Negative binomial distribution") 
distribution,
we define $p$ to be the probability of a *stopping* event (i.e. one out of the required number $\alpha$ of such events) and $q$ to be the probability of a *non-stopping* event,
namely
\begin{eqnarray}
p ~\doteq~ \frac{\beta}{\beta+1}\,, & ~~~\mbox{and}~~~ & q ~\doteq~ 1-p~=~\frac{1}{\beta+1}\,,
\end{eqnarray}
whereupon
\begin{eqnarray}
p(X=k\mid\alpha,p) & ~=~ & 
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,q^k\,p^\alpha\,.
\end{eqnarray}
This represents the overall probability of observing a sequence of $k+\alpha$ Bernoulli trials,
where the sequence was terminated as soon as $\alpha$ stopping events occurred.

In the traditional integer-valued (*Pascal* distribution) form, $\alpha>0$ is replaced by $r\in\mathbb{Z}^{>0}$, giving
\begin{eqnarray}
p(X=k\mid r,p) & ~=~ & 
\frac{(r+k-1)!}{k!\,(r-1)!}\,q^k\,p^r~=~\binom{k+r-1}{k}\,q^k\,p^r\,.
\end{eqnarray}
If we terminate the sequence of trials after just $r=1$ stopping events, then this reduces to the
*geometric* distribution, i.e. $\mathtt{Geom}(p)\equiv \mathtt{NegBinom}(1,p)$.
Furthermore, $X\sim\mathtt{NegBinom}(r,p)$ corresponds to the sum of $r$ independent geometric counts, i.e.
$X=X_1+X_2+\ldots+X_r$ with $X_i\sim\mathtt{Geom}(p)$.

Finally, for a sequence of $n$ conditionally independent counts, the joint mixture distribution is
\begin{eqnarray}
p_c(X_1,\ldots,X_n\mid\alpha,\beta) & ~=~ & \int_0^\infty p(X_1,\ldots,X_n\mid\lambda)\,p(\lambda\mid\alpha,\beta)\,d\lambda
\\& = &
\int_0^\infty \frac{e^{-n\lambda}\,\lambda^{n\bar{X}}}{\prod_{i=1}^n X_i!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\lambda^{\alpha-1}\,e^{-\beta\lambda}\,d\lambda
\\& = &
\frac{\Gamma(\alpha+n\bar{X})}{\Gamma(\alpha)\,\prod_{i=1}^n X_i!}\,\frac{\beta^\alpha}{(\beta+n)^{\alpha+n\bar{X}}}\,.
\end{eqnarray}
Alternatively, for *unconditionally* independent counts sampled with *different* values of $\lambda$, the joint mixture distribution is just
\begin{eqnarray}
p_u(X_1,\ldots,X_n\mid\alpha,\beta) & ~=~ &
\prod_{i=1}^n\frac{\Gamma(\alpha+X_i)}{X_i!}\cdot\frac{q^{n\bar{X}}\,p^{n\alpha}}{\Gamma(\alpha)^n}\,.
\end{eqnarray}

### Gamma posterior distribution

For a sequence of $n$ conditionally independent counts, the posterior distribution for the Poisson rate $\lambda$ is
given by
\begin{eqnarray}
p(\lambda\mid X_1,\ldots,X_n,\alpha,\beta) & ~=~ &
\frac{p(X_1,\ldots,X_n\mid\lambda)\,p(\lambda\mid\alpha,\beta)}{p_c(X_1,\ldots,X_n\mid\alpha,\beta)}
\\& = &
\left.
\frac{e^{-n\lambda}\,\lambda^{n\bar{X}}}{\prod_{i=1}^n X_i!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\lambda^{\alpha-1}\,e^{-\beta\lambda}
\right/
\frac{\Gamma(\alpha+n\bar{X})}{\Gamma(\alpha)\,\prod_{i=1}^n X_i!}\,\frac{\beta^\alpha}{(\beta+n)^{\alpha+n\bar{X}}}
\\& = &
\frac{(\beta+n)^{\alpha+n\bar{X}}}{\Gamma(\alpha+n\bar{X})}\,\lambda^{\alpha+n\bar{X}-1}\,e^{-(\beta+n)\lambda}\,,
\end{eqnarray}
such that
\begin{eqnarray}
\lambda\mid X_1,\ldots,X_n & ~\sim~ & \texttt{Gamma}(\alpha+n\bar{X},\beta+n)\,.
\end{eqnarray}

### Maximum likelihood estimation

We now return to the sequence of unconditionally independent observations.
The log-likelihood is therefore given by
\begin{eqnarray}
L(\boldsymbol{\theta};X_1,\ldots,X_n) & ~\doteq~ &
\ln p_u(X_1,\ldots,X_n\mid\alpha,\beta)
\\& ~=~ &
\sum_{i=1}^n\left\{\ln\Gamma(\alpha+X_i)-\ln X_i!\right\}
+n\bar{X}\ln q+n\alpha\ln p-n\ln\Gamma(\alpha)\,.
\end{eqnarray}
Now, since
\begin{eqnarray}
p~\doteq~\frac{\beta}{\beta+1} & ~~~\Rightarrow~~~ &
\beta~\doteq~\frac{p}{q}\,,
\end{eqnarray}
we may consider the parameterisation as either
$\boldsymbol{\theta}=(\alpha,\beta)$ or
$\boldsymbol{\theta}'=(\alpha,p)$.
Taking the gradient with respect to $p$ then gives
\begin{eqnarray}
\nabla_p L & ~=~ &
\frac{n\alpha}{p}-\frac{n\bar{X}}{1-p}
~=~\frac{n\alpha-n(\alpha+\bar{X})p}{pq}\,,
\end{eqnarray}
such that  the maximum likelihood estimate
$\hat{\boldsymbol{\theta}}_\texttt{ML}'=(\hat{\alpha}_\texttt{ML},
\hat{p}_\texttt{ML})$ satisfies
\begin{eqnarray}
\hat{p}_\texttt{ML} & ~=~ & 
\frac{\hat{\alpha}_\texttt{ML}}{\hat{\alpha}_\texttt{ML}+\bar{X}}\,.
\end{eqnarray}

Next, we take the gradient with respect to $\alpha$, which gives
\begin{eqnarray}
\nabla_\alpha L & ~=~ & 
\sum_{i=1}^{n}\psi(\alpha+X_i)+n\ln p-n\psi(\alpha)\,.
\end{eqnarray}
Consequently, the maximum likelihood estimate satisfies the nonlinear
equation
\begin{eqnarray}
\frac{1}{n}\sum_{i=1}^{n}\psi(\hat{\alpha}_\texttt{ML}+X_i)
-\psi(\hat{\alpha}_\texttt{ML})
+\ln\hat{p}_\texttt{ML} & ~=~ & 0\,.
\end{eqnarray}
Now, we note that $\hat{p}_\texttt{ML}=g(\hat{\alpha}_\texttt{ML})$,
where
\begin{eqnarray}
g(\alpha) ~=~ \frac{\alpha}{\alpha+\bar{X}}
& ~~~\Rightarrow~~~ &
g'(\alpha)~=~\frac{\bar{X}}{(\alpha+\bar{X})^2}
~=~\frac{pq}{\alpha}\,,
\end{eqnarray}
using $p\doteq g(\alpha)$ everywhere.
Hence,  we may treat the nonlinear equation as $f(\hat{\alpha}_\texttt{ML})=0$, where
$n\,f(\alpha)\doteq\nabla_\alpha L(\alpha,p)$ with $p=g(\alpha)$.
As a result, we may then iteratively solve this equation via 
Newton's method, namely
\begin{eqnarray}
\hat{\alpha}' & ~=~ & \hat{\alpha}-\frac{f(\hat{\alpha})}{f'(\hat{\alpha})}\,,
\end{eqnarray}
using the derivative
\begin{eqnarray}
n\,f'(\alpha) & ~\doteq~ & 
\nabla_\alpha^2 L(\alpha,p)+
g'(\alpha)\,\nabla_p\nabla_\alpha L(\alpha,p)
\\
\Rightarrow f'(\alpha)
& = &
\frac{1}{n}\sum_{i=1}^{n}\psi'(\alpha+X_i)
-\psi'(\alpha)
+\frac{q}{\alpha}
\,.
\end{eqnarray}

Finally, in terms of the original rate parameter $\beta$, we obtain
the maximum likelihood estimate
\begin{eqnarray}
\hat{\beta}_\texttt{ML} & ~\doteq~ &
\frac{\hat{p}_\texttt{ML}}{1-\hat{p}_\texttt{ML}}
~=~\frac{\hat{\alpha}_\texttt{ML}}{\bar{X}}\,.
\end{eqnarray}

### Serial correlation

Suppose that some total number $Y$ of events is observed over a given period of time.
Further suppose that this time period is now partitioned into $n$ 
consecutive (but not necessarily equal) intervals, such that
\begin{eqnarray}
Y & ~=~ & X_1+X_2+\cdots+X_n~\doteq~n\bar{X}\,,
\end{eqnarray}
where $X_t$ denotes the number of events observed in the $t$th interval.
If these intervals are independent, then it follows that
\begin{eqnarray}
\mathbb{V}[Y] & ~=~ & \mathbb{V}\left[\sum_{t=1}^{n}X_t\right]
~=~\sum_{t=1}^{n}\mathbb{V}[X_t]\,.
\end{eqnarray}
However, there are situations in which over-dispersion occurs, such that
\begin{eqnarray}
\mathbb{V}\left[\sum_{t=1}^{n}X_t\right]
& ~=~ &
\sum_{t=1}^{n}\mathbb{V}[X_t]+2\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}
\mathtt{Cov}[X_i,X_j]
~>~\sum_{t=1}^{n}\mathbb{V}[X_t]\,,
\end{eqnarray}
from which it follows that the intervals are not independent.
Typically, we find that $\mathtt{Cov}[X_i,X_j]>0$ for all
$i,j=1,2,\ldots,n$.

To set the scene, let us first allow for possible differences between intervals
by assuming that each interval has some arbitrary (but prespecified) event rate, say $\lambda_t$ for each $t=1,2,\ldots,n$.
For convenience, let 
$\boldsymbol{\lambda}\doteq(\lambda_1,\lambda_2,\ldots,\lambda_n)$
collectively represent these underlying rates.
Furthermore, let
$\vec{\mathbf{X}}_t\doteq(X_1,X_2,\ldots,X_t)$
denote a sequence (or subsequence) of $t$ consecutive counts, where, by convention, $\vec{\mathbf{X}}_{0}\doteq()$ is the empty or null tuple.

There are now several ways to model the correlation between intervals.
One way is to consider temporal causality, such that $X_2$ depends upon
the observed $X_1$, and $X_3$ depends upon $X_2$, et cetera.
This leads either to a Markov chain or, more generally, to a Bayesian network
that embodies the sequential nature of $\vec{\mathbf{X}}_n$, e.g.
\begin{eqnarray}
p(\vec{\mathbf{X}}_n\mid\boldsymbol{\lambda}) & ~\doteq~ &
\prod_{t=1}^{n}p(X_t\mid\lambda_t,\vec{\mathbf{X}}_{t-1})\,,
\end{eqnarray}
where $\vec{\mathbf{X}}_{0}\doteq()$ indicates no previous dependencies.
Unfortunately, this approach leads to increasingly complex conditional distributions.

Another approach is to consider that the intervals, which are
dependent, are in fact *conditionally* independent given
some latent or hidden parameter, say $\gamma$.
The corresponding conditional model is then given by
\begin{eqnarray}
p(\vec{\mathbf{X}}_n\mid\boldsymbol{\lambda},\gamma) & ~\doteq~ &
\prod_{t=1}^{n}p(X_t\mid\lambda_t,\gamma)\,,
\end{eqnarray}
where the corresponding generative model first samples $\gamma$ and then samples $\vec{\mathbf{X}}_n$ with fixed $\gamma$.

We now extend our 
[earlier](#Poisson-data-likelihood "Section: Poisson data likelihood") 
assumption that each $X_t$ is Poisson distributed to include conditional dependence upon $\gamma$ (and $\lambda_t$) via
\begin{eqnarray}
p(X_t\mid\lambda_t,\gamma) & ~\doteq~ &
e^{-\gamma\lambda_t}\frac{(\gamma\lambda_t)^{X_t}}{X_t!}\,,
\end{eqnarray}
which has conditional mean and variance given by
\begin{eqnarray}
\mu_{X_t\mid\gamma}~\doteq~\mathbb{E}[X_t\mid\boldsymbol{\lambda},\gamma]~=~\gamma\lambda_t\,, 
& ~~~ &
\sigma^2_{X_t\mid\gamma}\doteq\mathbb{V}[X_t\mid\boldsymbol{\lambda},\gamma]~=~\gamma\lambda_t\,,
\end{eqnarray}
respectively.

Next, we borrow our 
[earlier](#Gamma-prior-distribution "Section: Gamma prior distribution")
assumption of a gamma prior to obtain $\gamma\sim\mathtt{Gamma}(\alpha,\beta)$.
The marginal distribution for each $X_t$ is thus
\begin{eqnarray}
p(X_t\mid\lambda_t) & ~=~ &
\int_0^\infty
e^{-\gamma\lambda_t}\frac{(\gamma\lambda_t)^{X_t}}{X_t!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\gamma^{\alpha-1}\,
e^{-\beta\gamma}\,d\gamma
\\& = &
\frac{\lambda_t^{X_t}}{X_t!}\,\frac{\beta^\alpha}{\Gamma(\alpha)}\,
\int_0^\infty
\gamma^{\alpha+X_t-1}\,e^{-\gamma(\beta+\lambda_t)}\,d\gamma
\\& = &
\frac{\lambda_t^{X_t}}{X_t!}\,\frac{\beta^\alpha}{\Gamma(\alpha)}\,
\frac{\Gamma(\alpha+X_t)}{(\beta+\lambda_t)^{\alpha+X_t}}
\\& = &
\frac{\Gamma(\alpha+X_t)}{X_t!\,\Gamma(\alpha)}\,
\left(\frac{\beta}{\beta+\lambda_t}\right)^\alpha\,
\left(\frac{\lambda_t}{\beta+\lambda_t}\right)^{X_t}
\,,
\end{eqnarray}
which is again a 
[negative binomial](#Negative-binomial-distribution
"Section: Negative binomial distribution")
distribution
with a probability $p_t\doteq\frac{\beta}{\beta+\lambda_t}$ of a non-countable
event and a probability $q_t\doteq\frac{\lambda_t}{\beta+\lambda_t}$ of a countable event.
The marginal mean number of countable events is then
\begin{eqnarray}
\mu_{X_t} & ~\doteq~ & 
\mathbb{E}[X_t\mid\boldsymbol{\lambda}]~=~
\mathbb{E}[\mu_{X_t\mid\gamma}\mid\boldsymbol{\lambda}]
\\& = &
\mathbb{E}[\gamma\lambda_t\mid\boldsymbol{\lambda}]
~=~\lambda_t\,\mu_\gamma
\\& = &
\frac{\alpha\lambda_t}{\beta}
~=~\frac{\alpha\,q_t}{p_t}\,,
\end{eqnarray}
and the marginal variance is
\begin{eqnarray}
\sigma^2_{X_t} & ~\doteq~ & 
\mathbb{V}[X_t\mid\boldsymbol{\lambda}]
~=~
\mathbb{V}[\mu_{X_t\mid\gamma}\mid\boldsymbol{\lambda}]
+\mathbb{E}[\sigma^2_{X_t\mid\gamma}\mid\boldsymbol{\lambda}]
\\& = &
\mathbb{V}[\gamma\lambda_t\mid\boldsymbol{\lambda}]
+\mathbb{E}[\gamma\lambda_t\mid\boldsymbol{\lambda}]
\\& = &
\lambda_t^2\,\sigma^2_\gamma+\lambda_t\,\mu_\gamma
~=~\frac{\alpha\lambda_t^2}{\beta^2}+\frac{\alpha\lambda_t}{\beta}
%~=~\frac{\lambda_t\,\mu_\gamma}{p_t}
\\& = &
\frac{\alpha\,(\beta+\lambda_t)\,\lambda_t}{\beta^2}
~=~\frac{\alpha\,q_t}{p_t^2}
\,.
\end{eqnarray}

Lastly, the dependence between the $i$th and $j$th intervals is measured by
\begin{eqnarray}
\mathtt{Cov}[X_i,X_j] & ~\doteq~ &
\mathbb{E}[X_i\,X_j\mid\boldsymbol{\lambda}]
-\mathbb{E}[X_i\mid\boldsymbol{\lambda}]
\,\mathbb{E}[X_j\mid\boldsymbol{\lambda}]
\\& = & 
\mathbb{E}[\,
\mathbb{E}[X_i\,X_j\mid\boldsymbol{\lambda},\gamma]\mid\boldsymbol{\lambda}
]
-\mathbb{E}[X_i\mid\boldsymbol{\lambda}]
\,\mathbb{E}[X_j\mid\boldsymbol{\lambda}]
\\& = &
\mathbb{E}[\gamma\lambda_i\,\gamma\lambda_j\mid\boldsymbol{\lambda}]
-\lambda_i\mu_\gamma\,\lambda_j\mu_\gamma
\\& = &
\lambda_i\lambda_j\,(\sigma^2_\gamma+\mu_\gamma^2)
-\lambda_i\lambda_j\,\mu_\gamma^2~=~\lambda_i\lambda_j\sigma^2_\gamma
\,.
\end{eqnarray}

Finally, let us return to the total number $Y=X_1+\cdots+X_n$ of
countable events.
The conditional joint distribution of the interval counts is given by
\begin{eqnarray}
p(\vec{\mathbf{X}}_n\mid\boldsymbol{\lambda},\gamma) & = &
\prod_{t=1}^{n}
e^{-\gamma\lambda_t}\frac{(\gamma\lambda_t)^{X_t}}{X_t!}
~=~
\gamma^{n\bar{X}}\,e^{-n\bar{\lambda}\gamma}\,
\prod_{t=1}^{n}\frac{\lambda_t^{X_t}}{X_t!}
\,,
\end{eqnarray}
where we have defined $n\bar{\lambda}\doteq\lambda_1+\lambda_2+\cdots+\lambda_n$
for convenience. Hence, the conditional distribution for $Y$ is
given by
\begin{eqnarray}
p(Y=k\mid\boldsymbol{\lambda},\gamma) & ~=~ &
\underset{\mid X_1+\cdots+X_n=k}{\sum_{X_1=0}^\infty\cdots\sum_{X_n=0}^\infty}
\gamma^{k}\,e^{-n\bar{\lambda}\gamma}\,
\prod_{t=1}^{n}\frac{\lambda_t^{X_t}}{X_t!}
~=~
e^{-n\bar{\lambda}\gamma}\,\frac{(n\bar{\lambda}\gamma)^k}{k!}
\,,
\end{eqnarray}
which follows from the multinomial expansion of $(\lambda_1+\cdots+\lambda_n)^k$. Consequently, this is just a Poisson distribution.

As an aside, suppose that we know the total $Y=k$ and want to determine the distribution of the interval counts. Then observe that
\begin{eqnarray}
p(\vec{\mathbf{X}}\mid Y=k,\boldsymbol{\lambda},\gamma)
& ~=~ &
\frac{p(\vec{\mathbf{X}},Y=k\mid\boldsymbol{\lambda},\gamma)}
{p(Y=k,\boldsymbol{\lambda},\gamma)}
\\& = &
\frac{\gamma^{k}\,e^{-n\bar{\lambda}\gamma}\,
\prod_{t=1}^{n}\frac{\lambda_t^{X_t}}{X_t!}}
{e^{-n\bar{\lambda}\gamma}\,\frac{(n\bar{\lambda}\gamma)^k}{k!}}
\\& = &
\frac{k!}{\prod_{t=1}^{n}X_t!}
\prod_{t=1}^{n}\left(\frac{\lambda_t}{n\bar{\lambda}}\right)^{X_t}
\,,
\end{eqnarray}
since $k=X_1+\cdots+X_n$. This is just the multinomial distribution,
where $\frac{\lambda_t}{\lambda_1+\cdots+\lambda_n}$ is the probability
of assigning any single one of the $k$ counts to the $t$th interval at random.
Note that this result is independent of the conditional parameter $\gamma$.
Hence, we could equally have taken the ratio of the marginal distributions
instead.

The marginal distribution for $Y$ is now given by
\begin{eqnarray}
p(Y=k\mid\boldsymbol{\lambda}) & ~=~ &
\int_0^\infty
e^{-n\bar{\lambda}\gamma}\frac{(n\bar{\lambda}\gamma)^k}{k!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\gamma^{\alpha-1}\,
e^{-\beta\gamma}\,d\gamma
\\& = &
\frac{(n\bar{\lambda})^k\,\beta^\alpha}{k!\,\Gamma(\alpha)}\,
\int_0^\infty
\gamma^{\alpha+k-1}\,e^{-(\beta+n\bar{\lambda})\gamma}\,d\gamma
\\& = &
\frac{(n\bar{\lambda})^k\,\beta^\alpha}{k!\,\Gamma(\alpha)}\,
\frac{\Gamma(\alpha+k)}{(\beta+n\bar{\lambda})^{\alpha+k}}
\\&=&
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,
\left(\frac{\beta}{\beta+n\bar{\lambda}}\right)^\alpha\,
\left(\frac{n\bar{\lambda}}{\beta+n\bar{\lambda}}\right)^k
\,.
\end{eqnarray}
This is just a negative binomial distribution with the overall
probabilities of non-countable and countable events given by
\begin{eqnarray}
\bar{p}~\doteq~\frac{\beta}{\beta+n\bar{\lambda}}
& ~\;\;\Rightarrow~\;\; &
\bar{q}~=~1-\bar{p}~\doteq~\frac{n\bar{\lambda}}{\beta+n\bar{\lambda}}\,,
\end{eqnarray}
respectively. The marginal mean count is thus
\begin{eqnarray}
\mu_Y & ~=~ & \frac{\alpha\bar{q}}{\bar{p}}
~=~\frac{n\bar{\lambda}\alpha}{\beta}\,,
\end{eqnarray}
and the marginal variance is
\begin{eqnarray}
\sigma^2_Y & ~=~ & \frac{\alpha\bar{q}}{\bar{p}^2}
~=~\frac{n\bar{\lambda}\alpha\,(\beta+n\bar{\lambda})}{\beta^2}
~=~\frac{n\bar{\lambda}\alpha}{\beta}+\frac{n^2\bar{\lambda}^2\alpha}{\beta^2}
\,.
\end{eqnarray}
In comparison, the sum of variances of the marginal interval counts is
\begin{eqnarray}
\sum_{t=1}^{n}\sigma^2_{X_t} & ~=~ &
\sum_{t=1}^{n}\frac{\alpha\lambda_t\,(\beta+\lambda_t)}{\beta^2}
~=~\frac{n\bar{\lambda}\alpha}{\beta}+\frac{n\overline{\lambda^2}\alpha}{\beta^2}
\,,
\end{eqnarray}
where $n\overline{\lambda^2}\doteq\lambda_1^2+\cdots+\lambda_n^2$.
The excess variance is then
\begin{eqnarray}
\sigma^2_Y - \sum_{t=1}^{n}\sigma^2_{X_t} & ~=~ &
\frac{\alpha}{\beta^2}\,\left[(\lambda_1+\cdots+\lambda_n)^2
-(\lambda_1^2+\cdots+\lambda_n^2)\right]
\\& = &
\frac{2\alpha}{\beta^2}\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\lambda_i\lambda_j
~=~2\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\mathtt{Cov}[X_i,X_j]
\,,
\end{eqnarray}
since $\sigma^2_\gamma=\frac{\alpha}{\beta^2}$.

## Beta-Binomial distribution

The binomial distribution may be derived as a sum of $n$ independent
Bernoulli variables. Hence, we may generalise the
[beta-Bernoulli](#Beta-Bernoulli-distribution "Section: Beta-Bernoulli distribution") distribution to the binomial case.
One rationale for doing so is to model  data that are empirically
over-dispersed with respect to the binomial distribution.
This is similar to our use of the
[gamma-Poisson](#Gamma-Poisson-distribution "Section: Gamma-Poisson distribution") distribution to handle over-dispersion with respect to the
Poisson distribution.

### Beta prior distribution (again)

Once again, we utilise the general
[beta](#Beta-prior-distribution "Section: Beta prior distribution")
prior of
\begin{eqnarray}
\theta & ~\sim~ & \mathtt{Beta}(\alpha,\beta)
\,,
\end{eqnarray}
for which the mean and variance are given by
\begin{eqnarray}
\mu_\theta & ~\doteq~ & \mathbb{E}[\theta\mid\alpha,\beta]
~=~\frac{\alpha}{\alpha+\beta}\,,
\\
\sigma^2_\theta & ~\doteq~ & \mathbb{V}[\theta\mid\alpha,\beta]
~=~\frac{\alpha\beta}{(\alpha+\beta)^2\,(\alpha+\beta+1)}\,,
\end{eqnarray}
respectively.
[Recall](https://en.wikipedia.org/wiki/Beta_distribution
"Wikipedia: Beta distribution") that the beta density function is
\begin{eqnarray}
p(\theta\mid\alpha,\beta) & ~=~ &
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}\,
\theta^{\alpha-1}\,(1-\theta)^{\beta-1}\,.
\end{eqnarray}


### Binomial data likelihood

The binomial distribution abstracts from each observed Bernoulli sequence of $n$ trials by disregarding the order and simply counting the total number of
special events of interest (e.g. successes, or failures) which occurred. It happens that such sequences are *exchangeable*, meaning that every perturbation of an observed
sequence has the *same* probability of occurrence. Thus, there are
$n\choose x$ exchangeable sequences all having exactly $x$ special events and $n-x$ non-special events, in any order.

Thus, let the count $X$ follow a binomial distribution, namely
\begin{eqnarray}
X\mid\theta & ~\sim~ & \texttt{Binom}(n,\theta)\,,
\end{eqnarray}
where $\theta$ is the probability of an event of interest.
In our sporting match scenario, we might let $n$ be the total allowable
number of attempted scoring shots in a fixed period of time.
Then $\theta$ might be the probability 
that any given attempt actually succeeds. Thus, $X$ would give the total score
of a given team for the match.

The likelihood of a given score is then
\begin{eqnarray}
p(X=x\mid n,\theta) & ~=~ & {n\choose x}\,\theta^x\,(1-\theta)^{1-x}\,,
\end{eqnarray}
which gives a conditional mean and variance of
\begin{eqnarray}
\mu_{X\mid\theta} & ~\doteq~ & \mathbb{E}[X\mid n,\theta]
~=~n\,\theta\,,
\\
\sigma^2_{X\mid\theta} & ~\doteq~ & \mathbb{V}[X\mid n,\theta]
~=~ n\,\theta\,(1-\theta)\,,
\end{eqnarray}
respectively. Observe that $\sigma^2_{X\mid\theta}<\mu_{X\mid\theta}$ for all $\theta\in(0,1)$.
Thus the binomial distribution is under-dispersed in comparison to
the Poisson distribution, for which 
$\sigma^2_{X\mid\lambda}=\mu_{X\mid\lambda}=\lambda$.

### Beta-binomial mixture distribution

To allow for over-dispersion, we permit the probability $\theta$
to first be drawn from the beta prior before the start of each match, and then to be held constant for the duration of the match.
The law of 
[iterated expectations](https://en.wikipedia.org/wiki/Law_of_total_expectation
"Wikipedia: Law of total expectation") then gives
\begin{eqnarray}
\mu_X & ~\doteq~ & \mathbb{E}[X] ~=~
\mathbb{E}[\mu_{X\mid\theta}\mid\alpha,\beta]
\\
& ~=~ & \mathbb{E}[n\,\theta\mid\alpha,\beta]~=~n\mu_\theta\,.
\end{eqnarray}
Similarly, the 
law of 
[iterated variances](https://en.wikipedia.org/wiki/Law_of_total_variance
"Wikipedia: Law of total variance")
gives
\begin{eqnarray}
\sigma^2_X & ~\doteq~ & \mathbb{V}[X] ~=~
\mathbb{V}[\mu_{X\mid n,\theta}\mid\alpha,\beta]
+
\mathbb{E}[\sigma^2_{X\mid n,\theta}\mid\alpha,\beta]
\\
& ~=~ & 
\mathbb{V}[n\,\theta\mid\alpha,\beta]
+
\mathbb{E}[n\,\theta\,(1-\theta)\mid\alpha,\beta]
\\
& ~=~ & n^2\sigma^2_\theta+n\,\mu_\theta-n\,(\sigma^2_\theta+\mu_\theta^2)
\\
& = &
n\,\mu_\theta\,(1-\mu_\theta)+n(n-1)\,\sigma^2_\theta
\,.
\end{eqnarray}
Observe that if $\sigma^2_\theta=0$ then $\theta=\mu_\theta$ and the variance reduces to the usual binomial variance of $\sigma^2_X=n\,\theta\,(1-\theta)$.
Hence, for $\sigma^2_\theta>0$ the beta-binomial allows for additional
variance over the binomial distribution.

The 
[beta-binomial](https://en.wikipedia.org/wiki/Beta-binomial_distribution
"Wikipedia: Beta-binomial distribution")
mixture distribution for a single score $X$ is now derived as
\begin{eqnarray}
p(X=x\mid n,\alpha,\beta) & ~=~ & \int_0^1 p(x\mid n,\theta)\,p(\theta\mid\alpha,\beta)\,d\theta
\\& = &
\int_0^1 {n\choose x}\theta^x(1-\theta)^{n-x}\,
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}
\,\theta^{\alpha-1}(1-\theta)^{\beta-1}\,d\theta
\\& = &
{n\choose x}\,\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}
\int_0^1 \theta^{\alpha+x-1}(1-\theta)^{\beta+n-x-1}\,d\theta
\\& =  &
{n\choose x}\,\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}
\,\frac{\Gamma(\alpha+x)\,\Gamma(\beta+n-x)}{\Gamma(\alpha+\beta+n)}
\,,
\end{eqnarray}
since the final integrand is proportional to the $\texttt{Beta}(\alpha+x,\beta+n-x)$ distribution.


Let us now define $z!\doteq\Gamma(z+1)$ for real-valued $z\in[0,\infty)$.
Then the beta-binomial distribution becomes
\begin{eqnarray}
p(X=x\mid n,\alpha,\beta) & ~=~ &
\frac{n!\,(\alpha+\beta-1)!\,(\alpha+x-1)!\,(\beta+n-x-1)!}
{x!\,(n-x)!\,(\alpha-1)!\,(\beta-1)!\,(\alpha+\beta+n-1)!}
\\
& = &
\frac{(\alpha+x-1)!}{x!\,(\alpha-1)!}\,
\frac{(\beta+n-x-1)!}{(n-x)!\,(\beta-1)!}
\left/\frac{(\alpha+\beta+n-1)!}{n!\,(\alpha+\beta-1)!}\right.
\\ & = &
{\alpha+x-1\choose x}\,{\beta+n-x-1\choose n-x}
\left/{\alpha+\beta+n-1\choose n}\right.\,.
\end{eqnarray}
This is just the real-valued  version of
the [negative hypergeometric](https://en.wikipedia.org/wiki/Negative_hypergeometric_distribution
"Wikipedia: Negative hypergeometric distribution")
distribution, i.e. a *Polya urn* distribution.
Notionally, in the integer-valued version, the urn contains $K=n$ successes and $N-K=\alpha+\beta-1$ failures,
and sampling without replacement halts as soon as $r=\alpha$ failures have been observed, giving $k=x$ observed successes.

Note that, paradoxically, $\alpha$ is actually the prior number of *successes*
from the $\texttt{Beta}(\alpha,\beta)$ prior distribution.
The fact that $r=\alpha$ is also the required number of urn sampling *failures* can be explained
(with a bit of hand waving)
by the fact that a high value of $\alpha$ would imply that a high number $x$ of successes would be expected to occur
before $\alpha$ failures were finally observed.
This makes sence since, from the beta distribution, a high value of $\alpha$ would lead to a high expected probability of observing a success in any given
Bernoulli trial.

## Citations

[1] George E. P. Box and George C. Tiao (1973), "*Bayesian Inference in Statistical Analysis*", John Wiley & Sons.

[2] Jouni Kerman (2011), "*Neutral noninformative and informative conjugate beta and gamma prior distributions*", Electron. J. Statist. 5: 1450-1470.