# Appendix D: Modelling Distributions

The purpose of this appendix is to discuss some of the stochastic distributions that might be useful for modelling
the outcomes of sporting matches, particularly Australian Rules football.

## Beta-Bernoulli distribution

Consider the situation where the outcome $X$ (i.e. a win or loss) of a match between team A and team B is stochastically decided,
given some (unknown) probability $\theta$ that team A wins. Essentially, the outcome may be considered as a biased coin toss.
However, prior to the match we might consider the relative strengths of teams A and B, and this knowledge might inform us
as to likely values of the probability $\theta$. Consequently, we may model $X$ by a continuous mixture distribution over 
$\theta$, and then use this to derive a posterior distribution for $\theta$.

### Bernoulli data likelihood

Let the variate $X\in\{0,1\}$ represent the outcome of a single match between some arbitrary team A and team B, with $X=1$ indicating a win by team A, and $X=0$ indicating a win by team B.
For convenience, we consider the Bernoulli distribution
\begin{eqnarray}
X\mid\theta & ~\sim~ & \texttt{Bern}(\theta)\,,
\end{eqnarray}
where $\theta$ is the assumed probability that team A wins, which is decided before the match commences.
Thus, we see that
\begin{eqnarray}
p(X=x\mid\theta) & ~=~ & \theta^x\,(1-\theta)^{1-x}\,.
\end{eqnarray}
If we further wish to model the outcomes of an ordered sequence of $n$ *conditionally independent* matches between the same teams A and B, 
with the same parameter
$\theta$, then we will obtain the joint data likelihood
\begin{eqnarray}
p(X=w\mid n,\theta) & ~=~ & \theta^w\,(1-\theta)^{n-w}\,,
\end{eqnarray}
where $w\in\{0,1,\ldots,n\}$ represents the total number of wins for team A, and $\ell=n-w$ represents the total number of
losses.

Note that the use of the Bernoulli distribution explicitly presumes that only wins and losses are possible.
How then might we deal with games for which matches may end in a draw?
It turns out that there are a number of good reasons to treat a draw as being half-a-win and half-a-loss for each team.
Thus, we might define the weighted log-likelihood for a draw as being given by
\begin{eqnarray}
L_\texttt{draw}(\theta) & ~\doteq~ & \frac{1}{2}\ln p(X=1\mid\theta)+\frac{1}{2}\ln p(X=0\mid\theta)
\\& =  &
\frac{1}{2}\ln\theta+\frac{1}{2}\ln\,(1-\theta)
~=~
\ln\left[\theta^{\frac{1}{2}}\,(1-\theta)^{\frac{1}{2}}\right]\,.
\end{eqnarray}
We observe that functionally this takes the same form as
\begin{eqnarray}
L_\texttt{draw}(\theta) & ~=~ & \ln p(X=0.5\mid\theta)\,.
\end{eqnarray}
Consequently, we may numerically treat observed draws as having the value $X=\frac{1}{2}$.

### Beta prior distribution

We seek a non-informative prior distribution for $\theta$. 
Following Box and Tiao [[1]](#Citations "Citation [1]: Bayesian Inference in Statistical Analysis"), 
we desire a transformation $\phi(\theta)$ such that the likelihood $p(X=w\mid n,\theta)$, plotted as a function of $\phi$, remains
approximately invariant in shape and size for some fixed $n$ as $w$ varies (but varies in its mean location). 
A uniform prior for $\phi$ then induces a
non-informative prior for $\theta$. In general, it turns out that the relevant prior is usually inversely
proportional to the square-root of the variance. For the Bernoulli distribution, we have
\begin{eqnarray}
\mathbb{V}[X\mid\theta] & ~=~ \theta\,(1-\theta)\,,
\end{eqnarray}
whereupon the non-informative prior is therefore
\begin{eqnarray}
p(\theta) ~\propto~ \frac{1}{\sqrt{\theta\,(1-\theta)}}
& ~~~\Rightarrow~~~ & \theta~\sim~\mathtt{Beta}\left(\frac{1}{2},\frac{1}{2}\right)
\,.
\end{eqnarray}
Observe that this is just 
[Jeffreys' prior](https://en.wikipedia.org/wiki/Jeffreys_prior "Wikipeida: Jeffreys' prior"), 
which notionally corresponds to a single prior pseudo-match
with half-a-win and half-a-loss, i.e. a draw.

More generally, we might choose an arbitrary beta prior distribution
\begin{eqnarray}
\theta & ~\sim & ~\mathtt{Beta}(\alpha,\beta)
\,,
\end{eqnarray}
for some hyper-parameters $\alpha$ and $\beta$, where
$\alpha$ represents the prior number of pseudo-wins and $\beta$ represents the prior number of pseudo-losses.
Note that technically the parameter dependence is explicitly $\theta\mid\alpha,\beta$. However, by convention
the hyper-parameters are often left implicit to indicate that they are held constant. Despite this useful convention,
it is best to always explicitly condition on all relevant parameters when dealing with the actual probability functions,
in order to avoid accidental ambiguity, e.g. by hiding important modelling assumptions.

The special case of $\alpha=\beta$ corresponds to having no other prior information with which to
preference a win/loss over a loss/win.
We could, for example, choose the 
[Haldane prior](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution") 
with $\alpha=\beta=0$ to indicate that we have no
prior pseudo-matches.
Similarly, the uniform prior with $\alpha=\beta=1$ corresponds to one previous pseudo-win and one pseudo-loss.
Alternatively, 
Kerman [[2]](#Citations "Citation [2]: Neutral noninformative and informative conjugate beta and gamma prior distributions") 
derives $\alpha=\beta=\frac{1}{3}$ as being a noninformative prior that is flatter than Jeffreys' prior, and has the property that a sampled prior value is equally likely to be greater than or less than the *ex ante* maximum likelihood estimate.

Note that the beta distribution is also a good choice due to its complementary symmetry in $\alpha$ and $\beta$.
Suppose that $\alpha$ denotes our pre-match knowledge about the strength of team A, and similarly let
$\beta$ represent our prior knowlegde about team B. Now, $\theta=\theta_A$ is the prior probability of team A winning,
such that $\theta_A\sim\texttt{Beta}(\alpha,\beta)$.
However, the ordering of teams A and B is arbitrary. If we swapped the order of the teams and their respective hyper-parameters, 
then we would instead have $\theta_B\sim\texttt{Beta}(\beta,\alpha)$.
The beta distribution correctly satisfies the constraint that
\begin{eqnarray}
\mathbb{E}[\theta\mid\alpha,\beta] & ~=~ \frac{\alpha}{\alpha+\beta}~=~1-\mathbb{E}[\theta\mid\beta,\alpha]\,. 
\end{eqnarray}
This is the aforementioned complementary symmetry.

### Beta-Bernoulli mixture distribution

The mixture distribution for a single match is now derived as
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \int_0^1 p(x\mid\theta)\,p(\theta\mid\alpha,\beta)\,d\theta
\\& = &
\int_0^1 \theta^x(1-\theta)^{1-x}\,
\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}\,d\theta
\\& = &
\frac{1}{B(\alpha,\beta)}\int_0^1 \theta^{\alpha+x-1}(1-\theta)^{\beta-x}\,d\theta
\\& =  &
\frac{B(\alpha+x,\beta-x+1)}{B(\alpha,\beta)}\int_0^1 
\frac{\theta^{\alpha+x-1}(1-\theta)^{\beta-x}}{B(\alpha+x,\beta-x+1)}\,d\theta\,.
\end{eqnarray}
Recognising the final integrand as the $\texttt{Beta}(\alpha+x,\beta-x+1)$ distribution, we obtain
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ &
\frac{B(\alpha+x,\beta-x+1)}{B(\alpha,\beta)}\,.
\end{eqnarray}
Next, we expand $B(\cdot,\cdot)$ in terms of the *gamma* function $\Gamma(\cdot)$ to obtain
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ &
\frac{\Gamma(\alpha+x)\,\Gamma(\beta+1-x)}{\Gamma(\alpha+\beta+1)}\,
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}\,.
\end{eqnarray}
Finally, using the recurrence relation that $\Gamma(z+1)=z\,\Gamma(z)$, we deduce that
\begin{eqnarray}
p(X=1\mid\alpha,\beta)~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\;\; &
p(X=0\mid\alpha,\beta)~=~\frac{\beta}{\alpha+\beta}\,,
\end{eqnarray}
such that
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{\alpha^x\,\beta^{1-x}}{\alpha+\beta}\,.
\end{eqnarray}
Observe that this is just
\begin{eqnarray}
X & ~\sim~ & \texttt{Bern}(\bar{\theta})\,,
\end{eqnarray}
with $\bar{\theta}=\frac{\alpha}{\alpha+\beta}$ being the mean of the beta distribution.

Subsequently, for the case of $n$ *unconditionally independent* matches between teams A and B, we would obtain the joint distribution
\begin{eqnarray}
p_u(X=w\mid n,\alpha,\beta) & ~=~ & \frac{\alpha^w\,\beta^{n-w}}{(\alpha+\beta)^n}~=~\bar{\theta}^w\,(1-\bar{\theta})^{n-w}\,.
\end{eqnarray}
Note that this is notionally different from the *conditionally independent* case assumed
[previously](#Bernoulli-data-likelihood "Section: Bernoulli data likelihood"),
where the $n$ matches all shared the same value of the parameter $\theta$. In the unconditional form, the parameter
$\theta$ is assumed to be resampled before every match.

Consequently, for the conditionally independent case, we would instead derive that
\begin{eqnarray}
p_c(X=w\mid n,\alpha,\beta) & ~=~ & \int_0^1 p(X=w\mid n,\theta)\,p(\theta\mid\alpha,\beta)\,d\theta
\\& = &
\int_0^1 \theta^w\,(1-\theta)^{n-w}\,
\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}\,d\theta
\\& = &
\frac{1}{B(\alpha,\beta)}\int_0^1 \theta^{\alpha+w-1}(1-\theta)^{\beta+n-w-1}\,d\theta
\\& =  &
\frac{B(\alpha+w,\beta+n-w)}{B(\alpha,\beta)}\,.
\end{eqnarray}

### Beta posterior distribution

For the case of observing a sequence of $n$ conditionally independent matches, the posterior distribution is given by
\begin{eqnarray}
p(\theta\mid n,w,\alpha,\beta) & ~=~ &
\frac{p(X=w\mid n,\theta)\,p(\theta\mid\alpha,\beta)}{p_c(X=w\mid n,\alpha,\beta)}
~=~
\frac{\theta^{\alpha+w-1}(1-\theta)^{\beta+n-w-1}}{B(\alpha+w,\beta+n-w)}\,,
\end{eqnarray}
such that
\begin{eqnarray}
\theta\mid n,w & ~\sim~ & \texttt{Beta}(\alpha+w,\beta+n-w)\,.
\end{eqnarray}

### Beta-Bernoulli predictive distribution

Suppose the result of the $(n+1)$-th match is now $X$, conditionally independent of previous matches for the same
parameter $\theta$. Then the likelihood of this result is
\begin{eqnarray}
p(X=x\mid\theta) & ~=~ & \theta^{x}\,(1-\theta)^{1-x}\,.
\end{eqnarray}
Hence, the predictive probability of this result is given by
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & \int_0^1 p(x\mid\theta)\,p(\theta\mid n,w,\alpha,\beta)\,d\theta
\\&~=~&
\int_0^1\frac{\theta^{\alpha+w+x-1}\,(1-\theta)^{\beta+n-w-x}}
{B(\alpha+w,\beta+n-w)}\,d\theta
\\ & = &
\frac{B(\alpha+w+x,\beta+n-w+1-x)}
{B(\alpha+w,\beta+n-w)}
\,.
\end{eqnarray}

In terms of the gamma function, $\Gamma(\cdot)$, this becomes
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & 
\frac{\Gamma(\alpha+w+x)\,\Gamma(\beta+n-w+1-x)}
     {\Gamma(\alpha+\beta+n+1)}\,
\frac{\Gamma(\alpha+\beta+n)}
     {\Gamma(\alpha+w)\,\Gamma(\beta+n-w)}\,.
\end{eqnarray}
For a loss, i.e. $X=0$, the respective probability reduces to
\begin{eqnarray}
p(X=0\mid n,w,\alpha,\beta) & ~=~ & \frac{\beta+n-w}{\alpha+\beta+n}\,,
\end{eqnarray}
again using the recurrence relation that $\Gamma(z+1)=z\,\Gamma(z)$.
The corresponding probability of a win is therefore
\begin{eqnarray}
p(X=1\mid n,w,\alpha,\beta) & ~=~ & \frac{\alpha+w}{\alpha+\beta+n}\,,
\end{eqnarray}
such that
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & \frac{(\alpha+w)^x\,(\beta+n-w)^{1-x}}{\alpha+\beta+n}\,.
\end{eqnarray}
Defining $\hat{\theta}\doteq\frac{\alpha+w}{\alpha+\beta+n}$, we observe that
\begin{eqnarray}
p(X=x\mid n,w,\alpha,\beta) & ~=~ & \left(\hat{\theta}\right)^x\,\left(1-\hat{\theta}\right)^{1-x}
\,,
\end{eqnarray}
such that
\begin{eqnarray}
X\mid\hat{\theta} & ~\sim~ & \texttt{Bern}(\hat{\theta})
\,.
\end{eqnarray}

Note that the denominator of $\hat{\theta}$ corresponds to assuming $n$ observed matches plus $\alpha+\beta$ prior pseudo-matches, and the numerator corresponds to $w$ observed wins plus $\alpha$ prior pseudo-wins.
Thus, $\hat{\theta}$ is just a *smoothed* estimate of the empirical probability of a win.

## Gamma-Poisson distribution

Consider a Poisson-like count $X\in\mathbb{Z}^{\ge 0}$ with over-dispersion, namely 
$\mathbb{V}[X]>\mathbb{E}[X]$. This suggests the need for an additonal parameter, beyond the usual Poisson rate $\lambda$, 
to control the extra variance. 
One way in which over-dispersion can arise is when $\lambda$ is only held constant for a single trial 
or a single sequence of trials, but exhibits inter-trial or inter-sequence variation.
This variation may be captured by a prior distribution, giving rise to a continuous mixture distribution.

### Poisson data likelihood

For some fixed rate $\lambda$, let the count $X$ be distributed as
\begin{eqnarray}
X\mid\lambda & ~\sim~ & \mathtt{Poisson}(\lambda)\,,
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mathbb{E}[X\mid\lambda]~=~\lambda\,, & \;\;\mbox{and}\;\; &
\mathbb{V}[X\mid\lambda]~=~\lambda\,,
\end{eqnarray}
respectively. Then we may consider a 
single sequence of $n$ conditionally independent counts with $\lambda$ held constant just for that sequence.
If the average count of the sequence is $\langle X\rangle=\bar{X}$, then the joint likelihood is given by
\begin{eqnarray}
p(X_1,\ldots,X_n\mid\lambda) & ~=~ & \frac{e^{-n\lambda}\,\lambda^{n\bar{X}}}{\prod_{i=1}^n X_i!}\,.
\end{eqnarray}

### Gamma prior distribution

Consider, in general, some arbitrary prior distribution, say $\lambda\sim D(\boldsymbol{\theta})$, governed by one or more 
(constant) hyper-parameters denoted by $\boldsymbol{\theta}$.
Then it follows that  
\begin{eqnarray}
\mathbb{E}[X] & ~\doteq~ & \mathbb{E}\left[\mathbb{E}[X\mid\lambda]\mid\boldsymbol{\theta}\right]
 ~=~ \mathbb{E}[\lambda\mid\boldsymbol{\theta}]\,,
\end{eqnarray}
and
\begin{eqnarray}
\mathbb{V}[X] & ~\doteq~ & \mathbb{E}\left[\mathbb{V}[X\mid\lambda]\mid\boldsymbol{\theta}\right]
+\mathbb{V}\left[\mathbb{E}[X\mid\lambda]\mid\boldsymbol{\theta}\right]
 ~=~ \mathbb{E}[\lambda\mid\boldsymbol{\theta}]+\mathbb{V}[\lambda\mid\boldsymbol{\theta}]\,.
\end{eqnarray}
Consequently, we observe that $\mathbb{V}[X]>\mathbb{E}[X]$, such that the overall process is over-dispersed compared
to a simple Poisson process.
Clearly we require a *proper* prior distribution with finite mean and variance.

Following the rationale discussed [previously](#Beta-prior-distribution "Section: Beta prior distribution"),
we first seek a non-informative prior for the Possion rate $\lambda\in (0,\infty)$ via a variance-stabilising transformation. However, this results in the improper prior
\begin{eqnarray}
p(\lambda) & ~\propto~ & \frac{1}{\sqrt{\mathbb{V}[X\mid\lambda]}}~=~\frac{1}{\sqrt{\lambda}}\,.
\end{eqnarray}
If we truncate the domain to $\lambda\in(0,L]$, then the truncated prior becomes
\begin{eqnarray}
p(\lambda\mid L) & ~=~ & \frac{1}{2\sqrt{L\lambda}}\,,
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mathbb{E}[\lambda\mid L]~=~\frac{1}{3}L\,, & \;\;\mbox{and}\;\; &
\mathbb{V}[\lambda\mid L]~=~\frac{4}{45}L^2\,,
\end{eqnarray}
respectively. The overall mean and variance are then given by
\begin{eqnarray}
\mathbb{E}[X]~=~\frac{1}{3}L\,, & \;\;\mbox{and}\;\; &
\mathbb{V}[X]~=~\frac{1}{3}L+\frac{4}{45}L^2\,,
\end{eqnarray}
respectively. Clearly, the hyper-parameter $L$ may be estimated (using the method of moments) from the sample mean as $\hat{L}=3\bar{X}$.
However, if the sample variance does not match $\frac{1}{3}\hat{L}+\frac{4}{45}\hat{L}^2$, then 
having only a single prior hyper-parameter does not provide enough modelling flexibility.
Consequently, we require a prior distribution having at least two hyper-parameters.

To deduce a suitable form for such a prior, [recall](#Poisson-data-likelihood "Sectrion: Poisson data likelihood")
that the joint likelihood of a sequence of $n$ trials, considered as a function of $\lambda$ with fixed
observed counts, is proportional to $\lambda^{n\bar{X}}\,e^{-n\lambda}$. This has the same proportional form as
the $\texttt{Gamma}(n\bar{X}+1,n)$ distribution.
Consequently, we consider a prior distribution in the general form
\begin{eqnarray}
\lambda & ~\sim~ & \mathtt{Gamma}(\alpha,\beta)\,,
\end{eqnarray}
where $\alpha$ is the shape parameter and $\beta$ is the rate parameter.
It now follows that
\begin{eqnarray}
\mathbb{E}[X] & ~=~ & \mathbb{E}[\lambda\mid\alpha,\beta]~=~\frac{\alpha}{\beta}\,,
\end{eqnarray}
and
\begin{eqnarray}
\mathbb{V}[X] & ~=~ & \mathbb{E}[\lambda\mid\alpha,\beta]+\mathbb{V}[\lambda\mid\alpha,\beta]
\\ & = & \frac{\alpha}{\beta}+\frac{\alpha}{\beta^2}~=~\frac{\alpha\,(\beta+1)}{\beta^2}\,.
\end{eqnarray}
Not only is this over-dispersed for $\alpha>0$, but the two hyper-parameters $\alpha$ and $\beta$
provide sufficient freedom to flexibly model both the sample mean and variance.

### Negative binomial distribution

For a single trial, the explicit form of the mixture distribution is derived as
\begin{eqnarray}
p(X=k\mid\alpha,\beta) & ~=~ & \int_0^\infty p(X=k\mid\lambda)\,p(\lambda\mid\alpha,\beta)\,d\lambda
\\& = &
\int_0^\infty e^{-\lambda}\frac{\lambda^k}{k!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\lambda^{\alpha-1}\,e^{-\beta\lambda}\,d\lambda
\\& = &
\frac{\beta^\alpha}{k!\,\Gamma(\alpha)}
\int_0^\infty \lambda^{\alpha+k-1}\,e^{-(\beta+1)\lambda}\,d\lambda
\\& = &
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,\frac{\beta^\alpha}{(\beta+1)^{\alpha+k}}
\int_0^\infty \frac{(\beta+1)^{\alpha+k}}{\Gamma(\alpha+k)}\,\lambda^{\alpha+k-1}\,e^{-(\beta+1)\lambda}\,d\lambda
\\& = &
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,\frac{\beta^\alpha}{(\beta+1)^{\alpha+k}}\,,
\end{eqnarray}
since the final integrand is just the distribution $\mathtt{Gamma}(\alpha+k,\beta+1)$.

In order to show that this is just the real-valued (*Polya* distribution) version of the 
[*negative binomial*](https://en.wikipedia.org/wiki/Negative_binomial_distribution "Wikipedia: Negative binomial distribution") 
distribution,
we define $p$ to be the probability of a *stopping* event (i.e. one out of the required number $\alpha$ of such events) and $q$ to be the probability of a *non-stopping* event,
namely
\begin{eqnarray}
p ~\doteq~ \frac{\beta}{\beta+1}\,, & ~~~\mbox{and}~~~ & q ~\doteq~ 1-p~=~\frac{1}{\beta+1}\,,
\end{eqnarray}
whereupon
\begin{eqnarray}
p(X=k\mid\alpha,p) & ~=~ & 
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,q^k\,p^\alpha\,.
\end{eqnarray}
This represents the overall probability of observing a sequence of $k+\alpha$ Bernoulli trials,
where the sequence was terminated as soon as $\alpha$ stopping events occurred.

In the traditional integer-valued (*Pascal* distribution) form, $\alpha>0$ is replaced by $r\in\mathbb{Z}^{>0}$, giving
\begin{eqnarray}
p(X=k\mid r,p) & ~=~ & 
\frac{(r+k-1)!}{k!\,(r-1)!}\,q^k\,p^r~=~\binom{k+r-1}{k}\,q^k\,p^r\,.
\end{eqnarray}
If we terminate the sequence of trials after just $r=1$ stopping events, then this reduces to the
*geometric* distribution, i.e. $\mathtt{Geom}(p)\equiv \mathtt{NegBinom}(1,p)$.
Furthermore, $X\sim\mathtt{NegBinom}(r,p)$ corresponds to the sum of $r$ independent geometric counts, i.e.
$X=X_1+X_2+\ldots+X_r$ with $X_i\sim\mathtt{Geom}(p)$.

Finally, for a sequence of $n$ conditionally independent counts, the joint mixture distribution is
\begin{eqnarray}
p_c(X_1,\ldots,X_n\mid\alpha,\beta) & ~=~ & \int_0^\infty p(X_1,\ldots,X_n\mid\lambda)\,p(\lambda\mid\alpha,\beta)\,d\lambda
\\& = &
\int_0^\infty \frac{e^{-n\lambda}\,\lambda^{n\bar{X}}}{\prod_{i=1}^n X_i!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\lambda^{\alpha-1}\,e^{-\beta\lambda}\,d\lambda
\\& = &
\frac{\Gamma(\alpha+n\bar{X})}{\Gamma(\alpha)\,\prod_{i=1}^n X_i!}\,\frac{\beta^\alpha}{(\beta+n)^{\alpha+n\bar{X}}}\,.
\end{eqnarray}
Alternatively, for *unconditionally* independent counts sampled with *different* values of $\lambda$, the joint mixture distribution is just
\begin{eqnarray}
p_u(X_1,\ldots,X_n\mid\alpha,\beta) & ~=~ &
\prod_{i=1}^n\frac{\Gamma(\alpha+X_i)}{X_i!}\cdot\frac{q^{n\bar{X}}\,p^{n\alpha}}{\Gamma(\alpha)^n}\,.
\end{eqnarray}

### Gamma posterior distribution

For a sequence of $n$ conditionally independent counts, the posterior distribution for the Poisson rate $\lambda$ is
given by
\begin{eqnarray}
p(\lambda\mid X_1,\ldots,X_n,\alpha,\beta) & ~=~ &
\frac{p(X_1,\ldots,X_n\mid\lambda)\,p(\lambda\mid\alpha,\beta)}{p_c(X_1,\ldots,X_n\mid\alpha,\beta)}
\\& = &
\left.
\frac{e^{-n\lambda}\,\lambda^{n\bar{X}}}{\prod_{i=1}^n X_i!}\,
\frac{\beta^\alpha}{\Gamma(\alpha)}\,\lambda^{\alpha-1}\,e^{-\beta\lambda}
\right/
\frac{\Gamma(\alpha+n\bar{X})}{\Gamma(\alpha)\,\prod_{i=1}^n X_i!}\,\frac{\beta^\alpha}{(\beta+n)^{\alpha+n\bar{X}}}
\\& = &
\frac{(\beta+n)^{\alpha+n\bar{X}}}{\Gamma(\alpha+n\bar{X})}\,\lambda^{\alpha+n\bar{X}-1}\,e^{-(\beta+n)\lambda}\,,
\end{eqnarray}
such that
\begin{eqnarray}
\lambda\mid X_1,\ldots,X_n & ~\sim~ & \texttt{Gamma}(\alpha+n\bar{X},\beta+n)\,.
\end{eqnarray}

## Citations

[1] George E. P. Box and George C. Tiao (1973), "*Bayesian Inference in Statistical Analysis*", John Wiley & Sons.

[2] Jouni Kerman (2011), "*Neutral noninformative and informative conjugate beta and gamma prior distributions*", Electron. J. Statist. 5: 1450-1470.