# Appendix C: Regression Models

The purpose of this appendix is to provide an introduction from first principles to the mechanics of generalised linear models (GLMs) for regression,
first introduced by Nelder and Wedderburn [[1]](#Citations "Citation [1]: Generalized Linear Models"). 
However, note that traditional GLM theory was derived for distributions of a single parameter (possibly with a constant 
hyper-parameter defining variance or dispersion).
For distributions of multiple parameters, more careful handling is required, which we derive here.
We also consider generalised nonlinear regression and its relationship to least-squares fitting.

## Introduction

### Probability distribution functions

We consider either a multi-dimensional or a scalar (i.e. uni-dimensional) variate $X$ on domain $\mathcal{X}$. Let $X$ have an underlying probability distribution function (PDF) $p(X\mid\boldsymbol{\theta})$ governed by a scalar or vector parameter $\boldsymbol{\theta}$. Then $p$ must satisfy the constraints of non-negativity:
\begin{eqnarray}
p(\mathbf{x}\mid\boldsymbol{\theta})~\ge~0 && \forall\mathbf{x}\in\mathcal{X}\,,
\end{eqnarray}
and a total probability of unity:
\begin{eqnarray}
\int_\mathcal{X}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}| & = & 1\,,
\end{eqnarray}
where $|d\mathbf{x}|$ is taken to be an infinitesimal volume or length in $\mathcal{X}$. Note that for discrete variates the constraint
is instead
\begin{eqnarray}
\sum_{\mathbf{x}\in\mathcal{X}}p(\mathbf{x}\mid\boldsymbol{\theta}) & = & 1\,.
\end{eqnarray}
We shall henceforth assume continuous variates for convenience, but the resulting derivations will also hold 
in discretes case by replacing integration with summation.

When considered as general functions, PDFs have a variety of additional properties and constraints. For instance, for a continuous distribution, the integral represents the area under the curve. As a consequence of the non-negativity constraint, this places limits on the values of $p(\mathbf{x}\mid\boldsymbol{\theta})$. For example, if the domain $\mathcal{X}$ has no finite upper bound, then $p(\mathbf{x}\mid\boldsymbol{\theta}) \rightarrow 0$ as $\mathbf{x}\rightarrow\infty$. Similarly,
$p(\mathbf{x}\mid\boldsymbol{\theta}) \rightarrow 0$ as $\mathbf{x}\rightarrow -\infty$ if $\mathcal{X}$
does not have a finite lower bound. Similar conditions hold for spatial derivatives with respect to $\mathbf{x}$ at the extremes.

Consider now derivatives with respect to the parameter $\boldsymbol{\theta}$, denoted by the gradient vector operator
$\boldsymbol{\nabla}_\boldsymbol{\theta}\doteq\frac{\partial}{\partial\boldsymbol{\theta}}$, which we take to be a column vector. Similarly, second derivatives are denoted by the *Hessian* matrix operator
$\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}\doteq\frac{\partial^2}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{T}}$.
[Later](#Parameter-transformations "Section: Parameter transformations") 
we shall also require the first and second derivatives with respect to some transformation
$\boldsymbol{\eta}(\boldsymbol{\theta})$ of the parameters, denoted
$\boldsymbol{\nabla}_\boldsymbol{\eta}$ and 
$\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}$, respectively.
In general, unless we specifically need to distinguish between these two cases, we may drop the subscript and assume the results hold for all parameterisations.

Now, considering the total probability constraint above, since the derivatives are with respect to
$\boldsymbol{\theta}$ or $\boldsymbol{\eta}(\boldsymbol{\theta})$ and not $\mathbf{x}$, it follows that
taking first derivatives of both sides gives
\begin{eqnarray}
\boldsymbol{\nabla}\int_\mathcal{X}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}| & = &
\int_\mathcal{X}\boldsymbol{\nabla} p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~\mathbf{0}\,,
\end{eqnarray}
and taking second derivatives gives
\begin{eqnarray}
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}\int_\mathcal{X}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}| & = &
\int_\mathcal{X}\boldsymbol{\nabla}\boldsymbol{\nabla}^{T} p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~\mathbf{O}\,,
\end{eqnarray}
where we let $\mathbf{O}$ denote an appropriately dimensioned zero matrix, as distinct from the zero (column) vector
denoted by $\mathbf{0}$.
We shall use these results in the
[next](#Expectations-and-log-likelihoods "Section: Log-likelihoods and expectations") section.

Note that there are various fields of stochatic modelling, such as Maximum Entropy, that require use of the general properties of PDFs in order to construct specific PDFs that fit given theoretical or practical requirements. For the rest of this doccument, however, we shall assume that the form of the PDF has been specified in advance. The properties we require then relate instead to fitting the PDF to observed data, e.g. via regression modelling.

### Expectations and log-likelihoods

We assume the 
[*law of the unconscious statistician*](https://en.wikipedia.org/wiki/Law_of_the_unconscious_statistician "Wikipedia: LOTUS"), and take the expectation of an arbitrary function $\mathbf{f}(X, \boldsymbol{\theta})$ to be given by
\begin{eqnarray}
\mathbb{E}_X\left[\mathbf{f}(X, \boldsymbol{\theta})\mid\boldsymbol{\theta}\right] & \doteq & 
\int_\mathcal{X}\mathbf{f}(\mathbf{x}, \boldsymbol{\theta})\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\,.
\end{eqnarray}
Note that in general we may drop the subscript when it is clear with respect to which variate we are taking the expectation.
Also note that since the expectation is the weighted mean of the function $\mathbf{f}$, we often denote this for convenience  as
\begin{eqnarray}
\boldsymbol{\mu}_\mathbf{f}(\boldsymbol{\theta}) & \doteq & \mathbb{E}\left[\mathbf{f}\mid\boldsymbol{\theta}\right]\,.
\end{eqnarray}
Finally, note that function $\mathbf{f}$ may in general be scalar, vector, matrix or even tensor valued.
However, we shall typically assume, without loss of generality, some scalar function $f$ (unless otherwise stated), since the expectation of a vector is a vector of scalar expectations, and likewise for matrices and tensors.

Thus, taking the gradient of the expectation, we see that
\begin{eqnarray}
\boldsymbol{\nabla}\,\mathbb{E}\left[f\mid\boldsymbol{\theta}\right] & = & 
\int_\mathcal{X}\boldsymbol{\nabla}f(\mathbf{x}, \boldsymbol{\theta})\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
+
\int_\mathcal{X}f(\mathbf{x}, \boldsymbol{\theta})\,\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\& = &
\int_\mathcal{X}\boldsymbol{\nabla}f(\mathbf{x}, \boldsymbol{\theta})\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
+
\int_\mathcal{X}f(\mathbf{x}, \boldsymbol{\theta})\,\boldsymbol{\nabla}\ln p(\mathbf{x}\mid\boldsymbol{\theta})\,
p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\& = &
\mathbb{E}\left[\boldsymbol{\nabla}f\mid\boldsymbol{\theta}\right]
+
\mathbb{E}\left[f\,\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
where, for convenience, we have defined the log-likelihood $L$ as
\begin{eqnarray}
L(\boldsymbol{\theta}; X) & \doteq & \ln p(X\mid\boldsymbol{\theta})\,.
\end{eqnarray}
Note that if we instead used a vector function $\mathbf{f}$, then we would have a choice of either the scalar gradient
$\boldsymbol{\nabla}^T\mathbf{f}$ or the matrix gradient $\boldsymbol{\nabla}\mathbf{f}^T$.

Suppose now, as an example, that we take the constant function 
$f\equiv 1~\Rightarrow\boldsymbol{\nabla}f\equiv\mathbf{0}$.
Then we immediately deduce that
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right] & = & \mathbf{0}\,.
\end{eqnarray}
This is one of the useful
 results from Kendall and Stuart [[2]](#Citations "Citation [2]: The Advanced Theory of Statistics").
It is of interest to derive this result direcly.
We begin by taking the gradient of $L$, namely
\begin{eqnarray}
\boldsymbol{\nabla}L & = & 
\boldsymbol{\nabla}\ln p(\mathbf{x}\mid\boldsymbol{\theta})
~=~\frac{\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,.
\end{eqnarray}
Hence, taking the expectation of the gradient, we therefore obtain
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right] & = &
\int_\mathcal{X}
\frac{\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~
\int_\mathcal{X}\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~\mathbf{0}\,,
\end{eqnarray}
as before. The last part follows from the
[previous](#Probability-distribution-functions "Section: Probability distribution functions") section.

Similarly, taking the Hessian of $L$ gives
\begin{eqnarray}
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L & = & 
\boldsymbol{\nabla}\left\{
\frac{\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\right\}
~=~\frac{
p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
-\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
}
{p(\mathbf{x}\mid\boldsymbol{\theta})^2}
\,.
\end{eqnarray}
Hence, taking the expectation results in
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right] & = &
\int_\mathcal{X}
\frac{
p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
-\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
}
{p(\mathbf{x}\mid\boldsymbol{\theta})^2}
\,
p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\
& = & 
\int_\mathcal{X}\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
- \int_\mathcal{X}
\frac{\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,\frac{\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\
& = &
\mathbf{O}-\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
~=~-\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
which again follows from the 
[previous](#Probability-distribution-functions "Section: Probability distribution functions") section. We shall require these results later.

Finally, we note that the expected value of the log-likelihood itself is given by
\begin{eqnarray}
\mathbb{E}\left[L\mid\boldsymbol{\theta}\right] & = & 
\int_\mathcal{X} p(\mathbf{x}\mid\boldsymbol{\theta})\,\ln p(\mathbf{x}\mid\boldsymbol{\theta})
\,|d\mathbf{x}|~\doteq~ -H(X\mid\boldsymbol{\theta})
\,,
\end{eqnarray}
where $H(X\mid\boldsymbol{\theta})$ is just the information-theoretic entropy of the distribution measured in *nats*. Consequently, we obtain
\begin{eqnarray}
\boldsymbol{\nabla}H(X\mid\boldsymbol{\theta}) & = &
-\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
-
\mathbb{E}\left[L\,\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
~=~ -\mathbb{E}\left[L\,\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
which follows from the derivation above.

### Parameter transformations

Suppose now that we wish to take derivatives, not with respect to the 
[PDF](#Probability-distribution-functions "Section: Probability distribution functions") 
parameter $\boldsymbol{\theta}$, but with respect to some other reparameterisation, say $\boldsymbol{\eta}=\boldsymbol{\eta}(\theta)$.
For this purpose, we consider the chain rules, namely that
\begin{eqnarray}
\frac{\partial}{\partial\boldsymbol{\theta}}~=~
\frac{\partial\boldsymbol{\eta}^T}{\partial\boldsymbol{\theta}}\frac{\partial}{\partial\boldsymbol{\eta}}\,,
& \;\;\;\mbox{and}~ &
\frac{\partial}{\partial\boldsymbol{\eta}}~=~
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}\frac{\partial}{\partial\boldsymbol{\theta}}\,.
\end{eqnarray}
For convenience, we define 
$\mathbf{J}_\boldsymbol{\eta}\doteq\frac{\partial\boldsymbol{\eta}^T}{\partial\boldsymbol{\theta}}$
to be the *Jacobian* matrix of the transformation $\boldsymbol{\eta}(\theta)$. It then follows from the first chain rule that the gradients are related via
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\theta}~=~\mathbf{J}_\boldsymbol{\eta}\,\boldsymbol{\nabla}_\boldsymbol{\eta}
& ~\Rightarrow~ &
\boldsymbol{\nabla}_\boldsymbol{\eta}~=~\mathbf{J}_\boldsymbol{\eta}^{-1}\,\boldsymbol{\nabla}_\boldsymbol{\theta}\,,
\end{eqnarray}
and thus we deduce from the second chain rule that $\mathbf{J}_\boldsymbol{\eta}^{-1}\doteq\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}$. Note that $\mathbf{J}_\boldsymbol{\eta}$ and $\mathbf{J}_\boldsymbol{\eta}^{-1}$ are only truly matrix inverses of each other if both
$\boldsymbol{\theta}$ and $\boldsymbol{\eta}$ have the same dimensions, otherwise we shall treat them symbolically.

The relationship between the Hessians is more involved. Firstly, we take the transpose of the second chain rule  to obtain
\begin{eqnarray}
\frac{\partial}{\partial\boldsymbol{\eta}^T} & = &
\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}\,,
\end{eqnarray}
where the symbol "$\cdot$" explicitly indicates the position of the argument.
Next, we apply the second chain rule directly to this result, thereby obtaining
\begin{eqnarray}
\frac{\partial^2}{\partial\boldsymbol{\eta}\,\partial\boldsymbol{\eta}^T} & = &
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}\frac{\partial}{\partial\boldsymbol{\theta}}
\left\{
\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
\right\}
\\& = &
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}
\left\{
\frac{\partial^2\,\cdot}{\partial\boldsymbol{\theta}\,\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
+\left(\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial}{\partial\boldsymbol{\theta}}\right)
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
\right\}\,.
\end{eqnarray}
Finally, we replace the last derivative
$\frac{\partial}{\partial\boldsymbol{\theta}}$
via the first chain rule, to obtain
\begin{eqnarray}
\frac{\partial^2}{\partial\boldsymbol{\eta}\,\partial\boldsymbol{\eta}^T} & = &
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}
\left\{
\frac{\partial^2\,\cdot}{\partial\boldsymbol{\theta}\,\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
+\left(\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\eta}^T}{\partial\boldsymbol{\theta}}\right)\odot
\frac{\partial^2\boldsymbol{\theta}}{\partial\boldsymbol{\eta}\,\partial\boldsymbol{\eta}^T}
\right\}
\,.
\end{eqnarray}
Note that the last term is the dot product "$\odot$" of a (row) vector with a *tensor*, i.e. a column "vector" in which each element is 
itself a matrix. 
Consequently, in terms of the gradient operators and Jacobian matrices, we obtain
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T} & = &
\mathbf{J}_\boldsymbol{\eta}^{-1}\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}(\cdot)\,
\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}
+
\mathbf{J}_\boldsymbol{\eta}^{-1}\left(\boldsymbol{\nabla}_\boldsymbol{\theta}^T(\cdot)\,
\mathbf{J}_\boldsymbol{\eta}\right)\odot
\boldsymbol{\nabla}_\boldsymbol{\eta}\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}\,.
\end{eqnarray}
Also note that various specialisations of this relationship occur depending upon both the parameters and the reparameterisation.
For example, in the case of scalar $\theta$ and scalar $\eta$, the identity simplifies to
\begin{eqnarray}
\frac{\partial^2}{\partial\eta^2} & = & \left(\frac{\partial\theta}{\partial\eta}\right)^2
\frac{\partial^2}{\partial\theta^2} +
\frac{\partial^2\theta}{\partial\eta^2}\,\frac{\partial}{\partial\theta}\,.
\end{eqnarray}

Now, the general identity will be a bit complex to apply in practice.
However, we find that there is a simpler approximation when we specifically consider the log-likelihood $L$, such that
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L & = &
\mathbf{J}_\boldsymbol{\eta}^{-1}\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}
+
\mathbf{J}_\boldsymbol{\eta}^{-1}\left(\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L\,
\mathbf{J}_\boldsymbol{\eta}\right)\odot
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\boldsymbol{\theta}\,.
\end{eqnarray}
In particular, taking the expectation of both sides, we obtain
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L
\mid\boldsymbol{\theta}\right]
& = &
\mathbf{J}_\boldsymbol{\eta}^{-1}\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\mid\boldsymbol{\theta}\right]\,
\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}
+
\mathbf{J}_\boldsymbol{\eta}^{-1}\left(\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\mid\boldsymbol{\theta}\right]\,\mathbf{J}_\boldsymbol{\eta}\right)\odot
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\boldsymbol{\theta}
\\&=&
-\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\theta}L\,
\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\mid\boldsymbol{\theta}\right]\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^{T}}
~=~ -\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
which follows from a [previous](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods") section,
where we derived that $\mathbf{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]=\mathbf{0}$
and
$\mathbb{E}\left[\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]=
-\mathbb{E}\left[\boldsymbol{\nabla}L\,
\boldsymbol{\nabla}^{T}L
\mid\boldsymbol{\theta}\right]$.
This approximation to the Hessian of the log-likelihood will be used in the
[next](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") section.

### Maximum likelihood estimation

Consider a stochastic sampling process that produces an (arbitrary) length-$n$ sequence of independent, identically distributed variables,
$X_1, X_2, \ldots, X_n$. Then the sample average is defined as
\begin{eqnarray}
\langle X\rangle ~\doteq~
\frac{1}{n}\sum_{i=1}^{n}X_i\,.
\end{eqnarray}
More generally, the sample average of an arbitrary function $\mathbf{f}(X,\ldots)$ is given by
\begin{eqnarray}
\langle\mathbf{f}\rangle(\ldots) & \doteq &
\frac{1}{n}\sum_{i=1}^{n}\mathbf{f}( X_i,\ldots)\,,
\end{eqnarray}
where the ellipsis "$\ldots$" represents arbitrary parameters that do not vary with $X$.
Due to the linearity of the operator $\langle\cdot\rangle$, we henceforth typically consider a scalar function $f$
(unless otherwise stated), since the sample mean of a vector is a vector of scalar sample means, and likewise for
matrices and tensors.
Similarly, it also follows from the linearity of the various other operators
that the parameter gradient and Hessian of the sample mean obey
\begin{eqnarray}
\boldsymbol{\nabla}\langle f\rangle ~=~
\left\langle\boldsymbol{\nabla}f\right\rangle &\;\;\mbox{and}\;&
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}\langle f\rangle ~=~
\left\langle\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}f\right\rangle\,,
\end{eqnarray}
respectively, and the expectation obeys
\begin{eqnarray}
\mathbb{E}\left[\langle f\rangle\mid\boldsymbol{\theta}\right] & = &
\left\langle\mathbb{E}\left[f\mid\boldsymbol{\theta}\right]\right\rangle
~=~\mathbb{E}\left[f\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
since the variates $X_i$ are here assumed to be independent and identically distributed (IID).
We shall relax this last restriction [later](#Regression-modelling "Section: Regression modelling").

We therefore see that the sample-mean log-likelihood $\langle L\rangle$ satisfies
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}\langle L\rangle\mid\boldsymbol{\theta}\right] & = &
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]\right\rangle
~=~\mathbf{0}
\,,
\end{eqnarray}
with the last result obtained from a [previous](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods") section.
This result motivates the maximum likelihood approach, which is to determine the estimate $\hat{\boldsymbol{\theta}}_\texttt{ML}$ that satisfies
$\langle\boldsymbol{\nabla}L\rangle(\hat{\boldsymbol{\theta}}_\texttt{ML})=\mathbf{0}$,
if such a solution exists.
Under mild conditions of convexity,
$\langle L\rangle(\hat{\boldsymbol{\theta}}_\texttt{ML})$ is a local maximum.

The maximum-likelihood parameter value
$\hat{\boldsymbol{\theta}}_\texttt{ML}$ is usually found iteratively via an update scheme of the form
\begin{eqnarray}
\boldsymbol{\theta}' & = & \boldsymbol{\theta}+\Delta\boldsymbol{\theta}\,,
\end{eqnarray}
where all requisite quantities are evaluated at the current estimate $\boldsymbol{\theta}$.
Then the parameter increment $\Delta\boldsymbol{\theta}$ itself is usually computed either via
a direct gradient method, e.g. gradient ascent
\begin{eqnarray}
\Delta\boldsymbol{\theta} & \doteq & \rho\,\boldsymbol{\nabla}\langle L\rangle\,,
\end{eqnarray}
or via a modified gradient method, e.g. the Newton-Raphson method
\begin{eqnarray}
\Delta\boldsymbol{\theta} & \doteq & 
-\left[\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}\langle L\rangle\right]^{-1}\,
\boldsymbol{\nabla}\langle L\rangle
\,.
\end{eqnarray}
Note that, in either case, the iterations will halt when 
$\Delta\boldsymbol{\theta}=\mathbf{0}$, which occurs when
the gradient of the sample-mean log-likelihood vanishes.

In practice, we usually apply the Newton-Raphson scheme by solving the linear equation
\begin{eqnarray}
-\langle\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}L\rangle\,\Delta\boldsymbol{\theta} & = & 
\langle\boldsymbol{\nabla}L\rangle
\,.
\end{eqnarray}
However, the Hessian is often difficult to compute, and so an approximation is typically used.
Hence, following the reasoning from the
[previous](#Parameter-transformations "Section: Parameter transformations") section,
we take the expectation of the left-hand side only (since the expectation of the right-hand side is always zero).
This allows us to compute an approximate update as the solution to
\begin{eqnarray}
-\left\langle\mathbb{E}\left[
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]\right\rangle\,\Delta\boldsymbol{\theta} 
& = & 
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L
\mid\boldsymbol{\theta}\right]\right\rangle
\,\Delta\boldsymbol{\theta} ~=~
\langle\boldsymbol{\nabla}L\rangle
\,,
\end{eqnarray}
which has the advantage of only requiring knowledge about the gradient of the log-likelihood.
Note that some other gradient update schemes also use approximations to the Hessian. For example, the LBFGS algorithm computes an approximate Hessian matrix from (multiple) previous estimates of the gradient,
effectively approximating the expectation itself by a temporal average.

Finally, note that we might more generally consider some 
[parameter transformation](#Parameter-transformations "Section: Parameter transformations") 
$\boldsymbol{\eta}=\boldsymbol{\eta}(\boldsymbol{\theta})$, and therefore the transformed parameter update
$\Delta\boldsymbol{\eta}$ would follow from the above formulae by
taking all gradients and Hessians with respect to $\boldsymbol{\eta}$ rather than $\boldsymbol{\theta}$.

## Exponential families

Since [PDFs](#Probability-distribution-functions "Section: Probability distribution functions") are required to be
non-negative, it follows that they may be expressed as exponentials. Different exponential forms lead to different families of distributions.
Of particular interest is a [general family](#General-form "Section: General form")
of distributions having linearly-additive log-likelihoods, and also a more 
[specialised family](#Seperable-dependencies "Section: Seperable dependencies"),
misleadingly called "**the**" exponential family,
having bilinear or separable dependencies between the variates and the parameters. These distributions are discussed in the following sections.

### General form

Clearly, a [PDF](#Probability-distribution-functions "Section: Probability distribution functions")
$p(\mathbf{x}\mid\boldsymbol{\theta})$ must be the exponential of its log-likelihood
$L(\boldsymbol{\theta};\,\mathbf{x})$. Considered as an additive model with variate $X$, the log-likelihood will in general include: constant terms; terms in $X$ but not $\boldsymbol{\theta}$; terms in $\boldsymbol{\theta}$ but not $X$;
and terms containing interactions between $X$ and $\boldsymbol{\theta}$. Hence, the general log-likelihood takes the form
\begin{eqnarray}
L(\boldsymbol{\theta};\,X) & = & \ln h(X)-\ln Ƶ(\boldsymbol{\theta})+s(X,\boldsymbol{\theta})
\,,
\end{eqnarray}
where any constant terms may be placed in either or both of $h(X)$ or $Ƶ(\boldsymbol{\theta})$, but the
interaction function $s(X,\boldsymbol{\theta})$ may contain neither constant terms, nor terms only in $X$,
nor terms only in $\boldsymbol{\theta}$.
Consequently, the general probability distribution is specified by
\begin{eqnarray}
p(\mathbf{x}\mid\boldsymbol{\theta}) & = & 
\frac{h(\mathbf{x})\,e^{s(\mathbf{x},\boldsymbol{\theta})}}
     {Ƶ(\boldsymbol{\theta})}
\,,
\end{eqnarray}
where $Ƶ(\boldsymbol{\theta})$ is now seen to be the normalising *partition* function defined by
\begin{eqnarray}
Ƶ(\boldsymbol{\theta}) & \doteq & 
\int_\mathcal{X}h(\mathbf{x})\,e^{s(\mathbf{x},\boldsymbol{\theta})}
\,|d\mathbf{x}|\,.
\end{eqnarray}

It now [follows](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods") 
from the parameter gradient of the log-likelihood $L$ that
\begin{eqnarray}
\boldsymbol{\nabla}L(\boldsymbol{\theta}; X) & = & \boldsymbol{\nabla}s(X,\boldsymbol{\theta})
-\boldsymbol{\nabla}\ln Ƶ(\boldsymbol{\theta})
\\
\Rightarrow
\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right] & = & 
\mathbb{E}\left[\boldsymbol{\nabla}s\mid\boldsymbol{\theta}\right]
-\boldsymbol{\nabla}\ln Ƶ~=~\mathbf{0}
\\
\Rightarrow 
\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s} & ~\doteq~ &
\mathbb{E}\left[\boldsymbol{\nabla}s\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}\ln Ƶ
\\
\Rightarrow \boldsymbol{\nabla}L & = & \boldsymbol{\nabla}s-\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s}
\,.
\end{eqnarray}
Similarly, it [follows](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods")
from the Hessian of the log-likelihood that
\begin{eqnarray}
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}L(\boldsymbol{\theta}; X) & = & 
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s(X,\boldsymbol{\theta})
-\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ(\boldsymbol{\theta})
\\
\Rightarrow
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
& = &
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
-\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ
\\
\Rightarrow
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ
& = &
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
+\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
\\& = &
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
+\mathbb{E}\left[\left(\boldsymbol{\nabla}s-\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s}\right)\,
\left(\boldsymbol{\nabla}s-\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s}\right)^T
\mid\boldsymbol{\theta}\right]
\,.
\end{eqnarray}
We observe that the last term on the right-hand side is just the variance of $\boldsymbol{\nabla}s$, such that
we may define
\begin{eqnarray}
\boldsymbol{\Sigma}_{\small\boldsymbol{\nabla}s}
& ~\doteq~ &
\mathbb{V}\left[\boldsymbol{\nabla}s\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ
-
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]\,.
\end{eqnarray}
Note that the variance operator $\mathbb{V}[\cdot]$ is also often denoted as $\texttt{Var}[\cdot]$ instead.
We shall use the $\texttt{Var}[\cdot]$ form when it is also useful to consider the covariance operator
$\texttt{Cov}[\cdot,\cdot]$.

These identities hold for gradients and Hessians with respect to both the distributional parameter
$\boldsymbol{\theta}$ and also any arbitrary
[reparameterisation](#Parameter-transformations "Section: Parameter transformations")
$\boldsymbol{\eta}(\boldsymbol{\theta})$.
Consequently, we may always define a new variate of the form 
$Y\doteq\boldsymbol{\nabla}s(X,\boldsymbol{\theta})$, such that
its mean is given by
\begin{eqnarray}
\boldsymbol{\mu}_{Y} & ~=~ &
\mathbb{E}\left[Y\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}\ln Ƶ\,,
\end{eqnarray}
and its variance is given by
\begin{eqnarray}
\boldsymbol{\Sigma}_Y
& ~=~ &
\mathbb{V}\left[Y\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}^{}\boldsymbol{\mu}_{Y}^{T}
-
\mathbb{E}\left[\boldsymbol{\nabla}^{}Y^T\mid\boldsymbol{\theta}\right]
\,.
\end{eqnarray}
The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
estimate $\hat{\boldsymbol{\theta}}_\texttt{ML}$
may therefore be obtained iteratively via the approximate Newton-Raphson update
\begin{eqnarray}
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L
\mid\boldsymbol{\theta}\right]\right\rangle\,\Delta\boldsymbol{\theta} ~=~
\langle\boldsymbol{\nabla}L\rangle
& ~~~\Rightarrow~~~ &
\left\langle\boldsymbol{\Sigma}_Y\right\rangle\,\Delta\boldsymbol{\theta}
~=~
\left\langle Y-\boldsymbol{\mu}_Y\right\rangle
\,.
\end{eqnarray}
The iterations halt when $\Delta\boldsymbol{\theta}=\mathbf{0}$, at which point
the sample mean $\left\langle Y\right\rangle$ equals the expected mean
$\hat{\boldsymbol{\mu}}_Y=\boldsymbol{\mu}_Y(\hat{\boldsymbol{\theta}}_\texttt{ML})$.
Note that under the transformation $\boldsymbol{\eta}=\boldsymbol{\eta}(\boldsymbol{\theta})$,
we may also obtain updates for $\boldsymbol{\eta}$ via $\Delta\boldsymbol{\eta}$,
culminating in the maximum-likelihood etimate
$\hat{\boldsymbol{\eta}}_\texttt{ML}=\boldsymbol{\eta}(\hat{\boldsymbol{\theta}}_\texttt{ML})$.

### Seperable dependencies

We now consider a specialisation of the [general form](#General-form "Section: General form") which had a (somewhat) arbitrary,
scalar interaction term $s(X,\boldsymbol{\theta})$.
The essential idea is that all interactions between the variate $X$ and the
parameter $\boldsymbol{\theta}$ are now multiplicatively separable, i.e. specified via one or more product terms. 
The simplest such product form is 
$s(X,\boldsymbol{\theta})=\boldsymbol{\theta}^{T}X$. However, more generally the nonlinear product
$s(X,\boldsymbol{\theta})=\boldsymbol{\eta}(\boldsymbol{\theta})^{T}\mathbf{u}(X)$ is also valid. 
Note that the vector function $\boldsymbol{\eta}(\cdot)$ may be thought of as defining the
*natural* parameterisation $\boldsymbol{\eta}=\boldsymbol{\eta}(\boldsymbol{\theta})$ of the distribution.
Also, note that we now have the special property that $\boldsymbol{\nabla}_\boldsymbol{\eta}\,s(X,\boldsymbol{\theta})=\mathbf{u}(X)$,
such that $Y_\boldsymbol{\eta}\doteq\mathbf{u}(X)$ may be thought of as the natural *variates*
of the distribution.

Note that distributions having the form $s(\mathbf{x},\boldsymbol{\theta})=\boldsymbol{\eta}(\boldsymbol{\theta})^{T}\mathbf{u}(\mathbf{x})$ are regarded as belonging to **the** *exponential family*. This is a somewhat misleading and presumptuous term, given that other forms of $s(\mathbf{x},\boldsymbol{\theta})$ exist, and other forms of log-likelihood not in the general form also exist, i.e. members of an even more general exponential family that are not in "the" exponential family.
Also note that distributions having the bilinear form $s(\mathbf{x},\boldsymbol{\theta})=\boldsymbol{\theta}^{T}\mathbf{x}$
are regarded as members of the *natural* exponential family, since the natural parameter $\boldsymbol{\eta}$
is identically the ordinary parameter $\boldsymbol{\theta}$. The other stipluation is 
([apparently](https://en.wikipedia.org/wiki/Natural_exponential_family "Wikipedia: Natural exponential family")) 
that we also must have the identity function
$\mathbf{u}(\mathbf{x})=\mathbf{x}$ to be in the natural exponential family. It is unclear what categorisation should be given to distributions for which the parameters are already *natural*, but for which $\mathbf{u}(\mathbf{x})\neq\mathbf{x}$.

As noted above, the special property of "the" exponential family is that, taking gradients with respect to
the natural parameter
$\boldsymbol{\eta}$, we have
\begin{eqnarray}
Y_\boldsymbol{\eta}~=~\boldsymbol{\nabla}_\boldsymbol{\eta}\,s(X,\boldsymbol{\theta})~=~\mathbf{u}(X)
& ~\Rightarrow~ &
\boldsymbol{\nabla}_\boldsymbol{\eta}\,Y_\boldsymbol{\eta}^{T}~=~
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\,s~=~\mathbf{O}
\,.
\end{eqnarray}
It then follows from the [previous](#General-form "Section: General form") section that
\begin{eqnarray}
\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta}) & ~=~ & 
\mathbb{E}\left[Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right] ~=~ 
\boldsymbol{\nabla}_\boldsymbol{\eta}\ln Z(\boldsymbol{\theta})\,,
\end{eqnarray}
and
\begin{eqnarray}
\boldsymbol{\Sigma}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta}) & ~=~ &
\mathbb{V}\left[ Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\ln Z(\boldsymbol{\theta})
~=~\boldsymbol{\nabla}_\boldsymbol{\eta}\,\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}^{T}(\boldsymbol{\theta})
\,.
\end{eqnarray}
As a simplification, we may use the [results](#Parameter-transformations "Section: Parameter transformations") that
\begin{eqnarray}
\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}~=~\mathbf{J}_\boldsymbol{\eta}^{-1}\,
\boldsymbol{\nabla}_\boldsymbol{\theta}\ln Z\,, 
& \;\;~\mbox{and}~\;\; &
\boldsymbol{\Sigma}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta})~=~
\mathbf{J}_\boldsymbol{\eta}^{-1}\,
~\boldsymbol{\nabla}_\boldsymbol{\theta}\,\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}^{T}
\,.
\end{eqnarray}
Consequently, the **exact** 
[Newton-Raphson](#Maximum-likelihood-estimation "Section: #Maximum likelihood estimation") update
of the natural parameter $\boldsymbol{\eta}$ is now given by
\begin{eqnarray}
\boldsymbol{\Sigma}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta})\,\Delta\boldsymbol{\eta} & = & 
\langle Y_\boldsymbol{\eta}\rangle-\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta})\,,
\end{eqnarray}
such that the 
 [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
estimate $\hat{\boldsymbol{\eta}}_\texttt{ML}=\boldsymbol{\eta}(\hat{\boldsymbol{\theta}}_\texttt{ML})$
satisfies $\hat{\boldsymbol{\mu}}_{\small Y_\boldsymbol{\eta}}
=\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}(\hat{\boldsymbol{\theta}}_\texttt{ML})=\langle Y_\boldsymbol{\eta}\rangle$.
 We shall see from a later 
[example](#Beta-distribution "Section: Beta distribution") that $\langle Y_\boldsymbol{\eta}\rangle$ are the sufficient statistics for "the" exponential family.

### Bernoulli distribution

Consider a match between two teams, say team A and team B. Suppose further that, after considering all the evidence, our model proposes a probability $\theta$ of team A winning. In practice, team A may either win or lose the match, or even draw the match, which we shall deal with later. Hence, we let $X=1$ indicate that team A actually won the match, and let $X=0$ indicate that team A lost the match. The variate $X$ then follows the Bernoulli distribution
\begin{eqnarray}
p(X=x\mid\theta) & = & \theta^{x}\,(1-\theta)^{1-x}
~=~\frac{e^{x\ln\frac{\theta}{1-\theta}}}{(1-\theta)^{-1}}\,.
\end{eqnarray}
We therefore observe that the Bernoulli distribution is a member of "the" 
[exponential family](#Seperable-dependencies "Section: Seperable dependencies") 
with natural parameter
\begin{eqnarray}
\eta & \doteq & \ln\frac{\theta}{1-\theta}~\doteq~\sigma^{-1}(\theta)\,,
\end{eqnarray}
where $\sigma^{-1}(\cdot)$ is the *logit* function, and its inverse is the  *logistic* (sigmoid)
function $\sigma(\eta)\doteq(1+e^{-\eta})^{-1}$.

We also see that the utility function is just the identity, $u(x)=x$, 
such that $X$ is the natural variate.
Lastly, observe that the partition function is given by
\begin{eqnarray}
Ƶ(\theta)~=~(1-\theta)^{-1} & ~\Rightarrow~ & \nabla_\theta\ln Ƶ~=~\frac{1}{1-\theta}\,.
\end{eqnarray}
Given the reparameterisation $\eta=\sigma^{-1}(\theta)$, we also note that the
[Jacobian](#Parameter-transformations "Section: Parameter transformations") of this transformation is given by
\begin{eqnarray}
J_\eta & = & \nabla_\theta\,\eta~=~\frac{1}{\theta\,(1-\theta)}
\\
\Rightarrow \nabla_\eta\ln Ƶ & = & J_\eta^{-1}\nabla_\theta\ln Ƶ~=~\theta\,,
\end{eqnarray}
from which it [follows](#Seperable-dependencies "Section: Seperable dependencies") that
the distributional mean is given by
\begin{eqnarray}
\mu_X & = & \mathbb{E}[X\mid\theta]~=~\nabla_\eta\ln Ƶ~=~\theta\,.
\end{eqnarray}
Taking $\mu\doteq\mu_X$ for convenience,
we now see that the logit function $\sigma^{-1}(\cdot)$ is the
natural [link function](#Regression-modelling "Section: Regression modelling")
for the Bernoulli distribution, since $\eta=\sigma^{-1}(\mu)$.
This choice of link function is also justified in a 
[later](#Bernoulli-regression-(again) "Section: Bernoulli regression (again)") section.

Similarly, we observe that
\begin{eqnarray}
\nabla_\eta^2\ln Ƶ(\theta)~=~\nabla_\eta\,\theta~=~J_\eta^{-1}~=~\theta\,(1-\theta)\,.
\end{eqnarray}
It therefore [follows](#Seperable-dependencies "Section: Seperable dependencies") that
the distributional variance is given by
\begin{eqnarray}
\sigma_X^2 & = & \mathbb{V}[X\mid\theta]~=~\theta\,(1-\theta)~=~\mu\,(1-\mu)\,.
\end{eqnarray}
Hence, the Bernoulli distribution (or its variate $X$) is *heteroscedastic*, since the variance is not constant
but is instead a function of the mean.

It also [follows](#Seperable-dependencies "Section: Seperable dependencies") that the
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate of the mean
is given by 
\begin{eqnarray}
\hat{\mu}_\texttt{ML} & = & \hat{\theta}_\texttt{ML}~=~\langle X\rangle\,.
\end{eqnarray}

Finally, we make use of the fact that $1-\sigma(\eta)=\sigma(-\eta)$, and therefore observe that we may reparameterise the Bernoulli distribution 
in terms of its natural parameter $\eta$ as
\begin{eqnarray}
p(X=x\mid\eta) & = & \sigma(-\eta)\,e^{\eta x}\,,
\end{eqnarray}
which puts it into the natural exponential family. Consequently, it appears that the membership (or non-membership) of a probability distribution in a given exponential sub-family is largely determined by its parameterisation.
Also note that under this reparameterisation, the mean and variance are now given by
\begin{eqnarray}
\mu~=~\mathbb{E}[X\mid\eta]~=~\sigma(\eta)\,,
&\;\;\;& 
\sigma^2_X~=~\mathbb{V}[X\mid\eta]~=~\sigma(\eta)\,\sigma(-\eta)\,,
\end{eqnarray}
respectively.

### Poisson distribution

It is feasible that team scores from some sporting games might be represented as Possion variables.
We suppose that a scoring event occurs at an average rate of $\lambda$ units per unit time (e.g. per match).
Thus, let $X\in\mathbb{Z}^{\ge 0}$ follow the Possion distribution
\begin{eqnarray}
p(X=x\mid\lambda) & = & e^{-\lambda}\,\frac{\lambda^x}{x!}
~=~\frac{(x!)^{-1}\,e^{x\ln\lambda}}{e^\lambda}\,,
\end{eqnarray}
which is in "the" [exponential family](#Seperable-dependencies "Section: Seperable dependencies").
The natural variate is therefore just $X$, but the natural parameter is $\eta\doteq\ln\lambda$.
The partition function is
\begin{eqnarray}
Ƶ(\lambda)~=~e^\lambda & ~~~\Rightarrow~~~ & \ln Ƶ~=~\lambda~=~e^\eta\,,
\end{eqnarray}
[whereupon](#Seperable-dependencies "Section: Seperable dependencies")
\begin{eqnarray}
\mu_X & ~=~ & \mathbb{E}[X\mid\lambda]~=~\nabla_\eta\ln Ƶ~=~e^\eta~=~\lambda\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_X & ~=~ & \mathbb{V}[X\mid\lambda]~=~\nabla^2_\eta\ln Ƶ~=~e^\eta~=~\lambda\,.
\end{eqnarray}
The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate of
the rate $\lambda$ is therefore just $\hat{\lambda}_\texttt{ML}=\hat{\mu}_\texttt{ML}=\langle X\rangle$.

### Beta distribution

In a [previous](#Bernoulli-distribution "Section: Bernoulli distribution") section, we
considered the situation where a match between teams A and B might have a fixed probability $\theta$ of team A winning.
In contrast, we now suppose that this
probability is denoted by a variate $X$, which is itself sampled from another distribution. For example, $X$ might be drawn from the Beta distribution
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{x^{\alpha-1}\,(1-x)^{\beta-1}}{B(\alpha,\beta)}
~=~\frac{[x(1-x)]^{-1}\,e^{\alpha\ln x+\beta\ln(1-x)}}{B(\alpha,\beta)}\,.
\end{eqnarray}
This distribution is also in "the"
[exponential family](#Seperable-dependencies "Section: Seperable dependencies"), with natural parameters 
$\boldsymbol{\theta}=(\alpha,\beta)$, 
natural variates
$Y_\alpha=\ln X$ and $Y_\beta=\ln (1-X)$, and
partition function $Ƶ(\boldsymbol{\theta})$ given by
\begin{eqnarray}
B(\alpha,\beta) & ~\doteq~ & \frac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha+\beta)}\,,
\end{eqnarray}
where $\Gamma(\cdot)$ is the *gamma* function.
It therefore [follows](#Seperable-dependencies "Section: Seperable dependencies") that
the mean of $Y_\alpha=\ln X$ is given by
\begin{eqnarray}
\mu_{\small Y_\alpha} & ~=~ & \mathbb{E}\left[\ln X\mid\alpha,\beta\right] ~=~
\frac{\partial}{\partial\alpha}\ln B(\alpha,\beta)
~=~\psi(\alpha)-\psi(\alpha+\beta)\,,
\end{eqnarray}
where $\psi(\cdot)$ is the *digamma* function given by 
\begin{eqnarray}
\psi(z) & ~\doteq~ & \frac{d}{dz}\ln\Gamma(z)~=~\frac{\Gamma'(z)}{\Gamma(z)}\,.
\end{eqnarray}
Similarly, the mean of $Y_\beta=\ln (1-X)$ is given by
\begin{eqnarray}
\mu_{\small Y_\beta} & ~=~ & \mathbb{E}\left[\ln (1-X)\mid\alpha,\beta\right] ~=~
\frac{\partial}{\partial\beta}\ln B(\alpha,\beta)
~=~\psi(\beta)-\psi(\alpha+\beta)\,.
\end{eqnarray}
For interest sake, note that if we let $Y\doteq\ln\frac{X}{1-X}=\sigma^{-1}(X)$, then we deduce that
\begin{eqnarray}
\mu_Y & ~=~ & \mathbb{E}\left[Y\mid\alpha,\beta\right]~=~
\mu_{\small Y_\alpha}-\mu_{\small Y_\beta} 
~=~\psi(\alpha)-\psi(\beta)\,.
\end{eqnarray}

Similarly, we find that the variance of $Y_\alpha$ is given by
\begin{eqnarray}
\sigma^2_{\small Y_\alpha} & ~=~ & \texttt{Var}\left[\ln X\mid\alpha,\beta\right] ~=~
\frac{\partial^2}{\partial\alpha^2}\ln B(\alpha,\beta)
~=~\psi'(\alpha)-\psi'(\alpha+\beta)\,,
\end{eqnarray}
where $\psi'(\cdot)\doteq\psi_1(\cdot)$ is the *trigamma* function given by
\begin{eqnarray}
\psi_1(z) & \doteq & \frac{\Gamma(z)\,\Gamma''(z)-\Gamma'(z)^2}{\Gamma(z)^2}\,.
\end{eqnarray}
Likewise, the variance of $Y_\beta$ is given by
\begin{eqnarray}
\sigma^2_{\small Y_\beta} & ~=~ & \texttt{Var}\left[\ln(1-X)\mid\alpha,\beta\right] ~=~
\frac{\partial^2}{\partial\beta^2}\ln B(\alpha,\beta)
~=~\psi'(\beta)-\psi'(\alpha+\beta)\,,
\end{eqnarray}
and the covariance between $Y_\alpha$ and $Y_\beta$ is given by
\begin{eqnarray}
\sigma_{Y_\alpha,Y_\beta} & ~=~ & \texttt{Cov}\left[\ln X,\,\ln(1-X)\mid\alpha,\beta\right] ~=~
\frac{\partial^2}{\partial\alpha\partial\beta}\ln B(\alpha,\beta)
~=~-\psi'(\alpha+\beta)\,.
\end{eqnarray}


It might seem somewhat surprising that following the defined procedure does not immediately give us the mean and variance of the variate $X$, but instead gives the mean and variance of $\mathbf{u}(X)$. 
In fact, it turns out that $\langle\ln X\rangle$ and $\langle\ln(1-X)\rangle$ provide the sufficient statistics for the 
[Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution"), 
and not $\langle X\rangle$. 
The mean and variance of $X$ are actually given by
\begin{eqnarray}
\mu_X & ~=~ & \mathbb{E}[X\mid\alpha,\beta] ~=~\frac{\alpha}{\alpha+\beta}\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_X & ~=~ & \texttt{Var}[X\mid\alpha,\beta] ~=~\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\,,
\end{eqnarray}
respectively.
Note that we can also reparameterise the Beta distribution in another way. If we define $\nu\doteq\alpha+\beta$, then we see that
\begin{eqnarray}
\sigma^2_X ~=~ \frac{\mu_X\,(1-\mu_X)}{\nu+1} & \;\;\Rightarrow\;\; &
\nu ~=~ \frac{\mu_X\,(1-\mu_X)}{\sigma^2_X}-1\,,
\end{eqnarray}
such that
\begin{eqnarray}
\alpha~=~\mu_X\,\nu\,, & \;\;\;\; & \beta=(1-\mu_X)\,\nu\,.
\end{eqnarray}
The distribution is therefore heteroscedastic, with the variance $\sigma_X^2$ clearly being a function of 
the mean $\mu_X$ and hyper-parameter $\nu$.

Yet another reparametersiation is to retain $\alpha$ and $\nu$, such that the distribution becomes
\begin{eqnarray}
p(X=x\mid\alpha,\nu) & ~=~ & 
\frac{[x(1-x)]^{-1}\,e^{\alpha\ln\frac{x}{1-x}+\nu\ln(1-x)}}{B(\alpha,\nu-\alpha)}\,,
\end{eqnarray}
where the natural variates are now $\mathbf{u}(X)=[Y, \ln(1-X)]^{T}$. We therefore obtain the mean of variate $Y$ as
\begin{eqnarray}
\mu_Y & ~=~ & \frac{\partial}{\partial\alpha}\ln B(\alpha,\nu-\alpha)~=~\psi(\alpha)-\psi(\nu-\alpha)\,,
\end{eqnarray}
as before, and the variance of $Y$ is now also obtained as
\begin{eqnarray}
\sigma^2_Y & ~=~ & \frac{\partial^2}{\partial\alpha^2}\ln B(\alpha,\nu-\alpha)~=~\psi'(\alpha)+\psi'(\nu-\alpha)\,.
\end{eqnarray}
This should come as no surprise, since $Y=\ln X-\ln(1-X)$, such that
\begin{eqnarray}
\texttt{Var}[Y\mid\alpha,\beta] & ~=~ & 
\texttt{Var}[\ln X\mid\alpha,\beta]+\texttt{Var}[\ln(1-X)\mid\alpha,\beta]-2\,\texttt{Cov}[\ln X,\,\ln(1-X)\mid\alpha,\beta]
\\& ~=~ &
\sigma^2_{Y_\alpha}+\sigma^2_{Y_\beta}-2\,\sigma_{Y_\alpha, Y_\beta}~=~\psi'(\alpha)+\psi'(\beta)\,.
\end{eqnarray}
This little exercise demonstrates that although different parameterisations might make the various calculations either easier
or harder to obtain, they cannot alter the final results.

Finally, we observe that the [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
solution 
$\hat{\boldsymbol{\theta}}_\texttt{ML}=(\hat{\alpha}_\texttt{ML},\hat{\beta}_\texttt{ML})$
satisfies the nonlinear system of equations
\begin{eqnarray}
\left\langle\ln X\right\rangle & = & 
\psi(\hat{\alpha}_\texttt{ML})-\psi(\hat{\alpha}_\texttt{ML}+\hat{\beta}_\texttt{ML})\,,
\\
\left\langle\ln(1-X)\right\rangle & = & 
\psi(\hat{\beta}_\texttt{ML})-\psi(\hat{\alpha}_\texttt{ML}+\hat{\beta}_\texttt{ML})\,.
\end{eqnarray}
This may be solved numerically using the Newton-Raphson method with iterative parameter updates of the form
\begin{eqnarray}
\left[\begin{array}{}\Delta\alpha\\\Delta\beta\end{array}\right]
& = &
\left[\begin{array}{}
\psi'(\alpha)-\psi'(\alpha+\beta) & -\psi'(\alpha+\beta)\\
-\psi'(\alpha+\beta) & \psi'(\beta)-\psi'(\alpha+\beta)
\end{array}\right]^{-1}\,
\left[\begin{array}{}
\left\langle\ln X\right\rangle - \psi(\alpha)+\psi(\alpha+\beta)\\
\left\langle\ln(1-X)\right\rangle - \psi(\beta)+\psi(\alpha+\beta)
\end{array}\right]
\,.
\end{eqnarray}

The final issue remains about what sample values of the variate $X$ are being observed in practice? We can no longer use the
[Bernoulli](#Bernoulli-distribution "Section: Bernoulli distribution") values of $X=1$ for a  win and $X=0$ for a loss,
since now $X$ is a probability. One possibility is to note that after a match between team A and team B, we might have observed team A's score $S_A$ and team B's score $S_B$. Hence, we could use the proportion $X=\frac{S_A}{S_A+S_B}$ as a proxy measure of the probability of team A winning a similar match against team B in the future.

Note that, in general, we should not expect $S_A$ and $S_B$ to be independent, since $S_A$ should increase with team A's offensive strength, and decrease with team B's defensive strength. For example, we might expect both scores to be higher in a match between poor defenders than in a match between strong defenders. However, if we do assume independence, then taking
$X\sim\texttt{Beta}(\alpha,\beta)$ [follows](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution")
from $S_A\sim\texttt{Gamma}(\alpha,\gamma)$ and $S_B\sim\texttt{Gamma}(\beta,\gamma)$,
where $\gamma$ is a parameter common across all teams.

### Beta-Bernoulli distribution

We now consider the combined situation where each match has a 
[Bernoulli-distributed](#Bernoulli-distribution "Section: Bernoulli distribution") outcome, $X\mid\theta\sim\texttt{Bern}(\theta)$, but where the probability $\theta$ iteslf is [Beta-distributed](#Beta-distribution "Section: Beta distribution"),
with $\theta\sim\texttt{Beta}(\alpha,\beta)$. It may be 
[shown](#D_distributions.ipynb#Beta-Bernoulli-mixture-distribution "Appendix D: Beta-Bernoulli mixture distribution") 
that
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & = & \frac{\alpha^x\,\beta^{1-x}}{\alpha+\beta}\,.
\end{eqnarray}

We may now rewrite the distribution in the form
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{\beta\,\left(\frac{\alpha}{\beta}\right)^x}{\alpha+\beta}
~=~\frac{e^{x\ln\frac{\alpha}{\beta}}}{1+\frac{\alpha}{\beta}}\,,
\end{eqnarray}
which is in "the" [exponential family](#Seperable-dependencies "Section: Seperable dependencies") with natural parameter
$\eta=\ln\frac{\alpha}{\beta}$. Consequently, the reparameterised distribution
\begin{eqnarray}
p(X=x\mid\eta) & \doteq & \frac{e^{\eta x}}{1+e^\eta}\,,
\end{eqnarray}
is thus in the natural exponential family, and has mean
\begin{eqnarray}
\mu & ~=~ & \mathbb{E}[X\mid\eta]~=~\frac{d}{d\eta}\ln(1+e^\eta)
~=~\frac{e^\eta}{1+e^\eta}~=~\frac{1}{1+e^{-\eta}}\,.
\end{eqnarray}
This is just the logistic function $\sigma(\cdot)$, such that
\begin{eqnarray}
\mu~=~\sigma(\eta) & ~\Rightarrow~ & \eta~=~\sigma^{-1}(\mu)\,.
\end{eqnarray}
The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate
$\hat{\eta}_\texttt{ML}$ therefore satisfies
\begin{eqnarray}
\hat{\mu}_\texttt{ML} & ~=~ & \sigma(\hat{\eta}_\texttt{ML})=\langle X\rangle\,.
\end{eqnarray}
Similarly, the variance is given by
\begin{eqnarray}
\sigma_X^2 & ~=~ & \texttt{Var}[X\mid\eta]~=~\frac{d}{d\eta}\frac{1}{1+e^{-\eta}}
~=~\frac{e^{-\eta}}{(1+e^{-\eta})^2}~=~\sigma(\eta)\,\sigma(-\eta)\,.
\end{eqnarray}


In terms of the original parameters $\alpha$ and $\beta$, substitution of $\eta=\ln\frac{\alpha}{\beta}$ into the above results gives
\begin{eqnarray}
\mu_X~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\; & \sigma^2_X~=~\frac{\alpha\beta}{(\alpha+\beta)^2}\,.
\end{eqnarray}
However, it must be noted that the natural exponential form of the distribution above required only a single parameter, namely $\eta$, instead of the two parameters specified here, namely $\alpha$ and $\beta$.
In order to help explain this discrepancy, consider the log-likelihood
\begin{eqnarray}
L(\alpha,\beta;X) & ~=~ & X\,(\ln\alpha-\ln\beta)+\ln\beta-\ln(\alpha+\beta)\,. 
\end{eqnarray}
Taking first derivatives with respect to $\alpha$ and $\beta$ then gives
\begin{eqnarray}
\frac{\partial L}{\partial\alpha} & ~=~ & \frac{X}{\alpha}-\frac{1}{\alpha+\beta}~\doteq~Y_\alpha-\mu_{Y_\alpha}\,,
\\
\frac{\partial L}{\partial\beta} & ~=~ & -\frac{X}{\beta}+\frac{\alpha}{\beta(\alpha+\beta)}
~\doteq~Y_\beta-\mu_{Y_\beta}\,,
\end{eqnarray}
from which it follows that
\begin{eqnarray}
\sigma^2_{Y_\alpha} & ~\doteq~ & \mathtt{Var}[Y_\alpha\mid\alpha,\beta]~=~\frac{\sigma^2_X}{\alpha^2}
~=~\frac{\beta}{\alpha(\alpha+\beta)^2}\,,
\\
\sigma^2_{Y_\beta} & ~\doteq~ & \mathtt{Var}[Y_\beta\mid\alpha,\beta]
~=~\frac{\sigma^2_X}{\beta^2}~=~\frac{\alpha}{\beta(\alpha+\beta)^2}\,,
\\
\sigma_{Y_\alpha,Y_\beta} & ~\doteq~ & \mathtt{Cov}[Y_\alpha,Y_\beta\mid\alpha,\beta]
~=~-\frac{\sigma^2_X}{\alpha\beta}~=~-\frac{1}{(\alpha+\beta)^2}\,.
\end{eqnarray}
In terms of the vector parameter $\boldsymbol{\theta}=(\alpha,\beta)$ with vector variate 
$Y_\boldsymbol{\theta}=(Y_\alpha,Y_\beta)$,
the covariance matrix is therefore given by
\begin{eqnarray}
\Sigma_{Y_\boldsymbol{\theta}} & ~\doteq~ & \left[
\begin{array}{cc}
\frac{\beta}{\alpha(\alpha+\beta)^2} & -\frac{1}{(\alpha+\beta)^2}
\\
-\frac{1}{(\alpha+\beta)^2} & \frac{\alpha}{\beta(\alpha+\beta)^2}
\end{array}
\right]\,.
\end{eqnarray}
We observe that this covariance matrix is singular, which prevents us from using the
[Newton-Raphson](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
approach to estimating $\alpha$ and $\beta$. We conclude that we cannot simultaneously estimate
both $\alpha$ and $\beta$.

Part of the problem is that we have no natural scale for $\alpha$ and $\beta$.
Observe that rescaling both $\alpha$ and $\beta$ by the same factor affects neither the
mean $\mu_X$ nor the variance $\sigma^2_X$. Consequently, we may set either of $\alpha$ or $\beta$ to a constant,
which is equivalent to pre-defining the scale.
For example, we could set $\alpha=1$ and then estimate $\beta=e^{-\eta}$.

The problem, fundamentally, is that the variates $Y_\alpha$ and $Y_\beta$ are not independent,
since $\alpha Y_\alpha+\beta Y_\beta=0$. Furthermore,
given $\alpha$ and $\beta$, both $Y_\alpha$ and $Y_\beta$ provide *identical* information about 
the response variate $X$. This is what makes the covariance matrix $\Sigma_{Y_\boldsymbol{\theta}}$ singular.
However, we can create an approximate estimation scheme by *assuming* that these variates *are* in fact independent. This amounts to artificially setting the covariance $\sigma_{Y_\alpha,Y_\beta}$ to zero, such that the
approximate covariance matrix is no longer singular. This approach appears to work in practice.

### Gamma distribution

As was noted [previously](#Beta-distribution "Section: Beta distribution"), in a match between team A and team B, we might observe scores $S_A$ and $S_B$, respectively. These scores are non-negative and usually, but not necessarily,
integer valued. Let $X$ be the variate denoting a team's score.
Then given appropriate assumptions of independence between the teams' scores, $X$ might follow the Gamma distribution:
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{\beta^\alpha}{\Gamma(\alpha)}\,x^{\alpha-1}\,e^{-\beta x}
~=~\frac{x^{-1}e^{\alpha\ln x-\beta x}}{\beta^{-\alpha}\,\Gamma(\alpha)}\,.
\end{eqnarray}
This [PDF](#Probability-distribution-functions "Section: Probability distribution functions") 
is in "the" [exponential family](#Seperable-dependencies "Section: Seperable dependencies") with natural parameters
$\boldsymbol{\theta}=(\alpha,\beta)$, natural variates 
$Y_\alpha=\ln X$ and $Y_\beta=-X$, and
partition function $Ƶ(\boldsymbol{\theta})$ that obeys
\begin{eqnarray}
\ln Ƶ(\boldsymbol{\theta}) & = & -\alpha\ln\beta+\ln\Gamma(\alpha)\,.
\end{eqnarray}

It therefore [follows](#Seperable-dependencies "Section: Seperable dependencies") that the means are given by
\begin{eqnarray}
\mu_{Y_\alpha} & ~=~ & \mathbb{E}[\ln X\mid\alpha,\beta]
~=~\frac{\partial}{\partial\alpha}\ln Ƶ
~=~-\ln\beta+\psi(\alpha)\,,
\\
\mu_{Y_\beta} & ~=~ & \mathbb{E}[-X\mid\alpha,\beta]
~=~\frac{\partial}{\partial\beta}\ln Ƶ
~=~-\frac{\alpha}{\beta}\,,
\end{eqnarray}
where $\psi(\cdot)$ is (as before) the *digamma* function. We note that the distributional mean is thus given by
\begin{eqnarray}
\mu_X & ~=~ & \mathbb{E}[X\mid\alpha,\beta]~=~\frac{\alpha}{\beta}\,.
\end{eqnarray}
Similarly, the variances are given by
\begin{eqnarray}
\sigma^2_{Y_\alpha} & ~=~ & \texttt{Var}[\ln X\mid\alpha,\beta]
~=~\frac{\partial^2}{\partial\alpha^2}\ln Ƶ
~=~\psi'(\alpha)\,,
\\
\sigma^2_{Y_\beta} & ~=~ & \texttt{Var}[-X\mid\alpha,\beta]
~=~\frac{\partial^2}{\partial\beta^2}\ln Ƶ
~=~\frac{\alpha}{\beta^2}\,,
\end{eqnarray}
and the covariance is given by
\begin{eqnarray}
\sigma_{Y_\alpha,Y_\beta} & ~=~ & \texttt{Cov}[\ln X,-X\mid\alpha,\beta]
~=~\frac{\partial^2}{\partial\alpha\partial\beta}\ln Ƶ
~=~-\frac{1}{\beta}\,,
\end{eqnarray}
such that the distributional variance is given by
\begin{eqnarray}
\sigma^2_X & ~=~ & \texttt{Var}[X\mid\alpha,\beta]
~=~\frac{\alpha}{\beta^2}\,.
\end{eqnarray}

The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate
$\hat{\boldsymbol{\theta}}_\texttt{ML}$ therefore satistifies the equations
\begin{eqnarray}
\left\langle -X\right\rangle~=~-\frac{\hat{\alpha}_\texttt{ML}}{\hat{\beta}_\texttt{ML}}
& ~~~\Rightarrow~~~ & \hat{\beta}_\texttt{ML}~=~\frac{\hat{\alpha}_\texttt{ML}}{\left\langle X\right\rangle}
\,,
\\
\left\langle\ln X\right\rangle~=~-\ln\hat{\beta}_\texttt{ML}+\psi(\hat{\alpha}_\texttt{ML})
& ~\Rightarrow~ & \ln\left\langle X\right\rangle-\left\langle\ln X\right\rangle
~=~\ln\hat{\alpha}_\texttt{ML}-\psi(\hat{\alpha}_\texttt{ML})\,.
\end{eqnarray}
[Apparently](https://en.wikipedia.org/wiki/Gamma_distribution "Wikipedia: Gamma distribution"), a good initial
estimate of $\hat{\alpha}_\texttt{ML}$ is given by
\begin{eqnarray}
\hat{\alpha}_0 & ~=~ & \frac{3-s+\sqrt{(s-3)^2+24s}}{12s}\,,
\end{eqnarray}
where $s=\ln\left\langle X\right\rangle-\left\langle\ln X\right\rangle$.

### Negative binomial distribution

The [negative binomial distribution](D_distributions.ipynb#Negative-binomial-distribution
"Appendix D: Negative binomial distribution") may be derived as a continuous mixture of a Poisson distribution with a gamma prior. This distribution takes the form
\begin{eqnarray}
p(X=k\mid\alpha,p) & ~=~ & 
\frac{\Gamma(\alpha+k)}{k!\,\Gamma(\alpha)}\,(1-p)^k\,p^\alpha\,,
\end{eqnarray}
where $p$ is the probability of a notional *stopping* event and the sequence of trials terminates immediately after observing
$\alpha$ such stopping events. The variate $X\in\mathbb{Z}^{>0}$ counts the number of non-stopping events in a terminated sequence.
The mean and variance of the distribution are given by
\begin{eqnarray}
\mu_X & ~=~ & \mathbb{E}[X\mid\alpha,p]~=~\frac{\alpha\,(1-p)}{p}\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_X & ~=~ & \mathbb{V}[X\mid\alpha,p]~=~\frac{\alpha\,(1-p)}{p^2}\,,
\end{eqnarray}
respectively.
Note that when $\alpha$ is real-valued we have the *Polya* distribution, whereas when $\alpha$ is
integer-valued we have the *Pascal* distribution.

We may rewrite this distribution in the [general form](#General-form "Section: General form") as
\begin{eqnarray}
p(X=k\mid\alpha,p) & ~=~ & \frac{(k!)^{-1}\,e^{\ln\Gamma(\alpha+k)+k\ln(1-p)}}{\Gamma(\alpha)\,p^{-\alpha}}\,.
\end{eqnarray}
Denoting the parameters by $\boldsymbol{\theta}\doteq(\alpha,p)$, the interaction between the variate $X$ and the parameters
takes the form
\begin{eqnarray}
s(X,\boldsymbol{\theta}) & ~\doteq~ & \ln\Gamma(\alpha+X)+X\ln(1-p)\,.
\end{eqnarray}
Due to the nonlinear interaction of $\Gamma(\alpha+X)$, this is **not**
in "the" [exponential family](#Seperable-dependencies "Section: Seperable dependencies").

Clearly, the variate $X$ has the corresponding natural parameter $\eta\doteq\ln(1-p)$, whereupon
$p=1-e^\eta$.
Given the partition function
\begin{eqnarray}
Ƶ(\boldsymbol{\theta}) & ~=~ & \Gamma(\alpha)\,p^{-\alpha}\,,
\end{eqnarray}
we deduce that
\begin{eqnarray}
\nabla_\eta\ln Ƶ & ~=~ & 
-\frac{\partial p}{\partial\eta}\,\frac{\partial \alpha\ln p}{\partial p}
~=~(1-p)\,\frac{\alpha}{p}\,,
\end{eqnarray}
which agrees with $\mu_X$. Proceding to take the another derivative then gives
\begin{eqnarray}
\nabla^2_\eta\ln Ƶ & ~=~ & 
\frac{\partial p}{\partial\eta}\,\frac{\partial}{\partial p}\left\{\frac{\alpha}{p}-\alpha\right\}
~=~(1-p)\,\frac{\alpha}{p^2}\,,
\end{eqnarray}
which agrees with $\sigma^2_X$.

For convenience, let us now consider $\alpha$ as another 'natural' parameter, despite its nonlinear co-dependence with $X$.
We then define its corresponding variate to be
\begin{eqnarray}
Y_\alpha & ~\doteq~ & \nabla_\alpha\,s(X,\boldsymbol{\theta})
~=~\psi(\alpha+X)\,,
\end{eqnarray}
where $\psi(\cdot)$ is the *digamma* function. From the partition function we then derive that
\begin{eqnarray}
\mu_{Y_\alpha} & ~\doteq~ & \mathbb{E}[Y_\alpha\mid\boldsymbol{\theta}]
~=~\nabla_\alpha\ln Ƶ(\boldsymbol{\theta})
\\& = &
\frac{\partial}{\partial\alpha}\left\{\ln\Gamma(\alpha)-\alpha\ln p\right\}
~=~\psi(\alpha)-\ln p
\,.
\end{eqnarray}
It then [follows](#General-form "Section: General form") that  the variance of $Y_\alpha$ is given by
\begin{eqnarray}
\sigma^2_{Y_\alpha} & ~\doteq~ & \mathbb{V}[Y_\alpha\mid\boldsymbol{\theta}]
~=~\nabla^2_\alpha\ln Ƶ(\boldsymbol{\theta})-\mathbb{E}[\nabla^2_\alpha\,s(X,\boldsymbol{\theta})\mid\boldsymbol{\theta}]
\\& = &
\psi'(\alpha)-\mathbb{E}[\psi'(\alpha+X)\mid\boldsymbol{\theta}]\,,
\end{eqnarray}
and the covariance between $Y_\alpha$ and $Y_\eta\doteq\nabla_\eta\,s=X$ is given by
\begin{eqnarray}
\sigma_{Y_\alpha,X} & ~\doteq~ & \mathtt{Cov}[Y_\alpha,X\mid\boldsymbol{\theta}]
~=~\nabla_\alpha\nabla_\eta\ln Ƶ
~=~\frac{1-p}{p}\,.
\end{eqnarray}

The theoretical expectation 
$\mathbb{E}[\psi'(\alpha+X)\mid\boldsymbol{\theta}]$ above is likely to be intractible. However, if we consider a sequence of $n$
[*unconditionally independent*](D_distributions.ipynb#Negative-binomial-distribution
"Appendix D: Negative binomial distribution") observations, then the
Hessian of the
[*average* data log-likelihood](D_distributions.ipynb#Maximum-likelihood-estimation
"Appendix D: Maximum likelihood estimation") 
contains the analogous term
$\frac{1}{n}\sum_{i=1}^{n}\psi'(\alpha+X_i)$.
Consequently, we replace the exact variance $\sigma^2_{Y_\alpha}$
by its empirical estimate
\begin{eqnarray}
\hat{\sigma}^2_{Y_\alpha} & ~\doteq~ &
\psi'(\alpha)-\langle\psi'(\alpha+X)\rangle\,.
\end{eqnarray}

Taking the reparameterisation
$\boldsymbol{\theta}'\doteq(\eta,\alpha)$,
the [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate 
$\hat{\boldsymbol{\theta}}_\texttt{ML}'$ is then (hopefully) reached
via the empirical iterative update
\begin{eqnarray}
\left[\begin{array}{}\Delta\eta\\\Delta\alpha\end{array}\right]
& = &
\left[\begin{array}{}
\sigma_X^2 & \sigma_{X,Y_\alpha}\\
\sigma_{X,Y_\alpha} & \hat{\sigma}_{Y_\alpha}^2
\end{array}\right]^{-1}\,
\left[\begin{array}{}
\left\langle X\right\rangle - \mu_X\\
\left\langle Y_\alpha\right\rangle - \mu_{Y_\alpha}
\end{array}\right]
\,.
\end{eqnarray}
In particular, letting $q\doteq 1-p$ and $\bar{X}\doteq\langle X\rangle$,
the update for $\hat{\alpha}_\texttt{ML}$ takes the form
\begin{eqnarray}
\Delta\alpha & ~=~ &
\frac{\sigma_X^2\,(\langle Y_\alpha\rangle-\mu_{Y_\alpha})
-\sigma_{X,Y_\alpha}\,(\langle X\rangle-\mu_X)}
{\sigma_X^2\,\hat{\sigma}_{Y_\alpha}^2-\sigma_{X,Y_\alpha}^2}
\\& = &
\frac{\frac{\alpha q}{p^2}\,(\langle\psi(\alpha+X)\rangle-\psi(\alpha)+\ln p)-\frac{q}{p}\,(\bar{X}-\frac{\alpha q}{p})}
{\frac{\alpha q}{p^2}\,(\psi'(\alpha)-\langle\psi'(\alpha+X)\rangle)-
\frac{q^2}{p^2}}
\\& = &
\frac{\langle\psi(\alpha+X)\rangle-\psi(\alpha)+\ln p
+\left(q-\frac{p\bar{X}}{\alpha}\right)}
{\psi'(\alpha)-\langle\psi'(\alpha+X)\rangle-\frac{q}{\alpha}}
\,.
\end{eqnarray}

It turns out that the last term in the numerator vanishes under the
constraint that
\begin{eqnarray}
p~\doteq~\frac{\alpha}{\alpha+\bar{X}}
& ~~~\Rightarrow~~~ &
\frac{\partial}{\partial\alpha}\ln p~=~\frac{q}{\alpha}\,,
\end{eqnarray}
whereupon the denominator is recognised as the negative gradient of the
numerator with respect to $\alpha$. Hence, this Newton-Raphson update is 
[**identical**](D_distributions.ipynb#Maximum-likelihood-estimation
"Appendix D: Maximum likelihood estimation")
 to an update via Newton's method of the maximum likelihood
estimate $\hat{\alpha}_\texttt{ML}$ subject to the constraint that
\begin{eqnarray}
\langle X\rangle-\mu_X(\hat{\boldsymbol{\theta}}_\texttt{ML})~=~0 & ~~~\Rightarrow~~~ &
\hat{p}_\texttt{ML}~=~ 
\frac{\hat{\alpha}_\texttt{ML}}
{\hat{\alpha}_\texttt{ML}+\langle X\rangle}\,.
\end{eqnarray}
Note that, subject to the usual constraints on convergence,
the empirical approximation of the variance works here because
\begin{eqnarray}
\langle Y_\alpha\rangle\left(\hat{\boldsymbol{\theta}}_\texttt{ML}\right)
~=~\mathbb{E}[Y_\alpha\mid\hat{\boldsymbol{\theta}}_\texttt{ML}]
& ~~~\Rightarrow~~~ &
\sigma^2_{Y_\alpha}\left(\hat{\boldsymbol{\theta}}_\texttt{ML}\right)~=~\hat{\sigma}^2_{Y_\alpha}\left(\hat{\boldsymbol{\theta}}_\texttt{ML}\right)\,,
\end{eqnarray}
from the [definition](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
of the maximum likelihood.

## Regression modelling

For the case of regression modelling, we now suppose that the *response* (or *dependent*) variate $X$ is *explained* by one or more
exogenous (or *independent*) covariates, collectively denoted by $Z$, via some regression function $\mathbf{f}(Z,\boldsymbol{\phi})$
with regression parameter $\boldsymbol{\phi}$. 
In particular, the *unconditional* 
[distribution](#Probability-distribution-functions "Section: Probability distribution functions") 
$p(X\mid\boldsymbol{\theta})$ is now replaced by a
*conditional* distribtution $p(X\mid\boldsymbol{\theta},Z,\boldsymbol{\phi})$.

The usual [mean regression](#Mean-regression "Section: Mean regression") 
model is to fit $\mathbf{f}$ to the mean $\boldsymbol{\mu}=\mathbb{E}[X\mid\boldsymbol{\theta}]$
of the unconditional distribution $p(X\mid\boldsymbol{\theta})$.
In essence, this assumes that the conditional distribution
$p(X\mid\boldsymbol{\theta},Z,\boldsymbol{\phi})$ takes the same form as the unconditional distribution.
We may represent this symbollically by the graphical model
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
The traditional approach is then to estimate the regression parameter $\boldsymbol{\phi}$ from the observed data using
some form of [least-squares](#Least-squares-regression "Section: Least-squares regression") approach that minimises the square
error of fit.

One of the issues with such an approach is that the range of the regression function $\mathbf{f}$ is usually unconstrained, especially for 
[linear models](#Generalised-linear-models "Section: Generalised linear models") 
like $\mathbf{f}(\mathbf{z},\boldsymbol{\Phi})=\boldsymbol{\Phi}^{T}\mathbf{z}\in\mathbb{R}^d$.
However, the permissible values of the mean $\boldsymbol{\mu}$ are usually proscribed by the PDF, for example if $X$ represents a proportion.
The key innovation of generalised linear modelling (GLM) introduced by
 Nelder and Wedderburn [[1]](#Citations "Citation [1]: Generalized Linear Models") was to regress, not on the mean
 $\boldsymbol{\mu}$ itself, but instead on another parameter $\boldsymbol{\eta}$ related by a *link* function $\mathbf{g}$ to the mean
via $\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$. We shall henceforth refer to $\boldsymbol{\eta}$ as the *link parameter*,
although this is not standard terminology. This new relationship corresponds to the graphical model
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xrightarrow{\mathbf{g}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
Now, for $\mathbf{f}$ to explain $\boldsymbol{\mu}$, we require that the link function $\mathbf{g}$ be invertible,
such that the inverse relationship may be represented by the graphical model
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
In terms of mean regression, we could also collapse this model to become
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}\circ\,\mathbf{f}_Z}\boldsymbol{\phi}\,,
\end{eqnarray}
which corresponds to nonlinear regression, even with a linear regression function $\mathbf{f}$.

We now turn to the issue of how to explain the PDF parameter $\boldsymbol{\theta}$ in terms of the mean $\boldsymbol{\mu}$, regardless of whether the mean is directly or indirectly obtained from 
$\mathbf{f}(Z,\boldsymbol{\phi})$.
Now, for some simple PDFs, such as the 
[Bernoulli distribution](#Bernoulli-distribution "Section: Bernoulii distribution"),
the relationship between the distribution parameter $\boldsymbol{\theta}$ and the mean $\boldsymbol{\mu}$ is invertible,
and we may therefore deduce $\boldsymbol{\theta}$ from
knowledge of $\boldsymbol{\mu}$. In general, however, knowing $\boldsymbol{\mu}$ might only give us partial
knowledge of $\boldsymbol{\theta}$, as is the case, for example, with the 
[Beta distribution](#Beta-distribution "Section: Beta distribution").
We therefore propose that the parameter $\boldsymbol{\theta}$ may be partitioned into two parts,
namely $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$, 
such that implicitly $\boldsymbol{\mu}=\boldsymbol{\mu}(\boldsymbol{\psi},\boldsymbol{\varphi})$.
Conversely, however, we now suppose that the *independent* parameter $\boldsymbol{\psi}$ is not obtainable from
$\boldsymbol{\mu}$ and so must be estimated separately, but that the *dependent* parameter 
$\boldsymbol{\varphi}$ is obtainable from $\boldsymbol{\mu}$, given $\boldsymbol{\psi}$, via some implicit
function $\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi}, \boldsymbol{\mu})$,
such that now $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}))$.
This inverted relationship may be represented by the graphical model
\begin{eqnarray}
\boldsymbol{\varphi} & \xleftarrow{\boldsymbol{\iota}^{-1}_\boldsymbol{\psi}} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,,
\end{eqnarray}
where $\iota_\boldsymbol{\psi}(\cdot)$ denotes some implicit, quasi-invertible function
such that
$\boldsymbol{\mu}=\boldsymbol{\iota}_\boldsymbol{\psi}(\boldsymbol{\varphi})=\boldsymbol{\mu}(\boldsymbol{\psi},\boldsymbol{\varphi})$
and 
$\boldsymbol{\varphi}=\boldsymbol{\iota}^{-1}_\boldsymbol{\psi}(\boldsymbol{\mu})=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.

In summary, the generalised modelling approach may be boiled down to three essential requirements:

1. The mean $\boldsymbol{\mu}$ of the variate $X$ is a function of the
[PDF](#Probability-distribution-functions "Section: Probability distribution functions")
parameter $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$ via the expectation
$\boldsymbol{\mu}=\mathbb{E}[X\mid\boldsymbol{\theta}]$, with partial inversion given implicitly via
$\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.
2. There exists an explicitly invertable *link* function $\mathbf{g}$ that maps $\boldsymbol{\mu}$ into a more convenient *link* parameter
$\boldsymbol{\eta}$ via $\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$.
3. Each value of the *response* variate $X$ is sampled from a distribution for which the corresponding link parameter $\boldsymbol{\eta}$ is determined by a parameterised regression function 
$\mathbf{f}(Z,\boldsymbol{\phi})$ of exogenous covariates $Z$. 

We now provide brief explanations of [mean regression](#Mean-regression "Section: Mean regression") 
 and
[least-squares regression](#Least-squares-regression "Section: Least-squares regression"),
and then go on to expand upon the above points to derive the 
[general regression](#Generalised-nonlinear-models "Section: Generalised nonlinear models")
model, and then discuss its specialisation to a 
[linear regression](#Generalised-linear-models "Section: Generalised linear models")
function.

### Mean regression

Consider [again](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") 
the stochastic sampling process that produces an arbitrary length-$n$ sequence of independent variables,
$X_1, X_2, \ldots, X_n$. However, now we drop the requirement that these variables are identically distributed.
Instead, we suppose that (the value of) each $X_i$ is drawn from the same 
[PDF](#Probability-distribution-functions "Section: Probability distribution functions")
but with  potentially different parameters $\boldsymbol{\theta}_i$, with individual means
$\boldsymbol{\mu}_{X_i}\doteq\mathbb{E}\left[X_i\mid\boldsymbol{\theta}_i\right]$
and variances $\boldsymbol{\Sigma}_{X_i}\doteq\texttt{Var}\left[X_i\mid\boldsymbol{\theta}_i\right]$.


Now, if we were allowed to sample the value of variate $X_i$ multiple times, then we would expect the values
to be displaced about the mean $\boldsymbol{\mu}_i$ according to
\begin{eqnarray}
X_i ~=~ \boldsymbol{\mu}_i+\mathbf{e}_i
& ~\Rightarrow~ & \mathbf{e}_i~=~X_i-\boldsymbol{\mu}_i
\,.
\end{eqnarray}
In this context, the displacement $\boldsymbol{e}_i$ is called the *noise* or the measurement *error*, 
and arises due to imprecise, stochastic measurements of the unknown mean $\boldsymbol{\mu}_i$.
This error has the distributional properties
\begin{eqnarray}
\mathbb{E}\left[\mathbf{e}_i\mid\boldsymbol{\theta}_i\right] & = & \mathbf{0}\,,
\\
\mathbb{E}\left[\mathbf{e}_i\,\mathbf{e}_i^T\mid\boldsymbol{\theta}_i\right]
& = & \texttt{Var}\left[X_i\mid\boldsymbol{\theta}_i\right]~=~\boldsymbol{\Sigma}_{X_i}(\boldsymbol{\theta}_i)\,.
\end{eqnarray}
We must keep in mind that this variance is not necessarily constant, but is generally a function of the
distributional parameters $\boldsymbol{\theta}_i$, especially the
[mean](#Seperable-dependencies "Section: Seperable dependencies").

Next, we suppose that associated with each *response* (or *dependent*) variate $X_i$ is a corresponding *exogenous*
(or *independent*) covariate $Z_i$. We futher suppose that $Z_i$ explains the mean $\boldsymbol{\mu}_i$ of $X_i$ via a parameterised regression function of the form
\begin{eqnarray}
\boldsymbol{\mu}_i & = & \mathbf{f}(Z_i,\boldsymbol{\phi})+\mathbf{r}_i\,,
\end{eqnarray}
where $\mathbf{r}_i$ is a called the *residual* or *error* of fit. Note that, in contrast to the measurement error, the residual arises due to error in approximating the true mean $\boldsymbol{\mu}_i$ with an estimating function
$\hat{\boldsymbol{\mu}}_i=\mathbf{f}(Z_,\boldsymbol{\phi})$. However, since the mean $\boldsymbol{\mu}_i$ is actually unknown, then we may (conceptually) encode our uncertainty about its true value into another PDF, such that the mean of this PDF obeys
\begin{eqnarray}
\mathbb{E}[\boldsymbol{\mu}_i\mid Z_i,\boldsymbol{\phi}] ~=~
\hat{\boldsymbol{\mu}}_i~=~\mathbf{f}(Z_i,\boldsymbol{\phi})
& ~\Rightarrow~ & 
\mathbb{E}[\mathbf{r}_i\mid Z_i,\boldsymbol{\phi}] ~=~ \mathbf{0}\,.
\end{eqnarray}
Similarly, this PDF has a variance that measures our uncertainty, such that
\begin{eqnarray}
\texttt{Var}[\boldsymbol{\mu}_i\mid Z_i,\boldsymbol{\phi}] & ~=~ &
\mathbb{E}[\mathbf{r}_i^{}\,\mathbf{r}_i^T\mid Z_i,\boldsymbol{\phi}]
~=~\boldsymbol{\Sigma_{\mathbf{r}_i}}(Z_i,\boldsymbol{\phi})
\,.
\end{eqnarray}
Note that this variance is generally a function of both the covariates $Z_i$ and the regression parameter
$\boldsymbol{\phi}$.

Combining the two distributions, we now obtain the regression model
\begin{eqnarray}
X_i & = & \mathbf{f}(Z_i,\boldsymbol{\phi})+\boldsymbol{\varepsilon}_i\,,
\end{eqnarray}
where $\boldsymbol{\varepsilon}_i=\mathbf{e}_i+\mathbf{r}_i$ combines both measurement error and fitting error, and hence may be considered as either an error or a residual. Consequently, we deduce that
\begin{eqnarray}
\mathbb{E}[\boldsymbol{\varepsilon}_i\mid\boldsymbol{\theta}_i, Z_i, \boldsymbol{\phi}]
& = & \mathbf{0}\,,
\\
\mathbb{E}[\boldsymbol{\varepsilon}_i\,\boldsymbol{\varepsilon}_i^T\mid\boldsymbol{\theta}_i, Z_i, \boldsymbol{\phi}] & = & 
\boldsymbol{\Sigma}_{X_i}(\boldsymbol{\theta}_i)
+\boldsymbol{\Sigma}_{\mathbf{r}_i}(Z_i,\boldsymbol{\phi})
~\doteq~\boldsymbol{\Sigma}_i(\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi})\,,
\end{eqnarray}
on the assumption that the residual $\mathbf{r}_i$ is independent of the noise $\mathbf{e}_i$.
In practice, we do not know the value of $\boldsymbol{\Sigma}_i$, and consequently must either estimate it from
the empirical distribution of $\boldsymbol{\varepsilon}_i$, or else approximate it, for example by
$\boldsymbol{\Sigma}_{X_i}$, which corresponds to assuming $\boldsymbol{\Sigma}_{\mathbf{r}_i}=\mathbf{O}$,
i.e. being extremely certain of the regression function $\mathbf{f}$.

### Least-squares regression

Recall from the [previous](#Mean-regresssion "Section: Mean regresssion") section
that we are considering the regression model
\begin{eqnarray}
X_i~=~\boldsymbol{\mu}_i+\mathbf{e}_i\,, & \;\; \boldsymbol{\mu}_i~=~\mathbf{f}(Z_i,\boldsymbol{\phi})+\mathbf{r}_i
& ~\Rightarrow~
X_i~=~\mathbf{f}(Z_i,\boldsymbol{\phi})+\boldsymbol{\varepsilon}_i\,.
\end{eqnarray}
In order to fit the regression function $\mathbf{f}$ to observed data 
$\mathbf{X}\doteq(X_i)_{i=1}^{n}$ and $\mathbf{Z}\doteq(Z_i)_{i=1}^{n}$,
we first redefine the
sample mean [operator](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") $\langle\cdot\rangle$ to include the covariate $Z$. Consequently, 
the sample average of an arbitrary function $\mathbf{g}(X,Z,\ldots)$ is now given by
\begin{eqnarray}
\langle\mathbf{g}\rangle(\ldots) & \doteq &
\frac{1}{n}\sum_{i=1}^{n}\mathbf{g}(X_i,Z_i,\ldots)\,,
\end{eqnarray}
where the ellipsis "$\ldots$" represents arbitrary parameters that do not vary with $X$ or $Z$.
Note that all parameters and variables that depend upon $X_i$ and/or $Z_i$ must also be indexed in this summation,
e.g. the residual variance $\boldsymbol{\Sigma}_i$. If it becomes necessary to explicitly distinguish between constants and functions of $X_i$ and $Z_i$, then we may retain the subscript, e.g. $\langle\boldsymbol{\Sigma}_i\boldsymbol{\phi}\rangle$. Also note that the sample mean may still be treated as a function of $\mathbf{X}$ and $\mathbf{Z}$, when they are considered as variables rather than known, sampled values.
This is useful for computing expectations of sample means, for example.

The fitting process requires estimating the best value of the function parameter $\boldsymbol{\phi}$ that
minimises the overall error of fit. The *ordinary* least-squares (OLS) method is to minimise the mean of the squared lengths of the residuals, namely
$S(\boldsymbol{\phi})=\left\langle\boldsymbol{\varepsilon}^{T}\boldsymbol{\varepsilon}\right\rangle$.
However, use of OLS makes some implicit assumptions, in particular that:
1. The residuals are independent of each other.
2. All residuals are equally important.
3. The elements of each residual are independent of each other.
4. The elements of each residual are equally important.

Only the first assumption of residual independence really holds. In practice, if the residuals are independent, then the plot of the fitted residuals against the covariate $Z$ should appear randomly distributed. However, if a pattern appears then the particular choice of the regression function $\mathbf{f}$ must be reconsidered. The assumption that the elements of each residual $\boldsymbol{\varepsilon}_i$ are independent does not hold in general, since
\begin{eqnarray}
\mathbb{E}[\boldsymbol{\varepsilon}_i^{T}\,\boldsymbol{\varepsilon}_i
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]
& ~=~ &
\mathbb{E}[
\texttt{trace}\left(\boldsymbol{\varepsilon}_i\,\boldsymbol{\varepsilon}_i^T\right)
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]
~=~\texttt{trace}\left(\boldsymbol{\Sigma}_i\right)\,.
\end{eqnarray}
Consequently, the assumptions of equal importance do not hold either, since a residual (or an element of a residual)
with higher variance has higher uncertainty associated with its fit, and hence should be assigned less weight.
It turns out that weighting in inverse proportion to the variance is a good idea.
Note that use of OLS corresponds to assuming constant and equal variances of the form
$\boldsymbol{\Sigma}_i=\mathbf{I}\,\sigma^2$.

We can solve both problems of element-wise non-independence and unequal weighting of residuals and elements by applying a so-called
*whitening transformation* that decouples within-residual correlations and standardises the variances. This transformation takes the form
\begin{eqnarray}
\tilde{\boldsymbol{\varepsilon}}_i & ~\doteq~ & 
\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\boldsymbol{\varepsilon}_i
~=~\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\left[X_i-\mathbf{f}(Z_i,\boldsymbol{\phi})\right]\,,
\end{eqnarray}
such that 
\begin{eqnarray}
\mathbb{E}[\tilde{\boldsymbol{\varepsilon}}_i^{T}\,\tilde{\boldsymbol{\varepsilon}}_i
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]
& ~=~ &
\texttt{trace}\left(\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\,\mathbb{E}[
\boldsymbol{\varepsilon}_i\,\boldsymbol{\varepsilon}_i^T
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]\,
\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\right)
~=~\texttt{trace}\left(\mathbf{I}\right)\,.
\end{eqnarray}
Consequently, the *weighted* least-squares (WLS) method is to minimise
\begin{eqnarray}
S(\boldsymbol{\phi}) & ~=~ &
\left\langle\tilde{\boldsymbol{\varepsilon}}^{T}\,\tilde{\boldsymbol{\varepsilon}}\right\rangle
~=~\left\langle
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]^{T}\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]
\right\rangle\,.
\end{eqnarray}

However, we [recall](#Mean-regresssion "Section: Mean regresssion") that 
$\boldsymbol{\Sigma}_i=\boldsymbol{\Sigma}_i(\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi})$ is a function
of $\boldsymbol{\phi}$. Consequently, WLS is typically a nonlinear problem even when $\mathbf{f}$ is a linear function of $\boldsymbol{\phi}$. To overcome this difficulty, we 
use an iterative approximation where we
evaluate $\boldsymbol{\Sigma}_i$ at the
previous estimate $\boldsymbol{\phi}$, but evaluate $\mathbf{f}$ at the new estimate
$\boldsymbol{\phi}'$, resulting in
\begin{eqnarray}
S(\boldsymbol{\phi}, \boldsymbol{\phi}') & ~=~ &
\left\langle
\left[X-\mathbf{f}(Z,\boldsymbol{\phi}')\right]^{T}\,
\boldsymbol{\Sigma}^{-1}(\boldsymbol{\theta},Z,\boldsymbol{\phi})\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi}')\right]
\right\rangle\,.
\end{eqnarray}
Next, we substitute the Taylor series approximation
\begin{eqnarray}
\mathbf{f}(Z,\boldsymbol{\phi}') & ~\approx~ &
\mathbf{f}(Z,\boldsymbol{\phi})
+\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\,,
\end{eqnarray}
where
\begin{eqnarray}
\boldsymbol{\phi}' & ~=~ \boldsymbol{\phi}+\Delta\boldsymbol{\phi}\,,
\end{eqnarray}
to obtain the new approximation
\begin{eqnarray}
S(\boldsymbol{\phi}, \Delta\boldsymbol{\phi}) & ~=~ &
\left\langle
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})
-\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\right]^{T}\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})
-\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\right]
\right\rangle\,.
\end{eqnarray}
Finally, we take the gradient with respect to $\Delta\boldsymbol{\phi}$ to obtain
\begin{eqnarray}
\boldsymbol{\nabla}_{\Delta\boldsymbol{\phi}}S & ~=~ &
-2\,\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})
-\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\right]
\right\rangle\,,
\end{eqnarray}
which vanishes when
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})
\right\rangle\,\Delta\boldsymbol{\phi}
& ~=~ &
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]
\right\rangle
\,.
\end{eqnarray}
This is the nonlinear form of the *iteratively reweighted* least-squares (IRLS) method, due to the fact that the
variance $\boldsymbol{\Sigma}_i(\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi})$ needs to be
re-evaluated after every update of the parameter estimate $\boldsymbol{\phi}$.
Note that the iterations will cease when $\Delta\boldsymbol{\phi}=\mathbf{0}$, at which point the 
solutions satisfies
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]
\right\rangle & ~=~ & \mathbf{0}\,.
\end{eqnarray}
This latter is just the solution to $\boldsymbol{\nabla}_\boldsymbol{\phi}S=\mathbf{0}$ from the original WLS formulation, on the assumption that $\boldsymbol{\Sigma}$ is held constant for the update and recomputed after the update.

### Generalised nonlinear models

[Previously](#Mean-regresssion "Section: Mean regresssion"), we considered the
regression of the mean $\boldsymbol{\mu}$ via some parameterised function
$\mathbf{f}(Z,\boldsymbol{\phi})$ of the covariate $Z$.
In my opinion, the key contribution of Nelder and Wedderburn
[[1]](#Citations "Citation [1]: Generalized Linear Models")
to generalised linear modelling (GLM) lies in the fact that we may instead apply regression, not to $\boldsymbol{\mu}$,
but to some more natural parameterisation $\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$,
where $\mathbf{g}(\cdot)$ is known as the *link* function.
We may therefore depict the resulting relationships by the graphical model
\begin{eqnarray}
\boldsymbol{\varphi} & \xrightarrow{\boldsymbol{\iota}_\boldsymbol{\psi}} & \boldsymbol{\mu}
\xrightarrow{\mathbf{g}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,,
\end{eqnarray}
which may further be interpreted as meaning that the parameters $\boldsymbol{\varphi}$ and $\boldsymbol{\phi}$ are conditionally independent given $\boldsymbol{\eta}$ (and $Z$ and $\boldsymbol{\psi}$).
It therefore [follows](#Parameter-transformations "Section: Parameter transformations") that
the gradient of the log-likelihood $L$ with respect to the regression parameter $\boldsymbol{\phi}$ takes the form
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}L & = &
\boldsymbol{\nabla}_\boldsymbol{\phi}\boldsymbol{\eta}^T\,\boldsymbol{\nabla}_\boldsymbol{\eta}L
~=~\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\nabla}_\boldsymbol{\eta}L(\boldsymbol{\theta}; X)\,.
\end{eqnarray}
Note that although the last term on the right-hand side appears to be a function only of $X$ and
$\boldsymbol{\theta}$, we must remember that $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$, and that $\boldsymbol{\varphi}$ is related to $\boldsymbol{\eta}=\mathbf{f}(Z,\boldsymbol{\phi})$ via the graphical model above.
In fact, in order to compute the gradient of the log-likelihood $L$ with respect to the parameter $\boldsymbol{\eta}$,
we must use the fact that the link function $\mathbf{g}$ is invertible, as represented
by the inverted graphical model
\begin{eqnarray}
\boldsymbol{\varphi} & \xleftarrow{\boldsymbol{\iota}_\boldsymbol{\psi}^{-1}} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
Consequently, we obtain
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}L & = &
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\mu}^T\,
\boldsymbol{\nabla}_\boldsymbol{\mu}\boldsymbol{\varphi}^T\,
\boldsymbol{\nabla}_\boldsymbol{\varphi}L
\\& = &
\left[\frac{\partial\mathbf{g}^T}{\partial\boldsymbol{\mu}}\right]^{-1}\,
\left[\frac{\partial\boldsymbol{\mu}^T}{\partial\boldsymbol{\varphi}}\right]^{-1}\,
\boldsymbol{\nabla}_\boldsymbol{\varphi}L\,.
\end{eqnarray}
Implicitly, we may therefore suppose that $\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}(Z,\boldsymbol{\phi}))$,
such that 
$\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}(Z,\boldsymbol{\phi})))$.

The requisite 
[expectations](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods")
are therefore given by
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\phi}L\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}\right]
& = & \boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\mid\boldsymbol{\theta}\right]
~=~ \mathbf{0}\,,
\end{eqnarray}
and
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\phi}\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}L
\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}\right]
& = & -\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]
\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}
\,.
\end{eqnarray}
Consequently, 
in the special case where there is no independent parameter $\boldsymbol{\psi}$,
the [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
estimate $\hat{\boldsymbol{\phi}}_\texttt{ML}$ may be obtained iteratively via updates of the form
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]
\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}
\right\rangle\,\Delta\boldsymbol{\phi} & = &
\left\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,\boldsymbol{\nabla}_\boldsymbol{\eta}L
\right\rangle
\,.
\end{eqnarray}

It must be noted that the *link* parameter $\boldsymbol{\eta}$ here does not necessarily have to be the same as any of the natural *distributional* parameters used
[previously](#Seperable-dependencies "Section: Seperable dependencies") (which were, unfortunately, also denoted collectively by $\boldsymbol{\eta}$). In general they are not the same.
Despite this, we [may](#General-form "Section: General form")
still introduce a new variate $Y_\boldsymbol{\eta}$ as a function of $X$ (and possibly $\boldsymbol{\theta})$,
such that
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}L~=~Y_\boldsymbol{\eta}-\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}
& ~~~\Rightarrow~~~ & 
\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}~=~\mathbb{E}\left[Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right]
\,.
\end{eqnarray}
The variance is then directly obtained as
\begin{eqnarray}\boldsymbol{\Sigma}_{Y_\boldsymbol{\eta}} & ~=~ &
\mathbb{V}\left[Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right]
~=~\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]\,.
\end{eqnarray}
It then follows, again only when there is **no** independent parameter $\boldsymbol{\psi}$,
that the update for the parameter $\boldsymbol{\phi}$ is given by
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,\boldsymbol{\Sigma}_{Y_\boldsymbol{\eta}}
\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}
\right\rangle\,\Delta\boldsymbol{\phi} & = &
\left\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}
\left[Y_\boldsymbol{\eta}-\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}\right]
\right\rangle
\,.
\end{eqnarray}
Observe the similarity (and dissimilarity) with the nonlinear 
[IRLS](#Least-squares-regression "Section: Least-squares regression") update equation.

In general, however, the parameter $\boldsymbol{\psi}$ does exist.
Note that although, by definition, $\boldsymbol{\psi}$ is independent of 
$\boldsymbol{\phi}$ and $\boldsymbol{\eta}$, the estimates of these parameters are **not**
independent.  This dependence was demonstrated 
[previously](#Beta-Bernoulli-distribution "Section: Beta-Bernoulli distribution").
[Recall](#Regression-modelling "Section: Regression modelling") 
that the regression model itself takes the form
$\boldsymbol{\eta}=\mathbf{f}(Z,\boldsymbol{\phi})$,
and that the dependent parameter $\boldsymbol{\varphi}$ is then obtained by inverting the link function 
$\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$ to give 
$\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.
It follows that $\boldsymbol{\varphi}$ is implicitly a function of $\boldsymbol{\psi}$ and $\boldsymbol{\eta}$
(and thus $\boldsymbol{\phi}$). 

This fact has two immediate consequences. Firstly, $\boldsymbol{\eta}$
only affects the log-likelihood $L(\boldsymbol{\theta};X)$ indirectly via $\boldsymbol{\varphi}$,
such that
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}L & ~\doteq~ & 
\frac{\partial\boldsymbol{\varphi}^T}{\partial\boldsymbol{\eta}}\,\frac{\partial L}{\partial\boldsymbol{\varphi}}
~=~Y_\boldsymbol{\eta}-\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}
\,,
\end{eqnarray}
and therefore
\begin{eqnarray}
Y_\boldsymbol{\phi} & ~\doteq~ & \boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
Y_\boldsymbol{\eta}\,.
\end{eqnarray}
Secondly, $\boldsymbol{\psi}$ affects the log-likelihood both directly, as before, and now also indirectly via
$\boldsymbol{\varphi}$, such that
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\psi}L & ~\doteq~ & 
\frac{\partial L}{\partial\boldsymbol{\psi}}
+\frac{\partial\boldsymbol{\varphi}^T}{\partial\boldsymbol{\psi}}\,\frac{\partial L}{\partial\boldsymbol{\varphi}}
~=~Y_\boldsymbol{\psi}-\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}\,.
\end{eqnarray}
The dependence between the estimates of $\boldsymbol{\psi}$ and $\boldsymbol{\phi}$ now arises in practice due to the correlation between $Y_\boldsymbol{\psi}$ and $Y_\boldsymbol{\eta}$ via their covariance, namely
\begin{eqnarray}
\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi},Y_\boldsymbol{\eta}} ~\doteq~
\mathtt{Cov}[Y_\boldsymbol{\psi},Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}]
& \;\;\;\Rightarrow\;\;\; & 
\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi},Y_\boldsymbol{\phi}} ~\doteq~
\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi},Y_\boldsymbol{\eta}}\,
\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}\,.
\end{eqnarray}
The complete update equation for the nonlinear regression model is therefore
\begin{eqnarray}
\left[\begin{array}{cc}
\left\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,\boldsymbol{\Sigma}_{Y_\boldsymbol{\eta}}\,
\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}\right\rangle &
\left\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^T\,
\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi},Y_\boldsymbol{\eta}}^T\right\rangle
\\
\left\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi},Y_\boldsymbol{\eta}}\,
\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}\right\rangle &
\left\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\right\rangle
\end{array}\right]\,
\left[\begin{array}{c}
\Delta\boldsymbol{\phi}
\\
\Delta\boldsymbol{\psi}
\end{array}\right] & ~=~ &
\left[\begin{array}{c}
\left\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\left[Y_\boldsymbol{\eta}-\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}\right]\right\rangle
\\
\left\langle Y_\boldsymbol{\psi}-\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}\right\rangle
\end{array}\right]\,.
\end{eqnarray}

### Generalised linear models

Generalised linear modelling (GLM) now follows immediately from
[nonlinear modelling](#Generalised-nonlinear-models "Section: Generalised nonlinear models").
In theory, the most general linear model is given by
\begin{eqnarray}
\mathbf{f}(\mathbf{z},\boldsymbol{\Phi}) & \doteq & \boldsymbol{\Phi}^T\,\mathbf{z}\,,
\end{eqnarray}
for a multi-dimensional vector $Z=\mathbf{z}$ of covariates (which may also include a constant component), and 
a matrix parameter $\boldsymbol{\Phi}$. However, we may then subsequently separate the regression function into its components $\mathbf{f}=[f_i]$ by
considering independent scalar functions parameterised by each column 
 of $\boldsymbol{\Phi}=[\boldsymbol{\phi}_i]$.
 We may therefore instead assume, without loss of generality, that the regression function takes the simple, scalar form 
\begin{eqnarray}
\eta & = & f(Z,\boldsymbol{\phi}) ~\doteq~ \boldsymbol{\phi}^T\,Z
~=~Z^{T}\boldsymbol{\phi}\,.
\end{eqnarray}
Note, however, that in the full vector case, we would still need to reconstruct 
the link parameter vector
$\boldsymbol{\eta}=[\eta_i]$ in order to obtain
the vector mean $\boldsymbol{\mu}=\mathbf{g}^{-1}(\boldsymbol{\eta})$, and consequently 
the dependent parameter
$\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.

For scalar link parameter $\eta$, the general 
[update equation](#Generalised-nonlinear-models "Section: Generalised nonlinear models")
now reduces to
\begin{eqnarray}
\left[\begin{array}{cc}
\left\langle\sigma^2_\eta\,ZZ^T\right\rangle &
\left\langle Z\,\boldsymbol{\sigma}_{Y_\boldsymbol{\psi},Y_\eta}^T\right\rangle
\\
\left\langle\boldsymbol{\sigma}_{Y_\boldsymbol{\psi},Y_\eta}\,Z^T\right\rangle &
\left\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\right\rangle
\end{array}\right]\,
\left[\begin{array}{c}
\Delta\boldsymbol{\phi}
\\
\Delta\boldsymbol{\psi}
\end{array}\right] & ~=~ &
\left[\begin{array}{c}
\left\langle(Y_\eta-\mu_{Y_\eta})\,Z\right\rangle
\\
\left\langle Y_\boldsymbol{\psi}-\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}\right\rangle
\end{array}\right]\,.
\end{eqnarray}

For comparison, let us now take another look at 
[least-squares](#Least-squares-regression "Section: Least-squares regression")
regression. 
We start with WLS using the weighted residual $\tilde{\varepsilon}$ such that the square error is
\begin{eqnarray}
S(\boldsymbol{\phi}) & ~=~ & \left\langle\tilde{\varepsilon}^2\right\rangle
~=~\left\langle\frac{(Y_\eta-\mu_{Y_\eta})^2}{\sigma_{Y_\eta}^2}\right\rangle\,.
\end{eqnarray}
Next, we make use of IRLS by temporarily holding $\sigma^2_{Y_\eta}$ constant for each iterative update, and then taking the chain of gradients
with respect to first $\mu_{Y_\eta}$, then $\eta$, and finally $\boldsymbol{\phi}$.
We make use of the [fact](#General-form "Section: General form") that
\begin{eqnarray}
\frac{\partial\mu_{Y_\eta}}{\partial\eta} & = & \sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]\,,
\end{eqnarray}
for any arbitrary, scalar link parameter $\eta$.
It then follows that the approximate gradient of $S(\boldsymbol{\phi})$ with respect to $\boldsymbol{\phi}$
is given by
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}\,S & ~\approx~ & 
-2\,\left\langle\frac{Y_\eta-\mu_{Y_\eta}}{\sigma^2_{Y_\eta}}\,\frac{\partial\mu_{Y_\eta}}{\partial\eta}\,\frac{\partial\eta}{\partial\boldsymbol{\phi}}
\right\rangle
~=~-2\,\left\langle(Y_\eta-\mu_{\small Y_\eta})\,
\frac{\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]}{\sigma^2_{Y_\eta}}
\,Z\right\rangle
\,.
\end{eqnarray}
Next, we temporarily hold the expectation of $\nabla_\eta Y_\eta$ constant, and take a second gradient to obtain
the approximate Hessian as
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}\,\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\,S & ~\approx~ & 
2\left\langle\frac{\partial\mu_{Y_\eta}}{\partial\eta}\,\frac{\partial\eta}{\partial\boldsymbol{\phi}}\,
\frac{\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]}{\sigma^2_{Y_\eta}}
\,Z^T\right\rangle
\\
&=& 2\left\langle
\frac{\left(\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]\right)^2}
{\sigma^2_{Y_\eta}}
\,Z\,Z^T\right\rangle
\,.
\end{eqnarray}
Finally, we use the Newton-Raphson approach to obtain the approximate update
\begin{eqnarray}
\left\langle
\frac{\left(\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]\right)^2}
{\sigma^2_{Y_\eta}}
\,Z\,Z^T\right\rangle
\,\Delta\boldsymbol{\phi} & ~=~ &
\left\langle(Y_\eta-\mu_{Y_\eta})\,
\frac{\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]}{\sigma^2_{Y_\eta}}
\,Z\right\rangle
\,.
\end{eqnarray}
In the special case where $\eta$ is a natural parameter, such that $\nabla_\eta Y_\eta=0$, then this 
reduces to
\begin{eqnarray}
\left\langle
\sigma^2_{Y_\eta}
\,Z\,Z^T\right\rangle
\,\Delta\boldsymbol{\phi} & ~=~ &
\left\langle(Y_\eta-\mu_{Y_\eta})\,
\,Z\right\rangle
\,,
\end{eqnarray}
which is just the update equation for $\boldsymbol{\phi}$ when there is no independent parameter $\boldsymbol{\psi}$.
In other words, the [IRLS](#Least-squares-regression "Section: Least-squares regression") 
method only becomes identical to the 
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") 
approach in the special case for a single-parameter distribution
for which $\eta$ is both the natural parameter and the link parameter.
This is important to note because Nelder and Wedderburn [[1]](#Citations "Citation [1]: Generalized Linear Models")
claim equivalence of IRLS and ML but only actually demonstrate it for the special case. 

## Linear distributional regression

In the following sections, we consider, for convenience, the linear regression function $\eta=Z^T\boldsymbol{\phi}$.
We examine what form the 
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") 
equations take for the regression parameter $\boldsymbol{\phi}$, for a variety of distributions.

### Bernoulli regression

Recall that the [Bernoulli distribution](#Bernoulli-distribution "Section: Bernoulli distribution")
is parameterised by $\theta$ and has mean $\mu=\theta$ and variance
$\sigma_X^2=\theta\,(1-\theta)$. We also recall that the natural parameterisation of the distribution is
$\eta=\sigma^{-1}(\theta)$, where $\sigma^{-1}(\cdot)$ is the logit function, such that the natural gradient of
the log-likelihood $L$ is $\nabla_\eta L=X-\mu$. Also, since $\mu=\theta$, we find that 
$\eta=\sigma^{-1}(\mu)$ is the natural parameterisation of the mean $\mu$.
Therefore, the dependent parameter is $\varphi=\theta$, and there is no independent parameter $\psi$.

We [deduce](#Generalised-linear-models "Section: Generalised linear models") that 
the iterative parameter update for $\boldsymbol{\phi}$ therefore takes the form
\begin{eqnarray}
\langle\mu\,(1-\mu)\,ZZ^T\rangle\,\Delta\boldsymbol{\phi}
& ~=~ & \langle (X-\mu)\,Z\rangle\,.
\end{eqnarray}
Also note that here $Y_\eta=X$ is the natural variate, and $\eta$ is both the natural parameter and the link parameter. Hence,
from the [previous](#Generalised-linear-models "Section: Generalised linear models") section, we see that the Bernoulli
distribution is one of the special cases where the 
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") approach
is equivalent to an iteratively reweighted 
[least-squares](#Least-squares-regression "Section: Least-squares regression") (IRLS) approach.

Note that the resulting regression model is
\begin{eqnarray}
X\mid Z,\boldsymbol{\phi} & ~\sim~ & \texttt{Bern}\left(\sigma(Z^T\boldsymbol{\phi})\right)\,.
\end{eqnarray}

### Poisson regression

Recall that the [Poisson distribution](#Poisson-distribution "Section: Poison distribution") has mean and variance given by
$\mu_X=\sigma^2_X=\lambda$, and natural parameter $\eta=\ln\lambda$. Thus, we take the natural parameter to be the link parameter, and consequently
[obtain](#Generalised-linear-models "Section: Generalised linear models") the iterative parameter update
\begin{eqnarray}
\langle\mu_X\,ZZ^T\rangle\,\Delta\boldsymbol{\phi}
& ~=~ & \langle (X-\mu_X)\,Z\rangle\,.
\end{eqnarray}
This is equivalent to IRLS. Keep in mind that the mean $\mu_X$ is recomputed at each iteration from the regression function via the inverse link function, namely
\begin{eqnarray}
\mu_X & ~=~ & e^\eta ~=~ e^{Z^T\boldsymbol{\phi}}\,.
\end{eqnarray}
Also note that $\mu_X$ is computed within each iteration for every observed pair $(X,Z)$ of variate and covariates.
The resulting regression model is therefore
\begin{eqnarray}
X\mid Z,\boldsymbol{\phi} & ~\sim~ & \texttt{Poiss}\left(e^{Z^T\boldsymbol{\phi}}\right)\,.
\end{eqnarray}

### Beta regression

Recall that the [Beta distribution](#Beta-distribution "Section: Beta distribution")
has natural parameters $\alpha,\beta>0$, with mean
\begin{eqnarray}
\mu & ~=~ & \frac{\alpha}{\alpha+\beta}~=~\frac{1}{1+\frac{\beta}{\alpha}}
~=~\frac{1}{1+e^{-\ln\frac{\alpha}{\beta}}}\,.
\end{eqnarray}
Hence, a useful choice of link parameter is
\begin{eqnarray}
\eta & ~=~ & \ln\frac{\alpha}{\beta}~=~\sigma^{-1}(\mu)\,,
\end{eqnarray}
which satisfies the dual constraints that $\eta\in\mathbb{R}$ and $\mu\in(0,1)$.

We may now invert this relationship to obtain either $\alpha=\beta\,e^\eta$ or $\beta=\alpha\,e^{-\eta}$.
Note that since we cannot recover both $\alpha$ and $\beta$ from $\mu=\sigma(\eta)$, we must choose one of these parameters to be the independent parameter $\psi$, and the other to be the dependent parameter $\varphi$.
Following Kieschnick and McCullough 
[[3]](#Citations "Citation [3]: Regression analysis of variates observed on $(0, 1)$"),
we choose $\psi=\alpha$ and $\varphi=\beta=\alpha\,e^{-\eta}$.
Consequently, we take the distributional parameter $\psi$ and the link parameter $\eta=f(Z,\boldsymbol{\phi})$ 
to be independent of each other.

The required log-likelihood gradients are therefore 
[given](#Generalised-nonlinear-models "Section: Generalised nonlinear models")
by
\begin{eqnarray}
\nabla_\eta L & ~\doteq~ & 
\frac{\partial\beta}{\partial\eta}\,\frac{\partial L}{\partial\beta}
~=~-\beta\,(Y_\beta-\mu_{Y_\beta})\,,
\end{eqnarray}
and
\begin{eqnarray}
\nabla_\psi L & ~\doteq~ & \frac{\partial L}{\partial\alpha}
+\frac{\partial\beta}{\partial\alpha}\,\frac{\partial L}{\partial\beta}
~=~(Y_\alpha-\mu_{Y_\alpha})+\frac{\beta}{\alpha}\,(Y_\beta-\mu_{Y_\beta})\,,
\end{eqnarray}
where $Y_\alpha=\ln X$ and $Y_\beta=\ln(1-X)$ are the natural variates
corresponding to the natural parameters $\alpha$ and $\beta$, respectively,
and $\mu_{Y_\alpha}$ and $\mu_{Y_\beta}$ are their respective means. 
We may therefore define the corresponding *link* variate 
\begin{eqnarray}
Y_\eta & ~\doteq~ & -\beta Y_\beta~=~-\beta\ln(1-X)\,, 
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mu_{Y_\eta} & ~\doteq~ & \mathbb{E}[Y_\eta\mid\alpha,\beta]~=~-\beta\mu_{Y_\beta}\,,
\\
\sigma^2_{Y_\eta} & ~\doteq~ & \mathbb{V}[Y_\eta\mid\alpha,\beta]~=~\beta^2\sigma^2_{Y_\beta}\,,
\end{eqnarray}
respectively.
Similarly, the *independent* variate is defined as 
\begin{eqnarray}
Y_\psi & ~\doteq~ & Y_\alpha+\frac{\beta}{\alpha}Y_\beta~=~\ln X+\frac{\beta}{\alpha}\,\ln(1-X)\,,
\end{eqnarray}
with mean
\begin{eqnarray}
\mu_{Y_\psi} & ~\doteq~ & \mathbb{E}[Y_\psi\mid\alpha,\beta] ~=~ \mu_{Y_\alpha}+\frac{\beta}{\alpha}\,\mu_{Y_\beta}\,,
\end{eqnarray}
and variance
\begin{eqnarray}
\sigma^2_{Y_\psi} & ~\doteq~ \texttt{Var}[Y_\psi\mid\alpha,\beta]
~=~\sigma^2_{Y_\alpha}+\frac{\beta^2}{\alpha^2}\,\sigma^2_{Y_\beta}
+\frac{2\beta}{\alpha}\sigma_{Y_\alpha,Y_\beta}\,.
\end{eqnarray}
The covariance between $Y_\psi$ and $Y_\eta$ is therefore
\begin{eqnarray}
\sigma_{Y_\psi,Y_\eta} & ~\doteq~ & \mathtt{Cov}[Y_\psi,Y_\eta\mid\alpha,\beta]
~=~-\beta\,\sigma_{Y_\alpha,Y_\beta}-\frac{\beta^2}{\alpha}\,\sigma^2_{Y_\beta}\,.
\end{eqnarray}

The iterative updates for the regression parameter $\boldsymbol{\phi}$ and 
the independent parameter $\psi=\alpha$ therefore take the form
\begin{eqnarray}
\left[\begin{array}{cc}
\left\langle\sigma^2_\eta\,ZZ^T\right\rangle &
\left\langle Z\,\sigma_{Y_\psi,Y_\eta}^T\right\rangle
\\
\left\langle\sigma_{Y_\psi,Y_\eta}\,Z^T\right\rangle &
\left\langle\sigma^2_\psi\right\rangle
\end{array}\right]\,
\left[\begin{array}{c}
\Delta\boldsymbol{\phi}
\\
\Delta\boldsymbol{\alpha}
\end{array}\right] & ~=~ &
\left[\begin{array}{c}
\left\langle(Y_\eta-\mu_{Y_\eta})\,Z\right\rangle
\\
\left\langle Y_\psi-\mu_{Y_\psi}\right\rangle
\end{array}\right]\,.
\end{eqnarray}
The resulting regression model is given by
\begin{eqnarray}
X\mid Z, \alpha, \boldsymbol{\phi} & ~\sim~ & \texttt{Beta}\left(\alpha, \alpha\,e^{-Z^T\boldsymbol{\phi}}\right)\,.
\end{eqnarray}
Note that the estimated value of $\alpha$ has no influence on the mean of the cconditional distribution, but helps to control
the variance.

### Beta-Bernoulli regression

Recall that the [Beta-Bernoulli distribution](#Beta-Bernoulli-distribution "Section: Beta-Bernoulli distribution"),
namely
\begin{eqnarray}
P(X=x\mid\alpha,\beta) & ~=~ & \frac{\alpha^x\,\beta^{1-x}}{\alpha+\beta}\,,
\end{eqnarray}
has mean $\mu_X$ and variance $\sigma^2_X$ given by
\begin{eqnarray}
\mu_X~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\mbox{and}\;\; & \sigma^2_X~=~\frac{\alpha\beta}{(\alpha+\beta)^2}\,,
\end{eqnarray}
respectively, 
along with a natural parameter $\eta$ given by
\begin{eqnarray}
\eta & ~=~ & \ln\frac{\alpha}{\beta}~=~\sigma^{-1}(\mu)\,.
\end{eqnarray}
We therefore deduce that $\eta$ is also the natural link parameter for a logit link function.
Following [Beta regression](#Beta-regression "Section: Beta regression"), 
we take $\eta$ as the independent link parameter,
$\psi=\alpha$ as the independent distributional parameter, and 
$\varphi=\beta=\alpha\,e^{-\eta}$ as the dependent distributional parameter. 
This effectively reparameterises the distribution from 
$\boldsymbol{\theta}=(\alpha,\beta)$ to
$\boldsymbol{\theta}'\doteq(\eta,\psi)$.

Now, the derivative of the log-likelihood with respect to $\eta$ is therefore
\begin{eqnarray}
\nabla_\eta L & ~\doteq~ & 
\frac{\partial\beta}{\partial\eta}\,\frac{\partial L}{\partial\beta}
~=~-\beta\,(Y_\beta-\mu_{Y_\beta})\,,
\end{eqnarray}
such that
\begin{eqnarray}
Y_\eta & ~\doteq~ & -\beta\,Y_\beta~=~X\,.
\end{eqnarray}
The corresponding derivative with respect to $\psi$ is 
[given](#Generalised-nonlinear-models "Section: Generalised nonlinear models")
by
\begin{eqnarray}
\nabla_\psi L & ~\doteq~ & \frac{\partial L}{\partial\alpha}
+\frac{\partial\beta}{\partial\alpha}\,\frac{\partial L}{\partial\beta}
~=~(Y_\alpha-\mu_{Y_\alpha})+\frac{\beta}{\alpha}\,(Y_\beta-\mu_{Y_\beta})\,,
\end{eqnarray}
from which we deduce that
\begin{eqnarray}
Y_\psi & ~\doteq~ & Y_\alpha+\frac{\beta}{\alpha}\,Y_\beta~\equiv~ 0\,.
\end{eqnarray}
We conclude that there is no valid update equation for $\psi=\alpha$, which should therefore remain constant.

However, we [previously](#Beta-Bernoulli-distribution "Section: Beta-Bernoulli distribution") noted
that we can, in practice, utilise an approximate update scheme for $\alpha$ by neglecting the known correlation between $Y_\alpha$ and $Y_\beta$. Taking a similar approach here, we first **redefine** 
\begin{eqnarray}
\nabla_\psi L & ~\doteq~ & \frac{\partial L}{\partial\alpha}
~=~(Y_\alpha-\mu_{Y_\alpha})\,,
\end{eqnarray}
such that now
\begin{eqnarray}
Y_\psi & ~\doteq~ & Y_\alpha~=~\frac{X}{\alpha}\,.
\end{eqnarray}
Note that the covariance between $Y_\eta$ and $Y_\psi$ is now given by
\begin{eqnarray}
\sigma_{Y_\eta,Y_\psi} & ~\doteq~ & \mathtt{Cov}\left[Y_\eta,Y_\psi\mid\alpha,\beta\,\right]
~=~\frac{\sigma^2_X}{\alpha}\,,
\end{eqnarray}
such that the corresponding covariance matrix for $Y_{\boldsymbol{\theta}'}\doteq(Y_\eta,Y_\psi)$ is now given by
\begin{eqnarray}
\boldsymbol{\Sigma}_{\boldsymbol{\theta}'} & ~\doteq~ &
\mathtt{Var}\left[Y_{\boldsymbol{\theta}'}\mid\alpha,\beta\,\right]
~=~
\left[\begin{array}{cc}
\sigma^2_X & \frac{\sigma^2_X}{\alpha}
\\
\frac{\sigma^2_X}{\alpha} & \frac{\sigma^2_X}{\alpha^2}
\end{array}\right]\,.
\end{eqnarray}
Unfortunately, this approximate covariance matrix is singular, just as the original covariance matrix
$\Sigma_{Y_\boldsymbol{\theta}}$ was also singular.
Therefore, we additionally neglect the known correlation between $Y_\eta$ and $Y_\psi$, and take the approximate
regression update equation to be
\begin{eqnarray}
\left[\begin{array}{cc}
\left\langle\sigma^2_\eta\,ZZ^T\right\rangle &
0
\\
0 &
\left\langle\sigma^2_\psi\right\rangle
\end{array}\right]\,
\left[\begin{array}{c}
\Delta\boldsymbol{\phi}
\\
\Delta\boldsymbol{\alpha}
\end{array}\right] & ~=~ &
\left[\begin{array}{c}
\left\langle(Y_\eta-\mu_{Y_\eta})\,Z\right\rangle
\\
\left\langle Y_\psi-\mu_{Y_\psi}\right\rangle
\end{array}\right]\,.
\end{eqnarray}
The resulting regression model is therefore
\begin{eqnarray}
X\mid Z, \alpha, \boldsymbol{\phi} & ~\sim~ & \texttt{BetaBern}\left(\alpha,\alpha\,e^{-Z^T\boldsymbol{\phi}}\right)\,.
\end{eqnarray}
However, note that, in practice, the estimated value of $\alpha$ has no influence on either the mean or the variance of
the conditional distribution.

### Gamma regression

Recall that the  [Gamma distribution](#Gamma-distribution "Section: Gamma distribution") has natural parameters
$\alpha,\beta>0$, with corresponding natural variates $Y_\alpha=\ln X$ and $Y_\beta=-X$. The mean and variance of the distribution are
given by
\begin{eqnarray}
\mu_X~=~\frac{\alpha}{\beta}\,, & \;\;~\mbox{and}~\;\; & \sigma^2_X~=~\frac{\alpha}{\beta^2}\,,
\end{eqnarray}
respectively. We choose the link parameter to be
\begin{eqnarray}
\eta~=~\ln\frac{\alpha}{\beta}~=~\ln\mu_X & ~\Rightarrow~ & \mu_X~=~e^\eta\,,
\end{eqnarray}
which satisfies $\eta\in\mathbb{R}$ and $\mu_X>0$.

Now, following an [earlier](#Beta-distribution "Section: Beta distribution") suggestion,
we take either $\alpha$ or $\beta$ to be the independent parameter, and let the other be the dependent parameter.
For example, if we were to estimate $\alpha$ and $\beta$ by the method of matching moments, then we could either
estimate
\begin{eqnarray}
\alpha~=~\frac{\mu_X^2}{\sigma_X^2} & ~~~\Rightarrow~~~ & \beta~=~\frac{\alpha}{\mu_X}\,,
\end{eqnarray}
or
\begin{eqnarray}
\beta~=~\frac{\mu_X}{\sigma_X^2} & ~~~\Rightarrow~~~ & \alpha~=~\beta\,\mu_X\,.
\end{eqnarray}
To help us choose, we note the
[fact](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution")
that if $X\sim\texttt{Gamma}(\alpha,\beta)$ and $Y\sim\texttt{Gamma}(\nu,\beta)$ then the proportion
$\frac{X}{X+Y}$ is distributed as
\begin{eqnarray}
\frac{X}{X+Y} & ~\sim~ & \texttt{Beta}(\alpha,\nu)\,.
\end{eqnarray}
Consequently, we assume the scale parameter $\beta$ to be common across all observations.
Hence, we take the independent parameter to be $\psi=\beta$, and the dependent parameter to be $\varphi=\alpha=\beta\,e^{\eta}$.


Subsequently, we take the distributional parameter $\psi$ 
to be independent of the link parameter $\eta$, and thus independent of the regression parameter
$\boldsymbol{\phi}$ of the regression function $\eta=f(Z,\boldsymbol{\phi})$.
The required log-likelihood gradients are now [given](#Generalised-linear-models "Section: Generalised linear models") by
\begin{eqnarray}
\nabla_\eta L & ~\doteq~ &
\frac{\partial\alpha}{\partial\eta}\,\frac{\partial L}{\partial\alpha}
~=~\alpha\,(Y_\alpha-\mu_{Y_\alpha})\,,
\end{eqnarray}
and
\begin{eqnarray}
\nabla_\psi L & ~\doteq~ &
\frac{\partial L}{\partial\beta}+\frac{\partial\alpha}{\partial\beta}\,\frac{\partial L}{\partial\alpha}
~=~(Y_\beta-\mu_{Y_\beta})+\frac{\alpha}{\beta}\,(Y_\alpha-\mu_{Y_\alpha})\,.
\end{eqnarray}
Consequently, we define the link variate as
\begin{eqnarray}
Y_\eta & ~\doteq~ & \alpha\,Y_\alpha~=~\alpha\,\ln X\,,
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mu_{Y_\eta} & ~\doteq~ & \mathbb{E}[Y_\eta|\alpha,\beta]~=~\alpha\,\mu_{\ln X}\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_{Y_\eta} & ~=~ & \texttt{Var}[Y_\eta\mid\alpha,\beta]~=~\alpha^2\,\sigma^2_{\ln X}\,,
\end{eqnarray}
respectively. Similarly, we define the independent variate as
\begin{eqnarray}
Y_\psi & ~\doteq~ & Y_\beta+\frac{\alpha}{\beta}\,Y_\alpha~=~-X+\frac{\alpha}{\beta}\,\ln X\,,
\end{eqnarray}
with mean and variance given by
\begin{eqnarray}
\mu_{Y_\psi} & ~\doteq~ & \mathbb{E}[Y_\psi|\alpha,\beta]~=~
-\mu_X+\frac{\alpha}{\beta}\,\mu_{\ln X}\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_{Y_\psi} & ~\doteq~ & \texttt{Var}[Y_\psi\mid\alpha,\beta]
~=~\sigma^2_X+\frac{\alpha^2}{\beta^2}\,\sigma^2_{\ln X}-\frac{2\alpha}{\beta}\,\sigma_{X,\ln X}\,,
\end{eqnarray}
respectively.
Finally, we note that the covariance between $Y_\eta$ and $Y_\psi$ is given by
\begin{eqnarray}
\sigma_{Y_\eta,Y_\psi} & ~\doteq~ & \texttt{Cov}\left[Y_\eta,Y_\psi\mid\alpha,\beta\,\right]
~=~-\alpha\,\sigma_{X,\ln X}+\frac{\alpha^2}{\beta}\,\sigma^2_{\ln X}\,.
\end{eqnarray}
Taking $\boldsymbol{\theta}'\doteq(\eta,\psi)$, we may determine that
\begin{eqnarray}
\left|\Sigma_{Y_{\boldsymbol{\theta}'}}\right| & ~=~ & 
\alpha^2\left(\sigma^2_X\,\sigma^2_{\ln X}-\sigma_{X,\ln X}^2\right)\,,
\end{eqnarray}
such that the covariance matrix $\Sigma_{Y_{\boldsymbol{\theta}'}}$ is non-singular.
Consequently, we may utilise the 
[standard](#Generalised-linear-models "Section: Generalised linear models")
parameter update equations for $\psi=\beta$ and $\boldsymbol{\phi}$. 
The resulting regression model is
\begin{eqnarray}
X\mid Z,\beta,\boldsymbol{\phi} & ~\sim~ & \texttt{Gamma}\left(\beta\,e^{Z^T\boldsymbol{\phi}},\beta\right)\,.
\end{eqnarray}
Note that the estimated value of $\beta$ has no influence on the mean of the conditional distribution, but
acts to control the variance.

## Regression modelling revisited

Up until this point we have naively assumed that the covariates $Z$ give rise to a regression function of the
linear form $f(Z,\boldsymbol{\phi})=\boldsymbol{\phi}^TZ$. Essentially, we have assumed that the underlying
covariates have already been transformed into features suitable for linear regression. This of course begs the question: what are these suitable transformations and how are they derived? 
In order to answer this question, we turn to an alternative approach of deriving probabilistic regression models, offered by
Bergtold et al. [[4]](#Citations "Citation [4]: Bernoulli Regression Models"). 

Under this "*probabilistic reduction*" framework, both the dependent variate $X\in\mathcal{X}$ and the independent covariates $Z\in\mathcal{Z}$ are jointly modelled.
For such a joint density to exist, it must be able to be factored into conditionals into two different ways, namely
\begin{eqnarray}
p(X,Z\mid\Theta) & ~=~ & p(X\mid\Theta)\,p(Z\mid X,\Theta)
~=~ p(Z\mid\Theta)\,p(X\mid Z,\Theta)\,.
\end{eqnarray}
For convenience, we reintroduce the prior (now marginal) distribution parameter $\boldsymbol{\theta}$ 
for the response variate $X$, and the regression parameter $\boldsymbol{\phi}$ for the covariates $Z$. 
Conversely, we now introduce corresponding parameters $\boldsymbol{\pi}$ and $\boldsymbol{\zeta}$ for 
$X$ and $Z$, rerspectively.
Consequently, we take $\Theta=(\boldsymbol{\theta},\boldsymbol{\pi},\boldsymbol{\phi},\boldsymbol{\zeta})$ such that
\begin{eqnarray}
p(X,Z\mid\Theta) & ~=~ & p(X\mid\boldsymbol{\theta})\,p(Z\mid\boldsymbol{\pi},X,\boldsymbol{\zeta})
~=~ p(Z\mid\boldsymbol{\pi})\,p(X\mid\boldsymbol{\theta},Z,\boldsymbol{\phi})\,.
\end{eqnarray}

We may now combine both factorisations together and rearrange terms to obtain
\begin{eqnarray}
p(X\mid\boldsymbol{\theta}) & = &
\frac{p(Z\mid\boldsymbol{\pi})\,p(X\mid\boldsymbol{\theta},Z,\boldsymbol{\phi})}{p(Z\mid\boldsymbol{\pi},X,\boldsymbol{\zeta})}\,.
\end{eqnarray}
Note that the apparent dependency of the right-hand side on $Z$ must actually cancel out, since the left-hand side is purely a function of $X$.

The next step of "*probabilistic reduction*" is to consider two distinct values of $X$, say $x_0,x_1\in\mathcal{X}$, and evaluate the above formula at each point. Taking the ratio of the formula for the two values, rearranging terms, and taking the logarithm, we then obtain the *log-ratio formlua*:
\begin{eqnarray}
\ln\frac{p(X=x_1\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})}{p(X=x_0\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})}
& ~=~ &
\ln\frac{p(X=x_1\mid\boldsymbol{\theta})}{p(X=x_0\mid\boldsymbol{\theta})}
+\ln\frac{p(Z\mid\boldsymbol{\pi},X=x_1,\boldsymbol{\zeta})}{p(Z\mid\boldsymbol{\pi},X=x_0,\boldsymbol{\zeta})}
\,.
\end{eqnarray}
Note that the term on the left-hand side determines the form of the regression model, and depends upon the
choice of the distribution of the dependent variate $X$, as we shall see in following sections.
Also note that first term on the right-hand side
is (for pre-determined $x_0$ and $x_1$) just a function of the (unknown) parameter $\boldsymbol{\theta}$,
which we shall represent as $\kappa(\boldsymbol{\theta})$.
Hence, this term corresponds to a constant term in the regression function arguments.
The remaining regression terms, as functions of the covariates $Z$, arise from the second term on the right-hand side, and depend upon the choice of the covariate conditional distribution.

In the approach of Bergtold et al. [[4]](#Citations "Citation [4]: Bernoulli Regression Models"),
we want to determine the form of the regression features based upon the choice of the covariate (marginal) distribution
$p(Z\mid\boldsymbol{\pi})$. Consequently, we assume that the conditional distribution
$p(Z\mid\boldsymbol{\pi},X,\boldsymbol{\zeta})$ takes the same functional form as the marginal distribution.
One way to achieve this aim is to assume that each observed value $X=x\in\mathcal{X}$ of the response variate selects a specific
marginal distribution for the covariates $Z$, corresponding to the model
\begin{eqnarray}
\hat{\boldsymbol{\pi}}(x) ~\doteq~ \mathbf{G}(x;\boldsymbol{\pi},\boldsymbol{\zeta})\,,
& \;\;\;\; &
p(Z\mid\boldsymbol{\pi},X,\boldsymbol{\zeta})~\doteq~p(Z\mid\hat{\boldsymbol{\pi}}(X))
\,.
\end{eqnarray}
Thus, letting $\hat{\boldsymbol{\pi}}_0\doteq\hat{\boldsymbol{\pi}}(x_0)$ and
$\hat{\boldsymbol{\pi}}_1\doteq\hat{\boldsymbol{\pi}}(x_1)$, the log-ratio formula reduces to
\begin{eqnarray}
\ln\frac{p(X=x_1\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})}{p(X=x_0\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})}
& ~=~ &
\kappa(\boldsymbol{\theta})+\ln\frac{p(Z\mid\hat{\boldsymbol{\pi}}_1)}{p(Z\mid\hat{\boldsymbol{\pi}}_0)}\,,
\end{eqnarray}
As we shall see in later sections, it seems to be a rule of thumb that the relevant covariate features correspond to the
[natural variates](#Seperable-dependencies "Section: Seperable dependencies") of each distribution.
Presumably this only holds true for members of "the" exponential family, especially since,
as Bergtold et al. [[4]](#Citations "Citation [4]: Bernoulli Regression Models") notes, only "the" exponential family results in regression functions that are linear in their parameters. 
In such cases, the covariates will lead to an arbitrary number $n$ of features of the form $f_1(Z),\ldots,f_n(Z)$, such that the regression model takes the form
\begin{eqnarray}
\ln\frac{p(X=x_1\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})}
{p(X=x_0\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})}
& ~=~ &
f(Z,\boldsymbol{\phi})~\doteq~
\phi_0+\phi_1\,f_1(Z)+\cdots+\phi_n\,f_n(Z)
\,,
\end{eqnarray}
where the marginal function $\kappa(\boldsymbol{\theta})$ has been absorbed into the constant parameter
$\phi_0$, on the basis that both parameters
$\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ are unknown and must be estimated.

It might also be convenient to assume that the conditional distribution 
$p(X\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})$ takes the same functional form as utilised for the marginal distribution $p(X\mid\boldsymbol{\theta})$.
This would correspond to the model
\begin{eqnarray}
\hat{\boldsymbol{\theta}}(z)~\doteq~\mathbf{F}(z;\boldsymbol{\theta},\boldsymbol{\phi})\,,
& \;\;\;\; & 
p(X\mid \boldsymbol{\theta},Z,\boldsymbol{\phi})~\doteq~p(X\mid\hat{\boldsymbol{\theta}}(Z))\,.
\end{eqnarray}
However, as we shall see in later sections, this choice is not always feasible. In particular, there will
typically be constraints on the parameter $\boldsymbol{\theta}$, e.g. non-negativity, that must be satisfied
by the choice of regression function.

### Bernoulli regression (again)

The [Bernoulli distribution](#Bernoulli-distribution "Section: Bernoulli distribution")
has domain $\mathcal{X}=\{0,1\}$ and takes the (marginal) form 
\begin{eqnarray}
p(X=x\mid\theta) & ~=~ & \theta^x\,(1-\theta)^{1-x}\,,
\end{eqnarray}
for the given probability $\theta\in(0,1)$.
We therefore suppose that the conditional distribution takes a similar form, namely
\begin{eqnarray}
p(X=x\mid\theta,Z,\boldsymbol{\phi}) & ~=~ & F(Z;\boldsymbol{\phi})^x\,[1-F(Z;\boldsymbol{\phi})]^{1-x}\,,
\end{eqnarray}
where $F$ is contrained to satisfy $F(Z;\boldsymbol{\phi})\in(0,1)$ for all $Z\in\mathcal{Z}$
and all $\boldsymbol{\phi}\in\mathbb{R}^d$.
Note that since the mean of the Bernoulli distribution is just $\mu=\theta$, this is equivalent
to the [mean regression](#Mean-regression "Section: Mean regression") model
\begin{eqnarray}
\mu & ~=~ & F(Z;\boldsymbol{\phi})\,.
\end{eqnarray}
We now take $x_0=0$ and $x_1=1$, such that the log-ratio formula reduces to
\begin{eqnarray}
\ln\frac{F(Z;\boldsymbol{\phi})}{1-F(Z;\boldsymbol{\phi})}
& ~=~ & \ln\frac{\theta}{1-\theta} 
+ \ln\frac{p(Z\mid\boldsymbol{\pi},X=1,\boldsymbol{\zeta})}
{p(Z\mid\boldsymbol{\pi},X=0,\boldsymbol{\zeta})}
\,.
\end{eqnarray}
We recognise the left-hand side as the logit transform $\sigma^{-1}(\cdot)$ of $F(Z;\boldsymbol{\phi})$,
and hence the Bernoulli regression model takes the logistic form of
\begin{eqnarray}
\mu & ~=~ & F(Z;\boldsymbol{\phi}) ~=~\sigma(f(Z,\boldsymbol{\phi}))
\,,
\end{eqnarray}
with
\begin{eqnarray}
f(Z,\boldsymbol{\phi}) & ~\doteq~ &
\kappa(\theta) + \ln\frac{p(Z\mid\hat{\boldsymbol{\pi}}_1)}{p(Z\mid\hat{\boldsymbol{\pi}}_0)}
\,.
\end{eqnarray}
The correct form of regression on the covariates $Z$ now follows directly from the 
assumed marginal distribution $p(Z\mid\boldsymbol{\pi})$, as we demonstrate in the following sections.

### Bernoulli covariate

Suppose the covariate $Z$ is a binary variable that follows a 
[Bernoulli distribution](#Bernoulli-distribution "Section: Bernoulli distribution")
\begin{eqnarray}
p(Z=z\mid\mu) & ~\doteq~ & \mu^z\,(1-\mu)^{1-z}\,.
\end{eqnarray}
Then the corresponding features arise from
\begin{eqnarray}
\ln\frac{p(Z\mid\mu_1)}{p(Z\mid\mu_0)} & ~=~ & [\ln\mu_1-\ln\mu_0]\,Z+[\ln(1-\mu_1)-\ln(1-\mu_0)]\,(1-Z)
~=~\phi_0'+\phi_1\,Z\,.
\end{eqnarray}
Hence, the feature for a Bernoulli covariate $Z$ is just $Z$ itself.

### Categorical covariate

Suppose the covariate $Z$ takes a value from a finite set $\mathcal{Z}=\{z_1,z_2,\ldots,z_K\}$ of distinct values
with corresponding probabilities $\boldsymbol{\pi}=(\pi_1,\pi_2,\ldots,\pi_K)$. Then the categorical distribution
is
\begin{eqnarray}
p(Z=z_k\mid\boldsymbol{\pi}) & ~\doteq~ & \pi_k~=~\prod_{i=1}^{K}\pi_i^{\delta_{ik}}\,,
\end{eqnarray}
where $\delta_{ii}=1$ and $\delta_{ij}=0$ for $i\ne j$. The corresponding features then arise from
\begin{eqnarray}
\ln\frac{p(Z\mid\boldsymbol{\pi}_1)}{p(Z\mid\boldsymbol{\pi}_0)} & ~=~ &
\sum_{k=1}^{K}(\ln\pi_{k1}-\ln\pi_{k0})\,\delta(Z=z_k)
~=~\sum_{k=1}^{K}\phi_k\,\delta(Z=z_k)
\,.
\end{eqnarray}
Hence, the appropriate features for a categorical covariate $Z$ take the form of a binary indicator vector
$\boldsymbol{\delta}\doteq[\delta(Z=z_k)]_{k=1}^{K}$.

### Beta covariate

Suppose the covariate $Z$ is a probability or proportion drawn from a 
[Beta distribution](#Beta-distribution "Section: Beta distribution")
\begin{eqnarray}
p(Z=z\mid\alpha,\beta) & ~\doteq~ & \frac{z^{\alpha-1}\,(1-z)^{\beta-1}}{B(\alpha,\beta)}\,.
\end{eqnarray}
Then we find that
\begin{eqnarray}
\ln\frac{p(Z\mid\alpha_1,\beta_1)}{p(Z\mid\alpha_0,\beta_0)}
& ~=~ &
(\alpha_1-\alpha_0)\ln Z+(\beta_1-\beta_0)\ln(1-Z)-\ln\frac{B(\alpha_1,\beta_1)}{B(\alpha_0,\beta_0)}
\\& ~=~ & \phi_0'+\phi_1\ln Z+\phi_2\ln(1-Z)
\,.
\end{eqnarray}
Hence, the appropriate features for a Beta covariate $Z$ are, in general, $\ln Z$ and $\ln(1-Z)$.
However, in the special case where we have theoretical reasons to suppose that $\phi_2=-\phi_1$, 
i.e. $\alpha_0+\beta_0=\alpha_1+\beta_1$, then we may use the single, logit-transformed feature $\sigma^{-1}(Z)$
instead.

### Gamma covariate

Suppose the covariate $Z$ represents either a generalised count or a non-negative measurement of some kind.
Assuming that $Z$ is drawn from a 
[Gamma distribution](#Gamma-distribution "Section: Gamma distribution")
\begin{eqnarray}
p(Z=z\mid\alpha,\beta) & ~\doteq~ & \frac{\beta^\alpha}{\Gamma(\alpha)}\,z^{\alpha-1}\,e^{-\beta z}
\,,
\end{eqnarray}
then we observe that
\begin{eqnarray}
\ln\frac{p(Z\mid\alpha_1,\beta_1)}{p(Z\mid\alpha_0,\beta_0)}
& ~=~ &
[\alpha_1\ln\beta_1-\alpha_0\ln\beta_0+\ln\Gamma(\alpha_0)-\ln\Gamma(\alpha_1)]
+(\alpha_1-\alpha_0)\,\ln Z+(\beta_0-\beta_1)\,Z
\\
& ~=~ & \phi_0'+\phi_1\,Z+\phi_2\,\ln Z
\,.
\end{eqnarray}
Hence, the appropriate features for a Gamma covariate $Z$ are both $Z$ and $\ln Z$.

### Poisson covariate

Suppose the covariate $Z$ represents a count of the number of events that occurred in a fixed time-interval.
Assuming that $Z$ is drawn from a Poisson distribution
\begin{eqnarray}
p(Z=z\mid\lambda) & ~\doteq~ & e^{-\lambda}\,\frac{\lambda^z}{z!}\,,
\end{eqnarray}
then we observe that
\begin{eqnarray}
\ln\frac{p(Z\mid\lambda_1)}{p(Z\mid\lambda_0)} & ~=~ &
(\lambda_0-\lambda_1)+(\ln\lambda_1-\ln\lambda_0)\,Z~=~\phi_0'+\phi_1\,Z\,.
\end{eqnarray}
Hence, the appropriate feature for a Possion covariate $Z$ is just $Z$ itself.

### Poisson regression

As another example of regression modelling using "*probabilistic reduction*", let us now consider that the response variate $X$ follows the Poisson distribution
\begin{eqnarray}
p(X=x\mid\mu) & ~\doteq~ & e^{-\mu}\frac{\mu^x}{x!}\,.
\end{eqnarray}
Choosing $x_1=x_0+1$, it therefore follows that
\begin{eqnarray}
\ln\frac{p(X=x_1\mid\mu)}{p(X=x_0\mid\mu)} & ~=~ &
\ln\mu-\ln(x_0+1)\,.
\end{eqnarray}
This suggests that the regression model should take the form
\begin{eqnarray}
\ln\hat{\mu} & ~\doteq~ & f(Z,\boldsymbol{\phi})\,.
\end{eqnarray}
This turns out to be a consistent choice for arbitrary $f:\mathcal{Z}\times\mathbb{R}^d\mapsto\mathbb{R}$, 
since the distribution is constrained to have mean $\mu\in(0,\infty)\Rightarrow\ln\mu\in\mathbb{R}$.

### Geometric regression

So far we have seen that "*probabilistic reduction*" appears to work well. However, we previously cautioned
that this might not always be the case. Let us now consider the geometric distibution
\begin{eqnarray}
p(X=x\mid\theta) & ~\doteq~ & (1-\theta)^{x-1}\,\theta\,,
\end{eqnarray}
where $\theta\in(0,1)$ is the probability of halting a sequence of independent trials, and $X$ is the number of
trials up to and including the halted trial. Again choosing $x_1=x_0+1$, we deduce that
\begin{eqnarray}
\ln\frac{p(X=x_1\mid\theta)}{p(X=x_0\mid\theta)} & ~=~ & \ln(1-\theta)\,.
\end{eqnarray}
This suggests a regression model of the form
\begin{eqnarray}
\ln(1-\hat{\theta}) & ~\doteq~ & f(Z,\boldsymbol{\phi})\,.
\end{eqnarray}
However, although the right-hand side may take any real value, the left-hand side is constrained to obey
$\hat{\theta}\in(0,1)\Rightarrow\ln(1-\hat{\theta})\in(-\infty,0)$.
This mismatch of domains indicates that our distributional assumptions do not hold.
In fact, given the constraint on $\theta$, it makes more sense to assume a regression model like
\begin{eqnarray}
\hat{\theta} & ~\doteq~ & \sigma(f(Z,\boldsymbol{\phi}))\,.
\end{eqnarray}
Now, since the mean number of trials is $\mu\doteq\mathbb{E}[X\mid\theta]=\frac{1}{\theta}$,
then, in terms of 
[generalised nonlinear modelling](#Generalised-nonlinear-models "Section: Generalised nonlinear models"),
this is equivalent to choosing a link parameter $\eta=\sigma^{-1}(\theta)$ with link function
$g(\mu)=\sigma^{-1}\left(\frac{1}{\mu}\right)=-\ln(\mu-1)$.
Consequently, we *hypothesise* that "*probabilistic reduction*" fails 
to specify a valid regression function when none of the natural parameters, or linear combinations thereof,
are suitable for the link parameter.

### Beta regression (again)

As another demonstration of potential issues with "*probabilistic reduction*", suppose that $X$ follows the
[Beta distribution](#Beta-distribution "Section: Beta distribution")
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~\doteq~ & \frac{x^{\alpha-1}\,(1-x)^{\beta-1}}{B(\alpha,\beta)}\,.
\end{eqnarray}
Choosing $x_1=1-x_0\in(0,1)$, it then follows that
\begin{eqnarray}
\ln\frac{p(X=x_1\mid\alpha,\beta)}{p(X=x_0\mid\alpha,\beta)}
& ~=~ &
(\alpha-1)\ln\frac{x_1}{x_0}+(\beta-1)\ln\frac{1-x_1}{1-x_0}
~=~\sigma^{-1}(x_1)\,(\alpha-\beta)\,.
\end{eqnarray}
This suggests a regression model of the form
\begin{eqnarray}
\hat{\alpha}-\hat{\beta} & ~\doteq~ & f(Z,\boldsymbol{\phi})\,.
\end{eqnarray}
Now, since $\alpha,\beta>0$, it follows that $\eta=\alpha-\beta\in\mathbb{R}$, and so this
model is consistent with the constraints.
Also note that since we usually take $\hat{\eta}=0$ to indicate that only minimal predictive information
is available from the covariates, this corresponds to initialising the distribution with
default parameter values such that $\hat{\alpha}=\hat{\beta}$.

The problem with this formulation lies with computing the separates estimates $\hat{\alpha}$ and $\hat{\beta}$, for which we require another parameter $\psi$ that is independent of the link parameter $\eta$.
However, if we choose $\beta$ as the independent parameter, then we cannot guarantee that $\hat{\alpha}=\hat{\eta}+\hat{\beta}>0$, since $\hat{\eta}$ may be large and negative.
Similarly, we cannot guarantee that $\hat{\beta}=\hat{\alpha}-\hat{\eta}>0$ if we choose $\alpha$ as the independent parameter, since $\hat{\eta}$ may be large and positive. This 
problem persists even if we take the independent parameter to be a function of both $\alpha$ and $\beta$.
For example, we might choose $\psi=\alpha+\beta$, from which we could recover the distributional parameters via
$\alpha=(\psi+\eta)/2$ and $\beta=(\psi-\eta)/2$.
However, once again we have the problem that $\hat{\eta}$ could take a very large positive or negative value,
such that we cannot guarantee that $\hat{\alpha},\hat{\beta}>0$.

Despite these problems, this alternative approach can actually work in practice, provided that
the values of the regression model are never too extreme.
The required gradients of the log-likelihood are given by
\begin{eqnarray}
\nabla_\eta L & ~\doteq~ & \frac{\partial\alpha}{\partial\eta}\,\frac{\partial L}{\partial\alpha}
+\frac{\partial\beta}{\partial\eta}\,\frac{\partial L}{\partial\beta}
~=~\frac{1}{2}\left(Y_\alpha-\mu_{Y_\alpha}\right)
-\frac{1}{2}\left(Y_\beta-\mu_{Y_\beta}\right)\,,
\end{eqnarray}
and
\begin{eqnarray}
\nabla_\psi L & ~\doteq~ & \frac{\partial\alpha}{\partial\psi}\,\frac{\partial L}{\partial\alpha}
+\frac{\partial\beta}{\partial\psi}\,\frac{\partial L}{\partial\beta}
~=~\frac{1}{2}\left(Y_\alpha-\mu_{Y_\alpha}\right)
+\frac{1}{2}\left(Y_\beta-\mu_{Y_\beta}\right)\,,
\end{eqnarray}
respectively.
The corresponding variates are therefore
\begin{eqnarray}
Y_\eta & ~\doteq~ & \frac{1}{2}Y_\alpha-\frac{1}{2}Y_\beta=~\frac{1}{2}\ln X-\frac{1}{2}\ln(1-X)\,,
\end{eqnarray}
and
\begin{eqnarray}
Y_\psi & ~\doteq~ & \frac{1}{2}Y_\alpha+\frac{1}{2}Y_\beta=~\frac{1}{2}\ln X+\frac{1}{2}\ln(1-X)\,,
\end{eqnarray}
respectively, with corresponding variances given by
\begin{eqnarray}
\sigma^2_{Y_\eta} & ~\doteq~ & \mathtt{Var}\left[Y_\eta\mid\alpha,\beta\,\right]~=~
\frac{1}{4}\sigma^2_{Y_\alpha}+\frac{1}{4}\sigma^2_{Y_\beta}-\frac{1}{2}\sigma_{Y_\alpha,Y_\beta}
\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_{Y_\psi} & ~\doteq~ & \mathtt{Var}\left[Y_\eta\mid\alpha,\beta\,\right]~=~
\frac{1}{4}\sigma^2_{Y_\alpha}+\frac{1}{4}\sigma^2_{Y_\beta}+\frac{1}{2}\sigma_{Y_\alpha,Y_\beta}
\,,
\end{eqnarray}
respectively. The covariance is then given by
\begin{eqnarray}
\sigma_{Y_\psi,Y_\eta} & ~\doteq~ & \mathtt{Cov}\left[Y_\psi,Y_\eta\mid\alpha,\beta\,\right]~=~
\frac{1}{4}\sigma^2_{Y_\alpha}-\frac{1}{4}\sigma^2_{Y_\beta}
\,.
\end{eqnarray}
These values, in addition to the respective means of the variates, are sufficient for
parameter estimation. Whether or not the model converges in practice depends upon the observed data
and the settings of the [parameter update](#Generalised-nonlinear-models "Section: Generalised nonlinear models")
scheme.

## Citations

[1] J. A. Nelder and R. W. M. Wedderburn (1972), "*Generalized Linear Models*", J. Royal Stat. Soc. Series A, Vol. 135, No. 3, pp. 370-384.

[2] M. G. Kendall and A. Stuart (1967), "*The Advanced Theory of Statistics*", 2nd ed., Vol. 2.

[3] R. Kieschnick and B. D. McCullough (2003), "*Regression analysis of variates observed on $(0, 1)$*", Statistical Modelling 3(3):193-213. [[PDF]](https://journals.sagepub.com/doi/10.1191/1471082X03st053oa "journals.sagepub.com")

[4] J. S. Bergtold, A. Spanos and E. Onukwugha (2010), "*Bernoulli Regression Models: Revisiting the
Specification of Statistical Models with Binary Dependent Variables*",
J. Choice Modelling 3(2), pp 1-28.