# Appendix C: Regression Models

The purpose of this notebook is to provide a quick introduction to the mechanics of generalised linear models (GLMs) for regression,
first introduced by Nelder and Wedderburn [[1]](#Citations "Citation [1]: Generalized Linear Models"). We extend the notion to generalised nonlinear models.

## Introduction

### Probability distribution functions

We consider a multi-dimensional or scalar (i.e. uni-dimensional) variate $X$ on domain $\mathcal{X}$. Let $X$ have an underlying probability distribution function (PDF) $p$ parameterised by $\boldsymbol{\theta}$, satisfying non-negativity, i.e.
\begin{eqnarray}
p(\mathbf{x}\mid\boldsymbol{\theta})~\ge~0 && \forall\mathbf{x}\in\mathcal{X}\,,
\end{eqnarray}
and a total probability of unity, i.e.
\begin{eqnarray}
\int_\mathcal{X}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}| & = & 1\,,
\end{eqnarray}
where $|d\mathbf{x}|$ is taken to be an infinitesimal volume or length in $\mathcal{X}$. Note that for discrete variates the constraint
is instead
\begin{eqnarray}
\sum_{\mathbf{x}\in\mathcal{X}}p(\mathbf{x}\mid\boldsymbol{\theta}) & = & 1\,.
\end{eqnarray}
We shall henceforth consider only continuous variates for convenience, but the resulting derivations will also hold 
in the discrete case by replacing integration with summation.

When considered as general functions, PDFs have a variety of properties and constraints. For instance, for a continuous distribution, the integral represents the area under the curve. As a consequence of the non-negativity constraint, this places limits on the values of $p(\mathbf{x}\mid\boldsymbol{\theta})$. For example, if the domain $\mathcal{X}$ has no finite upper bound, then $p(\mathbf{x}\mid\boldsymbol{\theta}) \rightarrow 0$ as $\mathbf{x}\rightarrow\infty$. Similarly,
$p(\mathbf{x}\mid\boldsymbol{\theta}) \rightarrow 0$ as $\mathbf{x}\rightarrow -\infty$ if $\mathcal{X}$
does not have a finite lower bound. Similar conditions hold for spatial derivatives with respect to $\mathbf{x}$ at the extremes.

Consider now derivatives with respect to the parameter $\boldsymbol{\theta}$, denoted by the gradient vector operator
$\boldsymbol{\nabla}_\boldsymbol{\theta}\doteq\frac{\partial}{\partial\boldsymbol{\theta}}$, which we take to be a column vector. Similarly, second derivatives are denoted by the *Hessian* matrix operator
$\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}\doteq\frac{\partial^2}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{T}}$.
[Later](#Parameter-transformations "Section: Parameter transformations") 
we shall also require the first and second derivatives with respect to some transformation
$\boldsymbol{\eta}(\boldsymbol{\theta})$ of the parameters, denoted
$\boldsymbol{\nabla}_\boldsymbol{\eta}$ and 
$\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}$, respectively.
In general, unless we specifically need to distinguish between these two cases, we may drop the subscript and assume the results hold for all parameterisations.

Now, considering the total probability constraint above, since the derivatives are with respect to
$\boldsymbol{\theta}$ or $\boldsymbol{\eta}(\boldsymbol{\theta})$ and not $\mathbf{x}$, it follows that
taking first derivatives of both sides gives
\begin{eqnarray}
\boldsymbol{\nabla}\int_\mathcal{X}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}| & = &
\int_\mathcal{X}\boldsymbol{\nabla} p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~\mathbf{0}\,,
\end{eqnarray}
and taking second derivatives gives
\begin{eqnarray}
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}\int_\mathcal{X}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}| & = &
\int_\mathcal{X}\boldsymbol{\nabla}\boldsymbol{\nabla}^{T} p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~\mathbf{O}\,.
\end{eqnarray}
We shall use these results in the
[next](#Expectations-and-log-likelihoods "Section: Log-likelihoods and expectations") section.

Note that there are various fields of stochatic modelling, such as Maximum Entropy, that require use of the general properties of PDFs in order to construct specific PDFs that fit given theoretical or practical requirements. For the rest of this doccument, however, we shall assume that the form of the PDF has been specified in advance. The properties we require then relate instead to fitting the PDF to observed data via regression modelling.

### Expectations and log-likelihoods

We assume the 
[*law of the unconscious statistician*](https://en.wikipedia.org/wiki/Law_of_the_unconscious_statistician "en.wikipedia.org"), and take the expectation of arbitrary function $\mathbf{f}(X, \boldsymbol{\theta})$ to be given by
\begin{eqnarray}
\mathbb{E}_X\left[\mathbf{f}(X, \boldsymbol{\theta})\mid\boldsymbol{\theta}\right] & \doteq & 
\int_\mathcal{X}\mathbf{f}(\mathbf{x}, \boldsymbol{\theta})\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\,.
\end{eqnarray}
Note that in general we may drop the subscript when it is clear with respect to which variate we are taking the expectation.
Also note that since the expectation is the weighted mean of the function $\mathbf{f}$, we often denote this for convenience  as
\begin{eqnarray}
\boldsymbol{\mu}_\mathbf{f}(\boldsymbol{\theta}) & \doteq & \mathbb{E}\left[\mathbf{f}\mid\boldsymbol{\theta}\right]\,.
\end{eqnarray}
Finally, note that function $\mathbf{f}$ may in general be scalar, vector, matrix or even tensor valued.
However, we henceforth typically assume, without loss of generality, some scalar function $f$ (unless otherwise stated), since the expectation of a vector is a vector of scalar expectations, and likewise for matrices and tensors.

Thus, taking the gradient of the expectation, we see that
\begin{eqnarray}
\boldsymbol{\nabla}\,\mathbb{E}\left[f\mid\boldsymbol{\theta}\right] & = & 
\int_\mathcal{X}\boldsymbol{\nabla}f(\mathbf{x}, \boldsymbol{\theta})\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
+
\int_\mathcal{X}f(\mathbf{x}, \boldsymbol{\theta})\,\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\& = &
\int_\mathcal{X}\boldsymbol{\nabla}f(\mathbf{x}, \boldsymbol{\theta})\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
+
\int_\mathcal{X}f(\mathbf{x}, \boldsymbol{\theta})\,\boldsymbol{\nabla}\ln p(\mathbf{x}\mid\boldsymbol{\theta})\,
p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\& = &
\mathbb{E}\left[\boldsymbol{\nabla}f\mid\boldsymbol{\theta}\right]
+
\mathbb{E}\left[f\,\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
where, for convenience, we have defined the log-likelihood $L$ as
\begin{eqnarray}
L(\boldsymbol{\theta}; X) & \doteq & \ln p(X\mid\boldsymbol{\theta})\,.
\end{eqnarray}
Note that if we instead used a vector function $\mathbf{f}$, then we would have a choice of either the scalar gradient
$\boldsymbol{\nabla}^T\mathbf{f}$ or the matrix gradient $\boldsymbol{\nabla}\mathbf{f}^T$.

Suppose now, as an example, that we take the constant function 
$f\equiv 1~\Rightarrow\boldsymbol{\nabla}f\equiv\mathbf{0}$.
Then we immediately deduce that
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right] & = & \mathbf{0}\,.
\end{eqnarray}
This is one of the useful
 results from Kendall and Stuart [[2]](#Citations "Citation [2]: The Advanced Theory of Statistics").
It is of interest to derive this result direcly.
We begin by taking the gradient of $L$, namely
\begin{eqnarray}
\boldsymbol{\nabla}L & = & 
\boldsymbol{\nabla}\ln p(\mathbf{x}\mid\boldsymbol{\theta})
~=~\frac{\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,.
\end{eqnarray}
Hence, taking the expectation of the gradient, we therefore obtain
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right] & = &
\int_\mathcal{X}
\frac{\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~
\int_\mathcal{X}\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
~=~\mathbf{0}\,,
\end{eqnarray}
as before. The last part follows from the
[previous](#Probability-distribution-functions "Section: Probability distribution functions") section.

Similarly, taking the Hessian of $L$ gives
\begin{eqnarray}
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L & = & 
\boldsymbol{\nabla}\left\{
\frac{\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\right\}
~=~\frac{
p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
-\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
}
{p(\mathbf{x}\mid\boldsymbol{\theta})^2}
\,.
\end{eqnarray}
Hence, taking the expectation results in
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right] & = &
\int_\mathcal{X}
\frac{
p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
-\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})\,\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})
}
{p(\mathbf{x}\mid\boldsymbol{\theta})^2}
\,
p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\
& = & 
\int_\mathcal{X}\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
- \int_\mathcal{X}
\frac{\boldsymbol{\nabla}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,\frac{\boldsymbol{\nabla}^{T}p(\mathbf{x}\mid\boldsymbol{\theta})}{p(\mathbf{x}\mid\boldsymbol{\theta})}
\,p(\mathbf{x}\mid\boldsymbol{\theta})\,|d\mathbf{x}|
\\
& = &
\mathbf{O}-\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
~=~-\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
which again follows from the [previous](#Probability-distribution-functions "Section: Probability distribution functions") section. We shall require these results later.

Finally, we note that the expected value of the log-likelihood itself is given by
\begin{eqnarray}
\mathbb{E}\left[L\mid\boldsymbol{\theta}\right] & = & 
\int_\mathcal{X} p(\mathbf{x}\mid\boldsymbol{\theta})\,\ln p(\mathbf{x}\mid\boldsymbol{\theta})
\,|d\mathbf{x}|~\doteq~ -H(X\mid\boldsymbol{\theta})
\,,
\end{eqnarray}
where $H(X\mid\boldsymbol{\theta})$ is just the information-theoretic entropy of the distribution measured in *nats*. Consequently, we obtain
\begin{eqnarray}
\boldsymbol{\nabla}H(X\mid\boldsymbol{\theta}) & = &
-\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
-
\mathbb{E}\left[L\,\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
~=~ -\mathbb{E}\left[L\,\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
which follows from the derivation above.

### Parameter transformations

Suppose now that we wish to take derivatives, not with respect to the 
[PDF](#Probability-distribution-functions "Section: Probability distribution functions") 
parameters $\boldsymbol{\theta}$, but with respect to some other reparameterisation, say $\boldsymbol{\eta}=\boldsymbol{\eta}(\theta)$.
For this purpose, we consider the chain rules, namely that
\begin{eqnarray}
\frac{\partial}{\partial\boldsymbol{\theta}}~=~
\frac{\partial\boldsymbol{\eta}^T}{\partial\boldsymbol{\theta}}\frac{\partial}{\partial\boldsymbol{\eta}}\,,
& \;\;\;\mbox{and}~ &
\frac{\partial}{\partial\boldsymbol{\eta}}~=~
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}\frac{\partial}{\partial\boldsymbol{\theta}}\,.
\end{eqnarray}
For convenience, we define 
$\mathbf{J}_\boldsymbol{\eta}\doteq\frac{\partial\boldsymbol{\eta}^T}{\partial\boldsymbol{\theta}}$
to be the *Jacobian* matrix of the transformation $\boldsymbol{\eta}(\theta)$. It then follows from the first chain rule that the gradients are related via
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\theta}~=~\mathbf{J}_\boldsymbol{\eta}\,\boldsymbol{\nabla}_\boldsymbol{\eta}
& ~\Rightarrow~ &
\boldsymbol{\nabla}_\boldsymbol{\eta}~=~\mathbf{J}_\boldsymbol{\eta}^{-1}\,\boldsymbol{\nabla}_\boldsymbol{\theta}\,,
\end{eqnarray}
and thus we deduce from the second chain rule that $\mathbf{J}_\boldsymbol{\eta}^{-1}\doteq\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}$. Note that $\mathbf{J}_\boldsymbol{\eta}$ and $\mathbf{J}_\boldsymbol{\eta}^{-1}$ are only truly matrix inverses of each other if both
$\boldsymbol{\theta}$ and $\boldsymbol{\eta}$ have the same dimensions, otherwise we shall treat them symbolically.

The relationship between the Hessians is more involved. Firstly, we take the transpose of the second chain rule  to obtain
\begin{eqnarray}
\frac{\partial}{\partial\boldsymbol{\eta}^T} & = &
\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}\,,
\end{eqnarray}
where the symbol "$\cdot$" indicates the position of the argument.
Next, we then apply the second chain rule directly to this result, thereby obtaining
\begin{eqnarray}
\frac{\partial^2}{\partial\boldsymbol{\eta}\,\partial\boldsymbol{\eta}^T} & = &
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}\frac{\partial}{\partial\boldsymbol{\theta}}
\left\{
\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
\right\}
\\& = &
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}
\left\{
\frac{\partial^2\,\cdot}{\partial\boldsymbol{\theta}\,\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
+\left(\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial}{\partial\boldsymbol{\theta}}\right)
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
\right\}\,.
\end{eqnarray}
Finally, we replace the last derivative
$\frac{\partial}{\partial\boldsymbol{\theta}}$
via the first chain rule, to obtain
\begin{eqnarray}
\frac{\partial^2}{\partial\boldsymbol{\eta}\,\partial\boldsymbol{\eta}^T} & = &
\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}
\left\{
\frac{\partial^2\,\cdot}{\partial\boldsymbol{\theta}\,\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^T}
+\left(\frac{\partial\,\cdot}{\partial\boldsymbol{\theta}^T}\,
\frac{\partial\boldsymbol{\eta}^T}{\partial\boldsymbol{\theta}}\right)\odot
\frac{\partial^2\boldsymbol{\theta}}{\partial\boldsymbol{\eta}\,\partial\boldsymbol{\eta}^T}
\right\}
\,.
\end{eqnarray}
Note that the last term is the dot product "$\odot$" of a (row) vector with a *tensor*, i.e. a column "vector" in which each element is 
itself a matrix. 
Consequently, in terms of the gradient operators and Jacobian matrices, we obtain
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T} & = &
\mathbf{J}_\boldsymbol{\eta}^{-1}\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}(\cdot)\,
\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}
+
\mathbf{J}_\boldsymbol{\eta}^{-1}\left(\boldsymbol{\nabla}_\boldsymbol{\theta}^T(\cdot)\,
\mathbf{J}_\boldsymbol{\eta}\right)\odot
\boldsymbol{\nabla}_\boldsymbol{\eta}\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}\,.
\end{eqnarray}
Also note that various specialisations of this relationship occur depending upon both the parameters and the reparameterisation.
For example, in the case of scalar $\theta$ and scalar $\eta$, the identity simplifies to
\begin{eqnarray}
\frac{\partial^2}{\partial\eta^2} & = & \left(\frac{\partial\theta}{\partial\eta}\right)^2
\frac{\partial^2}{\partial\theta^2} +
\frac{\partial^2\theta}{\partial\eta^2}\,\frac{\partial}{\partial\theta}\,.
\end{eqnarray}

Now, the general identity will be a bit complex to apply in practice.
However, we find that there is a simpler approximation when we specifically consider the log-likelihood $L$, such that
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L & = &
\mathbf{J}_\boldsymbol{\eta}^{-1}\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}
+
\mathbf{J}_\boldsymbol{\eta}^{-1}\left(\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L\,
\mathbf{J}_\boldsymbol{\eta}\right)\odot
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\boldsymbol{\theta}\,.
\end{eqnarray}
In particular, taking the expecation of both sides, we obtain
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L
\mid\boldsymbol{\theta}\right]
& = &
\mathbf{J}_\boldsymbol{\eta}^{-1}\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\theta}\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\mid\boldsymbol{\theta}\right]\,
\left[\mathbf{J}_\boldsymbol{\eta}^{-1}\right]^{T}
+
\mathbf{J}_\boldsymbol{\eta}^{-1}\left(\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\mid\boldsymbol{\theta}\right]\,\mathbf{J}_\boldsymbol{\eta}\right)\odot
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\boldsymbol{\theta}
\\&=&
-\frac{\partial\boldsymbol{\theta}^T}{\partial\boldsymbol{\eta}}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\theta}L\,
\boldsymbol{\nabla}_\boldsymbol{\theta}^{T}L
\mid\boldsymbol{\theta}\right]\,
\frac{\partial\boldsymbol{\theta}}{\partial\boldsymbol{\eta}^{T}}
~=~ -\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
which follows from a [previous](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods") section,
where we derived that $\mathbf{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]=\mathbf{0}$
and
$\mathbb{E}\left[\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]=
-\mathbb{E}\left[\boldsymbol{\nabla}L\,
\boldsymbol{\nabla}^{T}L
\mid\boldsymbol{\theta}\right]$.
This approximation to the Hessian of the log-likelihood will be used in the
[next](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") section.

### Maximum likelihood estimation

Consider a stochastic sampling process that produces a sequence of (arbitrary) $n$ independent, identically distributed variables,
$X_1, X_2, \ldots, X_n$. Then the sample average is defined as
\begin{eqnarray}
\langle X\rangle ~\doteq~
\frac{1}{n}\sum_{i=1}^{n}X_i\,.
\end{eqnarray}
More generally, the sample average of an arbitrary function $\mathbf{f}(X,\ldots)$ is given by
\begin{eqnarray}
\langle\mathbf{f}\rangle(\ldots) & \doteq &
\frac{1}{n}\sum_{i=1}^{n}\mathbf{f}( X_i,\ldots)\,,
\end{eqnarray}
where the ellipsis "$\ldots$" represents arbitrary parameters that do not vary with $X$.
Due to the linearity of the operator $\langle\cdot\rangle$, we henceforth typically consider a scalar function $f$
(unless otherwise stated), since the sample mean of a vector is a vector of scalar sample means, and likewise for
matrices and tensors.
Similarly, it also follows from the linearity of the various other operators
that the parameter gradient and Hessian of the sample mean obey
\begin{eqnarray}
\boldsymbol{\nabla}\langle f\rangle ~=~
\left\langle\boldsymbol{\nabla}f\right\rangle &\;\;\mbox{and}\;&
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}\langle f\rangle ~=~
\left\langle\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}f\right\rangle\,,
\end{eqnarray}
respectively, and the expectation obeys
\begin{eqnarray}
\mathbb{E}\left[\langle f\rangle\mid\boldsymbol{\theta}\right] & = &
\left\langle\mathbb{E}\left[f\mid\boldsymbol{\theta}\right]\right\rangle
~=~\mathbb{E}\left[f\mid\boldsymbol{\theta}\right]
\,,
\end{eqnarray}
since the variates $X_i$ are here assumed to be independent and identically distributed (IID).
We shall relax this last restriction [later](#Regression-modelling "Section: Regression modelling").

We therefore see that the sample-mean log-likelihood $\langle L\rangle$ satisfies
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}\langle L\rangle\mid\boldsymbol{\theta}\right] & = &
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right]\right\rangle
~=~\mathbf{0}
\,,
\end{eqnarray}
with the last result obtained from a [previous](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods") section.
This result motivates the maximum likelihood approach, which is to determine the estimate $\hat{\boldsymbol{\theta}}_\texttt{ML}$ that satisfies
$\langle\boldsymbol{\nabla}L\rangle(\hat{\boldsymbol{\theta}}_\texttt{ML})=\mathbf{0}$,
if such a solution exists.
Under mild conditions of convexity,
$\langle L\rangle(\hat{\boldsymbol{\theta}}_\texttt{ML})$ is a local maximum.

The maximum-likelihood parameters
$\hat{\boldsymbol{\theta}}_\texttt{ML}$ themselves are usually found iteratively via an update scheme of the form
\begin{eqnarray}
\boldsymbol{\theta}' & = & \boldsymbol{\theta}+\Delta\boldsymbol{\theta}\,,
\end{eqnarray}
where all requisite quantities are evaluated at the current estimate $\boldsymbol{\theta}$ of the parameters.
Then the parameter increment $\Delta\boldsymbol{\theta}$ itself is usually computed either via
a direct gradient method, e.g. gradient ascent
\begin{eqnarray}
\Delta\boldsymbol{\theta} & \doteq & \rho\,\boldsymbol{\nabla}\langle L\rangle\,,
\end{eqnarray}
or via a modified gradient method, e.g. the Newton-Raphson method
\begin{eqnarray}
\Delta\boldsymbol{\theta} & \doteq & 
-\left[\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}\langle L\rangle\right]^{-1}\,
\boldsymbol{\nabla}\langle L\rangle
\,.
\end{eqnarray}
Note that, in either case, the iterations will halt when 
$\Delta\boldsymbol{\theta}=\mathbf{0}$, which occurs when
the gradient of the sample-mean log-likelihood vanishes.

In practice, we usually apply the Newton-Raphson scheme by solving the linear equation
\begin{eqnarray}
-\langle\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}L\rangle\,\Delta\boldsymbol{\theta} & = & 
\langle\boldsymbol{\nabla}L\rangle
\,.
\end{eqnarray}
However, the Hessian is often difficult to compute, and so an approximation is typically used.
Hence, following the reasoning from the
[previous](#Parameter-transformations "Section: Parameter transformations") section,
we take the expectation of the left-hand side only (since the expectation of the right-hand side is always zero).
This allows us to compute an approximate update as the solution to
\begin{eqnarray}
-\left\langle\mathbb{E}\left[
\boldsymbol{\nabla}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]\right\rangle\,\Delta\boldsymbol{\theta} 
& = & 
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L
\mid\boldsymbol{\theta}\right]\right\rangle
\,\Delta\boldsymbol{\theta} ~=~
\langle\boldsymbol{\nabla}L\rangle
\,,
\end{eqnarray}
which has the advantage of only requiring knowledge about the gradient of the log-likelihood.
Note that some other gradient update schemes also use approximations to the Hessian. For example, the LBFGS algorithm computes an approximate Hessian matrix from (multiple) previous estimates of the gradient,
effectively approximating the expectation itself by a temporal average.

Finally, note that we might more generally consider some 
[parameter transformation](#Parameter-transformations "Section: Parameter transformations") 
$\boldsymbol{\eta}=\boldsymbol{\eta}(\boldsymbol{\theta})$, and therefore the transformed parameter update
$\Delta\boldsymbol{\eta}$ would follow from the above formulae by
taking all gradients and Hessians with respect to $\boldsymbol{\eta}$ rather than $\boldsymbol{\theta}$.

## Exponential families

Since [PDFs](#Probability-distribution-functions "Section: Probability distribution functions") are required to be
non-negative, it follows that they may be expressed as exponentials. Different exponential forms lead to different families of distributions.
Of particular interest is a [general family](#General-form "Section: General form")
of distributions having linearly-additive log-likelihoods, and also a more 
[specialised family](#Seperable-dependencies "Section: Seperable dependencies"),
misleading called "**the**" exponential family,
having bilinear or separable dependencies between the variates and the parameters. These distributions are discussed in the following sections.

### General form

Clearly, a [PDF](#Probability-distribution-functions "Section: Probability distribution functions")
$p(\mathbf{x}\mid\boldsymbol{\theta})$ must be the exponential of its log-likelihood
$L(\boldsymbol{\theta};\,\mathbf{x})$. Considered as an additive model with variate $X$, the log-likelihood will in general have constant terms, terms in $X$ but not $\boldsymbol{\theta}$, terms in $\boldsymbol{\theta}$ but not $X$,
and terms containing interactions between $X$ and $\boldsymbol{\theta}$. Hence, the general log-likelihood takes the form
\begin{eqnarray}
L(\boldsymbol{\theta};\,X) & = & \ln h(X)-\ln Ƶ(\boldsymbol{\theta})+s(X,\boldsymbol{\theta})
\,,
\end{eqnarray}
where any constant terms may be placed in either or both of $h(X)$ or $Ƶ(\boldsymbol{\theta})$, but the
interaction function $s(X,\boldsymbol{\theta})$ may contain neither constant terms, nor terms only in $X$,
nor terms only in $\boldsymbol{\theta}$.
Consequently, the general probability distribution is specified by
\begin{eqnarray}
p(\mathbf{x}\mid\boldsymbol{\theta}) & = & 
\frac{h(\mathbf{x})\,e^{s(\mathbf{x},\boldsymbol{\theta})}}
     {Ƶ(\boldsymbol{\theta})}
\,,
\end{eqnarray}
where $Ƶ(\boldsymbol{\theta})$ is now seen tio be the normalising *partition* function defined by
\begin{eqnarray}
Ƶ(\boldsymbol{\theta}) & \doteq & 
\int_\mathcal{X}h(\mathbf{x})\,e^{s(\mathbf{x},\boldsymbol{\theta})}
\,|d\mathbf{x}|\,.
\end{eqnarray}

It now [follows](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods") 
from the parameter gradient of the log-likelihood $L$ that
\begin{eqnarray}
\boldsymbol{\nabla}L(\boldsymbol{\theta}; X) & = & \boldsymbol{\nabla}s(X,\boldsymbol{\theta})
-\boldsymbol{\nabla}\ln Ƶ(\boldsymbol{\theta})
\\
\Rightarrow
\mathbb{E}\left[\boldsymbol{\nabla}L\mid\boldsymbol{\theta}\right] & = & 
\mathbb{E}\left[\boldsymbol{\nabla}s\mid\boldsymbol{\theta}\right]
-\boldsymbol{\nabla}\ln Ƶ~=~\mathbf{0}
\\
\Rightarrow 
\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s} & ~\doteq~ &
\mathbb{E}\left[\boldsymbol{\nabla}s\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}\ln Ƶ
\\
\Rightarrow \boldsymbol{\nabla}L & = & \boldsymbol{\nabla}s-\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s}
\,.
\end{eqnarray}
Similarly, it [follows](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods")
from the Hessian of the log-likelihood that
\begin{eqnarray}
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}L(\boldsymbol{\theta}; X) & = & 
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s(X,\boldsymbol{\theta})
-\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ(\boldsymbol{\theta})
\\
\Rightarrow
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
& = &
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
-\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ
\\
\Rightarrow
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ
& = &
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
+\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L\mid\boldsymbol{\theta}\right]
\\& = &
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
+\mathbb{E}\left[\left(\boldsymbol{\nabla}s-\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s}\right)\,
\left(\boldsymbol{\nabla}s-\boldsymbol{\mu}_{\small\boldsymbol{\nabla}s}\right)^T
\mid\boldsymbol{\theta}\right]
\\
\Rightarrow \boldsymbol{\Sigma}_{\small\boldsymbol{\nabla}s}
& ~\doteq~ &
\texttt{Var}\left[\boldsymbol{\nabla}s\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}\ln Ƶ
-
\mathbb{E}\left[\boldsymbol{\nabla}^{}\boldsymbol{\nabla}^{T}s\mid\boldsymbol{\theta}\right]
\,.
\end{eqnarray}

Note that these identities hold for gradients and Hessians with respect to both the distributional parameters
$\boldsymbol{\theta}$ and also any arbitrary
[reparameterisation](#Parameter-transformations "Section: Parameter transformations")
$\boldsymbol{\eta}(\boldsymbol{\theta})$.
Consequently, we may always define a new variate of the form 
$Y\doteq\boldsymbol{\nabla}s(X,\boldsymbol{\theta})$, such that
its mean is given by
\begin{eqnarray}
\boldsymbol{\mu}_{Y} & ~=~ &
\mathbb{E}\left[Y\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}\ln Ƶ\,,
\end{eqnarray}
and its variance is given by
\begin{eqnarray}
\boldsymbol{\Sigma}_Y
& ~=~ &
\texttt{Var}\left[Y\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}^{}\boldsymbol{\mu}_{Y}^{T}
-
\mathbb{E}\left[\boldsymbol{\nabla}^{}Y^T\mid\boldsymbol{\theta}\right]
\,.
\end{eqnarray}
The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
estimate $\hat{\boldsymbol{\theta}}_\texttt{ML}$
may therefore be obtained iteratively via the approximate Newton-Raphson update
\begin{eqnarray}
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}L\,\boldsymbol{\nabla}^{T}L
\mid\boldsymbol{\theta}\right]\right\rangle\,\Delta\boldsymbol{\theta} ~=~
\langle\boldsymbol{\nabla}L\rangle
& ~\Rightarrow~ &
\left\langle\boldsymbol{\Sigma}_Y\right\rangle\,\Delta\boldsymbol{\theta}
~=~
\left\langle Y-\boldsymbol{\mu}_Y\right\rangle
\,.
\end{eqnarray}
The iterations halt when $\Delta\boldsymbol{\theta}=\mathbf{0}$, at which point
the sample mean $\left\langle Y\right\rangle$ equals the expected mean
$\hat{\boldsymbol{\mu}}_Y=\boldsymbol{\mu}_Y(\hat{\boldsymbol{\theta}}_\texttt{ML})$.
Note that under the transformation $\boldsymbol{\eta}=\boldsymbol{\eta}(\boldsymbol{\theta})$,
we may also obtain updates for $\boldsymbol{\eta}$ via $\Delta\boldsymbol{\eta}$,
culminating in the maximum-likelihood etimate
$\hat{\boldsymbol{\eta}}_\texttt{ML}=\boldsymbol{\eta}(\hat{\boldsymbol{\theta}}_\texttt{ML})$.

### Seperable dependencies

We now consider a specialisation of the [general form](#General-form "Section: General form") which had (semi-)arbitrary,
scalar interaction term $s(X,\boldsymbol{\theta})$.
The essential idea is that the interactions between the variate $X$ and the
parameters $\boldsymbol{\theta}$ are now multiplicatively separable, i.e. specified via a product. 
The simplest product form is 
$s(X,\boldsymbol{\theta})=\boldsymbol{\theta}^{T}X$. However, more generally the nonlinear product
$s(X,\boldsymbol{\theta})=\boldsymbol{\eta}(\boldsymbol{\theta})^{T}\mathbf{u}(X)$ is also valid. 
Note that the vector function $\boldsymbol{\eta}(\cdot)$ may be thought of as defining the
*natural* parameterisation $\boldsymbol{\eta}=\boldsymbol{\eta}(\boldsymbol{\theta})$ of the distribution.
Also, note that we now have the special property that $\boldsymbol{\nabla}_\boldsymbol{\eta}\,s(X,\boldsymbol{\theta})=\mathbf{u}(X)$,
such that $Y_\boldsymbol{\eta}\doteq\mathbf{u}(X)$ may be thought of as the natural *variates*
of the distribution.

Note that distributions having the form $s(\mathbf{x},\boldsymbol{\theta})=\boldsymbol{\eta}(\boldsymbol{\theta})^{T}\mathbf{u}(\mathbf{x})$ are regarded as belonging to **the** *exponential family*. This is a somewhat misleading and presumptuous term, given that other forms of $s(\mathbf{x},\boldsymbol{\theta})$ exist, and other forms of log-likelihood not in the general form also exist, i.e. members of a more general exponential family that are not in "the" exponential family.
Also note that distributions having the bilinear form $s(\mathbf{x},\boldsymbol{\theta})=\boldsymbol{\theta}^{T}\mathbf{x}$
are regarded as members of the *natural* exponential family, since the natural parameters $\boldsymbol{\eta}$
are just the ordinary parameters $\boldsymbol{\theta}$. The other stipluation is (apparently) that we also must have the
identity function
$\mathbf{u}(\mathbf{x})=\mathbf{x}$ to be in the natural exponential family. It is unclear what categorisation should be given to distributions having natural parameters but for which $\mathbf{u}(\mathbf{x})\neq\mathbf{x}$.

As noted above, the special property of "the" exponential family is that, taking gradients with respect to
the natural parameters
$\boldsymbol{\eta}$, we have
\begin{eqnarray}
Y_\boldsymbol{\eta}~=~\boldsymbol{\nabla}_\boldsymbol{\eta}\,s(X,\boldsymbol{\theta})~=~\mathbf{u}(X)
& ~\Rightarrow~ &
\boldsymbol{\nabla}_\boldsymbol{\eta}\,Y_\boldsymbol{\eta}^{T}~=~
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\,s~=~\mathbf{O}
\,.
\end{eqnarray}
It then follows from the [previous](#General-form "Section: General form") section that
\begin{eqnarray}
\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta}) & ~=~ & 
\mathbb{E}\left[Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right] ~=~ 
\boldsymbol{\nabla}_\boldsymbol{\eta}\ln Z(\boldsymbol{\theta})\,,
\end{eqnarray}
and
\begin{eqnarray}
\boldsymbol{\Sigma}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta}) & ~=~ &
\texttt{Var}\left[ Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right]
~=~
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}\ln Z(\boldsymbol{\theta})
~=~\boldsymbol{\nabla}_\boldsymbol{\eta}\,\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}^{T}(\boldsymbol{\theta})
\,.
\end{eqnarray}
As a simplification, we may use the [results](#Parameter-transformations "Section: Parameter transformations") that
\begin{eqnarray}
\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}~=~\mathbf{J}_\boldsymbol{\eta}^{-1}\,
\boldsymbol{\nabla}_\boldsymbol{\theta}\ln Z\,, 
& \;\;~\mbox{and}~\;\; &
\boldsymbol{\Sigma}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta})~=~
\mathbf{J}_\boldsymbol{\eta}^{-1}\,
~\boldsymbol{\nabla}_\boldsymbol{\theta}\,\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}^{T}
\,.
\end{eqnarray}
Consequently, the exact [Newton-Raphson](#Maximum-likelihood-estimation "Section: #Maximum likelihood estimation") update
of the natural parameters $\boldsymbol{\eta}$ is given by
\begin{eqnarray}
\boldsymbol{\Sigma}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta})\,\Delta\boldsymbol{\eta} & = & 
\langle Y_\boldsymbol{\eta}\rangle-\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}(\boldsymbol{\theta})\,,
\end{eqnarray}
such that the 
 [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
estimate $\hat{\boldsymbol{\eta}}_\texttt{ML}=\boldsymbol{\eta}(\hat{\boldsymbol{\theta}}_\texttt{ML})$
satisfies $\hat{\boldsymbol{\mu}}_{\small Y_\boldsymbol{\eta}}
=\boldsymbol{\mu}_{\small Y_\boldsymbol{\eta}}(\hat{\boldsymbol{\theta}}_\texttt{ML})=\langle Y_\boldsymbol{\eta}\rangle$.
 We shall see from a later 
[example](#Beta-distribution "Section: Beta distribution") that $\langle Y_\boldsymbol{\eta}\rangle$ are the sufficient statistics.

### Bernoulli distribution

Consider a match between two teams, say team A and team B. Suppose further that, after considering all the evidence, our model proposes a probability $\theta$ of team A winning. In practice, team A may either win or lose the match, or even draw the match, which we shall deal with later. Hence, we let $X=1$ indicate that team A actually won the match, and let $X=0$ indicate that team A lost the match. The variate $X$ then follows the Bernoulli distribution
\begin{eqnarray}
p(X=x\mid\theta) & = & \theta^{x}\,(1-\theta)^{1-x}
~=~\frac{e^{x\ln\frac{\theta}{1-\theta}}}{(1-\theta)^{-1}}\,.
\end{eqnarray}
We therefore observe that the Bernoulli distribution is a member of "the" 
[exponential family](#Seperable-dependencies "Section: Seperable dependencies") 
with natural parameter
\begin{eqnarray}
\eta & \doteq & \ln\frac{\theta}{1-\theta}~\doteq~\sigma^{-1}(\theta)\,,
\end{eqnarray}
where $\sigma^{-1}(\cdot)$ is the *logit* function, and its inverse is the  *logistic* (sigmoid)
function $\sigma(\eta)\doteq(1+e^{-\eta})^{-1}$.

We also see that the utility function is just the identity, $u(x)=x$, 
such that $X$ is the natural variate.
Lastly, observe that the partition function is given by
\begin{eqnarray}
Ƶ(\theta)~=~(1-\theta)^{-1} & ~\Rightarrow~ & \nabla_\theta\ln Ƶ~=~\frac{1}{1-\theta}\,.
\end{eqnarray}
Given the reparameterisation $\eta=\sigma^{-1}(\theta)$, we also note that the
[Jacobian](#Parameter-transformations "Section: Parameter transformations") of this transformation is given by
\begin{eqnarray}
J_\eta & = & \nabla_\theta\,\eta~=~\frac{1}{\theta\,(1-\theta)}
\\
\Rightarrow \nabla_\eta\ln Ƶ & = & J_\eta^{-1}\nabla_\theta\ln Ƶ~=~\theta\,,
\end{eqnarray}
from which it [follows](#Seperable-dependencies "Section: Seperable dependencies") that
the distributional mean is given by
\begin{eqnarray}
\mu & = & \mathbb{E}[X\mid\theta]~=~\nabla_\eta\ln Ƶ~=~\theta\,.
\end{eqnarray}
We now see that the logit function $\sigma^{-1}(\cdot)$ is the
natural [link function](#Regression-modelling "Section: Regression modelling")
for the Bernoulli distribution, since $\eta=\sigma^{-1}(\mu)$.

Similarly, we observe that
\begin{eqnarray}
\nabla_\eta^2\ln Ƶ(\theta)~=~\nabla_\eta\,\theta~=~J_\eta^{-1}~=~\theta\,(1-\theta)\,.
\end{eqnarray}
It therefore [follows](#Seperable-dependencies "Section: Seperable dependencies") that
the distributional variance is given by
\begin{eqnarray}
\sigma_X^2 & = & \texttt{Var}[X\mid\theta]~=~\theta\,(1-\theta)~=~\mu\,(1-\mu)\,.
\end{eqnarray}
Hence, the Bernoulli distribution (or its variate $X$) is *heteroscedastic*, since the variance is not constant
but is instead a function of the mean.

It also [follows](#Seperable-dependencies "Section: Seperable dependencies") that the
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate of the mean
is given by 
\begin{eqnarray}
\hat{\mu}_\texttt{ML} & = & \hat{\theta}_\texttt{ML}~=~\langle X\rangle\,.
\end{eqnarray}

We now return to the issue of how we deal with draws. It turns out that there are a number of good reasons to treat a draw as being half-a-win and half-a-loss.
Hence, we take the weighted log-likelihood for a draw as being given by
\begin{eqnarray}
L_\texttt{draw}(\theta) & \doteq & \frac{1}{2}\ln p(X=1\mid\theta)+\frac{1}{2}\ln p(X=0\mid\theta)
\\& =  &
\frac{1}{2}\ln\theta+\frac{1}{2}\ln\,(1-\theta)
\\& = &
\ln\left[\theta^{\frac{1}{2}}\,(1-\theta)^{\frac{1}{2}}\right]
~=~\ln p(X={}^{\frac{1}{2}}\mid\theta)\,.
\end{eqnarray}
Consequently, we may numerically treat draws as having the value $X=\frac{1}{2}$.

Finally, we make use of the fact that $1-\sigma(\eta)=\sigma(-\eta)$, and therefore observe that we may reparameterise the Bernoulli distribution 
in terms of its natural parameter $\eta$ as
\begin{eqnarray}
p(X=x\mid\eta) & = & \sigma(-\eta)\,e^{\eta x}\,,
\end{eqnarray}
which puts it into the natural exponential family. Consequently, it appears that the membership (or non-membership) of a probability distribution in a given exponential subfamily is largely determined by its parameterisation.
Also note that under this reparameterisation, the mean and variance are now given by
\begin{eqnarray}
\mu~=~\mathbb{E}[X\mid\eta]~=~\sigma(\eta)\,,
&\;\;\;& 
\sigma^2_X~=~\texttt{Var}[X\mid\eta]~=~\sigma(\eta)\,\sigma(-\eta)\,,
\end{eqnarray}
respectively.

### Beta distribution

In contrast to the [previous](#Bernoulli-distribution "Section: Bernoulli distribution") section,
suppose that instead of the match between teams A and B having a fixed probability $\theta$ of team A winning, the
probability, now denoted by the variate $X$, is itself sampled from another distribution. For example, $X$ might be drawn from the Beta distribution
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{x^{\alpha-1}\,(1-x)^{\beta-1}}{B(\alpha,\beta)}
~=~\frac{[x(1-x)]^{-1}\,e^{\alpha\ln x+\beta\ln(1-x)}}{B(\alpha,\beta)}\,.
\end{eqnarray}
This distribution is also in "the"
[exponential family](#Seperable-dependencies "Section: Seperable dependencies"), with natural parameters 
$\boldsymbol{\theta}=[\alpha,\beta]^T$, 
natural variates
$\mathbf{u}(X)=[Y_\alpha,Y_\beta]^T=[\ln X,\ln (1-X)]^{T}$, and
partition function $Ƶ(\boldsymbol{\theta})$ given by
\begin{eqnarray}
B(\alpha,\beta) & \doteq & \frac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha+\beta)}\,,
\end{eqnarray}
where $\Gamma(\cdot)$ is the *gamma* function.
It therefore [follows](#Seperable-dependencies "Section: Seperable dependencies") that
the mean of $Y_\alpha=\ln X$ is given by
\begin{eqnarray}
\mu_{\small Y_\alpha} & ~=~ & \mathbb{E}\left[\ln X\mid\alpha,\beta\right] ~=~
\frac{\partial}{\partial\alpha}\ln B(\alpha,\beta)
~=~\psi(\alpha)-\psi(\alpha+\beta)\,,
\end{eqnarray}
where $\psi(\cdot)$ is the *digamma* function given by 
\begin{eqnarray}
\psi(z) & \doteq & \frac{\Gamma'(z)}{\Gamma(z)}\,.
\end{eqnarray}
Similarly, the mean of $Y_\beta=\ln (1-X)$ is given by
\begin{eqnarray}
\mu_{\small Y_\beta} & ~=~ & \mathbb{E}\left[\ln (1-X)\mid\alpha,\beta\right] ~=~
\frac{\partial}{\partial\beta}\ln B(\alpha,\beta)
~=~\psi(\beta)-\psi(\alpha+\beta)\,.
\end{eqnarray}
For interest sake, note that if we let $Y\doteq\ln\frac{X}{1-X}=\sigma^{-1}(X)$, then we deduce that
\begin{eqnarray}
\mu_Y & ~=~ & \mathbb{E}\left[Y\mid\alpha,\beta\right]~=~
\mu_{\small Y_\alpha}-\mu_{\small Y_\beta} 
~=~\psi(\alpha)-\psi(\beta)\,.
\end{eqnarray}

Similarly, we find that the variance of $Y_\alpha$ is given by
\begin{eqnarray}
\sigma^2_{\small Y_\alpha} & ~=~ & \texttt{Var}\left[\ln X\mid\alpha,\beta\right] ~=~
\frac{\partial^2}{\partial\alpha^2}\ln B(\alpha,\beta)
~=~\psi'(\alpha)-\psi'(\alpha+\beta)\,,
\end{eqnarray}
where $\psi'(\cdot)\doteq\psi_1(\cdot)$ is the *trigamma* function given by
\begin{eqnarray}
\psi_1(z) & \doteq & \frac{\Gamma(z)\,\Gamma''(z)-\Gamma'(z)^2}{\Gamma(z)^2}\,.
\end{eqnarray}
Likewise, the variance of $Y_\beta$ is given by
\begin{eqnarray}
\sigma^2_{\small Y_\beta} & ~=~ & \texttt{Var}\left[\ln(1-X)\mid\alpha,\beta\right] ~=~
\frac{\partial^2}{\partial\beta^2}\ln B(\alpha,\beta)
~=~\psi'(\beta)-\psi'(\alpha+\beta)\,,
\end{eqnarray}
and the covariance between $Y_\alpha$ and $Y_\beta$ is given by
\begin{eqnarray}
\sigma_{Y_\alpha,Y_\beta} & ~=~ & \texttt{Cov}\left[\ln X,\,\ln(1-X)\mid\alpha,\beta\right] ~=~
\frac{\partial^2}{\partial\alpha\partial\beta}\ln B(\alpha,\beta)
~=~-\psi'(\alpha+\beta)\,.
\end{eqnarray}


It seems somewhat surprising that following the defined procedure does not immediately give us the mean and variance of the variate $X$, but instead gives the mean and variance of $\mathbf{u}(X)$. 
In fact, it turns out that $\langle\ln X\rangle$ and $\langle\ln(1-X)\rangle$ provide the sufficient statistics for the 
[Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution"), 
and not $\langle X\rangle$. 
The mean and variance of $X$ are actually given by
\begin{eqnarray}
\mu_X & ~=~ & \mathbb{E}[X\mid\alpha,\beta] ~=~\frac{\alpha}{\alpha+\beta}\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_X & ~=~ & \texttt{Var}[X\mid\alpha,\beta] ~=~\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\,,
\end{eqnarray}
respectively.
Note that we can also reparameterise the Beta distribution is another way. If we define $\nu\doteq\alpha+\beta$, then we see that
\begin{eqnarray}
\sigma^2_X ~=~ \frac{\mu_X\,(1-\mu_X)}{\nu+1} & \;\;\Rightarrow\;\; &
\nu ~=~ \frac{\mu_X\,(1-\mu_X)}{\sigma^2_X}-1\,,
\end{eqnarray}
such that
\begin{eqnarray}
\alpha~=~\mu_X\,\nu\,, & \;\;\;\; & \beta=(1-\mu_X)\,\nu\,.
\end{eqnarray}
The distribution is therefore heteroscedastic, with the variance $\sigma_X^2$ clearly being a function of 
the mean $\mu_X$.

Yet another reparametersiation is to retain $\alpha$ and $\nu$, such that the distribution becomes
\begin{eqnarray}
p(X=x\mid\alpha,\nu) & ~=~ & 
\frac{[x(1-x)]^{-1}\,e^{\alpha\ln\frac{x}{1-x}+\nu\ln(1-x)}}{B(\alpha,\nu-\alpha)}\,,
\end{eqnarray}
where the natural variates are now $\mathbf{u}(X)=[Y, \ln(1-X)]^{T}$. We therefore obtain the mean of variate $Y$ as
\begin{eqnarray}
\mu_Y & ~=~ & \frac{\partial}{\partial\alpha}\ln B(\alpha,\nu-\alpha)~=~\psi(\alpha)-\psi(\nu-\alpha)\,,
\end{eqnarray}
as before, and the variance of $Y$ is now also obtained as
\begin{eqnarray}
\sigma^2_Y & ~=~ & \frac{\partial^2}{\partial\alpha^2}\ln B(\alpha,\nu-\alpha)~=~\psi'(\alpha)+\psi'(\nu-\alpha)\,.
\end{eqnarray}
This should come as no surprise, since $Y=\ln X-\ln(1-X)$, such that
\begin{eqnarray}
\texttt{Var}[Y\mid\alpha,\beta] & ~=~ & 
\texttt{Var}[\ln X\mid\alpha,\beta]+\texttt{Var}[\ln(1-X)\mid\alpha,\beta]-2\,\texttt{Cov}[\ln X,\,\ln(1-X)\mid\alpha,\beta]
\\& ~=~ &
\sigma^2_{Y_\alpha}+\sigma^2_{Y_\beta}-2\,\sigma_{Y_\alpha, Y_\beta}~=~\psi'(\alpha)+\psi'(\beta)\,.
\end{eqnarray}
This little exercise demonstrates that although different parameterisations might make the various calculations either easier
or harder to obtain, they cannot alter the final results.

Finally, we observe that the maximum likelihood solution 
$\hat{\boldsymbol{\theta}}_\texttt{ML}=[\hat{\alpha}_\texttt{ML},\hat{\beta}_\texttt{ML}]^{T}$
satisfies the nonlinear system of equations
\begin{eqnarray}
\left\langle\ln X\right\rangle & = & 
\psi(\hat{\alpha}_\texttt{ML})-\psi(\hat{\alpha}_\texttt{ML}+\hat{\beta}_\texttt{ML})\,,
\\
\left\langle\ln(1-X)\right\rangle & = & 
\psi(\hat{\beta}_\texttt{ML})-\psi(\hat{\alpha}_\texttt{ML}+\hat{\beta}_\texttt{ML})\,.
\end{eqnarray}
This may be solved numerically using the Newton-Raphson method with iterative parameter updates of the form
\begin{eqnarray}
\left[\begin{array}{}\Delta\alpha\\\Delta\beta\end{array}\right]
& = &
\left[\begin{array}{}
\psi'(\alpha)-\psi'(\alpha+\beta) & -\psi'(\alpha+\beta)\\
-\psi'(\alpha+\beta) & \psi'(\beta)-\psi'(\alpha+\beta)
\end{array}\right]^{-1}\,
\left[\begin{array}{}
\left\langle\ln X\right\rangle - \psi(\alpha)+\psi(\alpha+\beta)\\
\left\langle\ln(1-X)\right\rangle - \psi(\beta)+\psi(\alpha+\beta)
\end{array}\right]
\,.
\end{eqnarray}


The final issue remains about what sample values of the variate $X$ are being observed in practice? We can no longer use the
[Bernoulli](#Bernoulli-distribution "Section: Bernoulli distribution") values of $X=1$ for a  win and $X=0$ for a loss,
since now $X$ is a probability. One possibility is to note that after a match between team A and team B, we might have observed team A's score $S_A$ and team B's score $S_B$. Hence, we could use the proportion $X=\frac{S_A}{S_A+S_B}$ as a proxy measure of the probability of team A winning a similar match against team B in the future.

Note that, in general, we should not expect $S_A$ and $S_B$ to be independent, since $S_A$ should increase with team A's offensive strength, and decrease with team B's defensive strength. For example, we might expect both scores to be higher in a match between poor defenders than in a match between strong defenders. However, if we do assume independence, then taking
$X\sim\texttt{Beta}(\alpha,\beta)$ [follows](https://en.wikipedia.org/wiki/Beta_distribution "Wikipedia: Beta distribution")
from $S_A\sim\texttt{Gamma}(\alpha,\gamma)$ and $S_B\sim\texttt{Gamma}(\beta,\gamma)$,
where $\gamma$ is a parameter common across all teams.

### Beta-Bernoulli distribution

We now consider the combined situation where each match has a 
[Bernoulli-distributed](#Bernoulli-distribution "Section: Bernoulli distribution") outcome, $X\sim\texttt{Bern}(\theta)$, but where the probability $\theta$ iteslf is [Beta-distributed](#Beta-distribution "Section: Beta distribution"),
with $\theta\sim\texttt{Beta}(\alpha,\beta)$. It therefore follows that
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & = & \int_0^1 p(x\mid\theta)\,p(\theta\mid\alpha,\beta)\,d\theta
\\& = &
\int_0^1 \theta^x(1-\theta)^{1-x}\,
\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}\,d\theta
\\& = &
\frac{1}{B(\alpha,\beta)}\int_0^1 \theta^{\alpha+x-1}(1-\theta)^{\beta+1-x-1}\,d\theta
\\& =  &
\frac{B(\alpha+x,\beta+1-x)}{B(\alpha,\beta)}~=~
\frac{\Gamma(\alpha+x)\,\Gamma(\beta+1-x)}{\Gamma(\alpha+\beta+1)}\,
\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}\,.
\end{eqnarray}
Hence, using the fact that $\Gamma(z+1)=z\,\Gamma(z)$, we deduce that
\begin{eqnarray}
p(X=1\mid\alpha,\beta)~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\;\; &
p(X=0\mid\alpha,\beta)~=~\frac{\beta}{\alpha+\beta}\,,
\end{eqnarray}
such that
\begin{eqnarray}
P(X=x\mid\alpha,\beta) & = & \frac{\alpha^x\,\beta^{1-x}}{\alpha+\beta}\,.
\end{eqnarray}
We also deal with draws similarly to [before](#Bernoulli-distribution "Section: Bernoulli distribution"), by treating them as half-a-win and half-a-loss, with weighted log-likelihood
\begin{eqnarray}
L_\texttt{draw}(\alpha,\beta) & \doteq & \frac{1}{2}\ln p(X=1\mid\alpha,\beta)+\frac{1}{2}\ln p(X=0\mid\alpha,\beta)
\\& =  &
\frac{1}{2}\ln\frac{\alpha}{\alpha+\beta}+\frac{1}{2}\ln\frac{\beta}{\alpha+\beta}
\\& = &
\ln\frac{\alpha^{\frac{1}{2}}\,\beta^{\frac{1}{2}}}{\alpha+\beta}
~=~\ln p(X={}^{\frac{1}{2}}\mid\alpha,\beta)\,.
\end{eqnarray}
Consequently, we may numerically treat draws as having the value $X=\frac{1}{2}$.

We may now rewrite the distribution in the form
\begin{eqnarray}
p(x\mid\alpha,\beta) & ~=~ & \frac{\beta\,\left(\frac{\alpha}{\beta}\right)^x}{\alpha+\beta}
~=~\frac{e^{x\ln\frac{\alpha}{\beta}}}{1+\frac{\alpha}{\beta}}\,,
\end{eqnarray}
which is in "the" [exponential family](#Seperable-dependencies "Section: Seperable dependencies") with natural parameter
$\eta=\ln\frac{\alpha}{\beta}$. Consequently, the reparameterised distribution
\begin{eqnarray}
p(x\mid\eta) & \doteq & \frac{e^{\eta x}}{1+e^\eta}\,,
\end{eqnarray}
is thus in the natural exponential family, and has mean
\begin{eqnarray}
\mu & ~=~ & \mathbb{E}[X\mid\eta]~=~\frac{d}{d\eta}\ln(1+e^\eta)
~=~\frac{e^\eta}{1+e^\eta}~=~\frac{1}{1+e^{-\eta}}\,.
\end{eqnarray}
This is just the logistic function $\sigma(\cdot)$, such that
\begin{eqnarray}
\mu~=~\sigma(\eta) & ~\Rightarrow~ & \eta~=~\sigma^{-1}(\mu)\,.
\end{eqnarray}
The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate
$\hat{\eta}_\texttt{ML}$ therefore satisfies
\begin{eqnarray}
\hat{\mu}_\texttt{ML} & ~=~ & \sigma(\hat{\eta}_\texttt{ML})=\langle X\rangle\,.
\end{eqnarray}
Similarly, the variance is given by
\begin{eqnarray}
\sigma_X^2 & ~=~ & \texttt{Var}[X\mid\eta]~=~\frac{d}{d\eta}\frac{1}{1+e^{-\eta}}
~=~\frac{e^{-\eta}}{(1+e^{-\eta})^2}~=~\sigma(\eta)\,\sigma(-\eta)\,.
\end{eqnarray}
In terms of the original parameters $\alpha$ and $\beta$, substitution of $\eta=\ln\frac{\alpha}{\beta}$ into the above results gives
\begin{eqnarray}
\mu~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\; & \sigma^2_X~=~\frac{\alpha\beta}{(\alpha+\beta)^2}\,.
\end{eqnarray}
Observe that although the mean $\mu$ matches that of the
[Beta distribution](#Beta-distribution "Section: Beta distribution"), the variance $\sigma^2_X$
now differs. In fact, the variance is given by $\sigma^2_X=\mu(1-\mu)$, which matches the
[Bernoulli distribution](#Bernoulli-distribution "Section: Bernoulli distribution").

### Gamma distribution

As was noted [previously](#Beta-distribution "Section: Beta distribution"), in a match between team A and team B, we might observe scores $S_A$ and $S_B$, respectively. These scores are non-negative and usually, but not necessarily,
integer valued. Let $X$ be the variate denoting a team's score.
Then given appropriate assumptions of independence between the teams' scores, $X$ might follow the Gamma distribution:
\begin{eqnarray}
p(X=x\mid\alpha,\beta) & ~=~ & \frac{\beta^\alpha}{\Gamma(\alpha)}\,x^{\alpha-1}\,e^{-\beta x}
~=~\frac{x^{-1}e^{\alpha\ln x-\beta x}}{\beta^{-\alpha}\,\Gamma(\alpha)}\,.
\end{eqnarray}
This [PDF](#Probability-distribution-functions "Section: Probability distribution functions") 
is in "the" [exponential family](#Seperable-dependencies "Section: Seperable dependencies") with natural parameters
$\boldsymbol{\theta}=[\alpha,\beta]^T$, natural variates $\mathbf{u}(X)=[Y_\alpha,Y_\beta]^{T}=[\ln X, -X]^T$, and
partition function $Ƶ(\boldsymbol{\theta})$ that obeys
\begin{eqnarray}
\ln Ƶ(\boldsymbol{\theta}) & = & -\alpha\ln\beta+\ln\Gamma(\alpha)\,.
\end{eqnarray}

It therefore [follows](#Seperable-dependencies "Section: Seperable dependencies") that the means are given by
\begin{eqnarray}
\mu_{Y_\alpha} & ~=~ & \mathbb{E}[\ln X\mid\alpha,\beta]
~=~\frac{\partial}{\partial\alpha}\ln Ƶ
~=~-\ln\beta+\psi(\alpha)\,,
\\
\mu_{Y_\beta} & ~=~ & \mathbb{E}[-X\mid\alpha,\beta]
~=~\frac{\partial}{\partial\beta}\ln Ƶ
~=~-\frac{\alpha}{\beta}\,,
\end{eqnarray}
where $\psi(\cdot)$ is the *digamma* function. We note that the distributional mean is thus given by
\begin{eqnarray}
\mu & ~=~ & \mathbb{E}[X\mid\alpha,\beta]~=~\frac{\alpha}{\beta}\,.
\end{eqnarray}
Similarly, the variances are given by
\begin{eqnarray}
\sigma^2_{Y_\alpha} & ~=~ & \texttt{Var}[\ln X\mid\alpha,\beta]
~=~\frac{\partial^2}{\partial\alpha^2}\ln Ƶ
~=~\psi'(\alpha)\,,
\\
\sigma^2_{Y_\beta} & ~=~ & \texttt{Var}[-X\mid\alpha,\beta]
~=~\frac{\partial^2}{\partial\beta^2}\ln Ƶ
~=~\frac{\alpha}{\beta^2}\,,
\end{eqnarray}
and the covariance is given by
\begin{eqnarray}
\sigma_{Y_\alpha,Y_\beta} & ~=~ & \texttt{Cov}[\ln X,-X\mid\alpha,\beta]
~=~\frac{\partial^2}{\partial\alpha\partial\beta}\ln Ƶ
~=~-\frac{1}{\beta}\,,
\end{eqnarray}
such that the distributional variance is given by
\begin{eqnarray}
\sigma^2_X & ~=~ & \texttt{Var}[X\mid\alpha,\beta]
~=~\frac{\alpha}{\beta^2}\,.
\end{eqnarray}

The [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") estimate
$\hat{\boldsymbol{\theta}}_\texttt{ML}$ therefore satistifies the equations
\begin{eqnarray}
\left\langle -X\right\rangle~=~-\frac{\hat{\alpha}_\texttt{ML}}{\hat{\beta}_\texttt{ML}}
& ~\Rightarrow~ & \hat{\beta}_\texttt{ML}~=~\frac{\hat{\alpha}_\texttt{ML}}{\left\langle X\right\rangle}
\,,
\\
\left\langle\ln X\right\rangle~=~-\ln\hat{\beta}_\texttt{ML}+\psi(\hat{\alpha}_\texttt{ML})
& ~\Rightarrow~ & \ln\left\langle X\right\rangle-\left\langle\ln X\right\rangle
~=~\ln\hat{\alpha}_\texttt{ML}-\psi(\hat{\alpha}_\texttt{ML})\,.
\end{eqnarray}
[Apparently](https://en.wikipedia.org/wiki/Gamma_distribution "Wikipedia: Gamma distribution"), a good initial
estimate of $\hat{\alpha}_\texttt{ML}$ is given by
\begin{eqnarray}
\hat{\alpha} & = & \frac{3-s+\sqrt{(s-3)^2+24s}}{12s}\,,
\end{eqnarray}
where $s=\ln\left\langle X\right\rangle-\left\langle\ln X\right\rangle$.

## Regression modelling

For the case of regression modelling, we now suppose that the *response* (or *dependent*) variate $X$ is *explained* by
exogenous (or *independent*) covariates $Z$ via some regression function $\mathbf{f}(Z,\boldsymbol{\phi})$
with regression parameters $\boldsymbol{\phi}$. The usual [mean regression](#Mean-regression "Section: Mean regression") 
model is to fit $\mathbf{f}$ to the distribution mean $\boldsymbol{\mu}=\mathbb{E}[X\mid\boldsymbol{\theta}]$. We may represent this symbollically by the graphical model
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
The usual approach is to estimate the regression parameters $\boldsymbol{\phi}$ that best fit the observed data using
some form of [least-squares](#Least-squares-regression "Section: Least-squares regression") approach that minimises the square
error of fit.

One of the issues with such an approach is that the range of the regression function $\mathbf{f}$ is usually unconstrained, especially for linear models like $\mathbf{f}(\mathbf{z},\boldsymbol{\Phi})=\boldsymbol{\Phi}^{T}\mathbf{z}$.
However, the permissible values of the mean $\boldsymbol{\mu}$ are usually proscribed by the PDF, for example if $X$ represents one or more proportions.
The key innovation of generalised linear modelling (GLM) introduced by
 Nelder and Wedderburn [[1]](#Citations "Citation [1]: Generalized Linear Models") was to regress, not on the mean
 $\boldsymbol{\mu}$ itself, but instead on parameters $\boldsymbol{\eta}$ related by a *link* function $\mathbf{g}$ to the mean
via $\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$. We shall henceforth refer to $\boldsymbol{\eta}$ as *link parameters*,
although this is not standard terminology. This new relationship corresponds to the graphical model
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xrightarrow{\mathbf{g}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
Now, for $\mathbf{f}$ to explain $\boldsymbol{\mu}$, we require that the link function $\mathbf{g}$ be invertible,
such that the inverse relationship may be represented by the graphical model
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
In terms of mean regression, we could also collapse this model to become
\begin{eqnarray}
\boldsymbol{\theta} & \xrightarrow{\mathbb{E}_X} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}\circ\,\mathbf{f}_Z}\boldsymbol{\phi}\,,
\end{eqnarray}
which corresponds to nonlinear least-squares regression, even with a linear regression function $\mathbf{f}$.

We now turn to the issue of how to explain the PDF parameters $\boldsymbol{\theta}$ in terms of the parameters $\boldsymbol{\mu}$, regardless of whether the mean is directly or indirectly obtained from $\mathbf{f}(Z,\boldsymbol{\phi})$.
Now, for some simple PDFs, such as the [Bernoulli](#Bernoulli-distribution "Section: Bernoulii distribution")
distribution,
the relationship between the distribution parameters $\boldsymbol{\theta}$ and the mean $\boldsymbol{\mu}$ is invertible,
and we may therefore deduce $\boldsymbol{\theta}$ from
knowledge of $\boldsymbol{\mu}$. In general, however, knowing $\boldsymbol{\mu}$ might only give us partial
knowledge of $\boldsymbol{\theta}$, as is the case, for example, with the [Beta](#Beta-distribution "Section: Beta distribution") distribution.
We therefore propose that the parameters $\boldsymbol{\theta}$ may be partitioned into two parts,
namely $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$, 
such that implicitly $\boldsymbol{\mu}=\boldsymbol{\mu}(\boldsymbol{\psi},\boldsymbol{\varphi})$.
Conversely, however, we now suppose that the *independent* parameters $\boldsymbol{\psi}$ are not obtainable from
$\boldsymbol{\mu}$ and so must be estimated separately, but that the *dependent* parameters 
$\boldsymbol{\varphi}$ are obtainable from $\boldsymbol{\mu}$, given $\boldsymbol{\psi}$, via some implicit
function $\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi}, \boldsymbol{\mu})$,
such that now $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}))$.
This inverted relationship may be represented by the graphical model
\begin{eqnarray}
\boldsymbol{\varphi} & \xleftarrow{\boldsymbol{\iota}^{-1}_\boldsymbol{\psi}} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,,
\end{eqnarray}
where $\iota_\boldsymbol{\psi}(\cdot)$ denotes some implicit, quasi-invertible function
such that
$\boldsymbol{\mu}=\boldsymbol{\iota}_\boldsymbol{\psi}(\boldsymbol{\varphi})=\boldsymbol{\mu}(\boldsymbol{\psi},\boldsymbol{\varphi})$
and 
$\boldsymbol{\varphi}=\boldsymbol{\iota}^{-1}_\boldsymbol{\psi}(\boldsymbol{\mu})=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.

In summary, the generalised modelling approach may be boiled down to three essential requirements:

1. The mean $\boldsymbol{\mu}$ of the variate $X$ is a function of the
[PDF](#Probability-distribution-functions "Section: Probability distribution functions")
parameters $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$ via the expectation
$\boldsymbol{\mu}=\mathbb{E}[X\mid\boldsymbol{\theta}]$, with partial inversion given implicitly via
$\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.
2. There exists an invertable *link* function $\mathbf{g}$ that maps $\boldsymbol{\mu}$ into more "natural" *link* parameters
$\boldsymbol{\eta}$ via $\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$.
3. Each value of the *response* variate $X$ is sampled from a distribution for which the corresponding link parameters $\boldsymbol{\eta}$ are determined by a parameterised regression function $\mathbf{f}(Z,\boldsymbol{\phi})$ of an exogenous covariate $Z$. 

We now provide brief explanations of [mean regression](#Mean-regression "Section: Mean regression") 
 and
[least-squares regression](#Least-squares-regression "Section: Least-squares regression"),
and then go on to expand upon the above points to derive the 
[general regression](#Generalised-nonlinear-models "Section: Generalised nonlinear models")
model, and then discuss its specialisation to a 
[linear regression](#Generalised-linear-models "Section: Generalised linear models")
function.

### Mean regression

Consider [again](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") 
the stochastic sampling process that produces an arbitrary length-$n$ sequence of independent variables,
$X_1, X_2, \ldots, X_n$. However, now we drop the requirement that these variables are identically distributed.
Instead, we suppose that (the value of) each $X_i$ is drawn from the same 
[PDF](#Probability-distribution-functions "Section: Probability distribution functions")
but with  potentially different parameters $\boldsymbol{\theta}_i$, with individual means
$\boldsymbol{\mu}_{X_i}\doteq\mathbb{E}\left[X_i\mid\boldsymbol{\theta}_i\right]$
and variances $\boldsymbol{\Sigma}_{X_i}\doteq\texttt{Var}\left[X_i\mid\boldsymbol{\theta}_i\right]$.


Now, if we were allowed to sample the value of variate $X_i$ multiple times, then we would expect the values
to be displaced about the mean $\boldsymbol{\mu}_i$ according to
\begin{eqnarray}
X_i ~=~ \boldsymbol{\mu}_i+\mathbf{e}_i
& ~\Rightarrow~ & \mathbf{e}_i~=~X_i-\boldsymbol{\mu}_i
\,.
\end{eqnarray}
In this context, the displacement $\boldsymbol{e}_i$ is called the *noise* or the measurement *error*, 
and arises due to imprecise, stochastic measurements of the unknown mean $\boldsymbol{\mu}_i$.
This error has the distributional properties
\begin{eqnarray}
\mathbb{E}\left[\mathbf{e}_i\mid\boldsymbol{\theta}_i\right] & = & \mathbf{0}\,,
\\
\mathbb{E}\left[\mathbf{e}_i\,\mathbf{e}_i^T\mid\boldsymbol{\theta}_i\right]
& = & \texttt{Var}\left[X_i\mid\boldsymbol{\theta}_i\right]~=~\boldsymbol{\Sigma}_{X_i}(\boldsymbol{\theta}_i)\,.
\end{eqnarray}
We must keep in mind that this variance is not necessarily constant, but is generally a function of the
distributional parameters $\boldsymbol{\theta}_i$, especially the
[mean](#Seperable-dependencies "Section: Seperable dependencies").

Next, we suppose that associated with each *response* (or *dependent*) variate $X_i$ is a corresponding *exogenous*
(or *independent*) covariate $Z_i$. We futher suppose that $Z_i$ explains the mean $\boldsymbol{\mu}_i$ of $X_i$ via a parameterised regression function of the form
\begin{eqnarray}
\boldsymbol{\mu}_i & = & \mathbf{f}(Z_i,\boldsymbol{\phi})+\mathbf{r}_i\,,
\end{eqnarray}
where $\mathbf{r}_i$ is a called the *residual* or *error* of fit. Note that, in contrast to the measurement error, the residual arises due to error in approximating the true mean $\boldsymbol{\mu}_i$ with an estimating function
$\hat{\boldsymbol{\mu}}_i=\mathbf{f}(Z_,\boldsymbol{\phi})$. However, since the mean $\boldsymbol{\mu}_i$ is actually unknown, then we may (conceptually) encode our uncertainty about its true value into another PDF, such that the mean of this PDF obeys
\begin{eqnarray}
\mathbb{E}[\boldsymbol{\mu}_i\mid Z_i,\boldsymbol{\phi}] ~=~
\hat{\boldsymbol{\mu}}_i~=~\mathbf{f}(Z_i,\boldsymbol{\phi})
& ~\Rightarrow~ & 
\mathbb{E}[\mathbf{r}_i\mid Z_i,\boldsymbol{\phi}] ~=~ \mathbf{0}\,.
\end{eqnarray}
Similarly, this PDF has a variance that measures our uncertainty, such that
\begin{eqnarray}
\texttt{Var}[\boldsymbol{\mu}_i\mid Z_i,\boldsymbol{\phi}] & ~=~ &
\mathbb{E}[\mathbf{r}_i^{}\,\mathbf{r}_i^T\mid Z_i,\boldsymbol{\phi}]
~=~\boldsymbol{\Sigma_{\mathbf{r}_i}}(Z_i,\boldsymbol{\phi})
\,.
\end{eqnarray}
Note that this variance is generally a function of both the covariate $Z_i$ and the regression parameters
$\boldsymbol{\phi}$.

Combining the two distributions, we now obtain the regression model
\begin{eqnarray}
X_i & = & \mathbf{f}(Z_i,\boldsymbol{\phi})+\boldsymbol{\varepsilon}_i\,,
\end{eqnarray}
where $\boldsymbol{\varepsilon}_i=\mathbf{e}_i+\mathbf{r}_i$ combines both measurement error and fitting error, and hence may be considered as either an error or a residual. Consequently, we deduce that
\begin{eqnarray}
\mathbb{E}[\boldsymbol{\varepsilon}_i\mid\boldsymbol{\theta}_i, Z_i, \boldsymbol{\phi}]
& = & \mathbf{0}\,,
\\
\mathbb{E}[\boldsymbol{\varepsilon}_i\,\boldsymbol{\varepsilon}_i^T\mid\boldsymbol{\theta}_i, Z_i, \boldsymbol{\phi}] & = & 
\boldsymbol{\Sigma}_{X_i}(\boldsymbol{\theta}_i)
+\boldsymbol{\Sigma}_{\mathbf{r}_i}(Z_i,\boldsymbol{\phi})
~\doteq~\boldsymbol{\Sigma}_i(\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi})\,,
\end{eqnarray}
on the assumption that the residual $\mathbf{r}_i$ is independent of the noise $\mathbf{e}_i$.
In practice, we do not know the value of $\boldsymbol{\Sigma}_i$, and consequently must either estimate it from
the empirical distribution of $\boldsymbol{\varepsilon}_i$, or else approximate it, for example by
$\boldsymbol{\Sigma}_{X_i}$, which corresponds to assuming $\boldsymbol{\Sigma}_{\mathbf{r}_i}=\mathbf{O}$,
i.e. being extremely certain of the regression function $\mathbf{f}$.

### Least-squares regression

Recall from the [previous](#Mean-regresssion "Section: Mean regresssion") section
that we are considering the regression model
\begin{eqnarray}
X_i~=~\boldsymbol{\mu}_i+\mathbf{e}_i\,, & \;\; \boldsymbol{\mu}_i~=~\mathbf{f}(Z_i,\boldsymbol{\phi})+\mathbf{r}_i
& ~\Rightarrow~
X_i~=~\mathbf{f}(Z_i,\boldsymbol{\phi})+\boldsymbol{\varepsilon}_i\,.
\end{eqnarray}
In order to fit the regression function $\mathbf{f}$ to observed data 
$\mathbf{X}\doteq(X_i)_{i=1}^{n}$ and $\mathbf{Z}\doteq(Z_i)_{i=1}^{n}$,
we first redefine the
sample mean [operator](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") $\langle\cdot\rangle$ to include the covariate $Z$. Consequently, 
the sample average of an arbitrary function $\mathbf{g}(X,Z,\ldots)$ is now given by
\begin{eqnarray}
\langle\mathbf{g}\rangle(\ldots) & \doteq &
\frac{1}{n}\sum_{i=1}^{n}\mathbf{g}(X_i,Z_i,\ldots)\,,
\end{eqnarray}
where the ellipsis "$\ldots$" represents arbitrary parameters that do not vary with $X$ or $Z$.
Note that all parameters and variables that depend upon $X_i$ and/or $Z_i$ must also be indexed in this summation,
e.g. the residual variance $\boldsymbol{\Sigma}_i$. If it becomes necessary to explicitly distinguish between constants and functions of $X_i$ and $Z_i$, then we may retain the subscript, e.g. $\langle\boldsymbol{\Sigma}_i\boldsymbol{\phi}\rangle$. Also note that the sample mean may still be treated as a function of $\mathbf{X}$ and $\mathbf{Z}$, when they are considered as variables rather than known, sampled values.
This is useful for computing expectations of sample means, for example.

The fitting process requires estimating the best value of the function parameters $\boldsymbol{\phi}$ that
minimises the overall error of fit. The *ordinary* least-squares (OLS) method is to minimise the mean of the squared lengths of the residuals, namely
$S(\boldsymbol{\phi})=\left\langle\boldsymbol{\varepsilon}^{T}\boldsymbol{\varepsilon}\right\rangle$.
However, use of OLS makes some implicit assumptions, in particular that:
1. The residuals are independent of each other.
2. All residuals are equally important.
3. The elements of each residual are independent of each other.
4. The elements of each residual are equally important.

Only the first assumption of residual independence really holds. In practice, if the residuals are independent, then the plot of the fitted residuals against the covariate $Z$ should appear randomly distributed. However, if a pattern appears then the particular choice of the regression function $\mathbf{f}$ must be reconsidered. The assumption that the elements of each residual $\boldsymbol{\varepsilon}_i$ are independent does not hold in general, since
\begin{eqnarray}
\mathbb{E}[\boldsymbol{\varepsilon}_i^{T}\,\boldsymbol{\varepsilon}_i
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]
& ~=~ &
\mathbb{E}[
\texttt{trace}\left(\boldsymbol{\varepsilon}_i\,\boldsymbol{\varepsilon}_i^T\right)
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]
~=~\texttt{trace}\left(\boldsymbol{\Sigma}_i\right)\,.
\end{eqnarray}
Consequently, the assumptions of equal importance do not hold either, since a residual (or an element of a residual)
with higher variance has higher uncertainty associated with its fit, and hence should be assigned less weight.
It turns out that weighting in inverse proportion to the variance is a good idea.
Note that use of OLS corresponds to assuming constant and equal variances of the form
$\boldsymbol{\Sigma}_i=\mathbf{I}\,\sigma^2$.

We can solve both problems of element-wise non-independence and unequal weighting of residuals and elements by applying a so-called
*whitening transformation* that decouples within-residual correlations and standardises the variances. This transformation takes the form
\begin{eqnarray}
\tilde{\boldsymbol{\varepsilon}}_i & ~\doteq~ & 
\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\boldsymbol{\varepsilon}_i
~=~\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\left[X_i-\mathbf{f}(Z_i,\boldsymbol{\phi})\right]\,,
\end{eqnarray}
such that 
\begin{eqnarray}
\mathbb{E}[\tilde{\boldsymbol{\varepsilon}}_i^{T}\,\tilde{\boldsymbol{\varepsilon}}_i
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]
& ~=~ &
\texttt{trace}\left(\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\,\mathbb{E}[
\boldsymbol{\varepsilon}_i\,\boldsymbol{\varepsilon}_i^T
\mid\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi}]\,
\boldsymbol{\Sigma}_i^{-\frac{1}{2}}\right)
~=~\texttt{trace}\left(\mathbf{I}\right)\,.
\end{eqnarray}
Consequently, the *weighted* least-squares (WLS) method is to minimise
\begin{eqnarray}
S(\boldsymbol{\phi}) & ~=~ &
\left\langle\tilde{\boldsymbol{\varepsilon}}^{T}\,\tilde{\boldsymbol{\varepsilon}}\right\rangle
~=~\left\langle
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]^{T}\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]
\right\rangle\,.
\end{eqnarray}

However, we [recall](#Mean-regresssion "Section: Mean regresssion") that 
$\boldsymbol{\Sigma}_i=\boldsymbol{\Sigma}_i(\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi})$ is a function
of $\boldsymbol{\phi}$. Consequently, WLS is typically a nonlinear problem even when $\mathbf{f}$ is a linear function of $\boldsymbol{\phi}$. To overcome this difficulty, we 
use an iterative approximation where we
evaluate $\boldsymbol{\Sigma}_i$ at the
previous estimate $\boldsymbol{\phi}$, but evaluate $\mathbf{f}$ at the new estimate
$\boldsymbol{\phi}'$, resulting in
\begin{eqnarray}
S(\boldsymbol{\phi}, \boldsymbol{\phi}') & ~=~ &
\left\langle
\left[X-\mathbf{f}(Z,\boldsymbol{\phi}')\right]^{T}\,
\boldsymbol{\Sigma}^{-1}(\boldsymbol{\theta},Z,\boldsymbol{\phi})\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi}')\right]
\right\rangle\,.
\end{eqnarray}
Next, we substitute the Taylor series approximation
\begin{eqnarray}
\mathbf{f}(Z,\boldsymbol{\phi}') & ~\approx~ &
\mathbf{f}(Z,\boldsymbol{\phi})
+\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\,,
\end{eqnarray}
where
\begin{eqnarray}
\boldsymbol{\phi}' & ~=~ \boldsymbol{\phi}+\Delta\boldsymbol{\phi}\,,
\end{eqnarray}
to obtain the new approximation
\begin{eqnarray}
S(\boldsymbol{\phi}, \Delta\boldsymbol{\phi}) & ~=~ &
\left\langle
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})
-\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\right]^{T}\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})
-\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\right]
\right\rangle\,.
\end{eqnarray}
Finally, we take the gradient with respect to $\Delta\boldsymbol{\phi}$ to obtain
\begin{eqnarray}
\boldsymbol{\nabla}_{\Delta\boldsymbol{\phi}}S & ~=~ &
-2\,\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})
-\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}(Z,\boldsymbol{\phi})\,\Delta\boldsymbol{\phi}
\right]
\right\rangle\,,
\end{eqnarray}
which vanishes when
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\boldsymbol{\nabla}^T_\boldsymbol{\phi}\mathbf{f}(Z,\boldsymbol{\phi})
\right\rangle\,\Delta\boldsymbol{\phi}
& ~=~ &
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]
\right\rangle
\,.
\end{eqnarray}
This is the nonlinear form of the *iteratively reweighted* least-squares (IRLS) method, due to the fact that the
variance $\boldsymbol{\Sigma}_i(\boldsymbol{\theta}_i,Z_i,\boldsymbol{\phi})$ needs to be
re-evaluated after every update of the parameter estimate $\boldsymbol{\phi}$.
Note that the iterations will cease when $\Delta\boldsymbol{\phi}=\mathbf{0}$, at which point the 
solutions satisfies
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\Sigma}^{-1}\,
\left[X-\mathbf{f}(Z,\boldsymbol{\phi})\right]
\right\rangle & ~=~ & \mathbf{0}\,.
\end{eqnarray}
This latter is just the solution to $\boldsymbol{\nabla}_\boldsymbol{\phi}S=\mathbf{0}$ from the original WLS formulation, on the assumption that $\boldsymbol{\Sigma}$ is held constant for the update and recomputed after the update.

### Generalised nonlinear models

[Previously](#Mean-regresssion "Section: Mean regresssion"), we considered the
regression of the mean $\boldsymbol{\mu}$ via some parameterised function
$\mathbf{f}(Z,\boldsymbol{\phi})$ of the covariate $Z$.
In my opinion, the key contribution of Nelder and Wedderburn
[[1]](#Citations "Citation [1]: Generalized Linear Models")
to generalised linear modelling (GLM) lies in the fact that we may instead apply regression, not to $\boldsymbol{\mu}$,
but to some more natural parameterisation $\boldsymbol{\eta}=\mathbf{g}(\boldsymbol{\mu})$,
where $\mathbf{g}(\cdot)$ is known as the *link* function.
We may therefore depict the resulting relationships by the graphical model
\begin{eqnarray}
\boldsymbol{\varphi} & \xrightarrow{\boldsymbol{\iota}_\boldsymbol{\psi}} & \boldsymbol{\mu}
\xrightarrow{\mathbf{g}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,,
\end{eqnarray}
which may further be interpreted as meaning that the parameters $\boldsymbol{\varphi}$ and $\boldsymbol{\phi}$ are conditionally independent given $\eta$ (and $Z$ and $\boldsymbol{\psi}$).
It therefore [follows](#Parameter-transformations "Section: Parameter transformations") that
the gradient of the log-likelihood $L$ with respect to the regression parameters $\boldsymbol{\phi}$ takes the form
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}L & = &
\boldsymbol{\nabla}_\boldsymbol{\phi}\boldsymbol{\eta}^T\,\boldsymbol{\nabla}_\boldsymbol{\eta}L
~=~\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}(Z,\boldsymbol{\phi})\,
\boldsymbol{\nabla}_\boldsymbol{\eta}L(\boldsymbol{\theta}; X)\,.
\end{eqnarray}
Note that although the last term on the right-hand side appears to be a function only of $X$ and
$\boldsymbol{\theta}$, we must remember that $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$, and that $\boldsymbol{\varphi}$ is related to $\boldsymbol{\eta}=\mathbf{f}(Z,\boldsymbol{\phi})$ via the graphical model above.
In fact, in order to compute the gradient of the log-likelihood $L$ with respect to the parameters $\boldsymbol{\eta}$,
we must use the fact that the link function $\mathbf{g}$ is invertible, as represented
by the graphical model
\begin{eqnarray}
\boldsymbol{\varphi} & \xleftarrow{\boldsymbol{\iota}_\boldsymbol{\psi}^{-1}} & \boldsymbol{\mu}
\xleftarrow{\mathbf{g}^{-1}}\boldsymbol{\eta}
\xleftarrow{\mathbf{f}_Z}\boldsymbol{\phi}\,.
\end{eqnarray}
Consequently, we obtain
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}L & = &
\boldsymbol{\nabla}_\boldsymbol{\eta}\boldsymbol{\mu}^T\,
\boldsymbol{\nabla}_\boldsymbol{\mu}\boldsymbol{\varphi}^T\,
\boldsymbol{\nabla}_\boldsymbol{\varphi}L
\\& = &
\left[\frac{\partial\mathbf{g}^T}{\partial\boldsymbol{\mu}}\right]^{-1}\,
\left[\frac{\partial\boldsymbol{\mu}^T}{\partial\boldsymbol{\varphi}}\right]^{-1}\,
\boldsymbol{\nabla}_\boldsymbol{\varphi}L\,.
\end{eqnarray}
Implicitly, we may therefore suppose that $\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}(Z,\boldsymbol{\phi}))$,
such that 
$\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}(Z,\boldsymbol{\phi})))$.

The requisite 
[expectations](#Expectations-and-log-likelihoods "Section: Expectations and log-likelihoods")
are therefore given by
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\phi}L\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}\right]
& = & \boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\mid\boldsymbol{\theta}\right]
~=~ \mathbf{0}\,,
\end{eqnarray}
and
\begin{eqnarray}
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\phi}\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}L
\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}\right]
& = & \boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]
\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}
\,.
\end{eqnarray}
Consequently, the [maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation")
estimate $\hat{\boldsymbol{\phi}}_\texttt{ML}$ may be obtained iteratively via updates of the form
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,
\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]
\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}
\right\rangle\,\Delta\boldsymbol{\phi} & = &
\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,\boldsymbol{\nabla}_\boldsymbol{\eta}L\rangle
\,.
\end{eqnarray}

It must be noted that the natural *link* parameters $\boldsymbol{\eta}$ here are not necessarily the same as the natural *distributional* parameters used
[previously](#Seperable-dependencies "Section: Seperable dependencies"), also denoted as $\boldsymbol{\eta}$. In general they are not the same.
Despite this, we [may](#General-form "Section: General form")
still introduce a new variate $Y_\boldsymbol{\eta}$ as a function of $X$ (and $\boldsymbol{\theta})$,
such that
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\eta}L~=~Y_\boldsymbol{\eta}-\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}
& ~\Rightarrow~ & 
\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}~=~\mathbb{E}\left[Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right]
\,.
\end{eqnarray}
The variance is then directly obtained as
\begin{eqnarray}\boldsymbol{\Sigma}_{Y_\boldsymbol{\eta}} & ~=~ &
\texttt{Var}\left[Y_\boldsymbol{\eta}\mid\boldsymbol{\theta}\right]
~=~\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\eta}L\,
\boldsymbol{\nabla}_\boldsymbol{\eta}^{T}L\mid\boldsymbol{\theta}\right]\,.
\end{eqnarray}
It then follows that the update for the parameter $\boldsymbol{\phi}$ is given by
\begin{eqnarray}
\left\langle
\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}\,\boldsymbol{\Sigma}_{Y_\boldsymbol{\eta}}
\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\mathbf{f}
\right\rangle\,\Delta\boldsymbol{\phi} & = &
\left\langle\boldsymbol{\nabla}_\boldsymbol{\phi}\mathbf{f}^{T}
\left[Y_\boldsymbol{\eta}-\boldsymbol{\mu}_{Y_\boldsymbol{\eta}}\right]
\right\rangle
\,.
\end{eqnarray}
We may observe the similarity (and dissimilarity) with the nonlinear 
[IRLS](#Least-squares-regression "Section: Least-squares regression") update equation.

Lastly, the maximum likelihood estimate $\hat{\boldsymbol{\psi}}_\texttt{ML}$
is directly obtained via iterative updates of the form
\begin{eqnarray}
\left\langle\mathbb{E}\left[\boldsymbol{\nabla}_\boldsymbol{\psi}L\,\boldsymbol{\nabla}_\boldsymbol{\psi}^{T}L
\mid\boldsymbol{\theta}\right]\right\rangle
\,\Delta\boldsymbol{\psi} & = &
\langle\boldsymbol{\nabla}_\boldsymbol{\psi}L\rangle
\,,
\end{eqnarray}
since $\boldsymbol{\psi}$ is an independent parameter that governs the distribution. Similarly to above, the parameter
$\boldsymbol{\psi}$ does not have to be a natural distributional parameter. Regardless, we may also define the variates
$Y_\boldsymbol{\psi}$ as functions of $X$, with corresponding means
$\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}$ and covariances $\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}$,
such that the update for parameter $\boldsymbol{\psi}$ becomes
\begin{eqnarray}
\left\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\right\rangle
\,\Delta\boldsymbol{\psi} & = &
\left\langle Y_\boldsymbol{\psi}-\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}\right\rangle
\,.
\end{eqnarray}
Despite this simplified form, recall that $\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}$ and  $\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}$ are still indirectly functions of $Z$ and $\boldsymbol{\phi}$
via $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu}(Z,\boldsymbol{\phi})))$.

### Generalised linear models

Generalised linear modelling (GLM) now follows immediately from
[nonlinear modelling](#Generalised-nonlinear-models "Section: Generalised nonlinear models").
In theory, the most general linear model is given by
\begin{eqnarray}
\mathbf{f}(\mathbf{z},\boldsymbol{\Phi}) & \doteq & \boldsymbol{\Phi}^T\,\mathbf{z}\,,
\end{eqnarray}
where the variate $Z$ is multi-dimensional (and may also include a constant component), and the parameters
$\boldsymbol{\Phi}$ take the form of a matrix. However, we may subsequently separate $\mathbf{f}$ into its components by
considering independent scalar functions parameterised by each column 
 of $\boldsymbol{\Phi}=[\boldsymbol{\phi}_i]$.
 We may therefore assume, without loss of generality, that the regression function takes the simple, scalar form 
\begin{eqnarray}
\eta & = & f(Z,\boldsymbol{\phi}) ~\doteq~ \boldsymbol{\phi}^T\,Z
~=~Z^{T}\boldsymbol{\phi}\,.
\end{eqnarray}
Note, however, that in the full vector case, we would still need to reconstruct $\boldsymbol{\eta}=[\eta_k]$
to obtain $\boldsymbol{\mu}=\mathbf{g}^{-1}(\boldsymbol{\eta})$ and thus 
$\boldsymbol{\varphi}=\boldsymbol{\varphi}(\boldsymbol{\psi},\boldsymbol{\mu})$.

The gradient of the log-likelihood is now given by
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}L  & ~=~ & Z\,\nabla_\eta L~=~Z\,\left[Y_\eta-\mu_{Y_\eta}\right]\,,
\end{eqnarray}
such that the parameter update
\begin{eqnarray}
\boldsymbol{\phi}' & = & \boldsymbol{\phi} + \Delta\boldsymbol{\phi}
\end{eqnarray}
is obtained via the solution of
\begin{eqnarray}
\left\langle
\sigma^2_{Y_\eta}\,
ZZ^T
\right\rangle\,\Delta\boldsymbol{\phi} & = &
\left\langle \left(Y_\eta-\mu_{Y_\eta}\right)\,Z\right\rangle
\,.
\end{eqnarray}
Consequently, applying the above matrix (on the left-hand side) directly to the updated parameters $\boldsymbol{\phi}'$ gives
\begin{eqnarray}
\langle\sigma^2_{Y_\eta}\,ZZ^T\rangle\,\boldsymbol{\phi}' & ~=~ &
\langle\sigma^2_{Y_\eta}\,ZZ^T\rangle\,\boldsymbol{\phi}+\langle\sigma^2_{Y_\eta}\,ZZ^T\rangle\,
\Delta\boldsymbol{\phi}
\\& = & \langle\sigma^2_{Y_\eta}\,Z\,\eta\rangle+
\left\langle \left(Y_\eta-\mu_{Y_\eta}\right)\,Z\right\rangle
\\&=& \left\langle \left(Y_\eta-\mu_{Y_\eta}+\eta\,\sigma^2_{Y_\eta}\right)\,Z\right\rangle
\,,
\end{eqnarray}
since $\eta=Z^{T}\boldsymbol{\phi}$.
Similarly, for the parameter update
\begin{eqnarray}
\boldsymbol{\psi}' & = & \boldsymbol{\psi} + \Delta\boldsymbol{\psi}\,,
\end{eqnarray}
we obtain
\begin{eqnarray}
\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\rangle\,\boldsymbol{\psi}' & ~=~ &
\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\rangle\,\boldsymbol{\psi}+
\langle\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\rangle\,\Delta\boldsymbol{\psi}
\\& = &
\left\langle 
Y_\boldsymbol{\psi}-\boldsymbol{\mu}_{Y_\boldsymbol{\psi}}+\boldsymbol{\Sigma}_{Y_\boldsymbol{\psi}}\,\boldsymbol{\psi}
\right\rangle
\,.
\end{eqnarray}


For comparison, let us now take another look at [least-squares](#Least-squares-regression "Section: Least-squares regression")
regression. We start with WLS using the weighted residual $\tilde{\varepsilon}$ such that the square error is
\begin{eqnarray}
S(\boldsymbol{\phi}) & ~=~ & \left\langle\tilde{\varepsilon}^2\right\rangle
~=~\left\langle\frac{(Y_\eta-\mu_{Y_\eta})^2}{\sigma_{Y_\eta}^2}\right\rangle\,.
\end{eqnarray}
Next, we make use of IRLS by temporarily holding $\sigma^2_{Y_\eta}$ constant and taking the gradient of 
$\mu=\sigma(\eta)$ with $\eta=Z^T\boldsymbol{\phi}$,
giving the approximate gradient
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}\,S & ~\approx~ & 
-2\,\left\langle\frac{Y_\eta-\mu_{Y_\eta}}{\sigma^2_{Y_\eta}}\,\frac{\partial\mu}{\partial\eta}\,\frac{\partial\eta}{\partial\boldsymbol{\phi}}
\right\rangle
~=~-2\,\left\langle(Y_\eta-\mu_{\small Y_\eta})\,
\frac{\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]}{\sigma^2_{Y_\eta}}
\,Z\right\rangle
\,.
\end{eqnarray}
Next, we temporarily hold $\nabla_\eta Y_\eta$ constant and take a second gradient of $\mu$ to obtain
the approximate Hessian as
\begin{eqnarray}
\boldsymbol{\nabla}_\boldsymbol{\phi}\,\boldsymbol{\nabla}_\boldsymbol{\phi}^{T}\,S & ~\approx~ & 
2\left\langle\frac{\partial\mu}{\partial\eta}\,\frac{\partial\eta}{\partial\boldsymbol{\phi}}\,
\frac{\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]}{\sigma^2_{Y_\eta}}
\,Z^T\right\rangle
\\
&=& 2\left\langle
\frac{\left(\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]\right)^2}
{\sigma^2_{Y_\eta}}
\,Z\,Z^T\right\rangle
\,.
\end{eqnarray}
Finally, we use the Newton-Raphson approach to obtain the approximate update
\begin{eqnarray}
\left\langle
\frac{\left(\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]\right)^2}
{\sigma^2_{Y_\eta}}
\,Z\,Z^T\right\rangle
\,\Delta\boldsymbol{\phi} & ~=~ &
\left\langle(Y_\eta-\mu_{Y_\eta})\,
\frac{\sigma^2_{Y_\eta}+\mathbb{E}[\nabla_\eta Y_\eta\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}]}{\sigma^2_{Y_\eta}}
\,Z\right\rangle
\,.
\end{eqnarray}
We observe that this [IRLS](#Least-squares-regression "Section: Least-squares regression") 
method only becomes identical to the 
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") 
approach in the special case where $\nabla_\eta Y_\eta=0$, i.e.
where $Y_\eta$ is a natural variate and thus $\eta$ is both a natural parameter and a link parameter.

### Bernoulli regression

We [recall](#Bernoulli-distribution "Section: Bernoulli distribution")
that the Bernoulli distribution parameterised by $\theta$ has mean $\mu=\theta$ and variance
$\sigma_X^2=\theta\,(1-\theta)$. We also recall that the natural parameterisation of the distribution is
$\eta=\sigma^{-1}(\theta)$, where $\sigma^{-1}(\cdot)$ is the logit function, such that the natural gradient of
the log-likelihood $L$ is $\nabla_\eta L=X-\mu$. Also, since $\mu=\theta$, we find that 
$\eta=\sigma^{-1}(\mu)$ is the natural parameterisation of the mean $\mu$.
Therefore, the dependent parameter is $\varphi=\theta$, and there is no independent parameter $\psi$.

We now assume, for convenience, the linear regression function $\eta=Z^T\boldsymbol{\phi}$.
Hence, from the [previous](#Generalised-linear-models "Section: Generalised linear models") section,
we find that 
the iterative parameter update for $\boldsymbol{\phi}$ therefore takes either the form
\begin{eqnarray}
\langle\mu\,(1-\mu)\,ZZ^T\rangle\,\Delta\boldsymbol{\phi}
& ~=~ & \langle (X-\mu)\,Z\rangle\,,
\end{eqnarray}
or
\begin{eqnarray}
\langle\mu\,(1-\mu)\,ZZ^T\rangle\,\boldsymbol{\phi}' 
& ~=~ & \langle \left[X-\mu+\eta\,\mu\,(1-\mu)\right]\,Z\rangle\,,
\end{eqnarray}
where $\eta=Z^T\boldsymbol{\phi}$ and $\mu=\sigma(\eta)$ are both functions of $Z$ and of the current
regression parameter estimate $\boldsymbol{\phi}$.
Also note that here $X$ is the natuiral variate, and $\eta$ is both the natural parameter and the link parameter. Hence,
from the [previous](#Generalised-linear-models "Section: Generalised linear models") section, we see that the Bernoulli
distribution is one of the special cases where the 
[maximum likelihood](#Maximum-likelihood-estimation "Section: Maximum likelihood estimation") approach
is identical to the 
[least-squares](#Least-squares-regression "Section: Least-squares regression") approach.

### Beta regression

Recall that the [Beta distribution](#Beta-distribution "Section: Beta distribution")
has natural parameters $\alpha$ and $\beta$, with mean
\begin{eqnarray}
\mu & ~=~ & \frac{\alpha}{\alpha+\beta}~=~\frac{1}{1+\frac{\beta}{\alpha}}
~=~\frac{1}{1+e^{-\ln\frac{\alpha}{\beta}}}\,.
\end{eqnarray}
Hence, the natural link parameter is given by
\begin{eqnarray}
\eta & ~=~ & \ln\frac{\alpha}{\beta}~=~\sigma^{-1}(\mu)\,.
\end{eqnarray}
We may invert this relationship to obtain either $\alpha=\beta\,e^\eta$ or $\beta=\alpha\,e^{-\eta}$, such that the
required log-likelihood gradient is given by
\begin{eqnarray}
\frac{\partial L}{\partial\eta} & ~=~ & 
\frac{\partial L}{\partial\alpha}\,\frac{\partial\alpha}{\partial\eta}
+\frac{\partial L}{\partial\beta}\,\frac{\partial\beta}{\partial\eta}
~=~\alpha\,(Y_\alpha-\mu_{Y_\alpha})-\beta\,(Y_\beta-\mu_{Y_\beta})\,,
\end{eqnarray}
where $Y_\alpha=\ln X$ and $Y_\beta=\ln(1-X)$ are the natural variates, and $\mu_{Y_\alpha}$ and
$\mu_{Y_\beta}$ are their respective means. 
We may therefore define the new *link* variate $Y_\eta$ and its mean $\mu_{Y_\eta}$ as
\begin{eqnarray}
Y_\eta~\doteq~\alpha\,Y_\alpha-\beta\,Y_\beta & \;\;~\mbox{and}~\;\;
\mu_{Y_\eta}~=~\alpha\,\mu_{Y_\alpha}-\beta\,\mu_{Y_\beta}\,,
\end{eqnarray}
respectively. Its variance is therefore
\begin{eqnarray}
\sigma^2_{Y_\eta} & ~=~ & \texttt{Var}[Y_\eta\mid\alpha,\beta]
~=~\alpha^2\,\sigma^2_{Y_\alpha}+\beta^2\,\sigma^2_{Y_\beta}-2\alpha\beta\,\sigma_{\small Y_\alpha,Y_\beta}\,.
\end{eqnarray}
Hence, the update for the [generalised linear regression](#Generalised-linear-models "Section: Generalised linear models")
parameter $\boldsymbol{\phi}$ is just
\begin{eqnarray}
\langle\sigma^2_{Y_\eta}\,Z\,Z^T\rangle\,\Delta\boldsymbol{\phi} & ~=~ &
\left\langle \left(Y_\eta-\mu_{Y_\eta}\right)\,Z\right\rangle
\,.
\end{eqnarray}

Note that since we cannot recover both $\alpha$ and $\beta$ from $\eta$ (or $\mu$), we must choose one of these parameters to be the independent parameter $\psi$, and the other to be the dependent parameter $\varphi$.
Following Kieschnick and McCullough [[3]](#Citations 
"Citation [3]: Regression analysis of variates observed on $(0, 1)$"),
we choose $\psi=\alpha$ and $\varphi=\beta=\alpha\,e^{-\eta}$.
Hence, the iterative update for $\alpha$ takes the form
\begin{eqnarray}
\langle\sigma^2_{Y_\alpha}\rangle\,\Delta\alpha & = &
\left\langle Y_\alpha-\mu_{Y_\alpha}\right\rangle
\,,
\end{eqnarray}
where $Y_\alpha\doteq\ln X$. The mean $\mu_{Y_\alpha}$ and variance $\sigma^2_{Y_\alpha}$ of
the variate $Y_\alpha$ may be obtained from 
the [Beta distribution](#Beta-distribution "Section: Beta distribution").

### Beta-Bernoulli regression

Recall that the [Beta-Bernoulli distribution](#Beta-Bernoulli-distribution "Section: Beta-Bernoulli distribution")
  has mean $\mu$ and variance $\sigma^2_X$ given by
\begin{eqnarray}
\mu~=~\frac{\alpha}{\alpha+\beta}\,, & \;\;\mbox{and}\;\; & \sigma^2_X~=~\frac{\alpha\beta}{(\alpha+\beta)^2}\,,
\end{eqnarray}
respectively, 
along with a natural parameter $\eta$ and link function $g(\mu)$ given by
\begin{eqnarray}
\eta & ~=~ & \ln\frac{\alpha}{\beta}~=~\sigma^{-1}(\mu)\,.
\end{eqnarray}
We therefore deduce that $\eta$ is also the natural link parameter for a logit link function.

Following [Beta regression](#Beta-regression "Section: Beta regression"), 
we take $\psi=\alpha$ as the independent parameter, and $\varphi=\beta=\alpha\,e^{-\eta}$ as the dependent parameter. Consequently, we derive that
\begin{eqnarray}
\frac{\partial L}{\partial\alpha} & ~=~ &
\frac{\partial\eta}{\partial\alpha}\frac{\partial L}{\partial\eta}
~=~\frac{1}{\alpha}\,(X-\mu)\,,
\end{eqnarray}
which results in a variate $Y_\alpha$ and corresponding mean $\mu_{Y_\alpha}$ given by
\begin{eqnarray}
Y_\alpha~\doteq~\frac{X}{\alpha}\,, & \;\;\mbox{and}\;\; & 
\mu_{Y_\alpha}~=~\frac{\mu}{\alpha}~=~\frac{1}{\alpha+\beta}\,,
\end{eqnarray}
respectively.
The variance $\sigma^2_{Y_\alpha}$ of $Y_\alpha$ is then given by
\begin{eqnarray}
\sigma^2_{Y_\alpha} & ~=~ & \frac{\sigma^2_X}{\alpha^2}~=~
\frac{\beta}{\alpha\,(\alpha+\beta)^2}\,.
\end{eqnarray}
The iterative update equation for the Beta parameter $\alpha$ is thus
\begin{eqnarray}
\langle\sigma^2_{Y_\alpha}\rangle\,\Delta\alpha & ~=~ & 
\langle Y_\alpha-\mu_\alpha\rangle
\\
\Rightarrow
\Delta\alpha & = & \alpha\,\frac{\langle X-\mu\rangle}{\langle\mu(1-\mu)\rangle}
\,.
\end{eqnarray}

Finally, since $\eta$ is both the link parameter and the natural parameter, then the iterative
[GLM](#Generalised-linear-models "Section: Generalised linear models")
update for regression parameters $\boldsymbol{\phi}$ is just
\begin{eqnarray}
\left\langle\sigma^2_X\,ZZ^T\right\rangle\,\Delta\boldsymbol{\phi} & ~=~ &
\left\langle\mu(1-\mu)\,ZZ^T\right\rangle\,\Delta\boldsymbol{\phi}~=~
\left\langle(X-\mu)\,Z\right\rangle\,.
\end{eqnarray}
Note that this is just the same update equation as for plain
[Bernoulli regression](#Bernoulli-regression "Section: Bernoulli regression").

### Gamma regression

We [recall](#Gamma-distribution "Section: Gamma distribution") that the Gamma distribution has natural parameters
$\alpha$ and $\beta$, and natural variates $Y_\alpha=\ln X$ and $Y_\beta=-X$. The mean and variance of the distribution are
given by
\begin{eqnarray}
\mu~=~\frac{\alpha}{\beta}\,, & \;\;~\mbox{and}~\;\; & \sigma^2_X~=~\frac{\alpha}{\beta^2}\,,
\end{eqnarray}
respectively. Following the
[previous](#Beta-regression "Section: Beta regression") section, we take the link variate to be
\begin{eqnarray}
\eta~=~\ln\frac{\alpha}{\beta}~=~\ln\mu & ~\Rightarrow~ & \mu~=~e^\eta\,,
\end{eqnarray}
which gives $\alpha=\beta\;e^{\eta}$ and $\beta=\alpha\,e^{-\eta}$ as before. Thus, we again obtain
\begin{eqnarray}
\frac{\partial L}{\partial\eta} & ~=~ &
\frac{\partial L}{\partial\alpha}\,\frac{\partial\alpha}{\partial\eta}
+\frac{\partial L}{\partial\beta}\,\frac{\partial\beta}{\partial\eta}
~=~\alpha\,(Y_\alpha-\mu_{Y_\alpha})-\beta\,(Y_\beta-\mu_{Y_\beta})\,,
\end{eqnarray}
leading to the new link variate 
\begin{eqnarray}
Y_\eta & ~\doteq~ & \alpha\,Y_\alpha-\beta\,Y_\beta~=~\alpha\,\ln X+\beta\,X\,,
\end{eqnarray}
whith mean $\mu_{Y_\eta}$ and variance $\sigma^2_{Y_\eta}$ given by
\begin{eqnarray}
\mu_{Y_\eta} & ~=~ & \alpha\,\mu_{Y_\alpha}-\beta\,\mu_{Y_\beta}
~=~\alpha\,\left[1+\psi(\alpha)-\ln\beta\right]\,,
\end{eqnarray}
and
\begin{eqnarray}
\sigma^2_{Y_\eta} & ~=~ & \alpha^2\,\sigma^2_{Y_\alpha}+\beta^2\,\sigma^2_{Y_\beta}
-2\alpha\beta\,\sigma_{\small Y_\alpha,Y_\beta}
~=~\alpha\,\left[3+\alpha\psi'(\alpha)\right]\,,
\end{eqnarray}
respectively.

Now, following an [earlier](#Beta-distribution "Section: Beta distribution") suggestion,
we suppose that parameter $\beta$ is common across all observations. Hence, we take the independent parameter to be $\psi=\beta$, and the dependent parameter to be $\varphi=\alpha=\beta\,e^{\eta}$.
Consequently, the independent parameter update equation is now given by
\begin{eqnarray}
\langle\sigma^2_{Y_\beta}\rangle\,\Delta\beta & = &
\left\langle Y_\beta-\mu_{Y_\beta}\right\rangle
\\
\Rightarrow \Delta\beta & = &
-\beta\,\frac{\left\langle X-\mu\right\rangle}{\langle\mu\rangle}
\,,
\end{eqnarray}
where the estimate of $\beta$ is held constant at each iteration, but $\mu=e^{\eta}$ with
$\eta=Z^T\boldsymbol{\phi}$ varying.
The linear regression parameter update equation is
\begin{eqnarray}
\langle\sigma^2_{Y_\eta}\,ZZ^T\rangle\,\Delta\boldsymbol{\phi} & ~=~ &
\left\langle\left(Y_\eta-\mu_{Y_\eta}\right)\,Z\right\rangle
\,,
\end{eqnarray}
as usual.

## Regression modelling revisited

An alternative approach to deriving probabilistic regression models is offered by
Bergtold et al. [[4]](#Citations "Citation [4]: Bernoulli Regression Models"). 
Under this "*probabilistic reduction*" framework, both the dependent variate $X\in\mathcal{X}$ and the independent covariate(s) $Z\in\mathcal{Z}$ are jointly modelled.
For such a joint density to exist, it must be able to be factored into conditionals into two different ways, namely
\begin{eqnarray}
p(X,Z\mid\Theta) & ~=~ & p(X\mid\boldsymbol{\theta})\,p(Z\mid X,\Theta)
~=~ p(Z\mid\boldsymbol{\pi})\,p(X\mid \boldsymbol{\psi},Z,\boldsymbol{\phi})\,,
\end{eqnarray}
where we have retained our previous notation of $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$ representing the parameters of the distribution of $X$, and $\boldsymbol{\phi}$ representing the regression function parameters. Here the arbitrary parameters $\Theta$ include both $\boldsymbol{\theta}$ and
$\boldsymbol{\psi}$, as well as the new parameters $\boldsymbol{\pi}$ and
any other parameters required to define the conditional distribution of $Z$.

We may now combine both factorisations together and rearrange terms to obtain
\begin{eqnarray}
p(X\mid\boldsymbol{\theta}) & = &
\frac{p(Z\mid\boldsymbol{\pi})\,p(X\mid \boldsymbol{\psi},Z,\boldsymbol{\phi})}{p(Z\mid X,\Theta)}\,.
\end{eqnarray}
It is of prime importance to note that the apparent dependency of the right-hand side on $Z$ must actually cancel out, since the left-hand side is purely a function of $X$.
The next step is to consider two distinct values, say $x_0,x_1\in\mathcal{X}$, and evaluating the above formula at both points. Taking the ratio then gives
\begin{eqnarray}
\frac{p(X=x_1\mid\boldsymbol{\theta})}{p(X=x_0\mid\boldsymbol{\theta})}
& ~=~ & 
\frac{p(X=x_1\mid \boldsymbol{\psi},Z,\boldsymbol{\phi})\,p(Z\mid X=x_0,\Theta)}
{p(X=x_0\mid \boldsymbol{\psi},Z,\boldsymbol{\phi})\,p(Z\mid X=x_1,\Theta)}
\,.
\end{eqnarray}

The final two steps of "probabilistic reduction" depend upon the distributions of $X$ and $Z$, respectively, as we shall see in the following sections.
In particular, since $\boldsymbol{\theta}=(\boldsymbol{\psi},\boldsymbol{\varphi})$,
we find it convenient to assume that the conditional distribution $p(X\mid \boldsymbol{\psi},Z,\boldsymbol{\phi})$ takes the same functional form as the marginal distribution $p(X\mid\boldsymbol{\theta})$, namely that
\begin{eqnarray}
p(X\mid\boldsymbol{\psi},Z,\boldsymbol{\phi}) & ~\doteq~ &
p(X\mid\boldsymbol{\psi},\mathbf{F}(Z;\boldsymbol{\phi}))\,,
\end{eqnarray}
which corresponds to the regression model
\begin{eqnarray}
\boldsymbol{\varphi} & ~\doteq~ & \mathbf{F}(Z;\boldsymbol{\phi})\,.
\end{eqnarray}
Similarly, we also make the simplifying assumption that the conditional distribution
$p(Z\mid X=x_i,\Theta)$ takes the same functional form as the marginal distribution
$p(Z\mid\boldsymbol{\pi})$. In particular, we 
suppose that $X=x_0$ acts to select specific parameters $\boldsymbol{\pi}_0\in\Theta$, and similarly $X=x_1$ selects alternative parameters $\boldsymbol{\pi}_1\in\Theta$. Consequently, we take
\begin{eqnarray}
p(Z\mid X=x_i,\Theta) & ~\doteq~ & p(Z\mid\boldsymbol{\pi}_i)\,,
\end{eqnarray}
for $i=0,1$. Under these conditions, the ratio formula above reduces to
\begin{eqnarray}
\frac{p(X=x_1\mid\boldsymbol{\theta})}{p(X=x_0\mid\boldsymbol{\theta})}
& ~=~ &
\left.
\frac{p(X=x_1\mid \boldsymbol{\psi},\mathbf{F}(Z;\boldsymbol{\phi}))}{p(Z\mid\boldsymbol{\pi}_1)}
\right/
\frac{p(X=x_0\mid \boldsymbol{\psi},\mathbf{F}(Z;\boldsymbol{\phi}))}{p(Z\mid\boldsymbol{\pi}_0)}
\,.
\end{eqnarray}
As we noted above, the dependence on $Z$ of the right-hand side must vanish in practice.
Consequently, the correct choice of $\mathbf{F}(Z,\boldsymbol{\phi})$ is determined
directly by the assumed (marginal) distribution of $Z$.

### Bernoulli regression (again)

The [Bernoulli distribution](#Bernoulli-distribution "Section: Bernoulli distribution")
has domain $\mathcal{X}=\{0,1\}$ and (marginal) density
\begin{eqnarray}
p(X=x\mid\theta) & ~=~ & \theta^x\,(1-\theta)^{1-x}\,.
\end{eqnarray}
It therefore makes sense for the conditional distribution to take a similar form, namely
\begin{eqnarray}
p(X=x\mid Z,\boldsymbol{\phi}) & ~=~ & F(Z;\boldsymbol{\phi})^x\,[1-F(Z;\boldsymbol{\phi})]^{1-x}\,.
\end{eqnarray}
Note that since the mean of the Bernoulli distribution is just $\mu=\theta$, this is equivalent
to the [mean regression](#Mean-regression "Section: Mean regression") model
\begin{eqnarray}
\mu & ~=~ & F(Z;\boldsymbol{\phi})\,.
\end{eqnarray}
We now take $x_0=0$ and $x_1=1$, such that the first ratio formula reduces to
\begin{eqnarray}
\frac{\theta}{1-\theta} & ~=~ & 
\frac{F(Z;\boldsymbol{\phi})\,p(Z\mid X=0,\Theta)}
{[1-F(Z;\boldsymbol{\phi})]\,p(Z\mid X=1,\Theta)}
\\
\Rightarrow \ln\frac{F(Z;\boldsymbol{\phi})}{1-F(Z;\boldsymbol{\phi})}
& ~=~ & \ln\frac{\theta}{1-\theta} 
+ \ln\frac{p(Z\mid X=1,\Theta)}{p(Z\mid X=0,\Theta)}
\,.
\end{eqnarray}
We recognise the left-hand side as the logit transform $\sigma^{-1}(\cdot)$ of $F(Z;\boldsymbol{\phi})$,
and hence the Bernoulli regression model takes the logistic form of
\begin{eqnarray}
\mu & ~=~ & F(Z;\boldsymbol{\phi}) ~=~\sigma(f(Z;\boldsymbol{\phi}))
\,,
\end{eqnarray}
with
\begin{eqnarray}
f(Z;\boldsymbol{\phi}) & ~\doteq~ &
\ln\frac{\theta}{1-\theta} + \ln\frac{p(Z\mid X=1,\Theta)}{p(Z\mid X=0,\Theta)}
\,.
\end{eqnarray}
Bergtold et al. [[4]](#Citations "Citation [4]: Bernoulli Regression Models") make the observation that
the first term on the right-hand side corresponds to a constant in the regression model, and that the
correct form of regression on the covariate(s) $Z$ follows directly from the 
assumed conditional distribution $p(Z\mid X,\Theta)$.

### Beta-distributed covariate

Let us suppose that a scalar covariate $Z$ represents a probability or proportion on the domain
$\mathcal{Z}=(0,1)$. For convenience, we might assume that each observed value of $Z$ is drawn from
a [Beta distribution](#Beta-distribution "Section: Beta distribution").
In the particular case that the response variate $X$ is 
[Bernoulli distributed](#Bernoulli-distribution "Section: Bernoulli distribution"), we assume that
\begin{eqnarray}
p(Z=z\mid X=x,\Theta) & ~\doteq~ & P(Z=z\mid\alpha_x,\beta_x) ~=~ 
\frac{z^{\alpha_x-1}\,(1-z)^{\beta_x-1}}{B(\alpha_x,\beta_x)}\,.
\end{eqnarray}
The corresponding terms for [Bernoulli regression](#Bernoulli-regression-(again) "Section: Bernoulli regression (again)") are therefore obtained from
\begin{eqnarray}
\ln\frac{p(Z=z\mid\alpha_1,\beta_1)}{p(Z=z\mid\alpha_0,\beta_0)}
& ~=~ &
(\alpha_1-\alpha_0)\ln z+(\beta_1-\beta_0)\ln(1-z)-\ln\frac{B(\alpha_1,\beta_1)}{B(\alpha_0,\beta_0)}
\,,
\end{eqnarray}
such that the appropriate regression function is given by
\begin{eqnarray}
f(Z;\boldsymbol{\phi}) & ~=~ & \phi_0+\phi_1\,\ln Z+\phi_2\,\ln(1-Z)\,.
\end{eqnarray}
The corresponding predictive model is therefore given by
\begin{eqnarray}
p(X=1\mid Z,\boldsymbol{\phi}) & ~=~ & \sigma(\phi_0+\phi_1\,\ln Z+\phi_2\,\ln(1-Z))\,.
\end{eqnarray}
In the special case where we have theoretical reasons to suppose that $\phi_2=-\phi_1$, 
i.e. $\alpha_0+\beta_0=\alpha_1+\beta_1$,
this
reduces to the simpler form of
\begin{eqnarray}
p(X=1\mid Z,\boldsymbol{\phi}) & ~=~ & \sigma(\phi_0+\phi_1\,\sigma^{-1}(Z))\,.
\end{eqnarray}


## Citations

[1] J. A. Nelder and R. W. M. Wedderburn (1972), "*Generalized Linear Models*", J. Royal Stat. Soc. Series A, Vol. 135, No. 3, pp. 370-384.

[2] M. G. Kendall and A. Stuart (1967), "*The Advanced Theory of Statistics*", 2nd ed., Vol. 2.

[3] R. Kieschnick and B. D. McCullough (2003), "*Regression analysis of variates observed on $(0, 1)$*", Statistical Modelling 3(3):193-213. [[PDF]](https://journals.sagepub.com/doi/10.1191/1471082X03st053oa "journals.sagepub.com")

[4] J. S. Bergtold, A. Spanos and E. Onukwugha (2010), "*Bernoulli Regression Models: Revisiting the
Specification of Statistical Models with Binary Dependent Variables*",
J. Choice Modelling 3(2), pp 1-28.