# Appendix A: Additively Weighted Models

## Introduction

### The problem

The task is to predict the class $c\in\mathcal{C}$ that best matches a sequence of observations $\vec{\mathbf{x}}\doteq (\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_K)$.
In some areas, the sequence $\vec{\mathbf{x}}$ is known as the *context*, and the class label $c$ is known as the *target*.  

We assume the predictor takes the form of some probabilistic model $P(c\mid\vec{\mathbf{x}},\Theta)$ with unknown parameters $\Theta$.
The parameters are to be estimated from *supervised* training data, with known class labels $\mathbf{C}\doteq\left[c^{(d)}\right]_{d=1}^N$,
and known sequences $\mathbf{X}\doteq\left[\vec{\mathbf{x}}^{(d)}\right]_{d=1}^N$.

### Additive model

For a single sequence $\vec{\mathbf{x}}=(\mathbf{x}_k)_{k=1}^K$ of observations, we assume *a priori* that some particular observation in the sequence, say $\mathbf{x}_{k^*}$, is the best predictor of the target class, $c$.
In other words, we suppose that a hypothetical generative process first samples the component $k^*$ from some
distribution, say $P(k\mid\Phi)$, and then samples the class $c$ from another distribution,
say $P(c\mid k^*,\vec{\mathbf{x}},\Psi)$. Thus, the generative process
is described in general by the joint distribution
\begin{eqnarray}
P(k,c\mid\vec{\mathbf{x}},\Theta) & \doteq & P(k\mid\Phi)\,P(c\mid k, \vec{\mathbf{x}},\Psi)\,,
\end{eqnarray}
with parameters $\Theta=(\Phi,\Psi)$. Observe that summing both sides over $k$ exposes the
underlying modelling assumption, namely that
\begin{eqnarray}
P(k,\mid\vec{\mathbf{x}},\Theta) & \doteq & P(k\mid\Phi)\,.
\end{eqnarray}
Consequently, we are unable to model $k$ without knowledge of both $c$ and $\vec{\mathbf{x}}$ (see below).

To specify the predictive model, we first define, for convenience, that
$\phi_k\doteq P(k\mid\Phi)$, and $\boldsymbol{\phi}\doteq (\phi_k)_{k=1}^{K}$.
Consequently, we may consider the model to be (partly) parameterised either by $\Phi$ or by $\boldsymbol{\phi}$,
interchangeably.

Secondly, we consider the sub-model $P(c\mid k,\vec{\mathbf{x}},\Psi)$. In general, we 
impose no restrictions on the sub-models, other than to assume that parameters $\Phi$ and $\Psi$ are independent.
However, since we are supposing that $k$ selects a single observation $\mathbf{x}_k$ from the sequence
$\vec{\mathbf{x}}$, then it does makes sense to assume that
\begin{eqnarray}
P(c\mid k,\vec{\mathbf{x}},\Psi) & \doteq & P(c\mid\mathbf{x}_{k},\Psi_k)\,,
\end{eqnarray}
where $\Psi\doteq(\Psi_k)_{k=1}^{K}$. That is, we consider $K$ arbitrary but independent sub-models.

In practice, we must consider issues such as model overfitting and data scarcity. To combat model overfitting, 
we might assume that the sub-models all share the same parametric form, but with different parameters,
e.g. $\Psi_k$ for $k=1,\ldots,K$. To overcome data scarcity, we might further assume that all sub-models share the same parameters, i.e. $\Psi_k=\bar{\Psi}$. However, we use no such assumptions here.

Putting aside such issues, the desired predictive model is now given by
\begin{eqnarray}
P(c\mid\vec{\mathbf{x}},\Theta) & = & 
\sum_{k=1}^K P(k,c\mid\vec{\mathbf{x}},\Theta)
~\doteq~\sum_{k=1}^K \phi_k\,P(c\mid\mathbf{x}_k,\Psi_k)
\,.
\end{eqnarray}
Thus, the predictive model takes the form of an additively weighted *mixture of experts*,
with mixture component weights $\boldsymbol{\phi}$ satisfying $\phi_k\ge 0$ and $\sum_{k=1}^{K}\phi_k=1$.

For the remainder of this document, we shall be concerned primarily with the estimation of the mixture
weights $\boldsymbol{\phi}$ from the training data $\mathbf{C}$ and $\mathbf{X}$. For simplicity, we henceforth assume that the parameters $\Psi$ have already been estimated in some unspecified fashion, and thus remain
fixed. In other words, we assume that the experts, or sub-models, have already been trained.

As the final step in our modelling, we may invert the predictive model to obtain the posterior component weights, namely
\begin{eqnarray}
P(k\mid c,\vec{\mathbf{x}},\Theta) & = &
\frac{P(k, c\mid\vec{\mathbf{x}},\Theta)}{P(c\mid\vec{\mathbf{x}},\Theta)}
~ \doteq ~
\frac{\phi_k\,P(c\mid\mathbf{x}_k,\Psi_k)}
{\sum_{\tilde{k}=1}^K \phi_\tilde{k}\,P(c\mid\mathbf{x}_\tilde{k},\Psi_\tilde{k})}\,.
\end{eqnarray}

## Expectation-Maxmisation Approach

### Known, hidden and complete information

The *expectation-maximisation* (EM) approach to parameter estimation differentiates between known data and hidden, or latent, information. If the hidden information were to become known, then we would have
*complete* data. Hence, the approach focuses first on modelling the complete data.

For the [additive model](#Additive-model "Introduction: Additive model"), 
our information would be complete if we knew the 'optimal' mixture component indices 
$\mathbf{K}\doteq [k^{(d)}]_{d=1}^{N}$ corresponding to the known sequences $\mathbf{X}$.
In that case, an appropriate log-likelihood of the complete data might be
\begin{eqnarray}
L(\Theta) & \doteq & \ln P(\mathbf{K},\mathbf{C}\mid\mathbf{X},\Theta)
~ = ~ \sum_{d=1}^{N}\ln P\left(k^{(d)}, c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\,.
\end{eqnarray}

### Expected log-likelihood

In practice, for any arbitrary sequence $\vec{\mathbf{x}}$, the optimal index $k^*$ remains unknown,
and thus $\mathbf{K}$ represents hidden information.
We defer this problem somewhat by now introducing notional binary indicators, namely $z_k\doteq\delta(k=k^*)$, 
such that $\mathbf{z}\doteq (z_k)_{k=1}^{K}$ is the hidden indicator vector for sequence $\vec{\mathbf{x}}$.
Clearly, if we knew $k^*$ we would know $\mathbf{z}$, and vice versa.

The point of this alternative parameterisation is that the 
[joint model](#Additive-model "Introduction: Additive model") now takes the form
\begin{eqnarray}
P(k^*,c\mid\vec{\mathbf{x}},\Theta) & = & 
\prod_{k=1}^K\left[\phi_k\,P(c\mid\mathbf{x}_{k},\Psi_k)\right]^{\,z_k}\,,
\end{eqnarray}
with log-likelihood
\begin{eqnarray}
\ln P(k^*,c\mid\vec{\mathbf{x}},\Theta) & = & \sum_{k=1}^K z_k\ln\left[\phi_k\,P(c\mid\mathbf{x}_{k},\Psi_k)
\right]\,.
\end{eqnarray}

Now, for training sequence $\vec{\mathbf{x}}^{(d)}$, the corresponding indicator vector is $\mathbf{z}^{(d)}$,
such that the hidden information $\mathbf{K}$ may be represented by $\mathbf{Z}\doteq\left[\mathbf{z}^{(d)}\right]_{d=1}^{N}$.
Consequently, given known data $\mathbf{C}$ and $\mathbf{X}$, the predictive log-likelihood 
takes the form
\begin{eqnarray}
L(\Theta;\mathbf{Z}) & = &
\sum_{d=1}^N\sum_{k=1}^K z_k^{(d)}\,
\ln\left[\phi_k\,P\left(c^{(d)}\mid\mathbf{x}^{(d)}_k,\Psi_k\right)\right]\,,
\end{eqnarray}
where the explicit dependence upon $\mathbf{Z}$ indicates that the likelihood still relies on hidden information.

In order to eliminate the unknown $\mathbf{Z}$, the EM approach is to take the expectation of the log-likelihood
with respect to the hidden information given the known data.
We thus consider expectations $\mathbb{E}_{\mathbf{Z}\mid\mathbf{C},\mathbf{X},\Theta}[\cdot]$ over $\mathbf{Z}$, given the known data $\mathbf{C}$ and $\mathbf{X}$, dependent upon the parameter $\Theta$
to be estimated.

Since we do not know the model parameters $\Theta=(\Phi,\Psi)$, we start the estimation process with
some known approximate values, say $\Theta'=(\Phi',\Psi')$. However, recall for our
[additive model](#Additive-model "Introduction: Additive model") that
we have assumed for convenience that $\Psi$ has been estimated separately, and is considered fixed, i.e. $\Psi'=\Psi$.

Consequently, the expected log-likelihood is given by
\begin{eqnarray}
Q(\Theta,\Theta') & \doteq &
\mathbb{E}_{\mathbf{Z}\mid\mathbf{C},\mathbf{X},\Theta'}[L(\Theta;\mathbf{Z})]
~=~ 
\sum_{d=1}^N\sum_{k=1}^K \mathbb{E}_{\mathbf{Z}\mid\mathbf{C},\mathbf{X},\Theta'}\left[z_k^{(d)}\right]\,
\ln\left[\phi_k\,P(c^{(d)}\mid\mathbf{x}^{(d)}_k,\Theta_k)\right]\,.
\end{eqnarray}
However, since $z_k^{(d)}\doteq\delta(k^{(d)}=k)$, we observe that
\begin{eqnarray}
\mathbb{E}_{\mathbf{Z}\mid\mathbf{C},\mathbf{X},\Theta'}\left[z_k^{(d)}\right] & = & 
P(k^{(d)}=k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta') ~\doteq~ \bar{w}_k^{(d)}\,,
\end{eqnarray}
such that $\bar{w}_k^{(d)}$ is just the posterior mixture weight 
for the [additive model](#Additive-model "Introduction: Additive model").
Hence, the expected log-likelihood takes the form
\begin{eqnarray}
Q(\Theta,\Theta') & = & 
\sum_{d=1}^N\sum_{k=1}^K \bar{w}_k^{(d)}\,
\ln\left[\phi_k\,P(c^{(d)}\mid\mathbf{x}^{(d)}_k,\Psi_k)\right]
\,.
\end{eqnarray}

### Maximising the expected log-likelihood

Before we maximise the 
[expected log-likelihood](#Expected-log-likelihood "Expectation-Maximisation Approach: Expected log-likelihood"),
we [recall](#Additive-model "Introduction: Additive model") 
that we may define the parameters $\Phi$ in terms of the mixture weights $\boldsymbol{\phi}$.
Furthermore, since these weights sum to unity, this constraint may be included via the use of a Lagrange multiplier. Hence, the appropriate objective function is
\begin{eqnarray}
F(\Phi;\Phi',\Psi) & \doteq & Q(\Theta,\Theta')-\lambda (\mathbf{1}^T\boldsymbol{\phi}-1)\,.
\end{eqnarray}
We now choose the weights to maximise this objective function.

The required gradient with respect to the $k$-th component is given by
\begin{eqnarray}
\frac{\partial F}{\partial \phi_k} & = &
\sum_{d=1}^{N}\frac{\bar{w}_k^{(d)}}{\phi_k}-\lambda\,,
\end{eqnarray}
which vanishes (i.e. becomes zero) exactly for the estimate
\begin{eqnarray}
\hat{\phi}_k & = & \frac{1}{\lambda}\sum_{d=1}^{N}\bar{w}_k^{(d)}\,.
\end{eqnarray}
Observe that summing both sides over $k$ results in the identity $\lambda=N$,
since $\hat{\phi}_k\doteq P(k\mid\hat{\Phi})$ and
$\bar{w}_k^{(d)}\doteq P(k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta')$.

From the [additive model](#Additive-model "Introduction: Additive model"), the update equation
for the estimate of the mixture weights $\boldsymbol{\phi}$ is therefore
\begin{eqnarray}
\hat{\phi}_k & = & 
\frac{1}{N}\sum_{d=1}^{N}P\left(k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta'\right) 
~=~
\frac{1}{N}\sum_{d=1}^{N}
\frac{\phi_k'\,P\left(c^{(d)}\mid\mathbf{x}_k^{(d)},\Psi_k\right)}
{\sum_{\tilde{k}=1}^K \phi_\tilde{k}'\,P\left(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Psi_k\right)}
\,.
\end{eqnarray}

The EM algorithm now proceeds by iteratively updating the 
previous parameter estimate $\Theta'$ with the new estimate $\hat{\Theta}$, i.e.
$\Theta'\leftarrow\hat{\Theta}$, or in this case $\Phi'\leftarrow\hat{\Phi}$.
This iteration continues until the parameter estimates converge (subject to mild conditions).

## Direct Optimsation

### Marginal model

Recall that the [EM approach](#Expectation-Maximisation-Approach "Section: Expectation-Maximisation Approach") explicitly models the hidden information and then takes expectations over its conditional distribution.
In contrast, in the direct approach we simply marginalise over the missing information, 
i.e. the optimal index $k^*$ for sequence $\vec{\mathbf{x}}$,
and consider only the known data, namely the training labels $\mathbf{C}$ and sequences $\mathbf{X}$.

The [marginal model](#Additive-model "Introduction: Additive model") is therefore given by
\begin{eqnarray}
P(c\mid\vec{\mathbf{x}},\Theta) & = & \sum_{k=1}^{K}P(k,c\mid\vec{\mathbf{x}},\Theta)
~\doteq~ \sum_{k=1}^K \phi_k\,P(c\mid\mathbf{x}_k,\Psi_k)
\,,
\end{eqnarray}
with mixture weights $\phi_k\ge 0$ that satisfy $\sum_{k=1}^{K}\phi_k=1$.
Note that in the direct approach there is no need to interpret these weights as component probabilities.

### Discriminative log-likelihood

The discriminative log-likelihood of the training data $\mathbf{C}$ and $\mathbf{X}$ is now given by
\begin{eqnarray}
L(\Theta) & \doteq & \ln P(\mathbf{C}\mid\mathbf{X},\Theta)
\nonumber\\& = &
\sum_{d=1}^{N}\ln P\left(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\nonumber\\& = &
\sum_{d=1}^{N}\ln \sum_{k=1}^{K}\phi_k\,P\left(c^{(d)}\mid\mathbf{x}_k^{(d)},\Psi_k\right)\,.
\end{eqnarray}
Note that other forms of likelihood could also be used, such as the joint likelihood
$P(\mathbf{C},\mathbf{X}\mid\Theta)$.

### Maximising the discriminative log-likelihood

Given the log-likelihood and the constraints on the mixture weights, we obtain the objective function
\begin{eqnarray}
F(\Phi;\Psi) & \doteq & L(\Theta)-\lambda (\mathbf{1}^T\boldsymbol{\phi}-1)\,,
\end{eqnarray}
which is to be maximised. The gradient with respect to the $k$-th mixture component is then
\begin{eqnarray}
\frac{\partial F}{\partial \phi_k} & = & 
\sum_{d=1}^{N}\frac{P\left(c^{(d)}\mid\mathbf{x}_k^{(d)},\Psi_k\right)}
{\sum_{\tilde{k}=1}^K \phi_\tilde{k}\,P\left(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Psi_\tilde{k}\right)}
-\lambda
\nonumber\\
& = & \frac{1}{\phi_k}\sum_{d=1}^{N}
P\left(k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta\right)-\lambda
\,,
\end{eqnarray}
which vanishes exactly at the estimate
\begin{eqnarray}
\hat{\phi}_k & = & \frac{1}{N}\sum_{d=1}^{N}
P\left(k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\hat{\Theta}\right)\,,
\end{eqnarray}
for $\lambda=N$.

Observe that this is a nonlinear optimisation, due to the presence of $\hat{\Theta}$ on the right-hand side.
Hence, we could use a gradient-ascent approach to maximise $F$, which would likely be the fastest approach.
Alternatively, we could take an iterative approach, and repeatedly compute
\begin{eqnarray}
\hat{\phi}_k & = & \frac{1}{N}\sum_{d=1}^{N}
P\left(k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta'\right)
~\doteq~\frac{1}{N}\sum_{d=1}^{N}\bar{w}_k^{(d)}
\,,
\end{eqnarray}
in conjunction with the update $\Theta'\leftarrow\hat{\Theta}$, i.e. $\Phi'\leftarrow\hat{\Phi}$.
Observe that this iterative approach is exactly the 
[EM solution](#Maximising-the-expected-log-likelihood 
"Expectation-Maximisation Approach: Maximising the expected log-likelihood").

## Unsupervised Training

In the previous sections, we assumed that the training labels $\mathbf{C}$ were known for all training cases
$\mathbf{X}$, and hence used *supervised* learning approaches. In contrast, we now assume instead that **no** class labels are known for any training case. Hence, we must use an *unsupervised* learning approach.

### Unsupervised model

As in the [previous](#Direct-Optimisation "Section: Direct Optimisation") section,
we assume a mixture model of the form
\begin{eqnarray}
P(c\mid\vec{\mathbf{x}},\Theta) & \doteq & \sum_{k=1}^K \phi_k\,P(c\mid\mathbf{x}_k,\Psi_k)
\,,
\end{eqnarray}
with mixture weights $w_k\ge 0$ that satisfy $\sum_{k=1}^{K}w_k=1$.

However, the true class label $c^{(d)}$ is no longer assumed known for the $d$-th sequence $\vec{\mathbf{x}}^{(d)}$. 
Hence, we [borrow](#Expected-log-likelihood "Expectation-Maximisation Approach: Expected log-likelihood")
the idea of using binary indicators to represent our ignorance.
In particular, we introduce the notional class indicator variable $z_c^{(d)}\doteq\delta(c^{(d)}=c)$.
Note that this is **not** the mixture component indicator $z^{(d)}_k$ used previously.
Thus, we (re)define $\mathbf{z}^{(d)}\doteq (z_c^{(d)})_{k=1}^K$ and
$\mathbf{Z}\doteq\left[\mathbf{z}^{(d)}\right]_{d=1}^N$.

As a consequence, the unsupervised model now takes the form
\begin{eqnarray}
P(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta) & \doteq & 
\prod_{c=1}^{C}\left[P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\right]^{\,z_c^{(d)}}
\\
\Rightarrow \ln P(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta) & \doteq &
\sum_{c=1}^{C}z_c^{(d)}\ln\sum_{k=1}^K \phi_k\,P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
\,.
\end{eqnarray}


### Expected class log-likelihood

As [before](#Discriminative-log-likelihood "Direct optimisation: Discriminative log-likelihood"),
the discriminative log-likelihood, given supervised training data $\mathbf{C}$ and $\mathbf{X}$, is taken to be
\begin{eqnarray}
L(\Theta) & \doteq & \ln P(\mathbf{C}\mid\mathbf{X},\Theta)
~ = ~ \sum_{d=1}^{N}\ln P\left(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\,.
\end{eqnarray}
However, since we no longer know $\mathbf{C}$ (nor $\mathbf{Z}$), then we explicitly represent this uncertainty
via the log-likelihood
\begin{eqnarray}
L(\Theta;\mathbf{Z}) & = &
\sum_{d=1}^{N}\sum_{c=1}^{C}z_c^{(d)}\ln\sum_{k=1}^{K}\phi_k\,P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
\,.
\end{eqnarray}

Once [again](#Expected-log-likelihood "Expectation-Maximisation Approach: Expected log-likelihood"),
we eliminate the hidden information by taking expectations over $\mathbf{Z}$, this time given the known data
$\mathbf{X}$, and assumed parameter values $\Theta'$.
We observe that
\begin{eqnarray}
q_c^{(d)} & \doteq & \mathbb{E}_{\mathbf{Z}\mid\mathbf{X},\Theta'}\left[z_c^{(d)}\right] ~=~
P(c\mid\vec{\mathbf{x}}^{(d)},\Theta')
\,.
\end{eqnarray}
It therefore follows that the expected log-likelihood is given by
\begin{eqnarray}
Q(\Theta,\Theta') & \doteq & \mathbb{E}_{\mathbf{Z}\mid\mathbf{X},\Theta'}[L(\Theta;\mathbf{Z})] ~=~
\sum_{d=1}^{N}\sum_{c=1}^{C}q_c^{(d)}\ln\sum_{k=1}^{K}\phi_k\,P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
\,.
\end{eqnarray}


### Maximising the unsupervised log-likelihood

As [usual](#Maximising-the-expected-log-likelihood 
"Expectation-Maximisation Approach: Maximising the expected log-likelihood"), we
seek the mixture weights $\boldsymbol{\phi}$ that maximise the objective function
\begin{eqnarray}
F(\Phi;\Phi',\Psi) & \doteq & Q(\Theta,\Theta')-\lambda (\mathbf{1}^T\boldsymbol{\phi}-1)\,.
\end{eqnarray}
The gradient with respect to the $k$-th component is given by
\begin{eqnarray}
\frac{\partial F}{\partial \phi_k} & = & 
\sum_{d=1}^{N}\sum_{c=1}^{C}q_c^{(d)}\,
\frac{P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)}
{\sum_{\tilde{k}=1}^{K}\phi_\tilde{k}\,P\left(c\mid\mathbf{x}_\tilde{k}^{(d)},\Psi_\tilde{k}\right)}
-\lambda
\nonumber\\
& = & \frac{1}{\phi_k}\sum_{d=1}^{N}\sum_{c=1}^{C}q_c^{(d)}\,
P(k\mid c,\vec{\mathbf{x}}^{(d)},\Theta)-\lambda
\,,
\end{eqnarray}
which vanishes exactly at the estimate
\begin{eqnarray}
\hat{\phi}_k & = & \frac{1}{N}\sum_{d=1}^{N}\sum_{c=1}^{C}q_c^{(d)}\,
P\left(k\mid c,\vec{\mathbf{x}}^{(d)},\hat{\Theta}\right)
\,,
\end{eqnarray}
with $\lambda=N$.

Observe that this is a nonlinear optimisation. Even worse, the standard EM approach would be to iteratively
update $\hat{\Theta}$ (until convergence) keeping $\Theta'$ fixed, and only then update $\Theta'\leftarrow\hat{\Theta}$. It is tempting to short-cut this procedure by utilising only a single
iteration. However, this immediately causes problems if we replace $\hat{\Theta}$ on the right-hand side
by $\Theta'$, since then
\begin{eqnarray}
\hat{\phi}_k & = & \frac{1}{N}\sum_{d=1}^{N}\sum_{c=1}^{C} 
P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta'\right)\,
P\left(k\mid c,\vec{\mathbf{x}}^{(d)},\Theta'\right)
\nonumber\\
& = & 
\frac{1}{N}\sum_{d=1}^{N}\sum_{c=1}^{C}P\left(k,c\mid\vec{\mathbf{x}}^{(d)},\Theta'\right)
\nonumber\\
& = &
\frac{1}{N}\sum_{d=1}^{N}\sum_{c=1}^{C} \phi_k'\,
P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
~=~\phi_k'
\,.
\end{eqnarray}


### Direct unsupervised training

As an alternative to the usual EM iteration of $\Theta'$ and $\hat{\Theta}$,
we might instead directly take the expectation over $\mathbf{Z}$ with respect to the true (but unknown) parameters
$\Theta$. This gives rise to the expected log-likelihood
\begin{eqnarray}
Q(\Theta) & \doteq & \mathbb{E}_{\mathbf{Z}\mid\mathbf{X},\Theta}\left[L(\Theta;\mathbf{Z})\right]
\nonumber\\ & = &
\sum_{d=1}^{N}\sum_{c=1}^{C} P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\nonumber\\ & = &
\sum_{d=1}^{N}\sum_{c=1}^{C}\left\{
\sum_{k=1}^{K}\phi_k\,P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
\right\}
\ln\sum_{k=1}^{K}\phi_k\,P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
\,.
\end{eqnarray}

Taking the objective function to be
\begin{eqnarray}
F(\Phi;\Psi) & \doteq & Q(\Theta)-\lambda (\mathbf{1}^T\boldsymbol{\phi}-1)\,,
\end{eqnarray}
the gradient with respect to the $k$-th component is therefore
\begin{eqnarray}
\frac{\partial F}{\partial\phi_k} & = &
\sum_{d=1}^{N}\sum_{c=1}^{C}\left\{
P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
+P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)
\right\}-\lambda
\nonumber\\
& = &
\sum_{d=1}^{N}\sum_{c=1}^{C}
P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
+N-\lambda
\nonumber\\
& = &
\sum_{d=1}^{N}\sum_{c=1}^{C}
P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\,,
\end{eqnarray}
for $\lambda=N$. Hence, we may use gradient ascent to obtain the optimal parameter estimate, $\hat{\Theta}$.
Also note that for $\lambda=N$ we have
\begin{eqnarray}
\sum_{k=1}^K \phi_k\,\frac{\partial F}{\partial\phi_k} & = &
\sum_{d=1}^{N}\sum_{c=1}^{C}\left\{
\sum_{k=1}^K \phi_k\,
P\left(c\mid\mathbf{x}_k^{(d)},\Psi_k\right)\right\}\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\nonumber\\& = &
\sum_{d=1}^{N}\sum_{c=1}^{C}P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
~=~Q(\Theta)\,.
\end{eqnarray}

## Quasi-Supervised Training

### Supervised, unsupervised and semi-supervised training

In *supervised* learning, the aim is to retrospectively predict the outcome of a single event with known
result. Hence, the class label $c^{(d)}$ of the $d$-th training sequence $\vec{\mathbf{x}^{(d)}}$ is always known. An appropriate measure is therefore the 
[discriminative log-likelihood](#Discriminative-log-likelihood 
"Direct Optimisation: Discriminative log-likelihood"), namely 
$\ln P\left(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta\right)$.
Alternatively, we may specify the class $c^{(d)}$ via the 
[binary indicator](#Unsupervised-model "Unsupervised Training: Unsupervised model")
vector $\mathbf{z}^{(d)}\doteq\left(z_c^{(d)}\right)_{c=1}^{C}$,
where $z_c^{(d)}\doteq\delta(c^{(d)}=c)$. The log-likelihood therefore becomes
\begin{eqnarray}
L^{(d)}(\Theta) & \doteq &
\ln P\left(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta\right) ~=~ 
\sum_{c=1}^{C}z_c^{(d)}\,\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,.
\end{eqnarray}
Note that $\mathbf{z}^{(d)}$ is also known as a 1-of-$C$ vector (in the statistics literature), or a *one-hot* vector (in the engineering literature).

Conversely, in *unsupervised* learning, the aim is to predict the outcome of a single event with unknown outcome. Hence, the class label $c^{(d)}$ is never known. Since the indicator $z_c^{(d)}$ is
also unknown, it is replaced by its 
[expectation](#Expected-class-log-likelihood "Unsupervised Training: Expected class log-likelihood"), namely
\begin{eqnarray}
q_c^{(d)} & \doteq & \mathbb{E}_{\mathbf{Z}\mid\mathbf{X},\Theta}\left[z_c^{(d)}\right] ~=~
P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\,.
\end{eqnarray}
Note that we are evaluating these class probabilities at $\Theta$ instead of $\Theta'$.
Hence, the appropriate measure is now
\begin{eqnarray}
L^{(d)}(\Theta) & \doteq &
\mathbb{E}_{\mathbf{X},\Theta}\left[
\ln P\left(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta\right)
\right]
~=~
\sum_{c=1}^{C}q_c^{(d)}\,\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,.
\end{eqnarray}

In *semi-supervised* learning, some but not all of the class labels $\mathbf{C}$ are known, and some are unknown. Note that when $c^{(d)}$ is known, we may define
\begin{eqnarray}
P\left(c\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta\right) & \doteq & 
P\left(c\mid c^{(d)}\right)
~=~\delta\left(c=c^{(d)}\right)
~\doteq~z_c^{(d)}
\,.
\end{eqnarray}
Conversely, when $c^{(d)}$ is missing, we observe that
\begin{eqnarray}
P\left(c\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta\right) & \doteq & 
P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)~\doteq~q_c^{(d)}
\,.
\end{eqnarray}
Hence, we may combine the supervised and unsupervised apparoaches into the common framework
\begin{eqnarray}
L^{(d)}(\Theta) & \doteq & 
\sum_{c=1}^{C}P\left(c\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta\right)
\,\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,,
\end{eqnarray}
where
\begin{eqnarray}
P\left(c\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta\right) & \doteq &
\left\{
\begin{array}{lr}
\delta\left(c=c^{(d)}\right) & \mbox{if $c^{(d)}$ is known}
\\
P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right) & \mbox{if $c^{(d)}$ is unknown}
\end{array}
\right.
\,.
\end{eqnarray}

### Quasi-supervised log-likelihood

[Recall](#Supervised,-unsupervised-and-semi-supervised-training
"Quasi-Supervised Training: Supervised, unsupervised and semi-supervised training")
that for supervised learning we know the class label, and for unsupervised learning we do not.
Here we consider an in-between case, for which I have coined the term *quasi-supervised* learning.
Rather than having either complete certainty or complete ignorance of the class label,
instead we know only the expected proportions of each class.

As an example, suppose that instead of modelling a single event, we model a collection of events.
Thus, we might amalgamate the class labels of the collected events by computing the proportion of events in each class.
Further generalising the [semi-supervised](#Supervised,-unsupervised-and-semi-supervised-training
"Quasi-Supervised Training: Supervised, unsupervised and semi-supervised training")
log-likelihood, the appropriate measure for quasi-supervised learning is therefore the negative cross-entropy
\begin{eqnarray}
L^{(d)}(\Theta) & \doteq & 
\sum_{c=1}^{C}P\left(c\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Gamma\right)
\,\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,,
\end{eqnarray}
where the use of $\Gamma$ indicates a different family of models than our 
[additive model](#Additive-model "Introduction: Additive model") 
using $\Theta$. Here we take $\Gamma$ to be fixed, such that each class proportion
\begin{eqnarray}
\gamma_c^{(d)} & \doteq & P\left(c\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Gamma\right)
~\doteq~P\left(c\mid\vec{\mathbf{x}}^{(d)},\Gamma\right)
\end{eqnarray}
is also constant and known. For convenience, we define
$\boldsymbol{\gamma}^{(d)}\doteq\left(\gamma_c^{(d)}\right)_{c=1}^{C}$
and $\boldsymbol{\Gamma}\doteq\left[\boldsymbol{\gamma}^{(d)}\right]_{d=1}^{N}$, such that
$\boldsymbol{\Gamma}$ now replaces $\mathbf{C}$ as part of the training data. 

Consequently, the overall log-likelihood is now given by
\begin{eqnarray}
L(\Theta) & \doteq & \sum_{d=1}^N\sum_{c=1}^C \gamma_c^{(d)}\,
\ln P\left(c\mid\vec{\mathbf{x}}^{(d)},\Theta\right)\,.
\end{eqnarray}

### Maximising the quasi-supervised log-likelihood

The objective function is taken to be
\begin{eqnarray}
F(\Phi;\Psi) & \doteq & L(\Theta)-\lambda (\mathbf{1}^T\boldsymbol{\phi}-1)\,.
\end{eqnarray}
Hence, the gradient with respect to the $k$-th component is therefore
\begin{eqnarray}
\frac{\partial F}{\partial\phi_k} & = &
\frac{1}{\phi_k}\sum_{d=1}^N\sum_{c=1}^C \gamma_c^{(d)}\,
P\left(k\mid c,\vec{\mathbf{x}}^{(d)},\Theta\right)-\lambda
\,,
\end{eqnarray}
which vanishes exactly when
\begin{eqnarray}
\hat{\phi}_k & = & \frac{1}{N}
\sum_{d=1}^N\sum_{c=1}^C \gamma_c^{(d)}\,
P\left(k\mid c,\vec{\mathbf{x}}^{(d)},\hat{\Theta}\right)
\,,
\end{eqnarray}
for $\lambda=N$.

This nonlinear equation may be solved via iteration.