# Appendix A: Additively Weighted Models

## The Problem

The task is to predict the class $c\in\mathcal{C}$ that best matches a sequence of observations $\vec{\mathbf{x}}\doteq (\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_K)$.
We assume some probabilistic model $P(c\mid\vec{\mathbf{x}},\Theta)$ with unknown parameters $\Theta$.
The parameters are to be estimated from the training data $\mathbf{C}\doteq[c^{(d)}]_{d=1}^N$
and $\mathbf{X}\doteq[\vec{\mathbf{x}}^{(d)}]_{d=1}^N$.

## Expectation-Maxmisation Approach

### Additive model

For a single sequence $\vec{\mathbf{x}}=(\mathbf{x}_k)_{k=1}^K$ of observations, we assume *a priori* that some particular observation, say $\mathbf{x}_{k^*}$, is the best predictor of the class, such that
\begin{eqnarray}
P(k^*,c\mid\vec{\mathbf{x}},\Theta) & \doteq & P(k^*\mid\Theta)\,P(c\mid\mathbf{x}_{k^*},\Theta)\,.
\end{eqnarray}
It then follows that the desired predictive model is given by
\begin{eqnarray}
P(c\mid\vec{\mathbf{x}},\Theta) & = & 
\sum_{k^*=1}^K P(k^*,c\mid\vec{\mathbf{x}},\Theta)
~=~\sum_{k=1}^K w_k\,P(c\mid\mathbf{x}_k,\Theta)
\,,
\end{eqnarray}
for prior observation weights $w_k\doteq P(k\mid\Theta)$. This model therefore takes the form of an additively weighted *mixture of experts*.
We leave the sub-model (or expert) $P(c\mid\mathbf{x}_k,\Theta)$ undefined, except to stipulate that its
parameters (nominally $\Theta$) do not depend on the prior weights.

Finally, we may invert the model to obtain the posterior observation weights, given by
\begin{eqnarray}
\bar{w}_k & \doteq & P(k\mid c,\vec{\mathbf{x}},\Theta)
~=~\frac{P(k, c\mid\vec{\mathbf{x}},\Theta)}{P(c\mid\vec{\mathbf{x}},\Theta)}
\nonumber\\& = &
\frac{w_k\,P(c\mid\mathbf{x}_k,\Theta)}
{\sum_{\tilde{k}=1}^K w_\tilde{k}\,P(c\mid\mathbf{x}_\tilde{k},\Theta)}\,.
\end{eqnarray}

### Expected log-likelihood

The index $k^*$ of the optimal observation $\mathbf{x}_{k^*}$ is, in practice, unknown. 
Hence, we introduce the binary indicators $z_k=\delta(k=k^*)$, such that the predictive model now takes the
form
\begin{eqnarray}
P(k^*,c\mid\vec{\mathbf{x}},\Theta) & = & 
\prod_{k=1}^K\left[w_k\,P(c\mid\mathbf{x}_{k},\Theta)\right]^{z_k}\,,
\end{eqnarray}
with log-likelihood
\begin{eqnarray}
\ln P(k^*,c\mid\vec{\mathbf{x}},\Theta) & = & \sum_{k=1}^K z_k\ln\left[w_k\,P(c\mid\mathbf{x}_{k},\Theta)
\right]\,.
\end{eqnarray}

Given the training data $\mathbf{C}$ and $\mathbf{X}$, the predictive log-likelihood is now given by
\begin{eqnarray}
L(\Theta,\mathbf{Z}) & \doteq &
\sum_{d=1}^N\ln P(k^{*(d)},c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta)
~=~\sum_{d=1}^N\sum_{k=1}^K z_k^{(d)}\ln\left[w_k\,P(c^{(d)}\mid\mathbf{x}^{(d)}_k,\Theta)
\right]\,,
\end{eqnarray}
where $\mathbf{Z}\doteq[\mathbf{z}^{(d)}]_{d=1}^N$ and $\mathbf{z}^{(d)}\doteq[z^{(d)}_k]_{k=1}^K$.

Now, the indicators $\mathbf{Z}$ are actually hidden variables, since they are defined in terms of the
unknown indices $k^{*(d)}$.
We therefore take expectations over $\mathbf{Z}$, resulting in the expected log-likelihood
\begin{eqnarray}
Q(\Theta,\Theta') & \doteq &
\mathbb{E}[L(\Theta,\mathbf{Z})\mid\Theta']
\nonumber\\
& = & 
\sum_{d=1}^N\sum_{k=1}^K \mathbb{E}[z_k^{(d)}\mid\Theta']\,
\ln\left[w_k\,P(c^{(d)}\mid\mathbf{x}^{(d)}_k,\Theta)\right]
\nonumber\\
& = & 
\sum_{d=1}^N\sum_{k=1}^K \bar{w}_k^{(d)}\,
\ln\left[w_k\,P(c^{(d)}\mid\mathbf{x}^{(d)}_k,\Theta)\right]
\,,
\end{eqnarray}
where $\bar{w}_k^{(d)}=P(k^{*(d)}=k\mid c^{(d)},\vec{\mathbf{x}}^{(d)},\Theta')$. These are just
the posterior weights from the
[previous](#Additive-model "Section: Additive model") section, evaluated using the parameter 
estimate $\Theta'$. 

### Maximising the log-likelihood

We now assume that the model parameters $\Theta$ include the vector
$\mathbf{w}\doteq(w_1,\ldots,w_K)$ of prior observation weights.
We further assume that the sub-model $P(c\mid\mathbf{x}_k,\Theta)$ does not depend on $\mathbf{w}$.
Given that the prior weights sum to unity, we add this constraint to the expected log-likelihood with a Lagrange multiplier, to form the objective function
\begin{eqnarray}
F(\mathbf{w}) & \doteq & Q(\Theta,\Theta')-\lambda (\mathbf{1}^T\mathbf{w}-1)\,.
\end{eqnarray}
We now choose the weights to maximise this objective function.

The required gradient with respect to the $k$-th component is given by
\begin{eqnarray}
\frac{\partial F}{\partial w_k} & = &
\sum_{d=1}^{N}\frac{\bar{w}_k^{(d)}}{w_k}-\lambda\,,
\end{eqnarray}
which vanishes (i.e. becomes zero) exactly when
\begin{eqnarray}
w_k & = & \frac{1}{N}\sum_{d=1}^{N}\bar{w}_k^{(d)}\,,
\end{eqnarray}
with $\lambda=N$.

The expectation-maximisation (EM) algorithm now proceeds by iteratively updating the estimates of the 'prior' observation weights $w_k$ (as part of $\Theta$) using the posterior weights computed from the previous estimate
$\Theta'$.
From the [additive model](#Additive-model "Section: Additive model"), we obtain the update
\begin{eqnarray}
w_k & = & \frac{1}{N}\sum_{d=1}^{N}
\frac{w_k'\,P(c^{(d)}\mid\mathbf{x}_k^{(d)},\Theta')}
{\sum_{\tilde{k}=1}^K w_\tilde{k}'\,P(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Theta')}
\,.
\end{eqnarray}

## Direct Approach

### Additive model

In the direct approach, we simply assume a mixture model of the form
\begin{eqnarray}
P(c\mid\vec{\mathbf{x}},\Theta) & \doteq & \sum_{k=1}^K w_k\,P(c\mid\mathbf{x}_k,\Theta)
\,,
\end{eqnarray}
with mixture weights $w_k\ge 0$ that satisfy $\sum_{k=1}^{K}w_k=1$.
There is no need to interpret these weights as component probabilities.

### Expected log-likelihood

The discriminative log-likelihood of the training data $\mathbf{C}$ and $\mathbf{X}$ is now given by
\begin{eqnarray}
L(\Theta) & \doteq & \ln P(\mathbf{C}\mid\mathbf{X},\Theta)
\nonumber\\& = &
\sum_{d=1}^{N}\ln P(c^{(d)}\mid\vec{\mathbf{x}}^{(d)},\Theta)
\nonumber\\& = &
\sum_{d=1}^{N}\ln \sum_{k=1}^{K}w_k\,P(c^{(d)}\mid\mathbf{x}_k^{(d)},\Theta)\,.
\end{eqnarray}
Note that other forms of likelihood could also be used, such as the joint likelihood
$P(\mathbf{C},\mathbf{X}\mid\Theta)$.

### Maximising the log-likelihood

Given the log-likelihood and the constraints on the mixture weights, we obtain the objective function
\begin{eqnarray}
F(\mathbf{w}) & \doteq & L(\Theta)-\lambda (\mathbf{1}^T\mathbf{w}-1)\,,
\end{eqnarray}
which is to be maximised. The gradient with respect to the $k$-th mixture component is then
\begin{eqnarray}
\frac{\partial F}{\partial w_k} & = & 
\sum_{d=1}^{N}\frac{P(c^{(d)}\mid\mathbf{x}_k^{(d)},\Theta)}
{\sum_{\tilde{k}=1}^K w_\tilde{k}\,P(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Theta)}
-\lambda\,.
\end{eqnarray}
Now, since $w_k\ge 0$, we take the modified gradient
\begin{eqnarray}
\delta w_k & \doteq & w_k\frac{\partial F}{\partial w_k} ~=~ 
\sum_{d=1}^{N}\frac{w_k\,P(c^{(d)}\mid\mathbf{x}_k^{(d)},\Theta)}
{\sum_{\tilde{k}=1}^K w_\tilde{k}\,P(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Theta)}
-\lambda w_k\,,
\end{eqnarray}
such that
\begin{eqnarray}
\sum_{k=1}^{K}\delta w_k & = & 
\sum_{d=1}^{N}\frac{\sum_{k=1}^{K}w_k\,P(c^{(d)}\mid\mathbf{x}_k^{(d)},\Theta)}
{\sum_{\tilde{k}=1}^K w_\tilde{k}\,P(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Theta)}
-\lambda\sum_{k=1}^{K}w_k~=~N-\lambda
\,.
\end{eqnarray}
The modified gradient vanishes when the gradient vanishes, at which point
$\lambda=N$. Consequently, the optimal mixture weights $w_k^*$ satisfy the relation
\begin{eqnarray}
w_k^* & = & \frac{1}{N}\sum_{d=1}^{N}
\frac{w_k^*\,P(c^{(d)}\mid\mathbf{x}_k^{(d)},\Theta)}
{\sum_{\tilde{k}=1}^K w_\tilde{k}^*\,P(c^{(d)}\mid\mathbf{x}_\tilde{k}^{(d)},\Theta)}
\,.
\end{eqnarray}
Thus, the mixture weights may be found by iteration (assuming the iterative scheme is stable
and convergent). Alternatively, any other gradient ascent scheme may be used (which might be faster).