# Discrete Boltzmann Machines

### Discrete exponential distributions

Consider a $D$-dimensional space ${\cal V}$ of values. We notionally consider that each point ${\bf v}=(v_1,v_2,\ldots,v_D)\in{\cal V}$ is an instance of stochastic variables
$(V_1,V_2,\ldots,V_D)$.
A Boltzmann machine (BM) now takes the form of a completely connected, undirected graph, with a distinct node for each variable, as shown in the figure below.

![General Boltzmann Machine](BM_general.png "General Boltzmann Machine")

For convenience, we now restrict ourselves to the case where ${\cal V}$ is a space of discrete values. Note that for continuous variables, (some of) the corresponding summations below would be replaced by integrations, although the resulting derivations will be of similar form.

A discrete Boltzmann machine now has an energy function 
$E:{\cal V}\rightarrow\mathbb{R}$ that induces the probability distribution 
\begin{eqnarray}
p({\bf v}) & = & 
\frac{e^{-E({\bf v})}}
{\sum_{{\bf v'}\in{\cal V}}e^{-E({\bf v'})}}
\,.
\end{eqnarray}
In practice, it is more convenient to let $E=-f$ for some complementary function
$f:{\cal V}\rightarrow\mathbb{R}$, which forms the exponential family
\begin{eqnarray}
p({\bf v}) & = & 
\frac{e^{f({\bf v})}}
{\sum_{{\bf v'}\in{\cal V}}e^{f({\bf v'})}}
\,.
\end{eqnarray}

### Parameter estimation

We now suppose that $f({\bf v})$ is implicitly parameterised by some collection of parameters, denoted by $\Theta$. Our task is therefore to estimate $\Theta$ from a data-set of known training points. An obvious choice is to jointly maximise the likelihood
$p({\bf v})$ of each training point ${\bf v}$, under the assumption that training cases are independent. This is equivalent to maximising
the log-likelihood, given by
\begin{eqnarray}
\ln p({\bf v}) & = & 
f({\bf v})-\ln\sum_{{\bf v'}\in{\cal V}}e^{f({\bf v'})}
\,.
\end{eqnarray}

The usual choice for maximisation is gradient ascent, in one of its many variants.
Hence, for some arbitrary parameter $\theta$, the gradient of the log-likelihood is
\begin{eqnarray}
\nabla\ln p({\bf v}) & = & 
\nabla f({\bf v})-\nabla\ln\sum_{{\bf v'}\in{\cal V}}e^{f({\bf v})}
\\& = &
\nabla f({\bf v})-
\frac{
 \sum_{{\bf v'}\in{\cal V}}e^{f({\bf v'})}\,\nabla f({\bf v'})
}
{
 \sum_{{\bf v'}\in{\cal V}}e^{f({\bf v'})}
}
\\& = &
\nabla f({\bf v})-\sum_{{\bf v'}\in{\cal V}} p({\bf v'})\,\nabla f({\bf v'})
\\& = &
\nabla f({\bf v})-\mathbb{E}_{\cal V}\left[\nabla f({\bf v'})\right]
\,.
\end{eqnarray}

The biggest problem in practice with Boltzmann machines is that $p({\bf v})$ is
intractable to compute in general, due largely to the *curse of dimensionality*.
Hence, the unconditional expectation $\mathbb{E}_{\cal V}[\cdot]$ is also intractable.
In the following sections, we discuss approximation techniques for handling this
intractability.

### Partitioned Boltzmann machine

The usual rationale for constructing a Boltzmann machine, or indeed for assuming any probability distribution, is for the purpose of prediction. 
Let us therefore suppose that the point ${\bf v}\in{\cal V}$ may be partitioned into
two sub-points, ${\bf x}\in{\cal X}$ and ${\bf y}\in{\cal Y}$. We also suppose that
${\bf x}$ and ${\bf y}$ may be *stitched* back together to obtain
${\bf v}=\breve{\bf v}({\bf x},{\bf y})$. This *partitioned* Boltzmann machine is shown in the figure below.

![Partitioned Boltzmann Machine](BM_partitioned.png "Partitioned Boltzmann Machine")

For convenience, let us define 
$\breve{f}({\bf x},{\bf y})\doteq f(\breve{\bf v}({\bf x},{\bf y})) = f({\bf v})$.
Then the joint distribution becomes
\begin{eqnarray}
p({\bf x},{\bf y}) & = & 
\frac{e^{\breve{f}({\bf x},{\bf y})}}
{\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}}
\,.
\end{eqnarray}
In addition,
due to the non-directionality of edges, we may also predict in either direction, namely
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & 
\frac{e^{\breve{f}({\bf x},{\bf y})}}
{\sum_{{\bf y'}\in{\cal Y}}e^{\breve{f}({\bf x},{\bf y'})}}
\,,
\end{eqnarray}
or
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & 
\frac{e^{\breve{f}({\bf x},{\bf y})}}
{\sum_{{\bf x'}\in{\cal X}}e^{\breve{f}({\bf x'},{\bf y})}}
\,.
\end{eqnarray}
The corresponding marginal distributions are
\begin{eqnarray}
p({\bf y}) & = & 
\frac{\sum_{{\bf x'}\in{\cal X}}e^{\breve{f}({\bf x'},{\bf y})}}
{\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf x'}\in{\cal X}}
 e^{\breve{f}({\bf x'},{\bf y'})}}
\,,
\end{eqnarray}
and
\begin{eqnarray}
p({\bf x}) & = & 
\frac{\sum_{{\bf y}\in{\cal Y}}e^{\breve{f}({\bf x},{\bf y})}}
{\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}}
\,,
\end{eqnarray}
respectively.

For convenience, we from now on suppose that the partitioned BM is treated as a predictive model with 
*input* ${\bf x}$ and *output* ${\bf y}$, although it will always be able to operate in
reverse.

Before we proceed to the examination of parameter estimation for these various models, we first digress to the additional technique that we will utilise for handling the intractability of BMs, dicussed in the next section.

### Mean field approximation

The mean field approximation states that the mean value of a function averaged over a number of points is approximately equal to the function evaluated at the average of those points. To demonstrate this, first let 
$\bar{\bf v}=\mathbb{E}_{\cal V}[{\bf v}]$ be the average of all the points in 
${\cal V}$. Next, consider the first-order Taylor series approximation of some
$g({\bf v})$ about $\bar{\bf v}$, namely
\begin{eqnarray}
g({\bf v}) & \approx & g(\bar{\bf v})
+\left({\bf v}-\bar{\bf v}\right)^T \nabla g(\bar{\bf v})
\,.
\end{eqnarray}
Thus, taking the expectation over ${\cal V}$, it follows that
\begin{eqnarray}
\mathbb{E}_{\cal V}[g({\bf v})] & \approx & 
g(\bar{\bf v})
+\left(\mathbb{E}_{\cal V}[{\bf v}]-\bar{\bf v}\right)^T \nabla g(\bar{\bf v})
\\& = &
g(\bar{\bf v}) = g(\mathbb{E}_{\cal V}[{\bf v}])
\,.
\end{eqnarray}
This is the mean field approximation (MFA). 

If we proceed further to the second term in the Taylor series expansion (not shown here), then it becomes apparent that the accuracy of the approximation depends on both the smoothness of the function (especially its second derivative) and the variance of the points in ${\cal V}$.
However, my experience is that MFA works very well in practice, especially for computing BM gradients that are otherwise intractable.

### Joint likelihood optimisation

We suppose that the training data-set specifies both ${\bf x}$ and ${\bf y}$.
Thus, we utilise the joint model $p({\bf x},{\bf y})$, defined in an earlier section.
Observe that
\begin{eqnarray}
\ln p({\bf x},{\bf y}) & = & 
\breve{f}({\bf x},{\bf y})
-\ln\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}
\\
\Rightarrow
\nabla\ln p({\bf x},{\bf y}) & = & 
\nabla \breve{f}({\bf x},{\bf y})
-\frac{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})\,\nabla \breve{f}({\bf x'},{\bf y'})}
}
{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}
}
\\& = &
\nabla\breve{f}({\bf x},{\bf y})
-\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 p({\bf x'},{\bf y'})\,\nabla\breve{f}({\bf x'},{\bf y'})
\\& = &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X,Y}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\,.
\end{eqnarray}
For convenience, we now let ${\cal X}'\equiv{\cal X}$ (and similarly for ${\cal Y}'$),
where the prime distinguishes expectation over ${\bf x'}\in{\cal X}'$ from
expectation over ${\bf x}\in{\cal X}$.
Thus, we write
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & = & 
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X',Y'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\\& = &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X'}\left[
 \mathbb{E}_{\cal Y'\mid X'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\,,
\end{eqnarray}
where we have made use of the BM partitioning.

Note that by $\mathbb{E}_{\cal Y'\mid X'}$ I really mean
$\mathbb{E}_{\cal Y'\mid {\bf x'}}$, and thus
$\mathbb{E}_{\cal X'}[\mathbb{E}_{\cal Y'\mid X'}]$ really means
$\mathbb{E}_{{\bf x'}\in\cal X'}[\mathbb{E}_{\cal Y'\mid {\bf x'}}]$.
However, I didn't feel like mixing sets and points. Alternatively, I could have just
written $\mathbb{E}_{\cal {\bf y'}\mid {\bf x'}}$, as I have done in other notebooks, although this loses explicit mention of the domain. In my experience, all expectation notation suffers from exposing some explicit dependencies whilst hiding other implicit dependencies, and is thus never entirely unambiguous.

Now, computing $p({\bf x'})$ is still intractable in general. However, we do know 
${\bf x}$ and ${\bf y}$. Hence, we make further use of the partitioning by taking the
approximation
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & \approx & 
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X'\mid Y}\left[
 \mathbb{E}_{\cal Y'\mid X'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\,.
\end{eqnarray}
In other words, we keep alternating between use of the predictive models
$p({\bf x}\mid{\bf y})$ and $p({\bf y}\mid{\bf x})$ until we reach known values
of the conditional.
We shall call this the *conditional expectation approximation* (CEA), although another appropriate name would be
*conditional expectation alternation* - take your pick.

Note that we assume that these predictive models are tractable to compute!

Next, we define the expectation functions
\begin{eqnarray}
\bar{\bf x}({\bf y}) \doteq \mathbb{E}_{\cal X\mid Y}[{\bf x}]\,,&\,\,&
\bar{\bf y}({\bf x}) \doteq \mathbb{E}_{\cal Y\mid X}[{\bf y}]\,.
\end{eqnarray}
These convenience functions allow us to more easily apply MFA, giving
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & \approx & 
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X'\mid Y}\left[
 \mathbb{E}_{\cal Y'\mid X'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X'\mid Y}\left[
 \nabla\breve{f}({\bf x'},\bar{\bf y}({\bf x'}))
\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y})
-\nabla\breve{f}(\bar{\bf x}({\bf y}),\bar{\bf y}(\bar{\bf x}({\bf y})))
\,.
\end{eqnarray}



Alternatively, we may rewrite the gradient as
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & = & 
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal Y'}\left[
 \mathbb{E}_{\cal X'\mid Y'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal Y'\mid X}\left[
 \mathbb{E}_{\cal X'\mid Y'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal Y'\mid X}\left[
 \nabla\breve{f}(\bar{\bf x}({\bf y'}),{\bf y'})
\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y})
-\nabla\breve{f}(\bar{\bf x}(\bar{\bf y}({\bf x})),\bar{\bf y}({\bf x}))
\,.
\end{eqnarray}
It is not clear which alternative is preferable. However, if we notionally think of ${\bf x}$ as the input, then this perhaps suggests the latter approxmation might be favoured. I have not implemented the former variant, but the latter variant appears to work well in practice.

In either case, we still cannot tractably compute the joint training score, 
$\ln p({\bf x},{\bf y})$. However, one possible approach is to observe that
\begin{eqnarray}
p({\bf x},{\bf y}) & = & p({\bf y}\mid{\bf x})\,p({\bf x})
\\& = &
p({\bf y}\mid{\bf x})\,\sum_{{\bf y'}\in{\cal Y}'} p({\bf y'})\,p({\bf x}\mid{\bf y'})
\\& = &
p({\bf y}\mid{\bf x})\,\mathbb{E}_{{\cal Y}'}\left[p({\bf x}\mid{\bf y'})\right]
\\
\Rightarrow p({\bf x},{\bf y}) & \approx &
p({\bf y}\mid{\bf x})\,\mathbb{E}_{{\cal Y}'\mid{\cal X}}\left[p({\bf x}\mid{\bf y'})\right]
\\& \approx &
p({\bf y}\mid{\bf x})\,p({\bf x}\mid\bar{\bf y}({\bf x}))
\,,
\end{eqnarray}
via CEA and MFA. Once again, this has been implemented and tested, and works well in practice.

### Marginal likelihood optimisation

We now suppose that the training data-set only specifies ${\bf x}$ but not ${\bf y}$.
Thus, we might utilise the marginal likelihood $p({\bf x})$, defined in an earlier section.
Observe that
\begin{eqnarray}
\ln p({\bf x}) & = & 
\ln\sum_{{\bf y}\in{\cal Y}}e^{\breve{f}({\bf x},{\bf y})}
-\ln\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}
\\
\Rightarrow
\nabla\ln p({\bf x}) & = &
\frac{
 \sum_{{\bf y}\in{\cal Y}}e^{\breve{f}({\bf x},{\bf y})}
 \,\nabla\breve{f}({\bf x},{\bf y})
}
{
 \sum_{{\bf y'}\in{\cal Y}}e^{\breve{f}({\bf x},{\bf y'})}
}
-\frac{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}
  \,\nabla\breve{f}({\bf x'},{\bf y'})
}
{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'})}
}
\\& = &
\sum_{{\bf y}\in{\cal Y}} p({\bf y}\mid{\bf x})\,\nabla\breve{f}({\bf x},{\bf y})
-\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
p({\bf x'},{\bf y'})\,\nabla\breve{f}({\bf x'},{\bf y'})
\\& = &
\mathbb{E}_{\cal Y\mid X}\left[\nabla\breve{f}({\bf x},{\bf y})\right]
-\mathbb{E}_{\cal X', Y'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\,.
\end{eqnarray}

We observe that a general pattern has emerged here, namely that summation over a collection of variables in the log-likelihood corresponds, in the gradient, to the expectation over those same variables conditional on the remaining variables.
Thus, $\sum_{\cal Y}$ in the first term on the right-hand side
becomes $\mathbb{E}_{\cal Y\mid X}$, and $\sum_{\cal X',Y'}$ in the second term
becomes $\mathbb{E}_{\cal X',Y'}$.

Now, applying CEA gives
\begin{eqnarray}
\nabla\ln p({\bf x}) & \approx &
\mathbb{E}_{\cal Y\mid X}\left[\nabla\breve{f}({\bf x},{\bf y})\right]
-\mathbb{E}_{\cal Y\mid X}\left[
  \mathbb{E}_{\cal X'\mid Y}\left[
   \mathbb{E}_{\cal Y'\mid X'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
  \right]
\right]
\,.
\end{eqnarray}

Finally, applying MFA gives
\begin{eqnarray}
\nabla\ln p({\bf x}) & \approx &
\nabla\breve{f}({\bf x},\bar{\bf y})
-\nabla\breve{f}(\bar{\bf x}',\bar{\bf y}')
\,,
\end{eqnarray}
where $\bar{\bf y}=\bar{\bf y}({\bf x})$, 
$\bar{\bf x}'=\bar{\bf x}(\bar{\bf y})=\bar{\bf x}(\bar{\bf y}({\bf x}))$,
and $\bar{\bf y}'=\bar{\bf y}(\bar{\bf x}')=
\bar{\bf y}(\bar{\bf x}(\bar{\bf y}({\bf x})))$.
Note that the ordering of these (shorthand) computations corresponds to computing the expectations from left (outside) to right (inside). This is another general pattern.

This version of the gradient has been tested with a Bernoulli Restricted BM, and works well. The alternative version, namely
\begin{eqnarray}
\nabla\ln p({\bf x}) & \approx &
\mathbb{E}_{\cal Y\mid X}\left[\nabla\breve{f}({\bf x},{\bf y})\right]
-\mathbb{E}_{\cal Y'\mid X}\left[
   \mathbb{E}_{\cal X'\mid Y'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
 \right]
\\& \approx &
\nabla\breve{f}({\bf x},\bar{\bf y}({\bf x}))
-\nabla\breve{f}(\bar{\bf x}(\bar{\bf y}({\bf x})),\bar{\bf y}({\bf x}))
\,,
\end{eqnarray}
has not been tested. However, the presence of the same term $\bar{\bf y}({\bf x})$ on both sides of the difference suggests that the reconstruction of ${\bf y'}$ would be poor!

Lastly, we observe that we cannot tractably compute 
the marginal score, $\ln p({\bf x})$, of training case ${\bf x}$, for the
same reason that we cannot in general compute the unconditional probability 
$p({\bf x})$. However, we recall from above that
\begin{eqnarray}
p({\bf x}) & = & \sum_{{\bf y}\in{\cal Y}} p({\bf x},{\bf y})
\\& = & \sum_{{\bf y}\in{\cal Y}} p({\bf x}\mid{\bf y})\,p({\bf y})
\\& = & \mathbb{E}_{\cal Y}\left[p({\bf x}\mid{\bf y})\right]
\\& \approx & \mathbb{E}_{\cal Y\mid X}\left[p({\bf x}\mid{\bf y})\right]
\\& \approx & p({\bf x}\mid\bar{\bf y}({\bf x}))\,,
\end{eqnarray}
via CEA then MFA.

### Conditional likelihood optimisation

Lastly, we look at the case where we wish to directly optimise the predictive model $p({\bf y}\mid{\bf x})$ instead of the
joint likelihood $p({\bf x},{\bf y})$.
Assuming we know both ${\bf x}$ and ${\bf y}$, then
(from our earlier derivation) we have
\begin{eqnarray}
\ln p({\bf y}\mid{\bf x}) & = &
\breve{f}({\bf x},{\bf y})-\ln\sum_{{\bf y'}\in{\cal Y}}e^{\breve{f}({\bf x},{\bf y'})}
\\
\Rightarrow
\nabla\ln p({\bf y}\mid{\bf x}) & = &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal Y'\mid X}\left[\nabla\breve{f}({\bf x},{\bf y'})\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y}) - \nabla\breve{f}({\bf x},\bar{\bf y}({\bf x}))
\,,
\end{eqnarray}
via MFA. The log-likelihood score, $\ln p({\bf y}\mid{\bf x})$, can now be computed
directly from the predictive model.

I should add, as a note of caution obtained from testing various other discriminative models, that directly optimising the conditional predictive model can often exacerbate the effect of over-training, although this
depends strongly on the training data.

Also note that there
is no explicit modelling of the distribution of ${\bf x}$, and hence any related model
parameters required for computing $p({\bf x}\mid{\bf y})$ would need to be estimated in some other fashion. We could, for example, alternate between the gradient updates
$\nabla\ln p({\bf y}\mid{\bf x})$ and $\nabla\ln p({\bf x}\mid{\bf y})$, where
\begin{eqnarray}
\ln p({\bf x}\mid{\bf y}) & = &
\breve{f}({\bf x},{\bf y})-\ln\sum_{{\bf x'}\in{\cal X}}e^{\breve{f}({\bf x'},{\bf y})}
\\
\Rightarrow
\nabla\ln p({\bf x}\mid{\bf y}) & = &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X'\mid Y}\left[\nabla\breve{f}({\bf x'},{\bf y})\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y}) - \nabla\breve{f}(\bar{\bf x}({\bf y}),{\bf y})
\,.
\end{eqnarray}
I have not tested this latter gradient scheme for BMs.

### Expected likelihood optimisation

What happens if we want to use discriminative training, but do not know ${\bf y}$?

We recall from a previous section that the traditional approach for unsupervised learning, when ${\bf y}$ is always unknown, is to maximise the marginal likelihood, $p({\bf x})$. 
Conversely, for supervised learning, when ${\bf y}$ is always known, we may
optimise either $p({\bf x},{\bf y})$ or $p({\bf y}\mid{\bf x})$, as we see fit
(again, refer to the previous sections above).
Now, for semi-supervised learning, where ${\bf y}$ is known for some cases but unknown for others, it is traditional to use either $p({\bf x},{\bf y})$ or $p({\bf y}\mid{\bf x})$ for the cases where ${\bf y}$ is known, but instead to use $p({\bf x})$ for the cases where ${\bf y}$ is unknown.

What is wrong with this traditional approach? The answer is that, for discrete distributions,
we have $p({\bf x})\approx\frac{1}{|{\cal X}|}$, $p({\bf y}\mid{\bf x})\approx\frac{1}{|{\cal Y}|}$,
and $p({\bf x},{\bf y})\approx\frac{1}{|{\cal X}|\,|{\cal Y}|}$,
in terms of approximate magnitudes.
Thus, for cases where we have computed $p({\bf x})$ in place of $p({\bf x},{\bf y})$, we have overestimated
the joint likelihood by a factor of $|{\cal Y}|$. Likewise, for cases where we have computed $p({\bf x})$ in place of $p({\bf y}\mid{\bf x})$, we have overestimated
the discriminative likelihood by a factor of $\frac{|{\cal X}|}{|{\cal Y}|}$.
In practice, this means that traditional semi-supervised learning gives more (possibly much more) weight to unknown cases than to known cases!

The solution is to either correct the magnitude of the likelihood approximation, or
else to use expected likelihoods (or, rather, expected log-likelihoods), for unsupervised and 
especially semi-supervised learning. Thus, if we wish to optimise the joint log-likelihood, 
$\ln p({\bf x},{\bf y})$, when ${\bf y}$ is unknown, then instead of the traditional approximation
\begin{eqnarray}
\ln p({\bf x},{\bf y}) & \approx & 
\ln\sum_{{\bf y'}\in{\cal Y}'}p({\bf x},{\bf y'})
=\ln p({\bf x})\,,
\end{eqnarray}
we could use the corrected version
\begin{eqnarray}
\ln p({\bf x},{\bf y}) & \approx & 
\frac{1}{|{\cal Y}|}
\sum_{{\bf y'}\in{\cal Y}'}\ln p({\bf x},{\bf y'})
\,.
\end{eqnarray}
This has the correct magnitude, but essentially assumes that each ${\bf y}\in{\cal Y}$ is of equal importance.

Alternatively,  we could instead use the expected value
\begin{eqnarray}
\ln p({\bf x},{\bf y}) & \approx & 
\sum_{{\bf y'}\in{\cal Y}'}p({\bf y'}\mid{\bf x})\,\ln p({\bf x},{\bf y'})
\\& = &
\mathbb{E}_{{\cal Y}'\mid{\cal X}}\left[\ln p({\bf x},{\bf y'})\right]
\doteq L_{J}({\bf x})
\,,
\end{eqnarray}
on the supposition that some values of ${\bf y}\in{\cal Y}$ are conditonally more likely than
others.
Interestingly, this means that $\ln p({\bf x},{\bf y}) \approx \ln p({\bf x},\bar{\bf y}({\bf x}))$,
via MFA, even though both terms remain intractable to compute.


The equivalent approximation to the discriminative log-likelihood is thus
\begin{eqnarray}
\ln p({\bf y}\mid{\bf x}) & \approx & 
\sum_{{\bf y'}\in{\cal Y}'}p({\bf y'}\mid{\bf x})\,\ln p({\bf y'}\mid{\bf x})
\\& = &
\mathbb{E}_{{\cal Y}'\mid{\cal X}}\left[\ln p({\bf y'}\mid{\bf x})\right]
\doteq L_{D}({\bf x})
\,.
\end{eqnarray}
In this case, since $p({\bf y}\mid{\bf x})$ is assumed to be tractable to compute 
(for small enough $|{\cal Y}|$), then the expectation is also tractable.
Also, via MFA, we have $\ln p({\bf y}\mid{\bf x})\approx \ln p(\bar{\bf y}({\bf x})\mid{\bf x})$.

To help with the gradient calculations, observe that, in general,
\begin{eqnarray}
\nabla\mathbb{E}_{\cal V}\left[g({\bf v})\right]
& = & \nabla\sum_{{\bf v}\in{\cal V}}p({\bf v})\,g({\bf v})
\\& = &
\sum_{{\bf v}\in{\cal V}}\left\{
p({\bf v})\,\nabla g({\bf v})+\nabla p({\bf v})\,g({\bf v})
\right\}
\\& = &
\sum_{{\bf v}\in{\cal V}}\left\{
p({\bf v})\,\nabla g({\bf v})+
g({\bf v})\,p({\bf v})\,\nabla\ln p({\bf v})
\right\}
\\& = &
\mathbb{E}_{\cal V}\left[\nabla g({\bf v})\right]
+\mathbb{E}_{\cal V}\left[g({\bf v})\,\nabla\ln p({\bf v})\right]
\,.
\end{eqnarray}
Hence, for the expected discriminative log-likelihood we have
\begin{eqnarray}
\nabla L_{D}({\bf x}) & = & 
\nabla\mathbb{E}_{\cal Y\mid\cal X}\left[
 \ln p({\bf y}\mid{\bf x})
\right]
\\& = &
\mathbb{E}_{\cal Y\mid\cal X}\left[\nabla \ln p({\bf y}\mid{\bf x})\right]
+\mathbb{E}_{\cal Y\mid\cal X}\left[
 \ln p({\bf y}\mid{\bf x})\,\nabla \ln p({\bf y}\mid{\bf x})
\right]
\,.
\end{eqnarray}
Now, from the previous section we have
\begin{eqnarray}
\nabla\ln p({\bf y}\mid{\bf x}) & = &
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal Y'\mid X}\left[\nabla\breve{f}({\bf x},{\bf y'})\right]
\\
\Rightarrow 
\mathbb{E}_{\cal Y\mid\cal X}\left[\nabla \ln p({\bf y}\mid{\bf x})\right]
& = &
\mathbb{E}_{\cal Y\mid\cal X}\left[\nabla\breve{f}({\bf x},{\bf y})\right]
-\mathbb{E}_{\cal Y'\mid\cal X}\left[\nabla\breve{f}({\bf x},{\bf y'})\right]
\equiv 0
\,.
\end{eqnarray}
Note that this means we cannot simply take the expectation of the gradient, but must go further
and take the gradient of the expectation, giving
\begin{eqnarray}
\nabla L_{D}({\bf x}) & = &
\mathbb{E}_{\cal Y\mid\cal X}\left[
 \ln p({\bf y}\mid{\bf x})\,\nabla\breve{f}({\bf x},{\bf y})
\right]
-\mathbb{E}_{\cal Y\mid\cal X}\left[
 \ln p({\bf y}\mid{\bf x})
\right]\,
\mathbb{E}_{\cal Y'\mid\cal X}\left[
 \nabla\breve{f}({\bf x},{\bf y'})
\right]
\\& = &
\mathbb{E}_{\cal Y\mid\cal X}\left[
 \ln p({\bf y}\mid{\bf x})\,\nabla\breve{f}({\bf x},{\bf y})
\right]
- L_{D}({\bf x})\,\mathbb{E}_{\cal Y\mid\cal X}\left[
 \nabla\breve{f}({\bf x},{\bf y})
\right]
\,.
\end{eqnarray}
However, note that MFA cannot help us much here, since it results in this last difference being approximated by zero!

Now turning back to the expected joint log-likelihood, we have
\begin{eqnarray}
\nabla L_{J}({\bf x}) & = & \nabla\mathbb{E}_{{\cal Y}\mid{\cal X}}\left[\ln p({\bf x},{\bf y})\right]
\\& = &
\mathbb{E}_{\cal Y\mid\cal X}\left[\nabla \ln p({\bf x},{\bf y})\right]
+\mathbb{E}_{\cal Y\mid\cal X}\left[
 \ln p({\bf x},{\bf y})\,\nabla \ln p({\bf y}\mid{\bf x})
\right]
\,.
\end{eqnarray}
Now, from the previous section on joint likelihood optimisation we have
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & \approx & 
\nabla\breve{f}({\bf x},{\bf y})
-\mathbb{E}_{\cal X'\mid Y}\left[
 \mathbb{E}_{\cal Y'\mid X'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\\
\Rightarrow
\mathbb{E}_{\cal Y\mid X}\left[\nabla\ln p({\bf x},{\bf y})\right]
& \approx & 
\mathbb{E}_{\cal Y\mid X}\left[\nabla\breve{f}({\bf x},{\bf y})\right]
-\mathbb{E}_{\cal Y\mid X}\left[
 \mathbb{E}_{\cal X'\mid Y}\left[
 \mathbb{E}_{\cal Y'\mid X'}\left[\nabla\breve{f}({\bf x'},{\bf y'})\right]
\right]
\right]
\,.
\end{eqnarray}
Note that this is exactly the approximation for $\nabla\ln p({\bf x})$, from the previous section
on marginal likelihood optimisation! Hence, we could, if we wished, stop with the expectation of the gradient,
namely
\begin{eqnarray}
\mathbb{E}_{\cal Y\mid X}\left[\nabla\ln p({\bf x},{\bf y})\right]
& \approx & 
\nabla\ln p({\bf x})
\,.
\end{eqnarray}
However, the approxpriate log-likelihood score is not $\ln p({\bf x})$, 
for the reasons outlined at the start of this section, but instead
\begin{eqnarray}
L_{J}({\bf x}) & = & \mathbb{E}_{\cal Y\mid X}\left[\ln p({\bf x},{\bf y})\right]
\\& = &
\mathbb{E}_{\cal Y\mid X}\left[\ln p({\bf x})+\ln p({\bf y}\mid{\bf x})\right]
\\& = &
\ln p({\bf x})+L_{D}({\bf x})\,.
\end{eqnarray}
Thus, we also see that gradient of the expectation of the joint log-likelihood is
\begin{eqnarray}
\nabla L_{J}({\bf x}) & = & \nabla\ln p({\bf x}) + \nabla L_{D}({\bf x})\,. 
\end{eqnarray}

### Hidden models

In the above derivations, we partitioned ${\bf v}$ into ${\bf x}$ and ${\bf y}$, where we assumed for training that ${\bf x}$ is always known, and ${\bf y}$ might or might not be known.

We now turn to the related case where ${\bf v}\in{\cal V}$ is partitioned into
${\bf x}\in{\cal X}$, ${\bf y}\in{\cal Y}$ and ${\bf z}\in{\cal Z}$, where ${\bf z}$
is never observed, i.e. it is latent or hidden.
For convenience, we redefine 
$f({\bf v})=f(\breve{\bf v}({\bf x},{\bf y},{\bf z}))
\doteq\breve{f}({\bf x},{\bf y},{\bf z})$.

The relvant unconditional distributions are now given by
\begin{eqnarray}
p({\bf x},{\bf y},{\bf z}) & = & 
\frac{e^{\breve{f}({\bf x},{\bf y},{\bf z})}}
{\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x'},{\bf y'},{\bf z'})}}
\,,
\\
p({\bf x},{\bf y}) & = & 
\frac{\sum_{{\bf z}\in{\cal Z}}e^{\breve{f}({\bf x},{\bf y},{\bf z})}}
{\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x'},{\bf y'},{\bf z'})}}
\,,
\\
p({\bf x}) & = & 
\frac{\sum_{{\bf y}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}}
 e^{\breve{f}({\bf x},{\bf y},{\bf z})}}
{\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x'},{\bf y'},{\bf z'})}}
%\,,
%\\
%p({\bf y}) & = & 
%\frac{\sum_{{\bf x}\in{\cal X}}\sum_{{\bf z}\in{\cal Z}}
% e^{\breve{f}({\bf x},{\bf y},{\bf z})}}
%{\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
% e^{\breve{f}({\bf x'},{\bf y'},{\bf z'})}}
\,.
\end{eqnarray}

Similarly, the relevant conditional distributions are
\begin{eqnarray}
p({\bf x},{\bf y}\mid{\bf z}) & = & 
\frac{
 e^{\breve{f}({\bf x},{\bf y},{\bf z})}
}
{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
 e^{\breve{f}({\bf x'},{\bf y'},{\bf z)}}
}
\,,
\\
p({\bf z}\mid{\bf x},{\bf y}) & = & 
\frac{
 e^{\breve{f}({\bf x},{\bf y},{\bf z})}
}
{
 \sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x},{\bf y},{\bf z'})}
}
\,.
\end{eqnarray}

The forward and backward predictive distributions are
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & 
\frac{
 \sum_{{\bf z}\in{\cal Z}}e^{\breve{f}({\bf x},{\bf y},{\bf z})}
}
{
 \sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x},{\bf y'},{\bf z')}}
}
\,,
\\
p({\bf x}\mid{\bf y}) & = & 
\frac{
 \sum_{{\bf z}\in{\cal Z}}e^{\breve{f}({\bf x},{\bf y},{\bf z})}
}
{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x'},{\bf y},{\bf z'})}
}
\,.
\end{eqnarray}

### Joint likelihood optimisation

From the above definition of $p({\bf x},{\bf y})$, we have
\begin{eqnarray}
\ln p({\bf x},{\bf y}) & = & 
\ln\sum_{{\bf z}\in{\cal Z}}e^{\breve{f}({\bf x},{\bf y},{\bf z})}
-\ln\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x'},{\bf y'},{\bf z'})}
\\
\Rightarrow
\nabla\ln p({\bf x},{\bf y}) & = &
\mathbb{E}_{\cal Z\mid X,Y}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-
\mathbb{E}_{\cal X',Y',Z'}\left[\nabla\breve{f}({\bf x'},{\bf y'},{\bf z'})\right]
\,.
\end{eqnarray}

Examination of the various gradient approximations derived in earlier sections suggests
yet another pattern, namely that the inner conditional expectation on the right-hand side of the difference should match the conditional expectation on the left-hand side (with added primes).
Thus, we choose the CEA expansion
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & = &
\mathbb{E}_{\cal Z\mid X,Y}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-
\mathbb{E}_{\cal Z}\left[
 \mathbb{E}_{\cal X',Y'\mid Z}\left[
  \mathbb{E}_{\cal Z'\mid X',Y'}\left[
   \nabla\breve{f}({\bf x'},{\bf y'},{\bf z'})
  \right]
 \right]
\right]
\\& \approx &
\mathbb{E}_{\cal Z\mid X,Y}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-
\mathbb{E}_{\cal Z\mid X,Y}\left[
 \mathbb{E}_{\cal X',Y'\mid Z}\left[
  \mathbb{E}_{\cal Z'\mid X',Y'}\left[
   \nabla\breve{f}({\bf x'},{\bf y'},{\bf z'})
  \right]
 \right]
\right]
\,.
\end{eqnarray}

We lack sufficient information about the specific model to be able to approximate the
middle expectation $\mathbb{E}_{\cal X',Y'\mid Z}$ via MFA. However, for the inner and outer expectations, it is clear that we need to define
\begin{eqnarray}
\bar{\bf z}({\bf x},{\bf y}) & \doteq & 
\mathbb{E}_{\cal Z\mid X,Y}[{\bf z}]
\,.
\end{eqnarray}
Consequently, we at least know that
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & \approx &
\nabla\breve{f}({\bf x},{\bf y},\bar{\bf z})
-\nabla\breve{f}(\bar{\bf x}',\bar{\bf y}',\bar{\bf z}')
\,,
\end{eqnarray}
where $\bar{\bf z}=\bar{\bf z}({\bf x},{\bf y})$ and 
$\bar{\bf z}'=\bar{\bf z}({\bf x}',{\bf y}')$.

In order to compute the log-likelihood score, observe that
\begin{eqnarray}
p({\bf x},{\bf y}) & = & \sum_{{\bf z}\in{\cal Z}} p({\bf x},{\bf y},{\bf z})
\\
& = & \sum_{{\bf z}\in{\cal Z}} p({\bf x},{\bf y}\mid{\bf z})\,p({\bf z})
\\
& = & \mathbb{E}_{\cal Z}[p({\bf x},{\bf y}\mid{\bf z})]
\\
& \approx & \mathbb{E}_{\cal Z\mid X, Y}[p({\bf x},{\bf y}\mid{\bf z})]
\\
& \approx & p({\bf x},{\bf y}\mid\bar{z}({\bf x},{\bf y}))
\,,
\end{eqnarray}
via CEA and MFA.

### Marginal likelihood optimisation

From the definition of $p({\bf x})$ in a previous section, we have
\begin{eqnarray}
\ln p({\bf x}) & = & 
\ln\sum_{{\bf y}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}}
e^{\breve{f}({\bf x},{\bf y},{\bf z})}
-\ln\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x'},{\bf y'},{\bf z'})}
\\
\Rightarrow
\nabla\ln p({\bf x}) & = &
\mathbb{E}_{\cal Y,Z\mid X}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-
\mathbb{E}_{\cal X',Y',Z'}\left[\nabla\breve{f}({\bf x'},{\bf y'},{\bf z'})\right]
\\& \approx &
\mathbb{E}_{\cal Y,Z\mid X}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-\mathbb{E}_{\cal Y,Z\mid X}\left[
  \mathbb{E}_{\cal X'\mid Y,Z}\left[
   \mathbb{E}_{\cal Y',Z'\mid X'}\left[
    \nabla\breve{f}({\bf x'},{\bf y'},{\bf z'})
   \right]
  \right]
\right]
\,,
\end{eqnarray}
via CEA. We further note that
\begin{eqnarray}
\mathbb{E}_{\cal Y,Z\mid X}[\cdot]
& = &
\mathbb{E}_{\cal Y\mid X}\left[\mathbb{E}_{\cal Z\mid X,Y}[\cdot]\right]
\\
\Rightarrow
\mathbb{E}_{\cal Y,Z\mid X}[{\bf y}] & = & 
\mathbb{E}_{\cal Y\mid X}[{\bf y}] = \bar{\bf y}({\bf x})\,,
\\
\mathbb{E}_{\cal Y,Z\mid X}[{\bf z}] & = & 
\mathbb{E}_{\cal Y\mid X}[\bar{\bf z}({\bf x},{\bf y})]
\approx
\bar{\bf z}({\bf x},\bar{\bf y}({\bf x}))
\,,
\end{eqnarray}
via MFA.
Consequently, we have
\begin{eqnarray}
\nabla\ln p({\bf x}) & \approx & 
\nabla\breve{f}({\bf x},\bar{\bf y},\bar{\bf z})
-
\nabla\breve{f}(\bar{\bf x}',\bar{\bf y}',\bar{\bf z}')
\,,
\end{eqnarray}
where $\bar{\bf y}=\bar{\bf y}({\bf x})$,
$\bar{\bf z}=\bar{\bf z}({\bf x},\bar{\bf y})$,
$\bar{\bf y}'=\bar{\bf y}(\bar{\bf x}')$,
and
$\bar{\bf z}'=\bar{\bf z}(\bar{\bf x}',\bar{\bf y}')$.
The form that $\bar{\bf x}'$ takes depends upon $\mathbb{E}_{\cal X'\mid Y,Z}$.

In order to compute the log-likelihood score, observe that
\begin{eqnarray}
p({\bf x}) & = & 
\sum_{{\bf y}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}} p({\bf x},{\bf y},{\bf z})
\\
& = & 
\sum_{{\bf y}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}} 
p({\bf x}\mid{\bf y},{\bf z})\,p({\bf y},{\bf z})
\\
& = & \mathbb{E}_{\cal Y,Z}[p({\bf x}\mid{\bf y},{\bf z})]
\\
& \approx & \mathbb{E}_{\cal Y,Z\mid X}[p({\bf x}\mid{\bf y},{\bf z})]
\\
& = &
\mathbb{E}_{\cal Y\mid X}\left[
 \mathbb{E}_{\cal Z\mid Y,X}[p({\bf x}\mid{\bf y},{\bf z})]
\right]
\\
& \approx & p({\bf x}\mid\bar{\bf y}({\bf x}),\bar{z}({\bf x},\bar{\bf y}({\bf x})))
\,,
\end{eqnarray}
via CEA and MFA.

### Conditional likelihood optimisation

From the definition of $p({\bf y}\mid{\bf x})$ in an earlier section, we have
\begin{eqnarray}
\ln p({\bf y}\mid{\bf x}) & = & 
\ln\sum_{{\bf z}\in{\cal Z}}
e^{\breve{f}({\bf x},{\bf y},{\bf z})}
-\ln\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{\breve{f}({\bf x},{\bf y'},{\bf z'})}
\\
\Rightarrow
\nabla\ln p({\bf y}\mid{\bf x}) & = &
\mathbb{E}_{\cal Z\mid X,Y}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-
\mathbb{E}_{\cal Y',Z'\mid X}\left[\nabla\breve{f}({\bf x},{\bf y'},{\bf z'})\right]
\\& = &
\mathbb{E}_{\cal Z\mid X,Y}\left[\nabla\breve{f}({\bf x},{\bf y},{\bf z})\right]
-\mathbb{E}_{\cal Y'\mid X}\left[
   \mathbb{E}_{\cal Z'\mid X,Y'}\left[
    \nabla\breve{f}({\bf x},{\bf y'},{\bf z'})
   \right]
\right]
\\& \approx &
\nabla\breve{f}({\bf x},{\bf y},\bar{\bf z}({\bf x},{\bf y}))
-
\nabla\breve{f}({\bf x},\bar{\bf y}({\bf x}),\bar{\bf z}({\bf x},\bar{\bf y}({\bf x})))
\,,
\end{eqnarray}
via MFA.

We can therefore write this as
\begin{eqnarray}
\nabla\ln p({\bf y}\mid{\bf x}) & \approx &
\nabla\breve{f}({\bf x},{\bf y},\bar{\bf z})
-\nabla\breve{f}({\bf x},\bar{\bf y}',\bar{\bf z}')
\,,
\end{eqnarray}
in contrast to the above gradient of the joint log-likelihood, namely
\begin{eqnarray}
\nabla\ln p({{\bf x}, \bf y}) & \approx &
\nabla\breve{f}({\bf x},{\bf y},\bar{\bf z})
-\nabla\breve{f}(\bar{\bf x}',\bar{\bf y}',\bar{\bf z}')
\,.
\end{eqnarray}
Thus, when directly optimising the conditional (or discriminative) log-likelhood,
we do not reconstruct the input ${\bf x}$ via $\bar{\bf x}'$. In practice, this means
that there are some parameters (i.e. those linked entirely to ${\bf x}$) that cannot be directly estimated (but might be indirectly estimated via some hybrid gradient scheme).