# Restricted Boltzmann Classifiers

This notebook is based largely on the [(Larochelle and Bengio)](https://www.researchgate.net/publication/221346359_Classification_using_discriminative_restricted_Boltzmann_machines) paper on discriminative RBMs. The related notes in [(Jarrad)](../notes/gaussian-restricted-boltzmann.pdf) were an attempt to see what the Gaussian input version looked like, in contrast to the usual Bernoulli input.
A thorough derivation of RBMs from first principles is given by 
[(Jarrad${}_2$)](./rbm_models.ipynb).
A thorough derivation of the properties of discrete Boltzmann machines from first principles is given by [(Jarrad${}_3$)](./discrete_boltzmann.ipynb).

### Composition of RBMs

The basic idea is to compose together two RBMs.
However, rather than simply feed the output of the first RBM into the input of the second RBM, the two RBMs are merged
to share the same hidden layer (see the figure below). The first RBM model takes a known input vector 
${\bf x}=(x_1,x_2,\ldots,x_F)\in{\cal X}$ and outputs the expectations of the 
hidden vector
${\bf z}=(z_1,z_2,\ldots,z_H)\in{\cal Z}$.
The second RBM model then takes ${\bf z}$ as input and outputs the expectations of the 
vector ${\bf y}=(y_1,y_2,\ldots,y_C)\in{\cal Y}$.

<img src="composedRBMs.png" width="50%" title="Composite Restricted Bolztmann Machine">

Since the input and output layers of an RBM are connected via undirected edges to form a bipartite graph, the first RBM (taken by itself) has the property (with caveats) 
that the 
distributions of  the elements of ${\bf x}$ are conditionally
independent given ${\bf z}$, and the distributions of ${\bf z}$ are conditionally independent given ${\bf x}$. Similarly, the second RBM (taken alone) has the same conditional independence property between ${\bf z}$ and ${\bf y}$. 

However, the composition of the two RBMs changes these independence properties somewhat.
We retain that the elements of ${\bf x}$ are conditionally independent given ${\bf z}$,
and that the elements of ${\bf y}$ are also conditionally independent given ${\bf z}$.
However, the converse is no longer true - the composed model now requires knowledge of both ${\bf x}$ and ${\bf y}$ for the elements of ${\bf z}$ to be independent.
In addition, the composed model has the further property of conditional independence between ${\bf x}$ and ${\bf y}$ given ${\bf z}$. 

Consequently, the joint probability distribution takes the form
\begin{eqnarray}
p({\bf x},{\bf y},{\bf z}) & \doteq &
\frac{e^{f({\bf x},{\bf y},{\bf z})}}
{
 \sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
 e^{f({\bf x'},{\bf y'},{\bf z'})}
}
=
\frac{
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}
  }  
}
{
\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z'}\in{\cal Z}}
  e^{
   {\bf a}^{T}{\bf x'}+{\bf b}^{T}{\bf z'}+{\bf x'}^{T}{\bf W}{\bf z'}
     +{\bf c}^{T}{\bf y'}+{\bf z'}^{T}{\bf U}{\bf y'}
  }  
}\,.
\end{eqnarray}
For convenience, we have assumed discrete values for ${\bf x}$, ${\bf y}$ and ${\bf z}$.
If continuous values are required instead, then the relevant summations will be replaced
by integrations.

As a demonstration of the aforementioned conditional independence properties, 
observe that
\begin{eqnarray}
p({\bf y}\mid{\bf x},{\bf z}) & = &
\frac{
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}
  }  
}
{
\sum_{{\bf y'}\in{\cal Y}}
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y'}+{\bf z}^{T}{\bf U}{\bf y'}
  }  
}
\\
& = &
\frac{e^{{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}}}
{\sum_{{\bf y'}\in{\cal Y}}e^{{\bf c}^{T}{\bf y'}+{\bf z}^{T}{\bf U}{\bf y'}}}
= p({\bf y}\mid{\bf z})
\,.
\end{eqnarray}
Further note that the exponent is a linear combination of the elements of ${\bf y}$.
Hence, we let ${\bf U}_{:,j}$ denote the $j$-th column of ${\bf U}$, such that
\begin{eqnarray}
p({\bf y}\mid{\bf z}) & = &
\frac{
 \prod_{j=1}^{C}
 e^{y_j\left(c_j+{\bf z}^{T}{\bf U}_{:,j}\right)}
}
{
\sum_{{\bf y'}\in{\cal Y}}
 \prod_{j=1}^{C}
 e^{y'_j\left(c_j+{\bf z}^{T}{\bf U}_{:,j}\right)}
}
\\& = &
\frac{
 \prod_{j=1}^{C}
 e^{y_j\left(c_j+{\bf z}^{T}{\bf U}_{:,j}\right)}
}
{
 \prod_{j=1}^{C}
 \sum_{y'_j\in{\cal Y}_j}
 e^{y'_j\left(c_j+{\bf z}^{T}{\bf U}_{:,j}\right)}
}
\\& = &
 \prod_{j=1}^{C} p(y_j\mid{\bf z})
\,,
\end{eqnarray}
where
\begin{eqnarray}
p(y_j\mid{\bf z}) & = &
\frac{
 e^{y_j\left(c_j+{\bf z}^{T}{\bf U}_{:,j}\right)}
}
{
 \sum_{y'_j\in{\cal Y}_j}
 e^{y'_j\left(c_j+{\bf z}^{T}{\bf U}_{:,j}\right)}
}
\,.
\end{eqnarray}
Note that this conditional independence of elements relies on the assumption that we can
partition the space via
\begin{eqnarray}
  {\cal Y} & = & {\cal Y}_1\times{\cal Y}_2\times\cdots\times{\cal Y}_C\,.
\end{eqnarray}

Similarly, observe that
\begin{eqnarray}
p({\bf x}\mid{\bf z},{\bf y}) & = &
\frac{
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}
  }  
}
{
\sum_{{\bf x'}\in{\cal X}}
  e^{
   {\bf a}^{T}{\bf x'}+{\bf b}^{T}{\bf z}+{\bf x'}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}
  }  
}
\\& = &
\frac{e^{{\bf a}^{T}{\bf x}+{\bf x}^{T}{\bf W}{\bf z}}}
{\sum_{{\bf x'}\in{\cal X}} 
e^{{\bf a}^{T}{\bf x'}+{\bf x'}^{T}{\bf W}{\bf z}}}
= p({\bf x}\mid{\bf z})
\,.
\end{eqnarray}
Hence, we let ${\bf W}_{i,:}$ denote the $i$-th row of ${\bf W}$,
such that
\begin{eqnarray}
p({\bf x}\mid{\bf z}) & = &
\frac{
  \prod_{i=1}^{F}
  e^{x_i\left(a_i+{\bf W}_{i,:}{\bf z}\right)}
}
{
 \sum_{{\bf x'}\in{\cal X}} 
  \prod_{i=1}^{F}
  e^{x'_i\left(a_i+{\bf W}_{i,:}{\bf z}\right)}
}
\\& = &
\frac{
  \prod_{i=1}^{F}
  e^{x_i\left(a_i+{\bf W}_{i,:}{\bf z}\right)}
}
{
  \prod_{i=1}^{F}
 \sum_{x'_i\in{\cal X}_i} 
  e^{x'_i\left(a_i+{\bf W}_{i,:}{\bf z}\right)}
}
\\
& = &
\prod_{i=1}^{F} p(x_i\mid{\bf z})\,,
\end{eqnarray}
where
\begin{eqnarray}
p(x_i\mid{\bf z}) & = &
\frac{
  e^{x_i\left(a_i+{\bf W}_{i,:}{\bf z}\right)}
}
{
 \sum_{x'_i\in{\cal X}_i} 
  e^{x'_i\left(a_i+{\bf W}_{i,:}{\bf z}\right)}
}
\,.
\end{eqnarray}
Again, we note that this conditional independence of elements relies on the assumption that we can partition the space via
\begin{eqnarray}
  {\cal X} & = & {\cal X}_1\times{\cal X}_2\times\cdots\times{\cal X}_F\,.
\end{eqnarray}

Conversely, we observe that
\begin{eqnarray}
p({\bf z}\mid{\bf x},{\bf y}) & = &
\frac{
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}
  }  
}
{
\sum_{{\bf z'}\in{\cal Z}}
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z'}+{\bf x}^{T}{\bf W}{\bf z'}
     +{\bf c}^{T}{\bf y}+{\bf z'}^{T}{\bf U}{\bf y}
  }  
}
\\& = &
\frac{
  e^{{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}+{\bf z}^{T}{\bf U}{\bf y}}
}
{
\sum_{{\bf z'}\in{\cal Z}} 
  e^{{\bf b}^{T}{\bf z'}+{\bf x}^{T}{\bf W}{\bf z'}+{\bf z'}^{T}{\bf U}{\bf y}}
}
\\& = &
\frac{
\prod_{k=1}^{H}
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
{
\sum_{{\bf z'}\in{\cal Z}} 
\prod_{k=1}^{H}
  e^{z'_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
\\& = &
\frac{
\prod_{k=1}^{H}
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
{
\prod_{k=1}^{H}
\sum_{z'_k\in{\cal Z}_k} 
  e^{z'_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
\\& = &
\prod_{k=1}^{H}p(z_k\mid{\bf x},{\bf y})
\,,
\end{eqnarray}
where
\begin{eqnarray}
p(z_k\mid{\bf x},{\bf y}) & = &
\frac{
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
{
\sum_{z'_k\in{\cal Z}_k} 
  e^{z'_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
\,.
\end{eqnarray}
Once again, this conditional independence of elements relies on the assumption that we can partition the space via
\begin{eqnarray}
  {\cal Z} & = & {\cal Z}_1\times{\cal Z}_2\times\cdots\times{\cal Z}_H\,.
\end{eqnarray}

### Discriminative RBM

Since ${\bf z}$ is unknown in practice, the ultimate purpose of the composite RBM
is to predict output ${\bf y}$ based on known input ${\bf x}$. The discriminative probability
is thus
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & 
\sum_{{\bf z}\in{\cal Z}}p({\bf y},{\bf z}\mid{\bf x})
\\& = &
\frac{
\sum_{{\bf z}\in{\cal Z}}
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y}+{\bf z}^{T}{\bf U}{\bf y}
  }  
}
{
\sum_{{\bf y'}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}}
  e^{
   {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}
     +{\bf c}^{T}{\bf y'}+{\bf z}^{T}{\bf U}{\bf y'}
  }  
}
\\& = &
\frac{
e^{{\bf c}^{T}{\bf y}}
\sum_{{\bf z}\in{\cal Z}}
  e^{{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}+{\bf z}^{T}{\bf U}{\bf y}}
}
{
\sum_{{\bf y'}\in{\cal Y}}e^{{\bf c}^{T}{\bf y'}}
\sum_{{\bf z}\in{\cal Z}}
  e^{{\bf b}^{T}{\bf z}+{\bf x}^{T}{\bf W}{\bf z}+{\bf z}^{T}{\bf U}{\bf y'}}
}
\\& = &
\frac{
e^{{\bf c}^{T}{\bf y}}
\sum_{{\bf z}\in{\cal Z}}\prod_{k=1}^{H}
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
{
\sum_{{\bf y'}\in{\cal Y}}e^{{\bf c}^{T}{\bf y'}}
\sum_{{\bf z}\in{\cal Z}}\prod_{k=1}^{H}
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y'}\right)}
}
\\& = &
\frac{
e^{{\bf c}^{T}{\bf y}}
\prod_{k=1}^{H}\sum_{z_k\in{\cal Z}_k}
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
{
\sum_{{\bf y'}\in{\cal Y}}e^{{\bf c}^{T}{\bf y'}}
\prod_{k=1}^{H}\sum_{z_k\in{\cal Z}_k}
  e^{z_k\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y'}\right)}
}
\,.
\end{eqnarray}

### Bernoulli hidden layer

It is traditional, and convenient for the purposes of tractability, to assume that
the hidden layer takes binary-valued ${\bf z}$ vectors. Hence, we have 
${\cal Z}={\cal Z}_1\times{\cal Z}_2\times\cdots\times{\cal Z}_H=\{0,1\}^{H}$, with the result that
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = &
\frac{
  e^{{\bf c}^{T}{\bf y}}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}}\right)
}
{
\sum_{{\bf y'}\in{\cal Y}}e^{{\bf c}^{T}{\bf y'}}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y'}}\right)
}
\,.
\end{eqnarray}
In addition, we note from the earlier derivation of $p({\bf z}\mid{\bf x},{\bf y})$
above that now
\begin{eqnarray}
p(z_k=1\mid{\bf x},{\bf y}) & = &
\frac{
  e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}}
}
{
1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}}
}
\\ & = &
\frac{1}
{
1+e^{-\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)}
}
\\
& = & \sigma\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)
\,,
\end{eqnarray}
where $\sigma(\cdot)$ is the sigmoid logistic function.

For later use, we can therefore define the expected value of element $z_k$ as
as
\begin{eqnarray}
\bar{z}_k({\bf x},{\bf y}) & \doteq & 
\mathbb{E}_{{\bf z}\mid{\bf x},{\bf y}}[z_k]
= p(z_k=1\mid{\bf x},{\bf y})
\,.
\end{eqnarray}
We may also collect these individual expected values together into the vector
\begin{eqnarray}
\bar{\bf z}({\bf x},{\bf y}) & \doteq & 
\mathbb{E}_{{\bf z}\mid{\bf x},{\bf y}}[{\bf z}]
= \left[\bar{z}_k({\bf x},{\bf y})\right]_{k=1}^{H}
\,.
\end{eqnarray}

### Restricted Boltzmann Classifier

Similarly to the hidden layer, we also assume that the output layer takes
binary-valued ${\bf y}$ vectors. However, we further assume that each
input ${\bf x}$ belongs to exactly one of $C$ possible classes. Hence,
${\bf y}$ is a 1-of-$C$ (or one-hot) vector, for which $C-1$ elements take the value $0$ and exactly 1 element takes the value $1$. Thus, we have
${\cal Y}=\{{\bf y}\in\{0,1\}^{C}\mid\sum_{j=1}^{C}y_j=1\}$.
It follows that the discriminative model above reduces to the probabilistic classifier
\begin{eqnarray}
p(y_j=1\mid{\bf x}) & = &
\frac{
  e^{c_j}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+u_{kj}}\right)
}
{
\sum_{j'=1}^{C}e^{c_{j'}}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+u_{kj'}}\right)
}
\,.
\end{eqnarray}

For the purposes of comparison, we contrast this Restricted Boltzmann Classifier (RBC)
with the traditional (linear) logistic classifier
\begin{eqnarray}
p(y_j=1\mid{\bf x}) & = &
\frac{e^{c_j+{\bf x}^T{\bf v}_j}}
{\sum_{j'=1}^{C}e^{c_{j'}+{\bf x}^T{\bf v}_{j'}}}
\,.
\end{eqnarray}
Hence, we observe that the presence of hidden units (even for $H=1$, but especially for $H>1$) provides
additional nonlinearity to the logistic classifier.

It must be noted that the restriction imposed upon ${\cal Y}$ has the consequence that
${\cal Y}\ne{\cal Y}_1\times\cdots\times{\cal Y}_C$, and so this breaks the usual conditional independence property. Thus, from the earlier derivation of 
$p({\bf y}\mid{\bf z})$ above, we instead have
the probability
\begin{eqnarray}
p(y_j=1\mid{\bf z}) & = &
\frac{e^{c_j+{\bf z}^T{\bf U}_{:,j}}}
{\sum_{j'=1}^{C}e^{c_{j'}+{\bf z}^T{\bf U}_{:,j'}}}
\,.
\end{eqnarray}
However, we may still define the expected values
\begin{eqnarray}
\bar{y}_j({\bf z}) & \doteq & 
\mathbb{E}_{{\bf y}\mid{\bf z}}[y_j] = p(y_j=1\mid{\bf z})
\,,
\end{eqnarray}
and
\begin{eqnarray}
\bar{\bf y}({\bf z}) & \doteq & 
\mathbb{E}_{{\bf y}\mid{\bf z}}[{\bf y}]
= \left[\bar{y}_j({\bf z})\right]_{j=1}^{C}\,.
\end{eqnarray}

### Modelling the input distribution

By design, the discriminative classifier above does not explicitly take into account the distribution of the input ${\bf x}$. This is manifested by the fact that,
as noted by [(Jarrad)](../notes/gaussian-restricted-boltzmann.pdf), the coefficients
${\bf a}$ do not appear in the classifier. However, there can be some implicit dependence upon the input distribution. For example, 
[(Jarrad${}_2$)](./rbm_models.ipynb)
showed that Gaussian inputs can be modelled with a standard (Bernoulli) RBM by replacing ${\bf x}$
on the right-hand side by $\tilde{\bf x}\doteq(x_1,\ldots,x_f,x_1^2,\ldots,x_F^2)$.
This 'trick' can also be applied to the standard (linear) logistic classifier.

Clearly, to be able to estimate ${\bf a}$ from data, we need to optimise a different model than $p({\bf y}\mid{\bf x})$, such as the joint model $p({\bf x},{\bf y})$. Empirically, my general observation, from past experiments on a wide variety of probabilistic classifiers, is that discriminative models have a tendency to overfit the training data, whereas joint models tend to be more robust, since they explicitly take account of the distribution of data points.

For convenience, we now make the choice that
${\bf x}$ is also binary-valued, such that 
${\cal X}={\cal X}_1\times\cdots\times{\cal X}_F=\{0,1\}^{F}$.
Hence, from the earlier derivation of $p({\bf x}\mid{\bf z})$, we now have
\begin{eqnarray}
p(x_i=1\mid{\bf z}) & = &
\frac{
  e^{a_i+{\bf W}_{i,:}{\bf z}}
}
{
 1+e^{a_i+{\bf W}_{i,:}{\bf z}}
}
= \sigma\left(a_i+{\bf W}_{i,:}{\bf z}\right)
\,.
\end{eqnarray}
Thus, we may also define the expected values
\begin{eqnarray}
  \bar{x}_i({\bf z}) & \doteq & \mathbb{E}_{{\bf x}\mid{\bf z}}[x_i]
= p(x_i=1\mid{\bf z})\,,
\end{eqnarray}
and
\begin{eqnarray}
  \bar{\bf x}({\bf z}) & \doteq & \mathbb{E}_{{\bf x}\mid{\bf z}}[{\bf x}]
= [\bar{x}_i({\bf z})]_{i=1}^{F}
\,.
\end{eqnarray}

### Optimising the joint likelihood

The RBC presented in this notebook is a discrete Bolztmann machine with a hidden layer,
and we now wish to optimise the joint likelihood $p({\bf x},{\bf y})$.
Hence, from [(Jarrad${}_3$)](./discrete_boltzmann.ipynb) we obtain
\begin{eqnarray}
\nabla\ln p({\bf x},{\bf y}) & \approx &
\mathbb{E}_{{\bf z}\mid{\bf x},{\bf y}}\left[
 \nabla f({\bf x},{\bf y},{\bf z})
\right]
-
\mathbb{E}_{{\bf z}\mid{\bf x},{\bf y}}\left[
 \mathbb{E}_{{\bf x'},{\bf y'}\mid{\bf z}}\left[
  \mathbb{E}_{{\bf z'}\mid{\bf x'},{\bf y'}}\left[
   \nabla f({\bf x'},{\bf y'},{\bf z'})
  \right]
 \right]
\right]
\\& \approx &
\nabla f({\bf x},{\bf y},\bar{\bf z}({\bf x},{\bf y}))
-\nabla f(\bar{\bf x}',\bar{\bf y}',\bar{\bf z}({\bf x}',{\bf y}'))
\,.
\end{eqnarray}
Now, since we have demonstrated for the RBC that ${\bf x}$ and ${\bf y}$ are conditionally independent given ${\bf z}$, then we see that
\begin{eqnarray}
\mathbb{E}_{{\bf x},{\bf y}\mid{\bf z}}[\cdot]
& = & \mathbb{E}_{{\bf x}\mid{\bf z}}[\cdot]\,
 \mathbb{E}_{{\bf y}\mid{\bf z}}[\cdot]
\,,
\end{eqnarray}
such that $\bar{\bf x}'=\bar{\bf x}(\bar{\bf z}({\bf x},{\bf y}))$
and $\bar{\bf y}'=\bar{\bf y}(\bar{\bf z}({\bf x},{\bf y}))$.

To interpret this result, note that it corresponds to the procedure:
1. Push the known input ${\bf x}$ and output ${\bf y}$ into the hidden layer of the RBC, and compute 
  $\bar{\bf z}=\bar{\bf z}({\bf x},{\bf y})$.
2. Push $\bar{\bf z}$ back out of the hidden layer into the input and output layers,
and compute $\bar{\bf x}'=\bar{\bf x}(\bar{\bf z})$ and 
$\bar{\bf y}'=\bar{\bf y}(\bar{\bf z})$.
3. Push the reconstructed input $\bar{\bf x}'$ and output $\bar{\bf y}'$ back into the hidden layer, and compute
$\bar{\bf z}'=\bar{\bf z}(\bar{\bf x}',\bar{\bf y}')$.
4. Compute the gradient as the difference of the data term
$\nabla f({\bf x},{\bf y},\bar{\bf z})$ and the reconstruction term
$\nabla f(\bar{\bf x}',\bar{\bf y}',\bar{\bf z}')$.

In addition to approximating the gradient, we also need to approximate the joint likelihood of input ${\bf x}$ and output ${\bf y}$.
Again from [(Jarrad${}_3$)](./discrete_boltzmann.ipynb), we have
\begin{eqnarray}
p({\bf x},{\bf y})
& \approx &
p({\bf x},{\bf y}\mid\bar{\bf z}({\bf x},{\bf y}))
\\
& = & p({\bf x}\mid\bar{z}({\bf x},{\bf y}))\,
p({\bf y}\mid\bar{z}({\bf x},{\bf y}))
\,,
\end{eqnarray}
again due to the conditional independence of ${\bf x}$ and ${\bf y}$ given ${\bf z}$.

### Supervised RBC training

Putting together the various assumptions and derivations from above, the key expectations for a Bernoulli RBC are:
\begin{eqnarray}
\bar{z}_k({\bf x},{\bf y}) & = &
p(z_k=1\mid{\bf x},{\bf y}) =
\sigma\left(b_k+{\bf x}^{T}{\bf W}_{:,k}+{\bf U}_{k,:}{\bf y}\right)\,,
\end{eqnarray}
for $k=1,2,\ldots,H$,
\begin{eqnarray}
\bar{x}_i({\bf z}) & = &
p(x_i=1\mid{\bf z}) =
\sigma\left(a_i+{\bf W}_{i,:}{\bf z}\right)\,,
\end{eqnarray}
for $i=1,2,\ldots,F$, and
\begin{eqnarray}
\bar{y}_j({\bf z}) & = &
p(y_j=1\mid{\bf z}) =
\frac{e^{c_j+{\bf z}^T{\bf U}_{:,j}}}
{\sum_{j'=1}^{C}e^{c_{j'}+{\bf z}^T{\bf U}_{:,j'}}}\,,
\end{eqnarray}
for $j=1,2,\ldots,C$.

Hence, we have
\begin{eqnarray}
p({\bf x}\mid{\bf z}) & = &
\prod_{i=1}^{F} \bar{x}_i({\bf z})^{\,x_i}\,
\left[1-\bar{x}_i({\bf z})\right]^{\,1-x_i}\,,
\end{eqnarray}
and
\begin{eqnarray}
p({\bf y}\mid{\bf z}) & = &
\frac{e^{{\bf c}^T{\bf y}+{\bf z}^T{\bf U}{\bf y}}}
{\sum_{j'=1}^{C}e^{c_{j'}+{\bf z}^T{\bf U}_{:,j'}}}\,,
\end{eqnarray}
such that
\begin{eqnarray}
p({\bf x},{\bf y}) & \approx & p({\bf x}\mid\bar{\bf z})\,p({\bf y}\mid\bar{\bf z})
\,.
\end{eqnarray}

Consequently, the required gradients are:
\begin{eqnarray}
\frac{\partial f}{\partial{\bf a}} = {\bf x}
& \Rightarrow &
\frac{\partial}{\partial{\bf a}}\ln p({\bf x},{\bf y}) \approx {\bf x} - \bar{\bf x}'
\,,
\\
\frac{\partial f}{\partial{\bf b}} = {\bf z}
& \Rightarrow &
\frac{\partial}{\partial{\bf b}}\ln p({\bf x},{\bf y}) 
\approx \bar{\bf z} - \bar{\bf z}'
\,,
\\
\frac{\partial f}{\partial{\bf c}} = {\bf y}
& \Rightarrow &
\frac{\partial}{\partial{\bf c}}\ln p({\bf x},{\bf y}) 
\approx {\bf y} - \bar{\bf y}'
\,,
\\
\frac{\partial f}{\partial{\bf W}} = {\bf x}\,{\bf z}^T
& \Rightarrow &
\frac{\partial}{\partial{\bf W}}\ln p({\bf x},{\bf y}) 
\approx {\bf x}\,\bar{\bf z}^T - \bar{\bf x}'\,\bar{\bf z}'^T
\,,
\\
\frac{\partial f}{\partial{\bf U}} = {\bf z}\,{\bf y}^T
& \Rightarrow &
\frac{\partial}{\partial{\bf U}}\ln p({\bf x},{\bf y}) 
\approx \bar{\bf z}\,{\bf y}^T - \bar{\bf z}'\,\bar{\bf y}'^T
\,.
\end{eqnarray}

### Optimising the marginal likelihood

In the case of unsupervised training, we have no class label for input ${\bf x}$, and thus no explicit ${\bf y}$ vector. Consequently, rather than maximising the joint
likelihood $p({\bf x}, {\bf y})$, we instead maximise the marginal likelihood 
$p({\bf x})$. Hence, from [(Jarrad${}_3$)](./discrete_boltzmann.ipynb), we obtain
\begin{eqnarray}
p({\bf x}) & \approx &
p({\bf x}\mid\tilde{\bf y}({\bf x}),\bar{\bf z}({\bf x},\tilde{\bf y}({\bf x})))
\,,
\end{eqnarray}
where we have now defined
\begin{eqnarray}
\tilde{\bf y}({\bf x}) & \doteq &
\mathbb{E}_{{\bf y}\mid{\bf x}}[{\bf y}]
\,,
\end{eqnarray}
in order to distinguish it from $\bar{\bf y}({\bf z})$.
Using the conditional independence of ${\bf x}$ and ${\bf y}$ given ${\bf z}$,
we thus have
\begin{eqnarray}
p({\bf x}) & \approx &
p({\bf x}\mid\bar{\bf z}({\bf x},\tilde{\bf y}({\bf x})))
\,.
\end{eqnarray}

Also from [(Jarrad${}_3$)](./discrete_boltzmann.ipynb), we obtain the gradient as
\begin{eqnarray}
\nabla\ln p({\bf x}) & \approx & 
\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
 \mathbb{E}_{{\bf z}\mid{\bf x},{\bf y}}[\nabla f]
\right]
-\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
  \mathbb{E}_{{\bf z}\mid{\bf x},{\bf y}}\left[
   \mathbb{E}_{{\bf x'}\mid{\bf y},{\bf z}}\left[
    \mathbb{E}_{{\bf y'}\mid{\bf x'}}\left[
     \mathbb{E}_{{\bf z'}\mid{\bf x'},{\bf y'}}[\nabla f]
     \right]
    \right]
  \right]
 \right]
\,.
\end{eqnarray}
However, observe that 
$\mathbb{E}_{{\bf x'}\mid{\bf y},{\bf z}}=\mathbb{E}_{{\bf x'}\mid{\bf z}}$.
Hence, the gradient is given by
\begin{eqnarray}
\nabla\ln p({\bf x}) & \approx & 
\nabla f({\bf x},\tilde{\bf y},\bar{\bf z})
-\nabla f(\bar{\bf x}',\tilde{\bf y}',\bar{\bf z}')
\,,
\end{eqnarray}
where now $\tilde{\bf y}=\tilde{\bf y}({\bf x})$, 
$\bar{\bf z}=\bar{\bf z}({\bf x},\tilde{\bf y})$,
$\bar{\bf x}'=\bar{\bf x}(\bar{\bf z})$,
$\tilde{\bf y}'=\tilde{\bf y}(\bar{\bf x}')$,
and $\bar{\bf z}'=\bar{\bf z}(\bar{\bf x}',\tilde{\bf y}')$.

The procedure for computing the unsupervised gradient is:
1. Push the known input ${\bf x}$ through the RBC to the output, and compute 
$\tilde{\bf y}=\tilde{\bf y}({\bf x})$.
1. Push the known input ${\bf x}$ and estimated output $\tilde{\bf y}$ back into the hidden layer of the RBC, and compute 
  $\bar{\bf z}=\bar{\bf z}({\bf x},\tilde{\bf y})$.
2. Push $\bar{\bf z}$ back out of the hidden layer into the input layer,
and compute $\bar{\bf x}'=\bar{\bf x}(\bar{\bf z})$.
3. Push the reconstructed $\bar{\bf x}'$ through the RBC to the output, and compute
$\tilde{\bf y}'=\tilde{\bf y}({\bf x}')$.
3. Push the reconstructed input $\bar{\bf x}'$ and output $\tilde{\bf y}'$ back into the hidden layer, and compute
$\bar{\bf z}'=\bar{\bf z}(\bar{\bf x}',\bar{\bf y}')$.
4. Compute the gradient as the difference of the data term
$\nabla f({\bf x},\tilde{\bf y},\bar{\bf z})$ and the reconstruction term
$\nabla f(\bar{\bf x}',\tilde{\bf y}',\bar{\bf z}')$.

### Unsupervised RBC training

From the previous section, the required gradients are:
\begin{eqnarray}
\frac{\partial f}{\partial{\bf a}} = {\bf x}
& \Rightarrow &
\frac{\partial}{\partial{\bf a}}\ln p({\bf x}) \approx {\bf x} - \bar{\bf x}'
\,,
\\
\frac{\partial f}{\partial{\bf b}} = {\bf z}
& \Rightarrow &
\frac{\partial}{\partial{\bf b}}\ln p({\bf x}) 
\approx \bar{\bf z} - \bar{\bf z}'
\,,
\\
\frac{\partial f}{\partial{\bf c}} = {\bf y}
& \Rightarrow &
\frac{\partial}{\partial{\bf c}}\ln p({\bf x}) 
\approx \tilde{\bf y} - \tilde{\bf y}'
\,,
\\
\frac{\partial f}{\partial{\bf W}} = {\bf x}\,{\bf z}^T
& \Rightarrow &
\frac{\partial}{\partial{\bf W}}\ln p({\bf x}) 
\approx {\bf x}\,\bar{\bf z}^T - \bar{\bf x}'\,\bar{\bf z}'^T
\,,
\\
\frac{\partial f}{\partial{\bf U}} = {\bf z}\,{\bf y}^T
& \Rightarrow &
\frac{\partial}{\partial{\bf U}}\ln p({\bf x}) 
\approx \bar{\bf z}\,\tilde{\bf y}^T - \bar{\bf z}'\,\tilde{\bf y}'^T
\,.
\end{eqnarray}

We also require
\begin{eqnarray}
\tilde{y}_j({\bf x}) & \doteq &
p(y_j=1\mid{\bf x}) =
\frac{
  e^{c_j}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+u_{kj}}\right)
}
{
\sum_{j'=1}^{C}e^{c_{j'}}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}{\bf W}_{:,k}+u_{kj'}}\right)
}
\,,
\end{eqnarray}
such that
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = &
\prod_{j=1}^{C} \tilde{y}_j({\bf x})^{\,y_j}\,.
\end{eqnarray}

The log-likelihood score is given by
\begin{eqnarray}
\ln p({\bf x}) & \approx &
\ln p({\bf x}\mid\bar{\bf z})
\,.
\end{eqnarray}

### Sequential RBC

[(Jarrad${}_2$)](./rbm_models.ipynb) examined the case where the input ${\bf x}$ to an RBM implicitly took the form of a Markov sequence. Following similar reasoning here, we obtain
\begin{eqnarray}
p({\bf x}) & = & p(x_1)\,p(x_2\mid x_1)\ldots p(x_F\mid x_1,\ldots,x_{F-1})
\\
& = & \prod_{i=1}^{F} p(x_i\mid{\bf x}_{1:i-1})
\,,
\end{eqnarray}
where now
\begin{eqnarray}
p(x_i\mid{\bf x}_{1:i-1}) & = & 
\sum_{{\bf y}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}}
p(x_i,{\bf y},{\bf z}\mid{\bf x}_{1:i-1})
\\& = &
\sum_{{\bf y}\in{\cal Y}}\sum_{{\bf z}\in{\cal Z}}
p(x_i\mid{\bf z})\,p({\bf z}\mid{\bf x}_{1:i-1},{\bf y})
\,p({\bf y}\mid{\bf x}_{1:i-1})
\\& = &
\mathbb{E}_{{\bf y}\mid{\bf x}_{1:i-1}}\left[
 \mathbb{E}_{{\bf z}\mid{\bf x}_{1:i-1},{\bf y}}\left[
  p(x_i\mid{\bf z})
 \right]
\right]
\\& \approx &
p\left(
 x_i\mid\bar{\bf z}\left(
  {\bf x}_{1:i-1}, \tilde{\bf y}({\bf x}_{1:i-1})
 \right)
\right)
\,,
\end{eqnarray}
due to the conditional independence of ${\bf x}$ and ${\bf y}$ given ${\bf z}$.

We require
\begin{eqnarray}
\bar{x}_i({\bf z}) & \doteq &
p(x_i=1\mid{\bf z}) =
\sigma\left(a_i+{\bf W}_{i,:}{\bf z}\right)\,,
\end{eqnarray}
for $i=1,2,\ldots,F$, and
\begin{eqnarray}
\tilde{y}_j({\bf x}_{1:i-1}) & \doteq &
p(y_j=1\mid{\bf x}_{1:i-1}) =
\frac{
  e^{c_j}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}_{1:i-1}^{T}{\bf W}_{1:i-1,k}+u_{kj}}\right)
}
{
\sum_{j'=1}^{C}e^{c_{j'}}
  \prod_{k=1}^{H}
  \left(1+e^{b_k+{\bf x}^{T}_{1:i-1}{\bf W}_{1:i-1,k}+u_{kj'}}\right)
}
\,,
\end{eqnarray}
for $j=1,2,\ldots,C$, 
and
\begin{eqnarray}
\bar{z}_k({\bf x}_{1:i-1},{\bf y}) & \doteq &
p(z_k=1\mid{\bf x}_{1:i-1},{\bf y}) =
\sigma\left(b_k+{\bf x}_{1:i-1}^{T}{\bf W}_{1:i-1,k}+{\bf U}_{k,:}{\bf y}\right)\,,
\end{eqnarray}
for $k=1,2,\ldots,H$.

TODO Finsh the derivatives of the log-likelihood.

### Logistic RBM

For the purposes of comparison, we now turn to a version of the RBC without the hidden ${\bf z}$ layer. Hence, the joint probability is now
\begin{eqnarray}
p({\bf x},{\bf y}) & \doteq &
\frac{
  e^{g({\bf x},{\bf y})}
}
{
\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
  e^{g({\bf x'},{\bf y'})}
}
=
\frac{
  e^{
   {\bf a}^{T}{\bf x}
     +{\bf c}^{T}{\bf y}+{\bf x}^{T}{\bf V}{\bf y}
  }  
}
{
\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}
  e^{
   {\bf a}^{T}{\bf x'}
     +{\bf c}^{T}{\bf y'}+{\bf x'}^{T}{\bf V}{\bf y'}
  }  
}
\,,
\end{eqnarray}
where we have intoduced the new weight matrix $V$ to account for the interactions between input ${\bf x}$ and output ${\bf y}$.

The joint log-likelihood is thus
\begin{eqnarray}
\ln p({\bf x},{\bf y}) & = & g({\bf x},{\bf y})
-\ln\sum_{{\bf x'}\in{\cal X}}\sum_{{\bf y'}\in{\cal Y}}e^{g({\bf x'},{\bf y'})}
\\
\Rightarrow \nabla\ln p({\bf x},{\bf y})
& = & 
\nabla g({\bf x},{\bf y})
-\mathbb{E}_{{\bf x'},{\bf y'}}\left[\nabla g({\bf x'},{\bf y'})\right]
\\& \approx &
\nabla g({\bf x},{\bf y})
-\mathbb{E}_{{\bf y'}\mid{\bf x}}\left[
  \mathbb{E}_{{\bf x'}\mid{\bf y'}}\left[\nabla g({\bf x'},{\bf y'})\right]
 \right]
\\& \approx &
\nabla g({\bf x},{\bf y})
-\nabla g\left(
 \bar{\bf x}\left(\tilde{\bf y}({\bf x})\right),\tilde{\bf y}({\bf x})
\right)
\,,
\end{eqnarray}
using CEA and MFA, from [(Jarrad${}_3$)](./discrete_boltzmann.ipynb).

Consequently, the joint gradients are therefore:
\begin{eqnarray}
\frac{\partial g}{\partial{\bf a}} = {\bf x}
& \Rightarrow &
\frac{\partial}{\partial{\bf a}}\ln p({\bf x},{\bf y}) 
\approx {\bf x} - \bar{\bf x}(\tilde{\bf y}({\bf x}))
\,,
\\
\frac{\partial g}{\partial{\bf c}} = {\bf y}
& \Rightarrow &
\frac{\partial}{\partial{\bf c}}\ln p({\bf x},{\bf y}) 
\approx {\bf y} - \tilde{\bf y}({\bf x})
\,,
\\
\frac{\partial g}{\partial{\bf V}} = {\bf x}\,{\bf y}^T
& \Rightarrow &
\frac{\partial}{\partial{\bf V}}\ln p({\bf x},{\bf y}) 
\approx {\bf x}\,{\bf y}^T - \bar{\bf x}\left(\tilde{\bf y}({\bf x})\right)
\,\tilde{\bf y}({\bf x})^T
\,.
\end{eqnarray}

The predictive output of the model is
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = &
\frac{e^{f({\bf x},{\bf y})}}
{\sum_{{\bf y'}\in{\cal Y}}e^{f({\bf x},{\bf y'})}}
= 
\frac{e^{{\bf c}^T{\bf y}+{\bf x}^T{\bf V}{\bf y}}}
{\sum_{{\bf y'}\in{\cal Y}}e^{{\bf c}^T{\bf y'}+{\bf x}^T{\bf V}{\bf y'}}}
\,.
\end{eqnarray}
For binary one-hot outputs, we thus have
\begin{eqnarray}
\tilde{y}_j({\bf x}) & \doteq & p(y_j=1\mid{\bf x})
= \frac{e^{c_j+{\bf x}^T{\bf V}_{:,j}}}
{\sum_{j'=1}^{C}e^{c_j'+{\bf x}^T{\bf V}_{:,j'}}}
\,,
\end{eqnarray}
for $j=1,2,\ldots,C$.

Conversely, the predictive input of the model is
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & 
\frac{e^{f({\bf x},{\bf y})}}
{\sum_{{\bf x'}\in{\cal X}}e^{f({\bf x'},{\bf y})}}
=
\frac{
 e^{{\bf a}^T{\bf x}+{\bf x}^T{\bf V}{\bf y}}
}
{
 \sum_{{\bf x'}\in{\cal X}}e^{{\bf a}^T{\bf x'}+{\bf x'}^T{\bf V}{\bf y}}
}
\,.
\end{eqnarray}
For binary inputs, we thus have
\begin{eqnarray}
\bar{x}_i({\bf y}) & \doteq & p(x_i=1\mid{\bf y})
=\sigma(a_i+{\bf V}_{i,:}{\bf y})
\,,
\end{eqnarray}
for $i=1,2,\ldots,F$.

In order to score the gradient updates, first observe that
\begin{eqnarray}
p({\bf x},{\bf y}) & = & p({\bf y}\mid{\bf x})\,p({\bf x})\,,
\end{eqnarray}
where
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & \prod_{j=1}^{C}\tilde{y}_j({\bf x})^{y_j}\,,
\end{eqnarray}
for one-hot outputs.
Next, note that
\begin{eqnarray}
p({\bf x}) & = & \sum_{{\bf y}\in{\cal Y}}p({\bf x},{\bf y})
= \sum_{{\bf y}\in{\cal Y}}p({\bf x}\mid{\bf y})\,p({\bf y})
\\& = &
\mathbb{E}_{\bf y}[p({\bf x}\mid{\bf y})]
\approx \mathbb{E}_{{\bf y}\mid{\bf x}}[p({\bf x}\mid{\bf y})]
\\& \approx &
p\left({\bf x}\mid\tilde{\bf y}({\bf x})\right)
\,,
\end{eqnarray}
where
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & 
\prod_{i=1}^{F}\bar{x}_i({\bf y})^{\,x_i}\,
\left[1-\bar{x}_i({\bf y})\right]^{\,1-x_i}
\,,
\end{eqnarray}
for binary inputs.