# Modelling Restricted Boltzmann Machines

## Boltzmann Machines

A Boltzmann Machine has an input layer with vector ${\bf x}=(x_1,x_2,\ldots,x_N)$ and an output layer with vector ${\bf y}=(y_1,y_2,\ldots,y_M)$. The relationships between input elements $x_i$ and output elements $y_j$
take the form of an undirected graph (see below).

<img src="Boltzmann.png" width="40%" title="Boltzmann Machine graph">

Taking inspiration from random Markov fields and the Boltzmann distribution, we specify an arbitrary energy function
\begin{eqnarray}
E({\bf x},{\bf y}) & = & -\left[
  f({\bf x})+g({\bf y})+h({\bf x},{\bf y})
\right]\,,
\end{eqnarray}
such that the joint probability (or density) of ${\bf x}$ and ${\bf y}$ is defined as
\begin{eqnarray}
p({\bf x},{\bf y}) & = & \frac{e^{-E({\bf x},{\bf y})}}{Z_{X,Y}}
= \frac{e^{f({\bf x})+g({\bf y})+h({\bf x},{\bf y})}}{Z_{X,Y}}
\,,
\end{eqnarray}
where $Z_{X,Y}$ is the appropriate partition function obtained by summing (or integrating) the numerator over all
possible values of ${\bf x}$ and ${\bf y}$.

It follows that the conditional distributions are given by
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & \frac{e^{f({\bf x})+h({\bf x},{\bf y})}}{Z_{X}({\bf y})}\,,
\end{eqnarray}
and
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & \frac{e^{g({\bf y})+h({\bf x},{\bf y})}}{Z_{Y}({\bf x})}\,,
\end{eqnarray}
with respective partition functions.

## Restricted Boltzmann Machines

A *Restricted* Boltzmann Machine (RBM) further restricts the relationships between ${\bf x}$ and ${\bf y}$ to take the form of an undirected *bipartite* graph, such that the elements of ${\bf x}=(x_1,x_2,\ldots,x_N)$ form disconnected nodes in one partition, the elements of ${\bf y}=(y_1,y_2,\ldots,y_M)$ form disconnected nodes in the other partition, and edges exist only between nodes in different partitions, i.e. $x_i -\!\!\!- y_j$ (see below).

![Restricted Boltzmann Machine graph](RBM.png "Restricted Boltzmann Machine")

Consequently, by design, the elements $x_i$ of ${\bf x}$ are conditionally independent given ${\bf y}$, viz.
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & \prod_{i=1}^{N}p(x_i\mid{\bf y})\,,
\end{eqnarray}
and the elements $y_j$ of ${\bf y}$ are likewise conditionally independent given ${\bf x}$, viz.
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & \prod_{j=1}^{M}p(y_i\mid{\bf x})\,.
\end{eqnarray}
Therefore, the functions $f(\cdot)$, $g(\cdot)$ and $h(\cdot,\cdot)$ must be linearly separable in their
arguments, namely
\begin{eqnarray}
f({\bf x})=\sum_{i=1}^{N}f_{i}(x_i)\,,\; &
g({\bf y})=\sum_{j=1}^{M}g_{j}(y_j)\,, &
h({\bf x},{\bf y})=\sum_{i=1}^{N}\sum_{j=1}^{M}h_{ij}(x_i,y_j)\,,
\end{eqnarray}
leading to
\begin{eqnarray}
p(x_i\mid{\bf y}) = \frac{e^{f_i(x_i)+\sum_{j=1}^{M}h_{ij}(x_i,y_j)}}{Z_{X_i}({\bf y})}\,,
&\;&
p(y_j\mid{\bf x}) = \frac{e^{g_j(y_j)+\sum_{i=1}^{N}h_{ij}(x_i,y_j)}}{Z_{Y_j}({\bf x})}
\,,
\end{eqnarray}
and thus
\begin{eqnarray}
p({\bf x},{\bf y}) &=& 
\frac{e^{\sum_{i=1}^{N}f_i(x_i)+\sum_{j=1}^{M}g_{j}(y_j)+\sum_{i=1}^{N}\sum_{j=1}^{M}h_{ij}(x_i,y_j)}}{Z_{X,Y}}\,.
\end{eqnarray}

## Hidden outputs

Traditionally, the RBM output is considered to be a hidden or latent layer, for which the values of ${\bf y}$ are never observed in practice, and hence must be summed (or integrated) out.

### Bernoulli outputs

For tractability, ${\bf y}$ is usually taken to be a vector of binary values, i.e. ${\bf y}\in\{0,1\}^M$.
Consequently, we obtain
\begin{eqnarray}
p(y_j=1\mid{\bf x}) & = & \frac{e^{g_j(1)+\sum_{i=1}^{N}h_{ij}(x_i,1)}}
{e^{g_j(0)+\sum_{i=1}^{N}h_{ij}(x_i,0)}+e^{g_j(1)+\sum_{i=1}^{N}h_{ij}(x_i,1)}}
\\
& = &
\frac{e^{g_j(1)-g_j(0)+\sum_{i=1}^{N}[h_{ij}(x_i,1)-h_{ij}(x_i,0)]}}
{1+e^{g_j(1)-g_j(0)+\sum_{i=1}^{N}[h_{ij}(x_i,1)-h_{ij}(x_i,0)]}}
\\
& = &
\frac{1}
{1+e^{-\left[g_j(1)-g_j(0)+\sum_{i=1}^{N}[h_{ij}(x_i,1)-h_{ij}(x_i,0)]\right]}}
\\
& = & \sigma\left(b_j+\sum_{i=1}^{N}w_{ij}(x_i)\right)\,,
\end{eqnarray}
where $b_j\doteq g_j(1)-g_j(0)$, $w_{ij}(x_i)\doteq h_{ij}(x_i,1)-h_{ij}(x_i,0)$ and $\sigma(\cdot)$ is the logistic function.

For convenience, note that we may invert these relations and define $g_j(y_j)\doteq b_j y_j$ and
$h_{ij}(x_i,y_j)\doteq w_{ij}(x_i)\,y_j$, without loss of generality.
The converse conditional distribution then becomes
\begin{eqnarray}
p(x_i\mid{\bf y}) & = & \frac{e^{f_i(x_i)+\sum_{j=1}^{M}w_{ij}(x_i)\,y_j}}{Z_{X_i}({\bf y})}\,,
\end{eqnarray}
from above.

The marginal distribution of ${\bf x}$ can also be derived.
Recall that
\begin{eqnarray}
p({\bf x},{\bf y}) & \propto & 
e^{\sum_{i=1}^{N}f_i(x_i)+\sum_{j=1}^{M}g_{j}(y_j)+\sum_{i=1}^{N}\sum_{j=1}^{M}h_{ij}(x_i,y_j)}\,,
\end{eqnarray}
so that
\begin{eqnarray}
p({\bf x}) & \propto & 
\sum_{y_1\in\{0,1\}}\cdots\sum_{y_M\in\{0,1\}}
e^{\sum_{i=1}^{N}f_i(x_i)+\sum_{j=1}^{M}b_{j}y_j+\sum_{i=1}^{N}\sum_{j=1}^{M}w_{ij}(x_i)\,y_j}
%\\
%& = &
%e^{\sum_{i=1}^{N}f_i(x_i)}
%\sum_{y_1\in\{0,1\}}\cdots\sum_{y_{M-1}\in\{0,1\}}
%e^{\sum_{j=1}^{M-1}b_{j}y_j+\sum_{i=1}^{N}\sum_{j=1}^{M-1}w_{ij}(x_i)\,y_j}
%\sum_{y_M\in\{0,1\}}
%e^{b_M y_M+\sum_{i=1}^{N} w_{ij}(x_i)\,y_M}
\\
& = &
e^{\sum_{i=1}^{N}f_i(x_i)}
\sum_{y_1\in\{0,1\}}e^{b_1 y_1+\sum_{i=1}^{N} w_{ij}(x_i)\,y_1}
\cdots\sum_{y_{M}\in\{0,1\}}
e^{b_M y_M+\sum_{i=1}^{N} w_{ij}(x_i)\,y_M}\,.
\end{eqnarray}
Therefore, we obtain
\begin{eqnarray}
p({\bf x}) & = & 
\frac{e^{\sum_{i=1}^{N}f_i(x_i)}\prod_{j=1}^{M}\left(1+e^{b_j+\sum_{i=1}^{N} w_{ij}(x_i)}\right)}
{Z_{X}}\,.
\end{eqnarray}

## Observed inputs

The input vector ${\bf x}$ forms the observed part of the RBM, and hence requires specialised handling to match the assumed input distribution.

### Bernoulli inputs

In some analyses, the input ${\bf x}$ is a vector of binary values, i.e. ${\bf x}\in\{0,1\}^{N}$. One example is from the field of natural language processing, where each vocabulary word (or token) is either in or not in a given document. Another example is
from the field of image processing, where a black-and-white image has pixels that are either on or off.
Thus, we obtain
\begin{eqnarray}
p(x_i=1\mid{\bf y}) & = & \frac{e^{f_i(1)+\sum_{j=1}^{M}h_{ij}(1,y_j)}}
{e^{f_i(0)+\sum_{j=1}^{M}h_{ij}(0,y_j)}+e^{f_i(1)+\sum_{j=1}^{M}h_{ij}(1,y_j)}}
\\
& = &
\frac{1}
{1+e^{-\left[f_i(1)-f_i(0)+\sum_{j=1}^{M}[h_{ij}(1,y_j)-h_{ij}(0,y_j)]\right]}}
\\
& = & \sigma\left(a_i+\sum_{j=1}^{M}w_{ij}'(y_j)\right)\,,
\end{eqnarray}
where $a_i\doteq f_i(1)-f_i(0)$ and $w_{ij}'(y_j)\doteq h_{ij}(1,y_j)-h_{ij}(0,y_j)$.
Hence, we may define
$f_i(x_i)\doteq a_i x_i$ and $h_{ij}(x_i,y_j)\doteq x_i w_{ij}'(y_j)$, without loss of generality. 

Upon also assuming Bernoulli outputs, we further find that $h_{ij}(x_i,y_j)\doteq x_i w_{ij} y_j$, which
gives rise to the standard, bilinear model
\begin{eqnarray}
p({\bf x},{\bf y}) & = & 
\frac{e^{{\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf y}+{\bf x}^{T}W{\bf y}}}
{Z_{X,Y}}\,,
\end{eqnarray}
for coefficient vectors ${\bf a}=(a_1,\ldots,a_N)$ and ${\bf b}=(b_1,\ldots,b_M)$, and coefficient matrix
$W=[w_{ij}]$, where we now interpret all vectors column-wise. 

The conditional distributions from above therefore take the forms
\begin{eqnarray}
p(x_i=1\mid{\bf y}) & = & \sigma\left(\left[{\bf a}+W{\bf y}\right]_i\right)\,,
\\
p(y_j=1\mid{\bf x}) & = & \sigma\left(\left[{\bf b}+W^T{\bf x}\right]_j\right)\,,
\end{eqnarray}
and the marginal distribution becomes
\begin{eqnarray}
p({\bf x}) =
\frac{e^{{\bf a}^{T}{\bf x}}\prod_{j=1}^{M}\left(1+e^{\left[{\bf b}+W^T{\bf x}\right]_j}\right)}
{Z_{X}}\,.
\end{eqnarray}
Note, however, that even for the simplifications inherent in the Bernoulli RBM, the partition functions $Z_{X}$
and $Z_{X,Y}$ remain intractable.

### Free energy

The `free energy` of ${\bf x}$, denoted here as $F({\bf x})$, is obtained by
reusing the energy distribution formulation, namely:
\begin{eqnarray}
p({\bf x}) & = & \frac{e^{-F({\bf x})}}{Z_X}
\\
\Rightarrow F({\bf x}) & = & -\ln p({\bf x}) - \ln Z_{X}
\\
& = & -{\bf a}^{T}{\bf x}-\sum_{j=1}^{M}\ln\left(
1+e^{[{\bf b}+W^T{\bf x}]_{j}}
\right)\,.
\end{eqnarray}
The  mean free energy of a dataset $X$ is then defined as
\begin{eqnarray}
\bar{F}(X) & \doteq & \frac{1}{D}\sum_{d=1}^{D}F({\bf x}_d)\,.
\end{eqnarray}
Note that since $Z_X$ is unknown in practice, we cannot compute the mean log-likelihood of $X$, and so we cannot score an individual dataset. However, for fixed RBM parameters,
the difference between the scores of two datasets is equal to the difference between
their respective mean free energies. Hence, we could, for example, monitor the difference in scores between the training set and a validation set. When this difference starts to grow persistently larger, it is a sign that the RBM might be overfitting the training data.

### Gaussian inputs

In other situations, it is more realistic that the ${\bf x}$ values are unrestricted, i.e.
${\bf x}\in\mathbb{R}^{N}$. Typically, we take ${\bf x}$ to be Gaussian distributed.
Recalling that the $x_i$ values (treated as variables) must be conditionally independent in an RBM,
we conclude that
\begin{eqnarray}
p(x_i\mid{\bf y}) & = & [2\pi\sigma_i^2({\bf y})]^{-\frac{1}{2}}
e^{-\left.[x_i-\mu_i({\bf y})]^2\right/2\sigma_i^2({\bf y})}
\\
& = & [2\pi\sigma_i^2({\bf y})]^{-\frac{1}{2}}
e^{-\left.\left[x_i^2-2x_i\mu_i({\bf y})+\mu_i^2({\bf y})\right]\right/2\sigma_i^2({\bf y})}
\\
& = & e^{\alpha_i({\bf y})+\beta_i({\bf y})\,x_i+\gamma_i({\bf y})\,x_i^2}\,,
\end{eqnarray}
for appropriately defined coefficient functions.

Hence, given Bernoulli outputs, we might modify the standard model to include independent squared terms, namely
\begin{eqnarray}
p({\bf x},{\bf y}) & = & 
\frac{e^{{\bf a}^{T}{\bf x}+{\bf x}^{T}\Gamma({\bf y}){\bf x}+{\bf b}^{T}{\bf y}+{\bf x}^{T}W{\bf y}}}
{Z_{X,Y}}\,,
\end{eqnarray}
where $\Gamma({\bf y})\doteq {\tt diag}({\bf c}+D{\bf y})$.
Letting ${\bf x}^2\doteq (x_1^2,\ldots,x_N^2)$ for convenience, this may be rewritten as
\begin{eqnarray}
p({\bf x},{\bf y}) & = & 
\frac{e^{
  ({\bf a}\oplus{\bf c})^{T}({\bf x}\oplus{\bf x}^2)+{\bf b}^{T}{\bf y}
      +({\bf x}\oplus{\bf x}^2)^{T}(W\oplus D){\bf y}
}}{Z_{X,Y}}\,,
\end{eqnarray}
where the operator $\oplus$ denotes column-wise concatenation.
Hence, with some care required when resampling $p({\bf x}\mid{\bf y})$, we may notionally augment our feature vector ${\bf x}$ with its squared elements and thus reuse the standard bilinear model.
In other words, defining $\tilde{\bf x}\doteq{\bf x}\oplus{\bf x}^2$, 
$\tilde{\bf a}\doteq{\bf a}\oplus{\bf c}$ and $\tilde{W}\doteq W\oplus D$, we have
\begin{eqnarray}
p({\bf x},{\bf y}) & = & 
\frac{e^{
  \tilde{\bf a}^{T}\tilde{\bf x}+{\bf b}^{T}{\bf y}
      +\tilde{\bf x}^{T}\tilde{W}{\bf y}
}}{Z_{X,Y}}\,.
\end{eqnarray}
In particular, we immediately obtain the conditional distribution
\begin{eqnarray}
p(y_j=1\mid{\bf x}) & = & \sigma\left(\left[{\bf b}+\tilde{W}^T\tilde{\bf x}\right]_j\right)\,,
\end{eqnarray}
and the marginal distribution
\begin{eqnarray}
p({\bf x}) & = &
\frac{e^{\tilde{\bf a}^{T}\tilde{\bf x}}\prod_{j=1}^{M}
\left(1+e^{\left[{\bf b}+\tilde{W}^T\tilde{\bf x}\right]_j}\right)}
{Z_{X}}\,.
\end{eqnarray}



For the conditional Gaussian, we obtain
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & \propto & 
e^{
  \tilde{\bf a}^T \tilde{\bf x}
  + \tilde{\bf x}^T \tilde{W}{\bf y}
}
\\
& = & e^{
  ({\bf a}\oplus{\bf c})^T ({\bf x}\oplus{\bf x}^2)
  + ({\bf x}\oplus{\bf x}^2)^T (W\oplus D){\bf y}
}
\\
& = & \prod_{i=1}^{N}e^{
  [{\bf a}+W{\bf y}]_i x_i + [{\bf c}+D{\bf y}]_i x_i^2\,.
 }
\end{eqnarray}
Equating coefficients with the standard Gaussian form above then gives
the individual variances and means as
\begin{eqnarray}
\sigma_i^2({\bf y}) = -\frac{1}{2[{\bf c}+D{\bf y}]_i}\,, && 
\mu_i({\bf y}) = -\frac{[{\bf a}+W{\bf y}]_i}{2[{\bf c}+D{\bf y}]_i}
\,.
\end{eqnarray}

Note that the contstraint $\sigma_i^2({\bf y})>0$ requires enforcement, such that 
${\bf c}+D{\bf y}<{\bf 0}\; \forall{\bf y}\in\{0,1\}^M$. Thus, it is necessary that
${\bf c}<{\bf 0}$, and sufficient that $D<0$.

## Gradient Approximations

Reconsider the Boltzmann Machine
\begin{eqnarray}
p({\bf x},{\bf y}) & = & \frac{e^{-E({\bf x},{\bf y})}}{Z_{X,Y}}
\,,
\end{eqnarray}
where the energy function $E({\bf x},{\bf y})$ implicitly has model parameters $\Theta$.
Assuming that the output ${\bf y}$ is always hidden, then we wish to estimate $\Theta$ by maximising
the marginal distribution $p({\bf x})$ over all cases of training data. Typically, this is
achieved by some gradient ascent procedure.

However, in general the partition function $Z_{X,Y}$ is intractable to compute, and thus the gradient is also intractable. The solution is to approximate the gradient. There are various approaches, including Gibbs sampling and mean field approximations.

For convenience, let us temporarily assume that the input ${\bf x}$ and output ${\bf y}$ are both discrete valued. However, the derivation below is also valid for continuous variables with summations replaced by integrations. 

Thus, let the joint distribution be
\begin{eqnarray}
p({\bf x},{\bf y}) & = & \frac{e^{-E({\bf x},{\bf y})}}
                      {\sum_{\bf x'}\sum_{\bf y'}e^{-E({\bf x'},{\bf y'})}}
\,,
\end{eqnarray}
such that the conditional distribution is
\begin{eqnarray}
p({\bf y}\mid{\bf x}) & = & \frac{e^{-E({\bf x},{\bf y})}}
                      {\sum_{\bf y'}e^{-E({\bf x},{\bf y'})}}
\,,
\end{eqnarray}
and the marginal distribution is
\begin{eqnarray}
p({\bf x}) & = & \frac{\sum_{\bf y}e^{-E({\bf x},{\bf y})}}
                      {\sum_{\bf x'}\sum_{\bf y'}e^{-E({\bf x'},{\bf y'})}}
\\
\Rightarrow
\ln p({\bf x}) & = & \ln\sum_{\bf y}e^{-E({\bf x},{\bf y})}
                      -\ln\sum_{\bf x'}\sum_{\bf y'}e^{-E({\bf x'},{\bf y'})}
\,.
\end{eqnarray}

Then, for each model parameter $\theta\in\Theta$, we have
\begin{eqnarray}
\frac{\partial}{\partial\theta}\ln p({\bf x}) & = &
-\frac{\sum_{\bf y}e^{-E({\bf x},{\bf y})}\frac{\partial E}{\partial\theta}({\bf x},{\bf y})}
{\sum_{\bf y}e^{-E({\bf x},{\bf y})}}
+\frac{\sum_{\bf x'}\sum_{\bf y'}e^{-E({\bf x'},{\bf y'})}\frac{\partial E}{\partial\theta}({\bf x'},{\bf y'})}
{\sum_{\bf x'}\sum_{\bf y'}e^{-E({\bf x'},{\bf y'})}}
\\
& = &
-\sum_{\bf y}p({\bf y}\mid{\bf x})\frac{\partial E}{\partial\theta}({\bf x},{\bf y})
+\sum_{\bf x'}\sum_{\bf y'}p({\bf x'},{\bf y'})\frac{\partial E}{\partial\theta}({\bf x'},{\bf y'})
\\
& = & -\mathbb{E}_{{\bf y}|{\bf x}}\left[\frac{\partial E}{\partial\theta}({\bf x},{\bf y})\right]
+\mathbb{E}_{{\bf x'},{\bf y'}}\left[\frac{\partial E}{\partial\theta}({\bf x'},{\bf y'})\right]
\\
& = & -\mathbb{E}_{{\bf y}|{\bf x}}\left[\frac{\partial E}{\partial\theta}({\bf x},{\bf y})\right]
+\mathbb{E}_{{\bf x'}}\left[
  \mathbb{E}_{{\bf y'}\mid{\bf x'}}\left[\frac{\partial E}{\partial\theta}({\bf x'},{\bf y'})\right]
\right]
\\
& = & +\mathbb{E}_{{\bf y}|{\bf x}}\left[\frac{\partial(-E)}{\partial\theta}({\bf x},{\bf y})\right]
-\mathbb{E}_{{\bf x'}}\left[
  \mathbb{E}_{{\bf y'}\mid{\bf x'}}\left[\frac{\partial(-E)}{\partial\theta}({\bf x'},{\bf y'})\right]
\right]
\,.
\end{eqnarray}

We assume that $p({\bf y}\mid{\bf x})$ is tractable to compute, and thus the conditional expectations are also
tractable. 
However, we have noted above that $p({\bf x})$ and $p({\bf x},{\bf y})$ generally are not tractable, so we still have to resort to approximations.

### Gibbs sampling

We note that, under suitable conditions, expectations obey
\begin{eqnarray}
\mathbb{E}_{X}\left[f(X)\right] & = & \lim_{K\rightarrow\infty}
\frac{1}{K}\sum_{k=1}^{K}f({\bf x}_k)\,,
\end{eqnarray}
where ${\bf x}_k\sim p(X)$. In particular, for the single sample ${\bf x}'$, $f({\bf x}')$ is an unbiased estimator of $\mathbb{E}_{X}\left[f(X)\right]$. Thus, the above gradient could be approximated by the stochastic gradient
\begin{eqnarray}
\frac{\partial}{\partial\theta}\ln p({\bf x}) & \approx &
\mathbb{E}_{{\bf y}|{\bf x}}\left[\frac{\partial(-E)}{\partial\theta}({\bf x},{\bf y})\right]
-\mathbb{E}_{{\bf y'}\mid{\bf x'}}\left[\frac{\partial(-E)}{\partial\theta}({\bf x'},{\bf y'})\right]
\,.
\end{eqnarray}

How are we supposed to sample ${\bf x'}$ if computing $p({\bf x'})$ is intractable?
This is where the Gibbs sampling comes in. 
Since we are assuming that the conditional distributions are tractable, then we approximate unconditional distributions by conditional ones. Thus, we let
\begin{eqnarray}
p({\bf x'}) = \sum_{\bf y}p({\bf x'}\mid{\bf y})\,p({\bf y})
& \Rightarrow & \mathbb{E}_{\bf x'}[\cdot] = 
\mathbb{E}_{\bf y}\left[\mathbb{E}_{{\bf x'}\mid{\bf y}}[\cdot]\right]
\,,
\end{eqnarray}
using the other conditional distribution
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & 
\frac{e^{-E({\bf x},{\bf y})}}
     {\sum_{\bf x'}e^{-E({\bf x'},{\bf y})}}
\,,
\end{eqnarray}
which is also assumed to be tractable to compute.

Now we are faced with the fact that $p({\bf y})$ is also intractable. Hence, we
repeat the above step, letting
\begin{eqnarray}
p({\bf y}) = \sum_{\bf x}p({\bf y}\mid{\bf x})\,p({\bf x})
& \Rightarrow & \mathbb{E}_{\bf y}[\cdot] = 
\mathbb{E}_{\bf x}\left[\mathbb{E}_{{\bf y}\mid{\bf x}}[\cdot]\right]
\,.
\end{eqnarray}

We could repeat this cycle any number of times, corresponding to multiple steps of 
sequential Gibbs sampling. However, we are already given ${\bf x}$, so we halt with the approximation
that
\begin{eqnarray}
\mathbb{E}_{\bf y}[\cdot] & \approx & \mathbb{E}_{{\bf y}|{\bf x}}[\cdot]
\,,
\end{eqnarray}
on the basis that $f({\bf x})$ is an unbiased estimate of 
$\mathbb{E}_{\bf x}[f({\bf x})]$, using the same argument as above.

This results in the final approximation
\begin{eqnarray}
\frac{\partial}{\partial\theta}\ln p({\bf x}) & \approx &
\mathbb{E}_{{\bf y}|{\bf x}}\left[\frac{\partial(-E)}{\partial\theta}({\bf x},{\bf y})\right]
-\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
 \mathbb{E}_{{\bf x'}\mid{\bf y}}\left[
  \mathbb{E}_{{\bf y'}\mid{\bf x'}}\left[
   \frac{\partial(-E)}{\partial\theta}\left({\bf x'},{\bf y'}\right)
  \right]
 \right]
\right]
\,.
\end{eqnarray}

In practice, Gibbs sampling of ${\bf y}$ and ${\bf x'}$ replaces the outer two expectations of the negative term. Hence, the Gibbs sampling algorithm is:
1. For visible input ${\bf x}$, compute the distribution 
$p({\bf y}\mid{\bf x})$ of the hidden output, and compute
the expectation term $\mathbb{E}_{{\bf y}\mid{\bf x}}[\cdot]$ in the gradient.
2. Sample the hidden output ${\bf y}$ from the distribution 
$p({\bf y}\mid{\bf x})$. 
The pair ${\bf x},{\bf y}$ form the so-called *positive* case.
3. Using ${\bf y}$, compute the distribution $p({\bf x'}\mid{\bf y})$, and sample ${\bf x'}$.
4. Using ${\bf x'}$, compute the distribution
$p({\bf y'}\mid{\bf x'})$, and compute the expectation term $\mathbb{E}_{{\bf y'}\mid{\bf x'}}[\cdot]$.
The pair ${\bf x'},{\bf y'}$ form the *negative* case.
5. Compute the approximate stochastic gradient as the difference of expectations.

### Mean field approximation

Note that in the special case where $E({\bf x},{\bf y})$ is linear in ${\bf y}$
for parameter $\theta$, e.g.
\begin{eqnarray}
E({\bf x},{\bf y}) & = & {\bf w}({\bf x},\theta)^{T}{\bf y}+\cdots\,,
\end{eqnarray}
then we have
\begin{eqnarray}
\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
  \frac{\partial E}{\partial\theta}({\bf x},{\bf y})
\right]
& = & \frac{\partial E}{\partial\theta}\left(
  {\bf x},\mathbb{E}_{{\bf y}\mid{\bf x}}\left[{\bf y}\right]
\right)\,,
\end{eqnarray}
exactly. If, however, there are nonlinearities in ${\bf y}$ then the above does not hold
exactly, but it does still approximately hold true. This is the `mean field` approximation.

To see how the mean field approximation works, we define
\begin{eqnarray}
\bar{\bf y}_{\bf x} & = & \bar{\bf y}({\bf x}) \doteq 
\mathbb{E}_{{\bf y}\mid{\bf x}}[{\bf y}]\,,
\end{eqnarray}
and consider the 
first-order Taylor approximation
\begin{eqnarray}
E({\bf x},{\bf y}) & \approx & E({\bf x},\bar{\bf y}_{\bf x})
+ ({\bf y}-\bar{\bf y}_{\bf x})^{T}\frac{\partial E}{\partial{\bf y}}
({\bf x},\bar{\bf y}_{\bf x})
\\
\Rightarrow \mathbb{E}_{{\bf y}|{\bf x}}\left[E({\bf x},{\bf y})\right]
& \approx & E({\bf x},\bar{\bf y}_{\bf x})
+ \left(\mathbb{E}_{{\bf y}|{\bf x}}\left[{\bf y}\right]
-\bar{\bf y}_{\bf x}\right)^{T}
\frac{\partial E}{\partial{\bf y}}({\bf x},\bar{\bf y}_{\bf x})
\\
%\Rightarrow \mathbb{E}_{{\bf y}|{\bf x}}\left[E({\bf x},{\bf y})\right]
%& \approx &
& = &
E\left({\bf x},\mathbb{E}_{{\bf y}\mid{\bf x}}[{\bf y}]\right)
\,.
\end{eqnarray}
Taking derivatives, we see that
\begin{eqnarray}
\mathbb{E}_{{\bf y}|{\bf x}}\left[
  \frac{\partial E}{\partial\theta}({\bf x},{\bf y})
\right]
& \approx & \frac{\partial E}{\partial\theta}\left(
  {\bf x},\mathbb{E}_{{\bf y}|{\bf x}}\left[{\bf y}\right]
\right)
\end{eqnarray}
also holds true.

If we now apply the mean field approximation to the Gibbs sampling approximation
of the gradient (above), then we obtain
\begin{eqnarray}
\frac{\partial}{\partial\theta}\ln p({\bf x}) & \approx &
\mathbb{E}_{{\bf y}|{\bf x}}\left[
  \frac{\partial(-E)}{\partial\theta}({\bf x},{\bf y})
\right]
-\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
 \mathbb{E}_{{\bf x'}\mid{\bf y}}\left[
  \mathbb{E}_{{\bf y'}\mid{\bf x'}}\left[
   \frac{\partial(-E)}{\partial\theta}\left({\bf x'},{\bf y'}\right)
  \right]
 \right]
\right]
\\
& \approx &
\frac{\partial(-E)}{\partial\theta}\left({\bf x},\bar{\bf y}_{\bf x}\right)
-\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
 \mathbb{E}_{{\bf x'}\mid{\bf y}}\left[
   \frac{\partial(-E)}{\partial\theta}\left({\bf x'},\bar{\bf y}_{\bf x'}\right)
 \right]
\right]
\\
& \approx &
\frac{\partial(-E)}{\partial\theta}\left({\bf x},\bar{\bf y}_{\bf x}\right)
-\mathbb{E}_{{\bf y}\mid{\bf x}}\left[
   \frac{\partial(-E)}{\partial\theta}
   \left(\bar{\bf x}_{\bf y},\bar{\bf y}(\bar{\bf x}_{\bf y})\right)
\right]
\\
& \approx &
\frac{\partial(-E)}{\partial\theta}\left({\bf x},\bar{\bf y}_{\bf x}\right)
-\frac{\partial(-E)}{\partial\theta}
   \left(
    \bar{\bf x}\left(\bar{\bf y}_{\bf x}\right),
    \bar{\bf y}\left(\bar{\bf x}\left(\bar{\bf y}_{\bf x}\right)\right)
   \right)
\,.
\end{eqnarray}

### Bernoulli RBM gradient

For the Bernoulli RBM, we have (from above) that
\begin{eqnarray}
-E({\bf x},{\bf y}) & = & {\bf a}^{T}{\bf x}+{\bf b}^{T}{\bf y}+{\bf x}^{T}W{\bf y}\,,
\end{eqnarray}
and thus
\begin{eqnarray}
\frac{\partial(-E)}{\partial{\bf a}} = {\bf x}\,,
&\;
\frac{\partial(-E)}{\partial{\bf b}} = {\bf y}\,,
&
\frac{\partial(-E)}{\partial W} = {\bf x}\,{\bf y}^{T}\,.
\end{eqnarray}
Furthermore, we find that
\begin{eqnarray}
\bar{\bf y}({\bf x}) & = & [p(y_j=1\mid{\bf x})]_{j=1}^{M}\,,
\\
\bar{\bf x}({\bf y}) & = & [p(x_i=1\mid{\bf y})]_{i=1}^{N}\,.
\end{eqnarray}

#### Gibbs sampling

Under the Gibbs sampling approximation described above, we first sample 
${\bf y}\sim\bar{\bf y}({\bf x})$ and let ${\bf y}$ stand in for
$\mathbb{E}_{{\bf y}\mid{\bf x}}[\cdot]$. Next, we sample
${\bf x'}\sim\bar{\bf x}({\bf y})$, and let ${\bf x'}$ stand in for
$\mathbb{E}_{{\bf x'}\mid{\bf y}}[\cdot]$.
This leaves only $\mathbb{E}_{{\bf y}\mid{\bf x}}[\cdot]$ and
$\mathbb{E}_{{\bf y'}\mid{\bf x'}}[\cdot]$ to be evaluated.
Hence,
\begin{eqnarray}
\frac{\partial}{\partial{\bf a}}\ln p({\bf x}) & \approx & {\bf x} - {\bf x'}\,,
\\
\frac{\partial}{\partial{\bf b}}\ln p({\bf x}) & \approx & 
\bar{\bf y}_{\bf x}-\bar{\bf y}_{\bf x'}\,,
\\
\frac{\partial}{\partial{\bf W}}\ln p({\bf x}) & \approx & 
{\bf x}\,\bar{\bf y}_{\bf x}^{T}-{\bf x'}\bar{\bf y}_{\bf x'}^{T}\,.
\end{eqnarray}
In practice, this approximation does not work well, with the parameters seemingly wandering about randomly.

#### Hinton modified gradient

[Hinton](https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf) offers a practical guide to training RBMs, although I found the commentary to be rather terse. Briefly, Hinton asserts that the positive (or data) term in the expectations above should couple the binary input with binary (sampled) output. However, the negative (or reconstruction) term can seemingly forego sampling altogether. My interpretation is that this leads to the modified gradient scheme:
\begin{eqnarray}
\frac{\partial}{\partial{\bf a}}\ln p({\bf x}) & \approx & 
{\bf x} - \bar{\bf x}_{\bf y}\,,
\\
\frac{\partial}{\partial{\bf b}}\ln p({\bf x}) & \approx & 
{\bf y}-\bar{\bf y}(\bar{\bf x}_{\bf y})\,,
\\
\frac{\partial}{\partial{\bf W}}\ln p({\bf x}) & \approx & 
{\bf x}\,{\bf y}^{T}-\bar{\bf x}_{\bf y}\,\bar{\bf y}(\bar{\bf x}_{\bf y})^{T}\,,
\end{eqnarray}
where we sample ${\bf y}\sim\bar{\bf y}({\bf x})$ as before.

This scheme seems to work well, after a burn-in training period of random fluctuations.
It is interesting to note that in practice this Hinton-modified gradient appears to minmise the RMS error discussed in a later section. This would appear to be due to the fact that both the modified gradient above and the reconstruction probabilities (later)
utilise some aspects of mean field approximations (see the next section).

#### Mean field approximation

The mean field approximation, described in detail above, is straightforward to
apply. The resulting gradient approximation is
\begin{eqnarray}
\frac{\partial}{\partial{\bf a}}\ln p({\bf x}) & \approx & 
{\bf x} - \bar{\bf x}\left(\bar{\bf y}_{\bf x}\right)\,,
\\
\frac{\partial}{\partial{\bf b}}\ln p({\bf x}) & \approx & 
\bar{\bf y}_{\bf x}-\bar{\bf y}\left(
  \bar{\bf x}\left(\bar{\bf y}_{\bf x}\right)
\right)\,,
\\
\frac{\partial}{\partial{\bf W}}\ln p({\bf x}) & \approx & 
{\bf x}\,\bar{\bf y}_{\bf x}^{T}-\bar{\bf x}\left(\bar{\bf y}_{\bf x}\right)
\,
\bar{\bf y}\left(
  \bar{\bf x}\left(\bar{\bf y}_{\bf x}\right)
\right)^{T}\,.
\end{eqnarray}

Clearly, this is closely related to the Hinton-modified gradient above, except that 
now no Gibbs sampling is required. 
In practice, this gradient approximation appears to work very well, and seemingly has better convergence than the Hinton-modified gradient (although YMMV). 

### Gaussian RBM gradient

For the Gaussian RBM (i.e. Gaussian input with Bernoulii output), the negative energy function is
\begin{eqnarray}
-E({\bf x},{\bf y}) & = & {\bf a}^{T}{\bf x}+{\bf c}^{T}{\bf x}^2
+{\bf b}^{T}{\bf y}+{\bf x}^{T}W{\bf y}+({\bf x}^{2})^{T}D{\bf y}\,,
\end{eqnarray}
and thus we obtain the above derivatives in ${\bf a}$, ${\bf b}$ and $W$, as well as
\begin{eqnarray}
\frac{\partial(-E)}{\partial{\bf c}} = {\bf x}^2\,,
&&
\frac{\partial(-E)}{\partial D} = {\bf x}^{2}\,{\bf y}^{T}\,.
\end{eqnarray}
Note that to preserve the constraints $c_i<0$ and $d_{ij}<0$, we might choose
$c_i\doteq -e^{c'_i}$ and $d_{ij}\doteq -e^{d_{ij}'}$, such that
\begin{eqnarray}
\frac{\partial(-E)}{\partial c_{i}'} = 
\frac{\partial(-E)}{\partial c_{i}}\frac{\partial c_{i}}{\partial c_{i}'} = x_{i}^2\,c_{i}\,,
&&
\frac{\partial(-E)}{\partial d_{ij}'} =
\frac{\partial(-E)}{\partial d_{ij}}\frac{\partial d_{ij}}{\partial d_{ij}'} =
x_{i}^{2}\,y_{j}\,d_{ij}\,.
\end{eqnarray}
Alternatively, we might simply rectify the updated estimates of ${\bf c}$ and $D$ to thus obey the constraints.

## Non-standard Estimation

RBMs can be rather difficult to train, since the usual parameter update scheme described above does not really maximise any particular likelihood [(Hinton)](https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf). Additionally, since computing $p({\bf x})$ is intractable, we cannot properly score the updates to test for convergence.

In practice, we need to use some sort of approximation to $p({\bf x})$. One approach is offered by the mean field approximation (described in an earlier section). We note that
\begin{eqnarray}
p({\bf x}) & = & \sum_{\bf y}p({\bf x}\mid{\bf y})\,p({\bf y})
= \mathbb{E}_{\bf y}\left[p({\bf x}\mid{\bf y})\right]\,,
\end{eqnarray}
and hence, following previous reasoning, we have
\begin{eqnarray}
p({\bf x}) & \approx & 
\mathbb{E}_{{\bf y}\mid{\bf x}}\left[p({\bf x}\mid{\bf y})\right]
\approx
p\left({\bf x}\mid\bar{\bf y}_{\bf x}\right)\,.
\end{eqnarray}
Next, we recall that RBMs obey the conditional independence property
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & \prod_{i=1}^{N}p(x_i\mid{\bf y})\,.
\end{eqnarray}
Thus, for a Bernoulli RBM we have
\begin{eqnarray}
p({\bf x}\mid{\bf y}) & = & \prod_{i=1}^{N}
p\left(x_i=1\mid{\bf y}\right)^{\,x_i}
\,
\left[1 - p\left(x_i=1\mid{\bf y}\right)\right]^{\,1-x_i}
\\
& = &
\prod_{i=1}^{N}
\bar{x}_i\left({\bf y}\right)^{x_i}
\,\left[1-\bar{x}_i\left({\bf y}\right)\right]^{1-x_i}
\,,
\end{eqnarray}
where
\begin{eqnarray}
\bar{x}_i({\bf y}) & \doteq &
p(x_i=1\mid{\bf y}) =
\mathbb{E}_{{\bf x}\mid{\bf y}}[x_i] =
\left[\bar{\bf x}({\bf y})\right]_i
\,.
\end{eqnarray}
The mean field approximate probability, i.e. the so-called *reconstruction* probability, is therefore
\begin{eqnarray}
p({\bf x}) & \approx & 
\prod_{i=1}^{N}
\bar{x}_i\left(\bar{\bf y}_{\bf x}\right)^{x_i}
\,\left[1-\bar{x}_i\left(\bar{\bf y}_{\bf x}\right)\right]^{1-x_i}
\,.
\end{eqnarray}



### Minimising the reconstruction error

As another approach, suppose we approximate $p({\bf x})$, and score how closely this approximation is to the binary input data. Specifically, we minimise the mean square error
\begin{eqnarray}
R^2 & \doteq & \frac{1}{D}\sum_{d=1}^{D}\sum_{i=1}^{N}(x_{di}-p_{di})^2\,,
\end{eqnarray}
where
\begin{eqnarray}
p_{di} & \doteq & \bar{x}_i({\bf q}_d) = \sigma\left(a_i+\sum_{j=1}^{M}w_{ij} q_{dj}\right)\,,
\\
q_{dj} & \doteq & \bar{y}_j({\bf x}_d) = \sigma\left(b_j+\sum_{i=1}^{N}x_{di} w_{ij}\right)\,.
\end{eqnarray}

We note that, for arbitrary parameter $\theta$, the gradient is
\begin{eqnarray}
\frac{\partial R^2}{\partial\theta} & = & 
-\frac{2}{D}\sum_{d=1}^{D}\sum_{i=1}^{N}\delta_i(\theta)\,(x_{di}-p_{di})
\frac{\partial p_{di}}{\partial\theta}\,,
\end{eqnarray}
where $\delta_i(\theta)$ is a notional 0/1 indicator that causes the summation over $i$ to be dropped if parameter $\theta$ is indexed by $i$.
Furthermore, we see that
\begin{eqnarray}
\frac{\partial p_{di}}{\partial\theta} & = & 
p_{di}\,(1-p_{di})\left\{ 
\frac{\partial a_i}{\partial\theta}
+\sum_{j=1}^{M}\delta_j(\theta)\,\frac{\partial w_{ij}}{\partial\theta} q_{dj}
+\sum_{j=1}^{M}\delta_j(\theta)\,w_{ij}\frac{\partial q_{dj}}{\partial\theta}
\right\}\,,
\end{eqnarray}
since $\sigma'(z)=\sigma(z)\,[1-\sigma(z)]$.
Similarly, we have
\begin{eqnarray}
\frac{\partial q_{dj}}{\partial\theta} & = & 
q_{dj}\,(1-q_{dj})\left\{ 
\frac{\partial b_j}{\partial\theta}
+\sum_{i=1}^{N}\delta_i(\theta)\,x_{di}\frac{\partial w_{ij}}{\partial\theta}
\right\}\,.
\end{eqnarray}

Consequently, we derive that
\begin{eqnarray}
\frac{\partial q_{dj}}{\partial a_i} & = & 0\,,\; 
\frac{\partial p_{di}}{\partial a_i}=p_{di}\,(1-p_{di})
\\
\Rightarrow \frac{\partial R^2}{\partial\,{\bf a}} & = & 
-2\,{\tt mean}(A)\,,\; A \doteq (X-P)\otimes P\otimes (1-P)\,,
\end{eqnarray}
for element-wise multiplicative operator $\otimes$, where the function 
${\tt mean}(\cdot)$ averages over the data.

Similarly, we find that
\begin{eqnarray}
\frac{\partial q_{dj}}{\partial b_j} & = & q_{dj}\,(1-q_{dj})\,,\; 
\frac{\partial p_{di}}{\partial b_j}=p_{di}\,(1-p_{di})\,
w_{ij}\frac{\partial q_{dj}}{\partial b_j}
\\
\Rightarrow \frac{\partial R^2}{\partial\,{\bf b}} & = & 
-2\,{\tt mean}([AW]\otimes B)\,,\; B \doteq Q\otimes (1-Q)\,.
\end{eqnarray}

Lastly, we obtain
\begin{eqnarray}
\frac{\partial q_{dj}}{\partial w_{ij}} & = & x_{di}\,q_{dj}\,(1-q_{dj})\,,\;
\frac{\partial p_{di}}{\partial w_{ij}}=p_{di}\,(1-p_{di})\,\left\{
q_{dj}+w_{ij}\frac{\partial q_{dj}}{\partial w_{ij}}
\right\}
\\
\Rightarrow \frac{\partial R^2}{\partial W} & = &
-\frac{2}{D}\left\{
A^{T}Q+W\otimes\left([A\otimes X]^{T}B\right)
\right\}
\,.
\end{eqnarray}

Note that we need to update the parameter estimates in the opposite direction of these gradients in order to minimise the reconstruction error.

### Maximising the approximate likelihood

Following similar reasoning to that above, we could instead maximise the approximate likelihood defined earlier. Reusing the definition of the reconstruction probability, we recall that
\begin{eqnarray}
p({\bf x}_{d}) & \approx & \prod_{i=1}^{N} p_{di}^{x_{di}}(1-p_{di})^{1-x_{di}}\,.
\end{eqnarray}
This leads to the average log-likelihood
\begin{eqnarray}
L & = & \frac{1}{D}\ln\prod_{d=1}^{D} p({\bf x}_{d})
\\
& \approx & \frac{1}{D}\sum_{d=1}^{D}\sum_{i=1}^{N}\left[
x_{di}\ln p_{di}+(1-x_{di})\ln(1-p_{di})
\right]
\\
\Rightarrow\frac{\partial L}{\partial\theta} & \approx & 
\frac{1}{D}\sum_{d=1}^{D}\sum_{i=1}^{N}\delta_{i}(\theta)\,\left[
\frac{x_{di}}{p_{di}}
-\frac{1-x_{di}}{1-p_{di}}
\right]\,\frac{\partial p_{di}}{\partial\theta}\,.
\end{eqnarray}



Thus, we obtain
\begin{eqnarray}
\frac{\partial L}{\partial a_i} & \approx &
\frac{1}{D}\sum_{d=1}^{D}\left[
x_{di}(1-p_{di})-(1-x_{di})p_{di}
\right]
= \frac{1}{D}\sum_{d=1}^{D}(x_{di}-p_{di})
\\
\Rightarrow \frac{\partial L}{\partial\,{\bf a}} & \approx &
{\tt mean}(X-P)\,.
\end{eqnarray}

Similarly, we have
\begin{eqnarray}
\frac{\partial L}{\partial b_j} & \approx &
\frac{1}{D}\sum_{d=1}^{D}\sum_{i=1}^{N}(x_{di}-p_{di})\,w_{ij}q_{dj}\,(1-q_{dj})
\\
\Rightarrow \frac{\partial L}{\partial\,{\bf b}} & \approx &
{\tt mean}([(X-P)W]\otimes B)\,.
\end{eqnarray}

Lastly, we obtain
\begin{eqnarray}
\frac{\partial L}{\partial w_{ij}}
& \approx & \frac{1}{D}\sum_{d=1}^{D}
(x_{di}-p_{di})\left[
q_{dj}+w_{ij}x_{di}q_{dj}\,(1-q_{dj})
\right]
\\
\Rightarrow \frac{\partial L}{\partial W} & \approx &
\frac{1}{D}\left\{
(X-P)^{T}Q+W\otimes\left([(X-P)\otimes X]^{T}B\right)
\right\}
\,.
\end{eqnarray}

## Sequential Restricted Boltzmann Machines

[(Larochelle and Murray)](http://proceedings.mlr.press/v15/larochelle11a/larochelle11a.pdf)
extend the RBM to the case where the input vector ${\bf x}$ has internal Bayesian Network dependencies. 
They call this the Neural Autoregressive Distribution Estimator (NADE).
In particular, it is assumed that the
elements ${\bf x}=(x_1,x_2,\ldots,x_N)$ have been canonically ordered in some fashion,
for example either randomly or by reason of causal effects. Thus, the input dependencies form an ordered Markov network (see the figure below).

![Network Restricted Boltzmann Machine graph](NADE.png "Networked Restricted Boltzmann Machine")

Borrowing Bayesian Network methodology, and temporarily ignoring the hidden output,
we now suppose that the visible input has the joint distribution:
\begin{eqnarray}
p({\bf x}) & = & p(x_1,x_2,\ldots,x_N)
\\
& \doteq & p(x_1)\,p(x_2\mid x_1)\,p(x_3\mid x_1, x_2)\cdots 
p(x_N\mid x_1,x_2,\ldots,x_{N-1})
\\
& = &
\prod_{i=1}^{N} p(x_i\mid{\bf x}_{1:i-1})\,,
\end{eqnarray}
where we define ${\bf x}_{j:k}\doteq (x_j,x_{j+1},\ldots,x_k)$ for $j\le k\in\{1,2,\ldots,N\}$. For convenience, whenever $j>k$ we let
${\bf x}_{j:k}=(\,)$.

Next, we reintroduce the binary hidden output ${\bf y}\in\{0,1\}^{M}$, such that
\begin{eqnarray}
p(x_i\mid{\bf x}_{1:i-1}) & = & \sum_{{\bf y}\in\{0,1\}^{M}}
p(x_i,{\bf y}\mid{\bf x}_{1:i-1})\,,
\end{eqnarray}
where (from the diagram)
\begin{eqnarray}
p(x_i,{\bf y}\mid{\bf x}_{1:i-1}) & = &
p(x_i\mid{\bf y},{\bf x}_{1:i-1})\,
p({\bf y}\mid{\bf x}_{1:i-1})\,.
\end{eqnarray}
In order to keep the networked model reasonably simple, we now ignore the input dependencies and define
\begin{eqnarray}
p(x_i\mid{\bf y},{\bf x}_{1:i-1}) & \doteq & p(x_i\mid {\bf y})\,.
\end{eqnarray}
Effectively, we have returned to the standard RBM formulation, but have implicitly retained the dependencies amongst the input units by utilising their effects on the output units.

Thus, for the Bernoulli RBM, we have
\begin{eqnarray}
\bar{x}_i({\bf y}) & = &
p(x_i=1\mid{\bf y}) = \sigma\left(\left[{\bf a}+W{\bf y}\right]_i\right)\,,
\end{eqnarray}
as before, but now we use the truncated models
\begin{eqnarray}
\bar{y}_j({\bf x}_{1:k}) & = & p(y_j=1\mid{\bf x}_{1:k}) \doteq 
\sigma\left(\left[{\bf b}+W_{1:k,:}^{T}{\bf x}_{1:k}\right]_j\right)\,,
\end{eqnarray}
for $k=1,2,\ldots,N$,
where $W_{1:k,:}$ is the matrix obtained by retaining the first $k$ rows of $W$.
For convenience, with $i=1$ and $k=i-1$, we take $\bar{y}_j({\bf x}_{1:0})=\sigma(b_j)$.

Reusing the derivations above, we similarly assume that
\begin{eqnarray}
p({\bf y}\mid{\bf x}_{1:i-1}) & = &
\prod_{j=1}^{M}\bar{y}_j({\bf x}_{1:i-1})^{y_j}
\left[1-\bar{y}_j({\bf x}_{1:i-1})\right]^{1-y_j}\,,
\end{eqnarray}
such that
\begin{eqnarray}
p(x_i=1\mid{\bf x}_{1:i-1}) & = & \sum_{{\bf y}\in\{0,1\}^M}
\bar{x}_i({\bf y})\,p({\bf y}\mid{\bf x}_{1:i-1})
= \mathbb{E}_{{\bf y}\mid{\bf x}_{1:i-1}}\left[\bar{x}_i({\bf y})\right]
\,.
\end{eqnarray}
However, we note that this summation remains intractable.

In order to obtain a tractable model,
[(Larochelle and Murray)](http://proceedings.mlr.press/v15/larochelle11a/larochelle11a.pdf)
used a mean field approximation to $p(x_i\mid{\bf x}_{1:i-1})$, namely
\begin{eqnarray}
p(x_i=1\mid{\bf x}_{1:i-1}) & \approx & \bar{x}_i(\bar{\bf y}({\bf x}_{1:i-1}))\,.
\end{eqnarray}
Hence, the joint probability of input ${\bf x}$ is
\begin{eqnarray}
p({\bf x}) & \approx & \prod_{i=1}^{N}
\bar{x}_i\left(\bar{\bf y}\left({\bf x}_{1:i-1}\right)\right)^{\,x_i}\,
\left[1-\bar{x}_i\left(\bar{\bf y}\left({\bf x}_{1:i-1}\right)\right)
\right]^{\,1-x_i}\,.
\end{eqnarray}
This is the essence of the NADE model. 

Observe that this is just a modified form of the approximate reconstruction probability derived in an earlier section. 
It follows that most of the maths we previously derived for the gradient of the log-likelihood still holds. Thus, we compute
\begin{eqnarray}
\bar{y}_j^{(i)}({\bf x}) & \doteq &\bar{y}_j({\bf x}_{1:i-1}) = \sigma\left(b_j+\sum_{k=1}^{i-1}x_k w_{kj}\right)\,,
\\
\bar{x}_i\left(\bar{\bf y}^{(i)}_{\bf x}\right) 
& = & \sigma\left(a_i+\sum_{j=1}^{M}w_{ij}\bar{y}_j^{(i)}({\bf x})\right)\,,
\end{eqnarray}
such that
\begin{eqnarray}
\ln p({\bf x}) & \approx & \sum_{i=1}^{N}\left\{
  x_i\ln \bar{x}_i\left(\bar{\bf y}^{(i)}_{\bf x}\right)
  +(1-x_i)\ln\left[1-\bar{x}_i\left(\bar{\bf y}^{(i)}_{\bf x}\right)\right]
\right\}\,.
\end{eqnarray}
Consequently, we obtain the approximate gradients
\begin{eqnarray}
\frac{\partial}{\partial a_i}\ln p({\bf x}) & \approx & 
x_i-\bar{x}_i\left(\bar{\bf y}^{(i)}_{\bf x}\right)\,,
\\
\frac{\partial}{\partial b_j}\ln p({\bf x}) & \approx & 
\sum_{i=1}^{N}B_{ij}\,,
\\
\frac{\partial}{\partial w_{ij}}\ln p({\bf x}) & \approx & 
\left[x_i-\bar{x}_i\left(\bar{\bf y}^{(i)}_{\bf x}\right)\right]
\,\bar{y}_j^{(i)}({\bf x})
+x_i\sum_{k=i+1}^{N}B_{kj}
\,,
\end{eqnarray}
where
\begin{eqnarray}
B_{ij} & = & 
\left[x_i-\bar{x}_i\left(\bar{\bf y}^{(i)}_{\bf x}\right)\right]
  \,w_{ij}
  \,\bar{y}_j^{(i)}({\bf x})\left[1-\bar{y}_j^{(i)}({\bf x})\right]
\,.
\end{eqnarray}

However, we might recall the various gradient schemes that we have derived so far, namely the Hinton-modified gradient and the mean field approximation, as well the explicit gradients of the reconstruction error and the log reconstruction probability.
In application, all of these gradient schemes act to minimise the reconstruction error and maximise the reconstruction probability.

Consequently, it seems reasonable to suppose that we might equally modify one of these existing gradient schemes to allow for dependencies between the input units. Note, however, that whereas previously we computed the component
$\bar{x}_i(\bar{\bf y}_{\bf x})$ from the vector 
$\bar{\bf x}(\bar{\bf y}_{\bf x})$, 
we now need to reverse this procedure and instead compute the vector by stacking the components incrementally. 
\begin{eqnarray}
%To avoid having to redefine our existing functions, 
%we instead introduce the sequential function
%\tilde{\bf x}({\bf x}) & \doteq & [\bar{x}_i(\bar{\bf y}^{(i)}_{\bf x})]_{i=1}^{F}\,.
\end{eqnarray}