In [2]:
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'png')
plt.rcParams['savefig.dpi'] = 75

plt.rcParams['figure.autolayout'] = False
plt.rcParams['figure.figsize'] = 10, 6
plt.rcParams['axes.labelsize'] = 18
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['font.size'] = 16
plt.rcParams['lines.linewidth'] = 2.0
plt.rcParams['lines.markersize'] = 8
plt.rcParams['legend.fontsize'] = 14

plt.rcParams['text.usetex'] = True
plt.rcParams['font.family'] = "serif"
plt.rcParams['font.serif'] = "cm"

## Introduction

Recent progress in machine learning and artificial intelligence has been driven by a renaissance in research on artificial neural networks. Deep learning methods have been especially powerful tools for large-scale supervised learning problems. In fact, one could go so far as to claim that there is now a well-established recipe for success on supervised learning problems: step 1 -- collect a very large dataset with reliable labels (for example, a billion images labeled as "cat" or "not cat"), step 2 -- apply some noise or nuissance transformations to the data that you want your model to be insensitive to, step 3 -- use stochastic gradient descent to train a very large multi-layer neural network on these data. Unfortunately, there are many problems that cannot be solved using this three step recipe, usually because of a failure at step 1. If there is little or no data with reliable labels then we have to throw away the cookbook and try something else.

Unsupervised learning is the task of learning from data without labels. In the unsupervised context, "learning" refers to developing an understanding of the *process that generates the data*, or to constructing a simpler (e.g., sparse or low-dimensional) *representation* of the data. A Restricted Boltzmann Machine, or RBM, is a type of neural network designed for this task.  

Training an RBM on an unsupervised problem is not as easy as training a neural networks for a supervised problem. Neural networks used for supervised problems are deterministic -- they assign a unique label to every image. Moreover, the gradient of the objective function can be calculated very easily. By contrast, RBMs are stochastic; they describe probablity distributions instead of input-output functions. As a result, computing the gradients exactly is so costly that it is effectively impossible. Therefore, the ability to train RBMs efficiently hinges on the development of fast and accurate approximates to the gradients of the objective function.


### Restricted Boltzmann Machines

An RBM is a type of energy based model defined through an energy function. Let $v_i$ for $i = 1, \ldots, N$ denote the units of the visible layer and $h_{\mu}$ for $\mu = 1, \ldots, M$ denote the units of the hidden layer. The energy function of a general RBM is:

\begin{equation}
H(\boldsymbol{v}, \boldsymbol{h}) = -\sum_i f_i(v_i) - \sum_{\mu} f_{\mu}(h_{\mu}) 
- \sum_{i \mu} W_{i \mu} g_i(v_i) g_{\mu}(h_{\mu})
\end{equation}

Here, $f_i(\cdot)$ and $g_i(\cdot)$ are functions defined for each visible unit, and $f_{\mu}(\cdot)$ and $g_{\mu}(\cdot)$ are functions defined for each hidden unit. The joint probability distribution of the visible and hidden units is obtained by analogy with Boltzmann's distribution is physics:

\begin{equation}
p_{RBM}(\boldsymbol{v}, \boldsymbol{h}) = Z^{-1} e^{-H(\boldsymbol{v}, \boldsymbol{h}) }
\end{equation}

where $Z = \int d \boldsymbol{v} d \boldsymbol{h} e^{-H(\boldsymbol{v}, \boldsymbol{h}) }$ is the normalizing constant of the distribution (also called the partition function). 


Training an RBM involves optimizing the parameters of the model so that the marginal distribution of the visible units is approximately equal to the data distribution:
\begin{equation}
p_{data}(\boldsymbol{v})\approx p_{RBM}(\boldsymbol{v}) = Z^{-1} \int d \boldsymbol{h} e^{-H(\boldsymbol{v}, \boldsymbol{h}) }
\end{equation}

We formulate the training problem as an optimization problem aimed at minimizing a measure of difference between $p_{data}(\boldsymbol{v})$ and $p_{RBM}(\boldsymbol{v})$ called the Kullback-Leibler divergence:

\begin{equation}
D_{KL}(p_{data}(\boldsymbol{v}) || p_{RBM}(\boldsymbol{v})) = -E_{data}[\log p_{RBM}(\boldsymbol{v}) ] + \text{constant}
\end{equation}

In [2]:
# Examples
import paysage

## Discussion



## Acknowledgements


## References