Restricted Boltzmann machine:
They are interesting why?
1. Unsupervised in their nature
2. Uses energy as a step to formulate probabilities. Uses probabilities to reduce the log likelihood of the model. Changes parameters using gradients.
3. Uses contrastive divergence. Increases probability of the data visible in input nodes. Reduces probability of cases which involve hidden units. Since all the possible $(v,h)$ combinations contribute in the partition function, there will be many cases which do not represent the input data. To do so, we sample visible units($\bar{v}$) from hidden units. Since the process is random, chances are that $\bar{v}$ isn't present in the data. So we adjust our weights so that these sampled visible units($\bar{v}$) get lower probabilities as compared to visible units present in our data.

In [1]:
import torch
import random

# Initialize params:
visible_vector_size = 28*28
hidden_vector_size = 28*2
learning_rate = 0.1

# Initialize visible and hidden vectors:
v = torch.randn([visible_vector_size,1]) #visible units vector
a = torch.zeros([visible_vector_size,1]) # visible units bias
h = torch.randn([hidden_vector_size,1]) # hidden units vector
b = torch.zeros([hidden_vector_size,1]) # hidden units bias

# Initialize weight matrix
W = torch.randn([visible_vector_size,hidden_vector_size])

#### Energy:
Energy is defined using:
![Energy](https://wikimedia.org/api/rest_v1/media/math/render/svg/ef4edf17279787e29bb1a581316d17d70de2072e)

Note:
1. There are 3 learnable params here: W, a and b.
2. We are trying to minimize in such a way that we can use hidden units to model our training data. 
3. I think you can replace the linear $v^{T}Wh$ with any other term like a MLP or a single perceptron and it will still work because we can get gradients for it. But maybe there is some mathematical reason behind it. Will check it out

In [2]:
# Energy 
def energy():
    return -(a.T.mm(v) + b.T.mm(h) + v.T.mm(W).mm(h))

print("Start energy = {}".format(energy()))

 Start energy = tensor([[47.0543]])


### Probabilities:

#### A simple case:
As an example let us start with a RBM of 3 visible and 2 hidden units. Now there are are total $2^5=32$ possible combinations of $(v,h)$. We convert the idea of energy into probabilities. Probability of a particular configuration $(v_1,v_2,v_3,h_1,h_2)$ is given by:
\begin{align*} \mathcal{P}(v_1,v_2,v_3,h_1,h_2) &= \frac{e^{- E(v_1,v_2,v_3,h_1,h_2)}}{Z} \\
\text{where      } Z &= \sum_{(i \in v,j \in h)} e^E(i,j)
\end{align*}

So now if we were to find the probaility of a given visible vector or an input vector, We sum our probabilties over the all the hidden vectors:
![Probaility for a visible vector](https://wikimedia.org/api/rest_v1/media/math/render/svg/70aed07d8a53e0f60dd5679ceae1799c69cbc62f)

#### When state space increases
A problem arises when we have a large number of units in our model. Finding the value of $Z$ becomes intractable. So we use logistic probabilities:
![probability of hidden vector](https://wikimedia.org/api/rest_v1/media/math/render/svg/5cb8dcbe9e8df021e89fc51b22a6fed8fe5f41ff)
![probability of visible vector](https://wikimedia.org/api/rest_v1/media/math/render/svg/057f0c5b5e369ebac4ecc1053a7fcec0af48567d)

In [5]:
def hidden_probs():
    return torch.sigmoid(b + W.T.mm(v))

def visible_probs():
    return torch.sigmoid(a + W.mm(h))

### Cost function:
Our cost function here is log likelihood. 
\begin{equation*} \mathcal{J} = \sum_{v \in V} \log(\mathcal{P}(v)) \text{  where, V is the training data}\end{equation*}

So our gradients for W 

In [None]:
# distributions

In [None]:
# Loading data

In [None]:
# Training loop

# Gibbs sampling



Resources:
1. Nice introduction and overview of RBMs: https://www.youtube.com/watch?v=Fkw0_aAtwIw&t=13s
2. For detailed mathematical dive in the future check out the series by this guy: https://www.youtube.com/watch?v=p4Vh_zMw-HQ
3. Lecture by the father himself, 11.5 to 12.5: https://www.youtube.com/watch?v=5jaBneYd5Ig&t=616s