# Restricted Boltzmann Machines for Collaborative Filtering
- Ruslan Salakhutdinov and Andriy Mnih are authours of this paper

## RBM architecture

![](https://viblo.asia/uploads/e416dffb-642d-4a95-ae27-85cb1b57511b.png)

[RBM blog](https://viblo.asia/p/restricted-boltzmann-machine-an-overview-aWj531oQZ6m)



- Visible - v
- Hidden - h
- Can calculate h from v (v -> h)
- Can also calculate v from h! (h -> v)

## Bernoulli RBMs

- Both visible and hidden units can only take on the values 0 and 1
- What's Bernoulli?
    - Binary random variable - only 2 outcomes
    - Coin toss
    - whether or not user clicks on advertisement
    - whether or not user signs up for website
- very useful for web-based applications


## Key calculations

- very intuitive
- How to calculate h from v

$$
\text{vector form} = p(h=1 |v) = \sigma(W^{T}v+c)\\
\text{Scalar form} = p(h_{j} = 1|v) = \sigma(\sum_{i=1}^{D}W_{ij}v_{i}+c_{j}), i = 1...D,j = 1,...M\\
\text{len}(v) = D, \text{len}(h) = M
$$

- going from h to v
- W is a shared weight

$$
\text{vector form} = p(v=1 |h) = \sigma(Wh+b)\\
\text{Scalar form} = p(v_{i} = 1|h) = \sigma(\sum_{j=1}^{M}W_{ij}h_{j}+b_{i}), i = 1...D,j = 1,...M\\
\text{len}(v) = D, \text{len}(h) = M
$$

## Both

- we only get probabilities
- What if I want the actual h?
- There's no fixed value - it's a nonsense question
- It's like a coin- I can flip the coin to grab a sample
- e.g if p($h_{i}=1$|v) = 0.5
- sample = np.random.random(p.shape) < p

$$
p(h=1 |v) = \sigma(W^{T}v+c)\\
p(v=1 |h) = \sigma(Wh+b)
$$

## Relaxing the Bernoulli constraint

- when going from h->v, simply use p(h=1|v) itself as input

$$
\tilde{h} = p(h=1|v) = \sigma(W^{T}v+c)\\
p(v' = 1|h) = \sigma(W\tilde{h}+b)
$$

## Comparison to Autoencoder
- looks familiar

$$
h = \sigma(W^{T}v + c)\\
v' = \sigma(Wh + b)
$$

![](https://image.slidesharecdn.com/ucl-irdm-deeplearning-160429080538/95/deep-learning-55-638.jpg?cb=1461917930)

- v -> h->v' in RBM, the cross entropy (distance between v and v') will go down as we train, even though we don't optimize it

# Motivation Behind RBMs

## Boltzmann Machine
- Energy of a Boltzmann machine
- Goal is to find some equilibrium
- Looks like a neural network equation, has a weight matrix, bias term

$$
E = -(\sum_{i,j}^{}W_{ij}s_{i}s_{j}+ \sum_{i}^{}b_{i}s_{i})
$$

## Training a Boltzmann Machine
- Boltzmann machines are difficult to train
- only works on trivial examples
- Restricted Boltzmann machines do train well, and scale up to non-trivial problems

$$
G = \sum_{v}^{}P^{+}(v)ln(\frac{P^{+}(v)}{P^{-}(v)})
$$

## Restricted Boltzmann Machine
- Discard any connections between hidden-hidden and visible-visible

![](https://cn.bing.com/th?id=OIP.TWt5Uc1QlbKUoJvObnUzgQHaDi&pid=Api&rs=1)

## Energy of an RBM

$$
E(v,h) = -(\sum_{i=1}^{D}\sum_{j=1}^{M}W_{ij}v_{i}h_{j}+\sum_{i=1}^{D}b_{i}v_{i}+\sum_{j=1}^{M}c_{j}h_{j})\\
E(v,h) = -(v^{T}Wh+b^{T}v+c^{T}h)
$$

## Action plan
- How does the energy function lead us to a probabilistic model?
- How does the probabilistic model lead us to the neural network equations?
    - It's quite remarkable that this energy function could lead us back to the same neural network equations we are already familiar with

## Probability Model
$$
p(v,h) \propto e^{-E(v,h)}\\
p(v,h) = \frac{1}{Z}e^{-E(v,h)}, Z = \sum_{v}^{}\sum_{h}^{}e^{-E(v,h)}\\
\text{so that}: \sum_{v}^{}\sum_{h}^{}p(v,h) = 1
$$

## Why?
- General outline: $p_{i}$ is the probability that a system is in a microstate with energy $E_{i}$
- T = temperature, Z = partition function

$$
p_{i} \propto e^{-E_{i}/(kT)}\\
p_{i} = \frac{1}{Z}e^{-E_{i}/(kT)}, Z = \sum_{i}^{}e^{-E_{i}/(kT)}\\
$$

# Intractability

## Intractability
- What does it mean to sum over v and sum over h?
- v and h are Bernoulli vectors(vectors of 0s and 1s)
- If v has length D and h has length M, the number of total possibilities is
- $2^D$ x $2^M$ = $2^{D+M}$


$$
p(v,h) = \frac{1}{Z}e^{-E(v,h)}, Z = \sum_{v}^{}\sum_{h}^{}e^{-E(v,h)}\\
$$

## Simple calculation
- MNIST contains 28x28 images, so D = 784
- we can choose M, let's pick M = 100 (hidden units)
- what's the total number of possible values of v and h?
- $2^{784} \times 2^{100} = 1.3 \times 10^{266}$
- How big is this number?
- Image how much money a billionaire has:1 billion = 1 x $10^9$

## Execise
- Calculate how much time it would take you to sum over all possible values of v and h
- Compute 100 or 1000 steps, then extrapolate to determine how long it would take in total

```python
z = 0
for v:
    for h:
        z += exp(-(v.T.dot(W).dot(h) + b.dot(v) + c.dot(h)))

```

# Neural Network Equations
- How does our probability model lead us to the neural network equations we showed earlier?
- Remarkable that statistical mechanics lead us back to the equations we already know and love
- No wonder that research in this area was strong

$$
E(v,h) = -(v^{T}Wh + b^{T}v + c^Th) \\
p(h=1 |v) = \sigma(W^{T}v+c)\\
p(v=1 |h) = \sigma(Wh+b)
$$

## Bayes rule
- we can start with the basic rules of probability

$$
p(v|h) = p(v,h)/p(h)\\
p(h|v) = p(v,h)/p(v)\\
$$

## Marginals
- we can calculate the denominator from the numerator
$$
p(v) = \sum_{h}^{}p(v,h),p(h) = \sum_{v}^{}p(v,h)
$$

## Plug in what we know

$$
p(v,h) = \frac{1}{Z}exp(v^{T}Wh + b^{T}v + c^Th)\\
p(v) = \sum_{h}^{}\frac{1}{Z}exp(v^{T}Wh + b^{T}v + c^Th)\\
p(h|v) = \frac{exp(v^{T}Wh + b^{T}v + c^Th)}{\sum_{h}^{}exp(v^{T}Wh + b^{T}v + c^Th)}
$$

## Simplify
- The denominator is simply another normalizing constant

$$
p(h|v) = \frac{1}{z'}exp(v^{T}Wh + b^{T}v + c^Th)
$$

## Write it in scalar form

$$
p(h|v) = \frac{1}{z'}exp(\sum_{i=1}^{D}\sum_{j=1}^{M}W_{ij}v_{i}h_{j}+\sum_{i=1}^{D}b_{i}v_{i}+\sum_{j=1}^{M}c_{j}h_{j})
$$

## Exponent rule
- exp(A+B) = exp(A)exp(B)

$$
p(h|v) = \frac{1}{z'}exp(\sum_{i=1}^{D}b_{i}v_{i})\prod_{j=1}^{M}exp(\sum_{i=1}^{D}W_{ij}v_{i}h_{j}+c_{j}h_{j})
$$

## Absorb Normalizing Constant
- Anything that doesn't depend on h can be absorbed
- iid (independence , identical distributed ) h and v 

$$
p(h|v) = \frac{1}{Z''}\prod_{j=1}^{M}exp(\sum_{i=1}^{D}W_{ij}v_{i}h_{j}+c_{j}h_{j})
$$

- Factor out $h_{j}$

$$
p(h|v) =  \frac{1}{Z''}\prod_{j=1}^{M}exp(h_{j}\bigg\{\sum_{i=1}^{D}W_{ij}v_{i}+c_{j} \bigg\})
$$

## Independence
- Key point: if A and B are independent, then p(A,B) = p(A)p(B)  
- we can see that p(h|v) factors out where each p(h_{j}|v) is independent of the others - let's just look at a single one  
- __Make sense graphically - hidden units can't connect to other hidden units__
$$
p(h_{j}|v) =  \frac{1}{Z'''}exp(h_{j}\bigg\{\sum_{i=1}^{D}W_{ij}v_{i}+c_{j} \bigg\})
$$

## It's Bernoulli

- $h_{j}$ can only ever be 0 or 1
- Let's just plug it in 

$$
p(h_{j}= 1|v) = \frac{1}{Z'''}exp(\bigg\{\sum_{i=1}^{D}W_{ij}v_{i}+c_{j} \bigg\})\\
p(h_{j}= 0|v) = \frac{1}{Z'''}exp(0) = \frac{1}{Z'''}\\
$$

- 0 and 1 are the only possibilities, therefore, they must sum to 1
- we have found the normalizing constant Z'''  

$$
p(h_{j}=1|v) + p(h_{j}=0|v) = 1\\
\frac{1}{Z'''}exp(\sum_{i=1}^{D}W_{ij}v_{i}+c_{j} ) + \frac{1}{Z'''} = 1\\
Z''' = 1 + exp(\sum_{i=1}^{D}W_{ij}v_{i}+c_{j})
$$

## Plug it back in
- we arrive at our usual neural network equation

$$
p(h_{j}=1 |v) = \frac{exp(\sum_{i=1}^{D}W_{ij}v_{i}+c_{j})}{1 + exp(\sum_{i=1}^{D}W_{ij}v_{i}+c_{j})} \\
 = \sigma(\sum_{i=1}^{D}W_{ij}v_{i}+c_{j})
$$

## Full vector form
- Remember, this abuses notation a little bit
- It's not a probability, but __`a vector of probabilities`__

$$
p(h=1|v) = \sigma(W^Tv +c)
$$

# Training an RBM(part 1)

## How to train an RBM
- More complicated than usual(not just plain gradient descent)
- Intuitively, we konw some quantities are intractable to calculate
- start by discussing what we'd like to do
- end with how we approximate it instead


## Maximum Likelihood Estimation
- we want W,b,c, to maximize the likelihood
- can take the log since log is monotonically increasing

$$
\text{Maximize }p(v) \text{ wrt } W,b,c \\
W,b,c = argmax_{W,b,c}p(v;W,b,c)\\
W,b,c = argmax_{W,b,c}logp(v;W,b,c)
$$

## Why does maxminum likelihood make sense?
- we measure the heights of all students in our class
    - $x_{1},x_{2},...,x_{n}$
- model it as Gaussian
    - find $\mu,\sigma^2$
- as usual, solve by setting derivative to 0    

$$
\mu,\sigma^2 = argmax_{\mu,\sigma^2}log\prod_{i=1}^{N}p(x_{i};\mu,\sigma^2)\\
\text{solution}: \mu = \frac{1}{N}\sum_{i=1}^{N}x_{i},\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^2
$$

## Another Example
- Gaussian Mixture Model(GMM)
- Used when distribution is multimodal
- multiple humps in histogram

![](https://www.researchgate.net/profile/Gregory_Valiant/publication/220427369/figure/fig1/AS:668920864309268@1536494570651/The-Gaussian-approximations-of-the-heights-of-adult-women-red-and-men-blue-Can-one.png)


## GMM(Gaussian Mixture Model)
- need to introduce a hidden variable $h_{j}$ to tell us which Gaussian a datapoint belongs to
- Assuming we have M Gaussians, the distribution of x is 

$$
p(x) = \sum_{j=1}^{M}p(h_{j})p(x|h_{j}) = \sum_{j=1}^{M}\pi_{j}\frac{1}{\sqrt{2\pi\sigma^2}}exp(\frac{-(x-\mu_{j})^2}{2\sigma^2})
$$

- $\pi_{j}$ is weight 


- 2 sets of variables: x(observed), h(unobserved)
- we still want to maximize p(x), wrt, $\pi,\mu,\sigma^2$

## Back to RBM
- Hopefully you're convinced it makes sense to maximize p(v)(or equivalently logp(v))
- If we could build up this expression, Theano or Tensorflow would automatically find the gradient and we would be done
- Problem: building up this expression is intractable

$$
p(v) = \sum_{h}^{}\frac{1}{Z}exp(-\bigg\{v^{T}Wh + b^{T}v + c^Th\bigg \})\\
Z = \sum_{v}^{}\sum_{h}^{}exp(-\bigg\{v^{T}Wh + b^{T}v + c^Th\bigg \})
$$

## Free Energy
- What should we do instead? Let's introduce a new quantity called free energy?
- Totally not obvious why it's useful, buy you'll see

$$
F(v) = -log \sum_{h}^{}e^{-E(v,h)}
$$

$$
E(v,h) = -(\sum_{i=1}^{D}\sum_{j=1}^{M}W_{ij}v_{i}h_{j}+\sum_{i=1}^{D}b_{i}v_{i}+\sum_{j=1}^{M}c_{j}h_{j})\\
E(v,h) = -(v^{T}Wh+b^{T}v+c^{T}h)
$$

## Common Term
- Can this allow us to write p(v) in terms of F(v)?
- Exercise: 

$$
F(v) = -log \sum_{h}^{}e^{-E(v,h)}\\
p(v) = \frac{1}{Z}\sum_{h}^{}e^{-E(v,h)}
$$

## Manipulate F(v)

- take the negative of both sides

$$
- F(v) = log \sum_{h}^{}e^{-E(v,h)}\\
$$

- Exponentiate both sides
$$
e^{-F(v)} = \sum_{h}^{}e^{-E(v,h)}, \text{since }e^{log(x)} = x \\
$$

$$
p(v) = \frac{1}{Z}e^{-F(v)}
$$

## Redefine Z

$$
Z = \sum_{v}^{}e^{-F(v)}
$$

## Gradient Descent
- Let's pretend p(v) is tractable and that gradient descent is possible
- What will the update look like?
- Find derivative wrt arbitrary parameter: $W_{ij},b_{i},c_{j}$ - doesn't matter

$$
\frac{\partial{logp(v)}}{\partial \theta} = \frac{\partial}{\partial{\theta}}(log\bigg\{ \frac{e^{-F(v)}}{Z} \bigg\})
$$

- since log(A/B) = log(A) - log(B)

$$
\frac{\partial{logp(v)}}{\partial{\theta}} = \frac{\partial}{\partial\theta}(-F(v)- logZ)\\
= -\frac{\partial{F(v)}}{\partial\theta} - \frac{\partial}{\partial\theta}logZ
$$

- use the log rule

$$
\frac{\partial{logp(v)}}{\partial{\theta}} = \frac{\partial}{\partial\theta}(-F(v)- logZ)\\
= -\frac{\partial{F(v)}}{\partial\theta} - \frac{\partial}{\partial\theta}logZ
= -\frac{\partial{F(v)}}{\partial\theta} - \frac{1}{Z}\frac{\partial}{\partial\theta}Z
$$

$$
-\frac{\partial{F(v)}}{\partial\theta} - \frac{\partial}{\partial\theta}logZ\\
= -\frac{\partial{F(v)}}{\partial\theta} - \frac{1}{Z}\frac{\partial}{\partial\theta}\sum_{v'}^{}e^{-F(v')}
$$

- move derivative inside the sum

$$
= -\frac{\partial{F(v)}}{\partial\theta} - \frac{1}{Z}\sum_{v'}^{}\frac{\partial}{\partial\theta}e^{-F(v')}
$$

- use exponent rule and chain rule

$$
= -\frac{\partial{F(v)}}{\partial\theta} - \frac{1}{Z}\sum_{v'}^{}e^{-F(v')}\frac{\partial{(-F(v'))}}{\partial\theta}
$$


- move Z back into summation


$$
= -\frac{\partial{F(v)}}{\partial\theta} - \sum_{v'}^{}\frac{1}{Z}e^{-F(v')}\frac{\partial{(-F(v'))}}{\partial\theta}
$$


$$
\frac{\partial{logp(v)}}{\partial{\theta}} = -\frac{\partial{F(v)}}{\partial{\theta}}+ \sum_{v'}^{}p(v')\frac{\partial{F(v')}}{\partial{\theta}}
$$

- Flip all the signs(we want something to minimize rather than maximize
- gradient descent rather than gradient ascent
- maximize p(v) is equivalent to minimizing -logp(v)
    - both mean squared error and cross entropy are negative log likelihoods

$$
-\frac{\partial{logp(v)}}{\partial{\theta}} = \frac{\partial{F(v)}}{\partial{\theta}}-\sum_{v'}^{}p(v')\frac{\partial{F(v')}}{\partial{\theta}}
$$

## Interpretation
- well known result for energy based models such as RBMs
- first term makes seeing v more likely , v feature vector
- second term makes seeing every possible v' less likely (weighted by p(v'))
- Imagine: bad model makes v'(never observed) very likely - p(v') is high
    - This gradient update exactly corrects it


$$
\text{first term: positive phase :} \frac{\partial{F(v)}}{\partial{\theta}}\\
\text{second term: negative phase :} -\sum_{v'}^{}p(v')\frac{\partial{F(v')}}{\partial{\theta}}
$$

# Training an RBM (part 2)


$$
-\frac{\partial{logp(v)}}{\partial{\theta}} = \frac{\partial{F(v)}}{\partial{\theta}}-\sum_{v'}^{}p(v')\frac{\partial{F(v')}}{\partial{\theta}}
$$

## Expected Value
- 2nd term is exactly an expected value, Recall the general definition of expected value

$$
E(f(x)) = \sum_{x}^{}p(x)f(x)\\
-\frac{\partial{logp(v)}}{\partial{\theta}} = \frac{\partial{F(v)}}{\partial{\theta}}-E\bigg\{\frac{\partial{F(v')}}{\partial{\theta}}\bigg\}
$$

## Approximating expected value
- Expected value is a mean
- Approximate means using sample means
- How do we generate these samples?

$$
E(f(x)) \approx \frac{1}{N}\sum_{n}^{}f(x_{n})
$$

## How to sample
- just go one round and only use that one sample
- v ->p(h|v) -> h ~p(h|v) -> p(v'|h)->v' ~p(v'|h)

$$
v \rightarrow h \rightarrow v'\\
-\frac{\partial{logp(v)}}{\partial{\theta}} \approx \frac{\partial{F(v)}}{\partial{\theta}}-\frac{\partial{F(v')}}{\partial{\theta}}
$$

## Contrastive Divergence
- Abbreviated as CD-k, to mean k steps of Gibbs sampling were performed to get v'
- we will do CD-1(you can try more and see if your results improve)
- one epoch  

```python
for v in dataset:
    p(h=1|v) = sigmoid(W.T.dot(v)+c)
    h = sample_from(p(h=1|v))
    p(v'=1|h) = sigmoid(W.got(h)+b)
    v' = sample_from(p(v'=1|h))
    param = param - learning_rate*(grad(F(v))-grad(F(v')))
```

## Theano and Tensorflow
- gradients will be calculated 
- in other words, our actual function will just be L = F(v)- F(v')



## __`Fake loss`__

- The gradient of our loss is approximated by the gradient of L
$$
\frac{\partial{-logp(v)}}{\partial{\theta}} \approx\frac{\partial L}{\partial\theta}\approx \frac{\partial{F(v)}}{\partial{\theta}}-\frac{\partial{F(v')}}{\partial{\theta}}
$$

- we don't express gradients explicitly in Theano or Tensorflow, therefore, we only want L itself

$$
\int{\frac{\partial L}{\partial\theta}d\theta } = \int{\frac{\partial F(v)}{\partial \theta}}d\theta - \int{\frac{\partial F(v')}{\partial\theta}}d\theta\\
L = F(v) - F(v')
$$

## But wait
- our free energy expression still appears to be intractable


$$
F(v) = -log \sum_{h}^{}e^{-E(v,h)}\\
$$

## Free Energy
- How can we calculate F(v) in a way that's not intractable
- Last theoretical lecture before you're able to write an RBM in code


$$
F(v) = -log \sum_{h}^{}e^{v^{T}Wh+b^Tv+c^Th}
$$

- $b^Tv$ term doesn't depend on h, move it outside

$$
F(v) = -log e^{b^Tv} \sum_{h}^{}e^{v^{T}Wh+c^Th}
$$

- use the log(AB) = log(A) + log(B)

$$
F(v) = -b^Tv -log \sum_{h}^{}e^{v^{T}Wh+c^Th}
$$

- unvectorize h

$$
F(v) = -b^Tv -log \sum_{h}^{}exp(\sum_{j=1}^{M}v^{T}W_{:,j}h_{j}+c_{j}h_{j})
$$

- Bring summation outside exponent
- Factor out $h_{j}$

$$
F(v) = -b^Tv -log \sum_{h}^{}\prod_{j=1}^{M}exp h_{j}(v^{T}W_{:,j}h_{j}+c_{j})
$$

- What terms actually need to be summed?
- If M = 2
    - possible values of h ={00,01,10,11}
- If M = 3
    - possible values of h ={000,001,010,011,100,101,110,111}
    
- as you konw, this is exponential $2^M$
    

- simplified version

$$
F(v) = -b^Tv -log \sum_{h}^{}\prod_{j=1}^{M}exp h_{j}(v^{T}W_{:,j}h_{j}+c_{j})\\
\sum_{h}^{}\prod_{j=1}^{M}exp h_{j}(v^{T}W_{:,j}h_{j}+c_{j}) = \sum_{h}^{}\prod_{j=1}^{M}e^{h_{j}u_{j}}
$$

- Let's plug in the values of h for M = 2

$$
\sum_{h}^{}\prod_{j=1}^{M}exp h_{j}(v^{T}W_{:,j}h_{j}+c_{j}) = \sum_{h}^{}\prod_{j=1}^{M}e^{h_{j}u_{j}}\\
e^{0u_{1}}e^{0u_{2}} + e^{0u_{1}}e^{1u_{2}}+ e^{1u_{1}}e^{0u_{2}}+e^{1u_{1}}e^{1u_{2}}
$$


- Factorization

$$
e^{0u_{1}}e^{0u_{2}} + e^{0u_{1}}e^{1u_{2}}+ e^{1u_{1}}e^{0u_{2}}+e^{1u_{1}}e^{1u_{2}}
= (e^{0u_{1}}+e^{1u_{1}})(e^{0u_{2}}+e^{1u_{2}})
$$

- M = 3

$$
= (e^{0u_{1}}+e^{1u_{1}})(e^{0u_{2}}+e^{1u_{2}})(e^{0u_{3}}+e^{1u_{3}})
$$

- Can we generalize this pattern for any value of M? sure
- This is going by quite fast, 
- more generally known as the sum-product rule and appears when working with Bayesian networks

$$
\sum_{h}^{}\prod_{j=1}^{M}e^{h_{j}u_{j}} = \prod_{j=1}^{M}\sum_{h_{j}=\big\{ 0,1\big\}}^{}e^{h_{j}u_{j}}
$$

- Plug our result back in

$$
F(v) = -b^Tv -log \sum_{h}^{}\prod_{j=1}^{M}exp h_{j}(v^{T}W_{:,j}h_{j}+c_{j})\\
F(v) = -b^Tv -log \prod_{j=1}^{M}\sum_{h_{j}=\big\{ 0,1\big\}}^{} exp h_{j}(v^TW_{:,j}+c_{j})
$$

- apply log rule log(AB) = log(A) + log(B)

$$
F(v) = -b^Tv -log \prod_{j=1}^{M}\sum_{h_{j}=\big\{ 0,1\big\}}^{} exp h_{j}(v^TW_{:,j}+c_{j})\\ 
F(v) = -b^Tv - \sum_{j=1}^{M}log\sum_{h_{j}= \big\{ 0,1\big\}}^{}exp h_{j}(v^TW_{:,j}+c_{j})
$$

- $h_{j}$ can only take on 2 values - 0 and 1
- Result in linear in M rather than exponential

$$
F(v) = -b^Tv - \sum_{j=1}^{M}log \big\{ 1+exp (v^TW_{:,j}+c_{j}) \big\}
$$

# Categorical RBM for recommender system Ratings

## Expanding Bernoulli RBM
- Why won't Bernoulli work?
- Bernoulli must be 0 or 1(or between 0 and 1)


## Categorical RBM
- Visible units represent a K-class categorical distribution
- Ratings go from 0.5 -> 5 = 10 categories

![](https://media.springernature.com/lw785/springer-static/image/chp%3A10.1007%2F978-3-319-73317-3_45/MediaObjects/462481_1_En_45_Fig1_HTML.gif)
[RBM paper](https://link.springer.com/chapter/10.1007/978-3-319-73317-3_45)

## New Neural Network Equations
- Hidden units remain as Bernoulli

$$
p(h_{j}=1|v) = \sigma(\sum_{k=1}^{K}\sum_{i=1}^{D}W_{ij}^{k}v_{i}^{k}+c_{j})\\
p(v_{i}^{k}=1|h) = softmax(\sum_{j=1}^{M}W_{ij}^{k}h_{j}^{k}+b_{i}^{k})\\
\text{where: } softmax(\alpha_{i}^{k}) = \frac{exp(\alpha_{i}^{k})}{\sum_{k'=1}^{K}exp(\alpha_{i}^{k'})}
$$

## Shapes
- very useful for thinking about how data is structured/implementation
- h = vector of size M
- v = 2-D array of size D x K
- W = 3-D array of size D x K x M
- b = 2-D array of size D x K
- c = vector of size M


## Vectorization
- can you vectorize these equations?
- quite a dot product or matrix multiply because we sum over two dimensions
- like a double dot product 
- Let's use the same notation- $W^{T}v$ and $b^{T}v$ = but remember it's actually a double sum 

$$
\sum_{k=1}^{K}\sum_{i=1}^{D}W_{ij}^{k}v_{i}^{k}\\
\sum_{k=1}^{K}\sum_{i=1}^{D}b_{i}^{k}v_{i}^{k}\\
$$

## Naming conventions
- normally with recommender systems
    - N = number of users
    - M = number of movies
    - K = latent dimension
- with RBMs - too many dimensions to represent
- we'll stick to more typical deep learning conventions
    - D = input dimensionality = number of movies
    - K = number of classes = number of possible ratings
    - M = number of hidden units
    - N = number of samples = number of users

## Training
- As usual, now that we have our model, we want to know how to train it 
- Start with energy again
- Remember, this looks the same, but it's not the same because implicitly there is a double dot product

$$
E(v,h) = -\bigg\{v^{T}Wh + b^{T}v + c^{T}h \bigg\}
$$

## Free Energy
- E(v,h) is just a scalar so this definition doesn't change

$$
F(v) = -log \sum_{h}^{}e^{-E(v,h)}\\
$$

- Plug in E

$$
F(v) = -log \sum_{h}^{}e^{v^{T}Wh + b^{T}v + c^{T}h }\\
$$

- Each addictive term is a scalar, so we can still move it outside sum like before

$$
F(v) = -log e^{b^{T}v}\sum_{h}^{}e^{v^{T}Wh + c^{T}h }\\
$$

- Bring it outside the log

$$
F(v) = -b^{T}v-log\sum_{h}^{}e^{v^{T}Wh + c^{T}h }\\
$$

- unvectorize h
- W has 2: because it's 3-D

$$
F(v) = -b^{T}v-log\sum_{h}^{}exp(\sum_{j=1}^{M}v^{T}W_{:,:,j}h_{j} + c_{j}h_{j})\\
F(v) = -b^{T}v-log\sum_{h}^{}exp(\sum_{j=1}^{M}h_{j}\bigg\{ v^{T}W_{:,:,j} + c_{j} \bigg\})\\
$$

- Everything being multiplied by $h_{j}$ is a scalar- let's just call it $u_{j}$

$$
F(v) = -b^{T}v-log\sum_{h}^{}exp(\sum_{j=1}^{M}h_{j}u_{j})\\
$$

- This is the same logic we applied before, therefore, we can arrive at the same conclusion

$$
F(v) = -b^{T}v-log\sum_{h}^{}exp(\sum_{j=1}^{M}h_{j}u_{j})\\
F(v) = -b^{T}v-\sum_{j=1}^{M}log\bigg\{ 1+exp(v^{T}W_{:,:,j}+c_{j}) \bigg\}
$$

## Tensorflow
- cannot use tf.matmul - only works on 2-D array
- does not work on 1-D arrays, nor 3-D array
- must write your own double dot function using tf.tensordot

## Training
- we've concluded that F(v) has the same form, therefore, trainign will generally be the same
- but how do we deal with missing ratings?
- v(i,k) = 1 if user rates move i the value k
    - whatever rating class k corresponds to 
    

## Missing Ratings

- consider a length -10 vector v(i) 
- If user rates movie i a 0.5
    - v(i) = [1,0,0,0,0,0,0,0,0,0]
- If user rates movie i a 5
    - v(i) = [0,0,0,0,0,0,0,0,0,1]
- If rating is missing
    - v(i) = [0,0,0,0,,0,0,0,0,0](all zeros)
    

- our loss L = F(v) - F(v')
- we have to mask v', so that if user did not rate movie i, then
    - v(i) = [0,0,0,0,0,0,0,0,0,0](all zeros)
    - v'(i) = [0,0,0,0,0,0,0,0,0,0](all zeros)
    - L(i) = F(v(i)) - F(v'(i)) = 0
- Difference is 0, gradient is 0, no changes made to parameters based on this movie    



## Making Predictions
- normally, when we have softmax we just take the argmax to get the prediction
- y = np.argmax(probs)
- In this paper, they discuss using the weighted average
- Example(pretend k= 5)
- Possible ratings = [1,2,3,4,5]
- Probabilities = [0.05,0.1,0.2,0.5,0.15]
- Prediction = $0.05\cdot 1 + 0.1\cdot2 + 0.2\cdot3 + 0.5\cdot4 + 0.15\cdot 5 = 3.6$


# RBM code

```python
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

import pandas as pd
from scipy.sparse import lil_matrix, csr_matrix, save_npz, load_npz
from datetime import datetime
```

- D = input dimensionality = number of movies
- K = number of classes = number of possible ratings
- M = number of hidden units
- N = number of samples = number of users


```python
def one_hot_encode(X, K):
    # input is N x D
    # output is N x D x K
    N, D = X.shape
    Y = np.zeros((N, D, K))
    for n, d in zip(*X.nonzero()):
        # 0.5...5 --> 1..10 --> 0..9
        k = int(X[n,d]*2 - 1)
        Y[n,d,k] = 1
    return Y

- v'(i) * mask
    - [1,0,0,0,0]*[1,1,1,1,1] = [1,0,0,0,0]
- v'(i) * mask
    - [0,0,1,0,0]*[0,0,0,0,0] = [0,0,0,0,0]    

def one_hot_mask(X, K):
    # input is N x D
    # output is N x D x K
    N, D = X.shape
    Y = np.zeros((N, D, K))
    # if X[n,d] == 0, there's a missing rating
    # so the mask should be all zeros
    # else, it should be all ones
    for n, d in zip(*X.nonzero()):
        Y[n,d,:] = 1
    return Y

one_to_ten = np.arange(10) + 1 # [1, 2, 3, ..., 10]
```

- Example(pretend k= 5)
- Possible ratings = [1,2,3,4,5]
- Probabilities = [0.05,0.1,0.2,0.5,0.15]
- Prediction = $0.05\cdot 1 + 0.1\cdot2 + 0.2\cdot3 + 0.5\cdot4 + 0.15\cdot 5 = 3.6$

```python
def convert_probs_to_ratings(probs):
    # probs is N x D x K
    # output is N x D matrix of predicted ratings
    # N, D, K = probs.shape
    # out = np.zeros((N, D))
    # each predicted rating is a weighted average using the probabilities
    # for n in range(N):
    #     for d in range(D):
    #         out[n,d] = probs[n,d].dot(one_to_ten) / 2
    # return out
    return probs.dot(one_to_ten) / 2

```




$$
p(h_{j}=1|v) = \sigma(\sum_{k=1}^{K}\sum_{i=1}^{D}W_{ij}^{k}v_{i}^{k}+c_{j})\\
p(v_{i}^{k}=1|h) = softmax(\sum_{j=1}^{M}W_{ij}^{k}h_{j}^{k}+b_{i}^{k})\\
\text{where: } softmax(\alpha_{i}^{k}) = \frac{exp(\alpha_{i}^{k})}{\sum_{k'=1}^{K}exp(\alpha_{i}^{k'})}
$$  




```python

def dot1(V, W):
    # V is N x D x K (batch of visible units)
    # W is D x K x M (weights)
    # returns N x M (hidden layer size)
    return tf.tensordot(V, W, axes=[[1,2], [0,1]])

def dot2(H, W):
    # H is N x M (batch of hiddens)
    # W is D x K x M (weights transposed)
    # returns N x D x K (visible)
    return tf.tensordot(H, W, axes=[[1], [2]])

```

$$
V : N x \underline{D} x K  [\underline{1},2]\\
W : \underline{D} x K x M  [\underline{0},1]\\
V : N x D x \underline{K}  [1,\underline{2}]\\
W : D x \underline{K} x M  [0,\underline{1}]\\
$$


- h = vector of size M
- v = 2-D array of size D x K
- W = 3-D array of size D x K x M
- b = 2-D array of size D x K
- c = vector of size M


```python
N, M = A.shape
rbm = RBM(M, 50, 10)

class RBM(object):
    def __init__(self, D, M, K):
        self.D = D # input feature size
        self.M = M # hidden size
        self.K = K # number of ratings
        self.build(D, M, K)

```

```python
    def build(self, D, M, K):
        # params
        self.W = tf.Variable(tf.random_normal(shape=(D, K, M)) * np.sqrt(2.0 / M))
        self.c = tf.Variable(np.zeros(M).astype(np.float32))
        self.b = tf.Variable(np.zeros((D, K)).astype(np.float32))

        # data
        self.X_in = tf.placeholder(tf.float32, shape=(None, D, K))
        self.mask = tf.placeholder(tf.float32, shape=(None, D, K))
```

$$
p(h_{j}=1|v) = \sigma(\sum_{k=1}^{K}\sum_{i=1}^{D}W_{ij}^{k}v_{i}^{k}+c_{j})\\
$$  

```python
        # conditional probabilities
        # NOTE: tf.contrib.distributions.Bernoulli API has changed in Tensorflow v1.2
        V = self.X_in
        p_h_given_v = tf.nn.sigmoid(dot1(V, self.W) + self.c)
        self.p_h_given_v = p_h_given_v # save for later

```


```python
        # draw a sample from p(h | v)
        r = tf.random_uniform(shape=tf.shape(p_h_given_v))
        H = tf.to_float(r < p_h_given_v)

```

$$
p(v_{i}^{k}=1|h) = softmax(\sum_{j=1}^{M}W_{ij}^{k}h_{j}^{k}+b_{i}^{k})\\
\text{where: } softmax(\alpha_{i}^{k}) = \frac{exp(\alpha_{i}^{k})}{\sum_{k'=1}^{K}exp(\alpha_{i}^{k'})}
$$

```python
        # draw a sample from p(v | h)
        # note: we don't have to actually do the softmax
        logits = dot2(H, self.W) + self.b
        cdist = tf.distributions.Categorical(logits=logits)
        X_sample = cdist.sample() # shape is (N, D)
        X_sample = tf.one_hot(X_sample, depth=K) # turn it into (N, D, K)
        X_sample = X_sample * self.mask # missing ratings shouldn't contribute to objective

```





```python
# build the objective
objective = tf.reduce_mean(self.free_energy(self.X_in)) - 
                        tf.reduce_mean(self.free_energy(X_sample))
self.train_op = tf.train.AdamOptimizer(1e-2).minimize(objective)
# self.train_op = tf.train.GradientDescentOptimizer(1e-3).minimize(objective)
```

```python
def free_energy(self, V):
    first_term = -tf.reduce_sum(dot1(V, self.b))
    second_term = -tf.reduce_sum(
        # tf.log(1 + tf.exp(tf.matmul(V, self.W) + self.c)),
        tf.nn.softplus(dot1(V, self.W) + self.c),
        axis=1
    )
    return first_term + second_term
```

$$
F(v) = -b^{T}v-log\sum_{h}^{}exp(\sum_{j=1}^{M}h_{j}u_{j})\\
F(v) = -b^{T}v-\sum_{j=1}^{M}log\bigg\{ 1+exp(v^{T}W_{:,:,j}+c_{j}) \bigg\}
$$

$$
\text{first term: positive phase :} \frac{\partial{F(v)}}{\partial{\theta}}\\
\text{second term: negative phase :} -\sum_{v'}^{}p(v')\frac{\partial{F(v')}}{\partial{\theta}}
$$

![](https://image.slidesharecdn.com/activationfunction-170608093401/95/activation-function-in-deep-neural-network-17-638.jpg?cb=1496914980)


- forward_logits belows:  

$$
p(h_{j}=1|v) = \sigma(\sum_{k=1}^{K}\sum_{i=1}^{D}W_{ij}^{k}v_{i}^{k}+c_{j})\\
\text{forward_logits} = \sum_{j=1}^{M}W_{ij}^{k}h_{j}^{k}+b_{i}^{k}\\
$$


```python
        # build the cost
        # we won't use this to optimize the model parameters
        # just to observe what happens during training
        logits = self.forward_logits(self.X_in)
        self.cost = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                labels=self.X_in,
                logits=logits,
            )
        )

```

$$
\text{output_visible} =  p(v_{i}^{k}=1|h) = softmax(\sum_{j=1}^{M}W_{ij}^{k}h_{j}^{k}+b_{i}^{k})\\
\text{where: } softmax(\alpha_{i}^{k}) = \frac{exp(\alpha_{i}^{k})}{\sum_{k'=1}^{K}exp(\alpha_{i}^{k'})}
$$


```python
# to get the output
        self.output_visible = self.forward_output(self.X_in)
```


```python
for j in range(n_batches):
    x = X[j*batch_sz:(j*batch_sz + batch_sz)].toarray()
    m = mask[j*batch_sz:(j*batch_sz + batch_sz)].toarray()

    # both visible units and mask have to be in one-hot form
    # N x D --> N x D x K
    batch_one_hot = one_hot_encode(x, self.K)
    m = one_hot_mask(m, self.K)

    _, c = self.session.run(
        (self.train_op, self.cost),
        feed_dict={self.X_in: batch_one_hot, self.mask: m}
    )
```