# ***Recommender Systems Model with Restricted Boltzmann Machine (Energy-based Model)***

Intuition: We will be creating a Recommender System Model using Restricted Boltzmann Machine (EBM) based on the viewers' ratings on various movies. We will be using Pytorch to transform out dataset into tensors. We will also be utilizing the Gibbs Sampling technique to obtain the observations from the specified multivariate probability distribution. Finally, we will evaluate out model with RMSE. To know the complete mathematical intuition behing the RBMs, visit https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf.

Dataset source: https://www.kaggle.com/grouplens/movielens-latest-small

### Importing libraries and packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable
import warnings
warnings.filterwarnings('ignore')

### Importing the 'Ratings' datasets

In [2]:
dataset = pd.read_csv('../input/movielens-latest-small/ratings.csv')

In [3]:
dataset

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


### Splitting the dataset into Training & Test Sets

In [4]:
training_set, test_set = train_test_split(dataset, test_size= 0.2, random_state = 0)

In [5]:
training_set

Unnamed: 0,userId,movieId,rating,timestamp
77701,483,8529,4.0,1215545278
94477,599,33437,2.5,1498518389
36246,247,5349,2.0,1467645405
17483,111,7361,3.5,1516140853
100300,610,57504,4.5,1493847901
...,...,...,...,...
21243,140,1393,4.0,949667278
45891,304,1676,4.0,896268382
42613,288,2529,5.0,976139309
43567,292,1307,3.0,1323631984


In [6]:
test_set

Unnamed: 0,userId,movieId,rating,timestamp
41008,276,780,5.0,858350384
94274,599,7624,2.5,1519235950
77380,483,1320,2.5,1215895327
29744,202,3448,3.0,974924072
40462,274,60291,4.0,1296947017
...,...,...,...,...
54027,356,6620,2.0,1229139803
84234,538,5618,5.0,1307846388
12840,82,2987,3.5,1084468057
74661,474,6787,4.0,1064171862


### Transforming the Training Set & Test Set into Arrays

In [7]:
training_set = np.array(training_set, dtype = 'int')
training_set

array([[       483,       8529,          4, 1215545278],
       [       599,      33437,          2, 1498518389],
       [       247,       5349,          2, 1467645405],
       ...,
       [       288,       2529,          5,  976139309],
       [       292,       1307,          3, 1323631984],
       [       440,       7361,          4, 1237569025]])

In [8]:
test_set = np.array(test_set, dtype = 'int')
test_set

array([[       276,        780,          5,  858350384],
       [       599,       7624,          2, 1519235950],
       [       483,       1320,          2, 1215895327],
       ...,
       [        82,       2987,          3, 1084468057],
       [       474,       6787,          4, 1064171862],
       [         6,        318,          5,  845553200]])

### Getting the number of users and movies

In [9]:
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))
print('Total number of users = %s' % nb_users, '\nTotal number of movies = %s' % nb_movies)

Total number of users = 610 
Total number of movies = 193609


### Converting the data into an array with users in lines and movies in columns

In [10]:
def convert(data):
    new_data = []
    for id_users in range(1, nb_users + 1):
        id_movies = data[:,1][data[:,0] == id_users]
        id_ratings = data[:,2][data[:,0] == id_users]
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data
training_set = convert(training_set)
test_set = convert(test_set)

### Converting the Training Set & Test Set into PyTorch Tensors

In [11]:
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

In [12]:
training_set[:1]

tensor([[4., 0., 4.,  ..., 0., 0., 0.]])

In [13]:
test_set[:1]

tensor([[0., 0., 0.,  ..., 0., 0., 0.]])

### Converting the Ratings into Binary, i.e. '1' if liked & '0' if not

In [14]:
training_set[training_set == 0] = -1
training_set[training_set == 1] = 0
training_set[training_set == 2] = 0
training_set[training_set >= 3] = 1
test_set[test_set == 0] = -1
test_set[test_set == 1] = 0
test_set[test_set == 2] = 0
test_set[test_set >= 3] = 1

In [15]:
training_set[:1]

tensor([[ 1., -1.,  1.,  ..., -1., -1., -1.]])

In [16]:
test_set[:1]

tensor([[-1., -1., -1.,  ..., -1., -1., -1.]])

### Structuring the RBM class

In [32]:
class RBM():
    def __init__(self, nv, nh):
        self.W = torch.randn(nh, nv)
        self.a = torch.randn(1, nh)
        self.b = torch.randn(1, nv)
    def sample_h(self, x):
        wx = torch.mm(x, self.W.t())
        activation = wx + self.a.expand_as(wx)
        p_h_given_v = torch.sigmoid(activation)
        return p_h_given_v, torch.bernoulli(p_h_given_v)
    def sample_v(self, y):
        wy = torch.mm(y, self.W)
        activation = wy + self.b.expand_as(wy)
        p_v_given_h = torch.sigmoid(activation)
        return p_v_given_h, torch.bernoulli(p_v_given_h)
    def train(self, v0, vk, ph0, phk):
        self.W += (torch.mm(v0.t(),ph0) - torch.mm(vk.t(),phk)).t()
        self.b += torch.sum((v0 - vk), 0)
        self.a += torch.sum((ph0 - phk), 0)
    def predict(self, x):
        _, h = self.sample_h(x)
        _, v = self.sample_v(h)
        return v
nv = len(training_set[0])
nh = 100
batch_size = 100
rbm = RBM(nv, nh)

### Training the RBM

In [33]:
nb_epoch = 10
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    rmse_tr = 0
    s = 0.
    for id_user in range(0, nb_users - batch_size, batch_size):
        vk = training_set[id_user:id_user+batch_size]
        v0 = test_set[id_user:id_user+batch_size]
        ph0,_ = rbm.sample_h(v0)
        for k in range(10):
            _,hk = rbm.sample_h(vk)
            _,vk = rbm.sample_v(hk)
            vk[v0<0] = v0[v0<0]
        phk,_ = rbm.sample_h(vk)
        rbm.train(v0, vk, ph0, phk)
        train_loss += torch.mean(torch.abs(v0[v0>=0] - vk[v0>=0]))
        rmse_tr += np.sqrt(torch.mean((v0[v0>=0] - vk[v0>=0])**2))
        s += 1.
    print('Epoch: '+str(epoch)+' , Training Loss: '+str(train_loss/s))
    print('RMSE: '+str(rmse_tr))

Epoch: 1 , Training Loss: tensor(0.4590)
RMSE: tensor(4.0383)
Epoch: 2 , Training Loss: tensor(0.2530)
RMSE: tensor(3.0122)
Epoch: 3 , Training Loss: tensor(0.2137)
RMSE: tensor(2.7732)
Epoch: 4 , Training Loss: tensor(0.2236)
RMSE: tensor(2.8367)
Epoch: 5 , Training Loss: tensor(0.2091)
RMSE: tensor(2.7432)
Epoch: 6 , Training Loss: tensor(0.2096)
RMSE: tensor(2.7454)
Epoch: 7 , Training Loss: tensor(0.2095)
RMSE: tensor(2.7454)
Epoch: 8 , Training Loss: tensor(0.2124)
RMSE: tensor(2.7651)
Epoch: 9 , Training Loss: tensor(0.2067)
RMSE: tensor(2.7276)
Epoch: 10 , Training Loss: tensor(0.2082)
RMSE: tensor(2.7368)


### Testing the RBM

In [34]:
test_loss = 0
rmse_ts = 0
s = 0.
for id_user in range(nb_users):
    v = training_set[id_user:id_user+1]
    vt = test_set[id_user:id_user+1]
    if len(vt[vt>=0]) > 0:
        _,h = rbm.sample_h(v)
        _,v = rbm.sample_v(h)
        test_loss += torch.mean(torch.abs(vt[vt>=0] - v[vt>=0]))
        rmse_ts += np.sqrt(torch.mean((vt[vt>=0] - v[vt>=0])**2))
        s += 1.
print('Test Loss: '+str(test_loss/s))
print('RMSE: '+str(rmse_ts))

Test Loss: tensor(0.1950)
RMSE: tensor(234.7473)


### Choosing a User by the Test Set and taking the whole list of Movies of that User and convert it to a PyTorch Tensor

In [44]:
user_id = 276
user_input = Variable(test_set[user_id - 1]).unsqueeze(0)
user_input

tensor([[ 1., -1., -1.,  ..., -1., -1., -1.]])

### Making the Predictions using the Input

In [45]:
output = rbm.predict(user_input)
output = output.data.numpy()
output

array([[1., 1., 1., ..., 0., 0., 0.]], dtype=float32)

### Stacking the inputs and outputs

In [46]:
input_output = np.vstack([user_input, output])
pd.DataFrame(input_output)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,193599,193600,193601,193602,193603,193604,193605,193606,193607,193608
0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


*From the dataframe above, we can see that the system recommended movies 1, 2, 3, 5, 7, 9, etc. (by interpreting visually). There might also be movies present in the dataframe that have already been seen by the user and our system also recommended them. Only a few of the predictions could be shown due to the limited/compressed output of the notebook.*