# *Recommender System Model with Stacked Autoencoder*

**Intuition**: We will build a Stacked Autoencoder to generate predictions of viewer ratings for our Recommender System using Pytorch. To learn more about Stacked (Denoising) Autoencoders, check out the following paper: http://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf.

**Dataset Source**: https://grouplens.org/datasets/movielens/100k/

### Importing libraries and packages

In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable
import warnings
warnings.filterwarnings('ignore')

### Importing the Training Set & Test Set

In [2]:
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set

Unnamed: 0,1,1.1,5,874965758
0,1,2,3,876893171
1,1,3,4,878542960
2,1,4,3,876893119
3,1,5,3,889751712
4,1,7,4,875071561
...,...,...,...,...
79994,943,1067,2,875501756
79995,943,1074,4,888640250
79996,943,1188,3,888640250
79997,943,1228,3,888640275


In [3]:
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set

Unnamed: 0,1,6,5,887431973
0,1,10,3,875693118
1,1,12,5,878542960
2,1,14,5,874965706
3,1,17,3,875073198
4,1,20,4,887431883
...,...,...,...,...
19994,458,648,4,886395899
19995,458,1101,4,886397931
19996,459,934,3,879563639
19997,460,10,3,882912371


### Transforming the Training Set & Test Set into Arrays

In [4]:
training_set_copy = training_set.copy()
training_set = np.array(training_set, dtype = 'int')
training_set

array([[        1,         2,         3, 876893171],
       [        1,         3,         4, 878542960],
       [        1,         4,         3, 876893119],
       ...,
       [      943,      1188,         3, 888640250],
       [      943,      1228,         3, 888640275],
       [      943,      1330,         3, 888692465]])

In [5]:
test_set_copy = test_set.copy()
test_set = np.array(test_set, dtype = 'int')
test_set

array([[        1,        10,         3, 875693118],
       [        1,        12,         5, 878542960],
       [        1,        14,         5, 874965706],
       ...,
       [      459,       934,         3, 879563639],
       [      460,        10,         3, 882912371],
       [      462,       682,         5, 886365231]])

### Getting the number of users and movies

In [6]:
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))
print('Total number of users = %s' % nb_users, '\nTotal number of movies = %s' % nb_movies)

Total number of users = 943 
Total number of movies = 1682


### Converting the data into an array with users in lines and movies in columns

In [7]:
def convert(data):
    new_data = []
    for id_users in range(1, nb_users + 1):
        id_movies = data[:,1][data[:,0] == id_users]
        id_ratings = data[:,2][data[:,0] == id_users]
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data
training_set = convert(training_set)
test_set = convert(test_set)

### Converting the Training Set & Test Set into PyTorch Tensors

In [8]:
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

In [9]:
training_set[:1]

tensor([[0., 3., 4.,  ..., 0., 0., 0.]])

In [10]:
test_set[:1]

tensor([[0., 0., 0.,  ..., 0., 0., 0.]])

### Structuring the Stacked Autoencoder class

In [11]:
class SAE(nn.Module):
    def __init__(self, ):
        super(SAE, self).__init__()
        self.fc1 = nn.Linear(nb_movies, 20)
        self.fc2 = nn.Linear(20, 10)
        self.fc3 = nn.Linear(10, 20)
        self.fc4 = nn.Linear(20, nb_movies)
        self.activation = nn.Sigmoid()
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.fc4(x)
        return x
sae = SAE()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(sae.parameters(), lr = 0.01, weight_decay = 0.5)

### Training the Stacked Autoencoder

In [12]:
nb_epoch = 200
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.
    for id_user in range(nb_users):
        input = Variable(training_set[id_user]).unsqueeze(0)
        target = input.clone()
        if torch.sum(target.data > 0) > 0:
            output = sae(input)
            target.require_grad = False
            output[target == 0] = 0
            loss = criterion(output, target)
            mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
            loss.backward()
            train_loss += np.sqrt(loss.data*mean_corrector)
            s += 1.
            optimizer.step()
    print('Epoch: '+str(epoch)+' & Training Loss: '+str(train_loss/s))

Epoch: 1 & Training Loss: tensor(1.7715)
Epoch: 2 & Training Loss: tensor(1.0968)
Epoch: 3 & Training Loss: tensor(1.0534)
Epoch: 4 & Training Loss: tensor(1.0382)
Epoch: 5 & Training Loss: tensor(1.0309)
Epoch: 6 & Training Loss: tensor(1.0266)
Epoch: 7 & Training Loss: tensor(1.0238)
Epoch: 8 & Training Loss: tensor(1.0219)
Epoch: 9 & Training Loss: tensor(1.0207)
Epoch: 10 & Training Loss: tensor(1.0195)
Epoch: 11 & Training Loss: tensor(1.0187)
Epoch: 12 & Training Loss: tensor(1.0187)
Epoch: 13 & Training Loss: tensor(1.0178)
Epoch: 14 & Training Loss: tensor(1.0175)
Epoch: 15 & Training Loss: tensor(1.0173)
Epoch: 16 & Training Loss: tensor(1.0170)
Epoch: 17 & Training Loss: tensor(1.0167)
Epoch: 18 & Training Loss: tensor(1.0164)
Epoch: 19 & Training Loss: tensor(1.0164)
Epoch: 20 & Training Loss: tensor(1.0161)
Epoch: 21 & Training Loss: tensor(1.0160)
Epoch: 22 & Training Loss: tensor(1.0159)
Epoch: 23 & Training Loss: tensor(1.0158)
Epoch: 24 & Training Loss: tensor(1.0159)
E

Epoch: 195 & Training Loss: tensor(0.9139)
Epoch: 196 & Training Loss: tensor(0.9142)
Epoch: 197 & Training Loss: tensor(0.9136)
Epoch: 198 & Training Loss: tensor(0.9141)
Epoch: 199 & Training Loss: tensor(0.9133)
Epoch: 200 & Training Loss: tensor(0.9136)


### Testing the Stacked Autoencoder

In [13]:
test_loss = 0
s = 0.
for id_user in range(nb_users):
    input = Variable(training_set[id_user]).unsqueeze(0)
    target=Variable(test_set[id_user]).unsqueeze(0)
    if torch.sum(target.data > 0) > 0:
        output = sae(input)
        target.require_grad = False
        output[target == 0] = 0
        loss = criterion(output, target)
        mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
        test_loss += np.sqrt(loss.data * mean_corrector)
        s += 1.
print('Test Loss: '+str(test_loss/s))

Test Loss: tensor(0.9502)


### Extracting the title of the Movies from the Movies dataset

In [14]:
movies = pd.read_csv('ml-100k/u.item', sep = '|', engine = 'python', encoding = 'latin-1', header = None)
movie_title = movies.iloc[:nb_movies, 1:2]
movie_title

Unnamed: 0,1
0,Toy Story (1995)
1,GoldenEye (1995)
2,Four Rooms (1995)
3,Get Shorty (1995)
4,Copycat (1995)
...,...
1677,Mat' i syn (1997)
1678,B. Monkey (1998)
1679,Sliding Doors (1998)
1680,You So Crazy (1994)


### Choosing a User by the Test Set and taking the whole list of Movies of that User

In [15]:
user_id = 101
user_rating = training_set.data.numpy()[user_id - 1, :].reshape(-1,1)
user_target = test_set.data.numpy()[user_id, :].reshape(-1,1)

### Making the Predictions using the Input

In [16]:
user_input = Variable(training_set[user_id]).unsqueeze(0)
predicted = sae(user_input)
predicted = predicted.data.numpy().reshape(-1,1)
predicted

array([[2.8014321],
       [2.2636726],
       [1.7945926],
       ...,
       [1.4948449],
       [2.4509208],
       [2.236212 ]], dtype=float32)

### Combining all the info into one dataframe

In [17]:
result_array = np.hstack([movie_title, user_target, predicted])
result_array = result_array[result_array[:, 1] > 0]
result_df = pd.DataFrame(data=result_array, columns=['Movie', 'Target Rating', 'Predicted'])
result_df

Unnamed: 0,Movie,Target Rating,Predicted
0,Copycat (1995),3,2.27144
1,Ed Wood (1994),2,2.69804
2,Star Wars (1977),4,3.66983
3,Stargate (1994),3,2.41681
4,While You Were Sleeping (1995),3,2.6145
...,...,...,...
99,Chain Reaction (1996),2,1.9055
100,Turbulence (1997),1,1.34292
101,Fire Down Below (1997),2,1.37296
102,"Beverly Hillbillies, The (1993)",1,1.62214


### Exporting the Results as a CSV file

In [18]:
result_df.to_csv('Generated ratings vs OG ratings.csv')

*The code (In [17]) was ran to check the predicted ratings compared to the real ratings (target) obtained from test dataset.
The resulting Dataframe has the movie name, the target rate (obtained from the Test Dataset) and the predicted  rating. The resulting array was filtered to show only the movies that has been predicted, that is, the movies that have the rating > 0 in the Test Set. The results can be also be viewed in the exported CSV file.*