# Building A Movie Recommandation System With Pytorch

In this project, we are trying to predict the ratings that a user will give to an unseen movie, based on the ratings he gave to other movies. We will use the [movielens dataset](https://grouplens.org/datasets/movielens/). Two main folders contain our data. One folder named *ml-1m* which contains informations about 1 million movies. The other folder, which is *ml-100k* contains informations about 100 000 movies. We will use AutoEncoders to create our recommandation system. Let's start by importing the required libraries.

## Importing the libraries

In [1]:
# AutoEncoders

# Importing the libraries

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

So, let's consider the folder *ml-1m*. In this folder, we've got the files:
- movies.dat : contains informations related to a movie (movie_id, movie_title, movie_genre)
- users.dat : contains informations related to a user
- ratings.dat : contains the ratings given by users on different movies

We can import these files using pandas.

## Importing the dataset

In [2]:
# Importing the dataset
movies = pd.read_csv('C:/Users/user/.spyder-py3/P16-AutoEncoders/AutoEncoders/ml-1m/ml-1m/movies.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
users = pd.read_csv('C:/Users/user/.spyder-py3/P16-AutoEncoders/AutoEncoders/ml-1m/ml-1m/users.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')
ratings = pd.read_csv('C:/Users/user/.spyder-py3/P16-AutoEncoders/AutoEncoders/ml-1m/ml-1m/ratings.dat', sep = '::', header = None, engine = 'python', encoding = 'latin-1')


In [3]:
# Visualizing the first elements of the movies dataset
movies.head()

Unnamed: 0,0,1,2
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


The *movies* dataset contains three columns:
- column 0: contains the movie_id
- column 1: contains the title of the movie
- column 2: contains the genre of the movie

In [4]:
# Visualizing the first elements of the users dataset
users.head()

Unnamed: 0,0,1,2,3,4
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [5]:
# Visualizing the first elements of the ratings dataset
ratings.head()

Unnamed: 0,0,1,2,3
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


The *ratings* dataset has four columns:
- column 0 : contains the user_id x
- column 1: contains the movie_id y
- columns 2: contains the rating that a user x  gave to the movie y
- column 3: contains the timestamp

We will train the autoencoder using a subset of the data. The folder *ml-100k* contains a file named *u1.base*, which is the file we'll use to train the model. This subset itslef contains the ratings of users. We will then test our algorithm using the *u1.test* file.

## Preparing the training set and the test set

In [6]:
# Preparing the training set and the test set

training_set = pd.read_csv('C:/Users/user/.spyder-py3/P16-AutoEncoders/AutoEncoders/ml-100k/ml-100k/u1.base', delimiter = '\t')
test_set = pd.read_csv('C:/Users/user/.spyder-py3/P16-AutoEncoders/AutoEncoders/ml-100k/ml-100k/u1.test', delimiter = '\t')

In [7]:
# Visualizing the first elements of the training_set
training_set.head()

Unnamed: 0,1,1.1,5,874965758
0,1,2,3,876893171
1,1,3,4,878542960
2,1,4,3,876893119
3,1,5,3,889751712
4,1,7,4,875071561


The first three columns of the training_set represent the user_id, the movie_id, and the rating respectively. The same thing is applicable for the test_set.

In [8]:
# Visualizing the first elements of the test_set
test_set.head()

Unnamed: 0,1,6,5,887431973
0,1,10,3,875693118
1,1,12,5,878542960
2,1,14,5,874965706
3,1,17,3,875073198
4,1,20,4,887431883


We will use Pytorch to create the Autoencoder. Pytorch expects numpy arrays in order to deal with the data without throwing an error. Therefore, we have to convert our dataframe into numpy arrays. 

In [9]:
# Converting the training and test sets into numpy arrays
training_set = np.array(training_set, dtype = 'int')
test_set = np.array(test_set, dtype = 'int')

We'll need the number of users and the number of movies to build our recommendation system. Since id_users and id_movies start at index 1, the number of users represent the maximum value of id_user. Similarly, the number of movies represent the maximum value of id_movie. However, since the data is divided into training and test set, the maximum value of id_user/id_movie is either in the training_set or in the test_set. 

In [10]:
# Getting the number of users and movies
nb_users = int(max(max(training_set[:, 0]), max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1]), max(test_set[:, 1])))

In [11]:
print("Number of users: {}".format(nb_users))
print("Number of movies: {}".format(nb_movies))

Number of users: 943
Number of movies: 1682


In order to build the autoencoder, we need a specific data structure. In our case, we will create a list of lists, expected by Pytorch. Each list of list will contain the ratings that a specific user gave to the movies. If a user didn't rate a movie, we'll just add a 0 for that observation. We will define a function which will create this list of list for us. 

## Converting the data into an array with users in lines and movies in columns

In [12]:
def convert(data):
    # Initializing an empty list that will take the list of ratings given by a specific user
    new_data = []
    # Looping over all the users
    for id_users in range(1, nb_users + 1):
        # We get the id of the movies rated by the current user
        id_movies = data[:, 1][data[:, 0] == id_users]
        # We get the id of the ratings given by the current_user
        id_ratings = data[:, 2][data[:, 0] == id_users]
        # 
        ratings = np.zeros(nb_movies)
        # For movies rated by the current user, we replace 0 with the rating
        # The first element of ratings is at index 0. However, id_movies start at index 1.
        # Therefore, ratings[id_movies - 1] will correspond to the location of the movie we're considering
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data

In [13]:
# Applying the convert function to the training and test set.
training_set = convert(training_set)
test_set = convert(test_set)

In [14]:
# Convert the data into Torch tensors
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

## Creating the architecture of the Neural Network

We'll create a stacked autoencoder. This stacked autoencoder will get one input layer, two encoding layers and two decoding layers. As a reminder, for an autoencoder, the number of nodes of the output layer should equal the number of nodes of the input layer. 

![autoencoder](https://miro.medium.com/max/3524/1*oUbsOnYKX5DEpMOK3pH_lg.png)

In [15]:
# We will create the SAE (stack autoencoder) class, which is inherited from nn.Module
class SAE(nn.Module):
    # Initializing the class
    def __init__(self, ):
        # making the class get all the functions from the parent class nn.Module
        super(SAE, self).__init__()
        # Creating the first encoding layer. The number of input corresponds to the number of movies
        # We decide to encode it into 20 outputs
        self.fc1 = nn.Linear(nb_movies, 20)
        # Creating the second encoding layer. From 20 inputs to 10 outputs
        self.fc2 = nn.Linear(20, 10)
        # Creating the first decoding layer. From 10 inputs to 20 outputs
        self.fc3 = nn.Linear(10, 20)
        # Creating the second hidden layer. From 20 inputs to nb_movies outputs
        self.fc4 = nn.Linear(20, nb_movies)
        # Creating the activation fucntion which will fire up specific neurons 
        self.activation = nn.Sigmoid()
        
        # Creating the function for forward propagation
    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        # With autoencoder, we don't need an activation function for the last decoding part
        x = self.fc4(x)
        return x

In [16]:
# Creating an instance of our SAE class
sae = SAE()
# Defining a criterion which specifies the metric to minimize. In this case, we want to minimize the MSE (Mean Squared Error)
criterion = nn.MSELoss()
# Defining the algorithm used to minimize the loss function. In this case, we'll use RMSprop
optimizer = optim.RMSprop(sae.parameters(), lr = 0.01, weight_decay = 0.5)

## Training the SAE

In [17]:
# Setting the number of epochs
nb_epoch = 200
# Iterating over each epoch
for epoch in range(1, nb_epoch + 1):
    # Initializing the train_loss which will be updated
    train_loss = 0
    # Initializing a counter
    s = 0.
    # Iterating over each user
    for id_user in range(nb_users):
        # The input corresponds to the ratings given by the current user for each movie
        input = Variable(training_set[id_user]).unsqueeze(0)
        target = input.clone()
        # We don't consider movies NOT rated by the current user. So we specify a conditional statement
        if torch.sum(target.data > 0) > 0:
            # We use our SAE to get the output from the 
            output = sae(input)
            target.require_grad = False
            output[target == 0] = 0
            # Defining our loss function, comparing the output with the target
            loss = criterion(output, target)
            mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
            # Computing the gradients necessary to adjust the weights
            loss.backward()
            # Updating the train_loss
            train_loss += np.sqrt(loss.data*mean_corrector)
            s += 1.
            # Updating the weights of the neural network
            optimizer.step()
    print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))

epoch: 1 loss: tensor(1.7715)
epoch: 2 loss: tensor(1.0964)
epoch: 3 loss: tensor(1.0533)
epoch: 4 loss: tensor(1.0383)
epoch: 5 loss: tensor(1.0308)
epoch: 6 loss: tensor(1.0268)
epoch: 7 loss: tensor(1.0239)
epoch: 8 loss: tensor(1.0218)
epoch: 9 loss: tensor(1.0208)
epoch: 10 loss: tensor(1.0198)
epoch: 11 loss: tensor(1.0188)
epoch: 12 loss: tensor(1.0186)
epoch: 13 loss: tensor(1.0180)
epoch: 14 loss: tensor(1.0175)
epoch: 15 loss: tensor(1.0173)
epoch: 16 loss: tensor(1.0170)
epoch: 17 loss: tensor(1.0167)
epoch: 18 loss: tensor(1.0165)
epoch: 19 loss: tensor(1.0164)
epoch: 20 loss: tensor(1.0162)
epoch: 21 loss: tensor(1.0161)
epoch: 22 loss: tensor(1.0161)
epoch: 23 loss: tensor(1.0158)
epoch: 24 loss: tensor(1.0157)
epoch: 25 loss: tensor(1.0156)
epoch: 26 loss: tensor(1.0156)
epoch: 27 loss: tensor(1.0153)
epoch: 28 loss: tensor(1.0150)
epoch: 29 loss: tensor(1.0128)
epoch: 30 loss: tensor(1.0113)
epoch: 31 loss: tensor(1.0101)
epoch: 32 loss: tensor(1.0077)
epoch: 33 loss: t

After a training of 200 epochs, we get an overall lost of 0.9124. This means that for the training_set, we have : **predicted_rating - 0.9124 <= real_rating <= predicted_rating + 0.9124**

## Testing the SAE

In [18]:
# Initializing the test_loss
test_loss = 0
s = 0.
for id_user in range(nb_users):
    input = Variable(training_set[id_user]).unsqueeze(0)
    target = Variable(test_set[id_user]).unsqueeze(0)
    if torch.sum(target.data > 0) > 0:
        output = sae(input)
        target.require_grad = False
        output[(target == 0)] = 0
        loss = criterion(output, target)
        mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
        test_loss += np.sqrt(loss.data*mean_corrector)
        s += 1.
print('test_loss: '+str(test_loss/s))

test_loss: tensor(0.9517)


The test_loss is 0.9517. Therefore, for this specific test_set, we have : **predicted_rating - 0.9517 <= real_rating <= predicted_rating + 0.9517**