## Building a recommendation system with Stacked Auto Encoderes

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

### 1. Importing our data

- It's the famous movie lens dataset you can read more about it [here](https://grouplens.org/datasets/movielens/).
- Here I'm gonna use the 100k old version of the dataset.

In [7]:
# Importing the training dataset and convert it to numpy array
training_set_df = pd.read_csv('dataset/ml-100k/u1.base', delimiter = '\t')
trainin_set = np.array(training_set_df, dtype=int)

In [11]:
trainin_set[10]

array([        1,        16,         5, 878543541])

- as shown above the first coulnm refers to the userID, the second one to movieId , the third to the movie rating and the last one refers to time that the user gives the movie the rate 

In [25]:
# Importing the test dataset and convert it to numpy array
test_set_df = pd.read_csv('dataset/ml-100k/u1.test', delimiter='\t')
test_set = np.array(test_set_df, dtype=int)
test_set[10]

array([        1,        36,         2, 875073180])

---
---

### 2. Preparing the data

- The data will be a giant matrix each row of the matrix represent a user and its rating
    > e.g: the first row represent the first user, and the first coulmn in the first row represent the rating from the first user to the first movie
    
    > e.g: matrix[5, 28] -> that will give us the rating from the user number \#5 to the movie \#28
    

- Getting the max user_id, and the max movie_id
    > two know what will be the dimensions of our matrix

#### First we want to get total number of users and the same for movies movies

- In the cell below, We are taking the maximum number from both the test set and the trainning set because the they are randomly suffeled for both the user and  the movie

In [26]:
nb_users = int(max(max(trainin_set[:, 0]), max(test_set[:, 0])))
nb_movies = int(max(max(trainin_set[:, 1]), max(test_set[:, 1])))

In [27]:
# printing the results
print(f'The number of users {nb_users}, The number of movies {nb_movies}')

The number of users 943, The number of movies 1682


- List of User 1: [Ratings of all the movies by User 1]
- List of User 2: [Ratings of all the movies by User 2]

+ Define a function to use it to convert both of the training and test data

#### In the below function some points to explain

- what we are trying to do here is to make a list of lists and every list correspond to a specific user and every element in the list correspondse to the specific movie  

- first we are creating an empty list
- then loop in range of the number of users in our data and in every step we create a list of length of the number of movies and every value will represent a rate for that movie.
- For every user that didn't rate a movie we will put a zero on that element
- and for every rate we put it in the list

In [39]:
def convert_fn(data):
    converted_data = []
    
    for user_id in range(1, nb_users + 1):
        # getting all the movies ids that rated by every user
        movies_for_user_id = data[:, 1][data[:, 0] == user_id] # getting the movies ids that taken be the current user in the for loop
        rating_for_user_id = data[:, 2][data[:, 0] == user_id]
        
        ratings = np.zeros(nb_movies) # initialze all the ratings with zeros then include the rated movies
        ratings[movies_for_user_id - 1] = rating_for_user_id
        converted_data.append(list(ratings))
        
    return converted_data        

- To chk if our function working let's print some data to get some intuations

In [40]:
print(f'That is the rating value:{trainin_set[0, 2]}, for the movie numeber:{trainin_set[0, 1]} from the user number:{trainin_set[0, 0]}')
print(f'That is the rating value:{trainin_set[1, 2]}, for the movie numeber:{trainin_set[1, 1]} from the user number:{trainin_set[1, 0]}')
print(f'That is the rating value:{trainin_set[2, 2]}, for the movie numeber:{trainin_set[2, 1]} from the user number:{trainin_set[2, 0]}')

That is the rating value:3, for the movie numeber:2 from the user number:1
That is the rating value:4, for the movie numeber:3 from the user number:1
That is the rating value:3, for the movie numeber:4 from the user number:1


#### Some ituations from the previous prints:
- the user didn't rate movie \#1 as the first row represent the movie \#2
    - So the first element in the first list should be 0
- the begging of the first list should look like this \[0, 3, 4, 3...\]
- so let's convert them and see the results

In [42]:
new_training = convert_fn(trainin_set)
new_test = convert_fn(test_set)

Printing the first **list** from the data set

In [58]:
print(new_training[0][:10]) #printing the first ten elemnts in the first list
print(f'\nThe dimensions of the data is {len(new_training)} X {len(new_training[0])}')

[0.0, 3.0, 4.0, 3.0, 3.0, 0.0, 4.0, 1.0, 5.0, 0.0]

The dimensions of the data is 943 X 1682


So as we expected the first four values are true and the dimensions are good
- So we good to go

### Convert the data into torch tesnors

In [59]:
training_d = torch.FloatTensor(new_training)
test_d = torch.FloatTensor(new_test)

---

# 3. Creating Stack auto-encoder

Here I'm using a 6 fully connected layers to make the stacked auto-encoder
- form fc1 to fc3 represent the encoder 
- form fc4 to fc6 represent the decoder 
- Using the sigmoid activation function between layers
    > Using sigmoid after tuning the model as it gives the better results over tanh and relu
- Don't use an activation function in the output layer

In [76]:
class SAE(torch.nn.Module):
    def __init__(self, ):
        super(SAE, self).__init__() # getting all the functionality from the parent class
        self.fc1 = torch.nn.Linear(nb_movies, 30) # Encoding
        self.fc2 = torch.nn.Linear(30, 15) # Encoding
        self.fc3 = torch.nn.Linear(15, 10) # Encoding
        self.fc4 = torch.nn.Linear(10, 15) # Decoding
        self.fc5 = torch.nn.Linear(15, 30) # Decoding
        self.fc6 = torch.nn.Linear(30, nb_movies) # Decoding
        
        self.activation = torch.nn.Sigmoid()
        
    def forward(self, input):
        lyr_1 = self.activation(self.fc1(input))
        lyr_2 = self.activation(self.fc2(lyr_1))
        lyr_3 = self.activation(self.fc3(lyr_2))
        lyr_4 = self.activation(self.fc4(lyr_3))
        lyr_5 = self.activation(self.fc5(lyr_4))
        out = self.fc6(lyr_5)
        
        return out
        

### Creating our cost and optimizer

In [77]:
sae = SAE()
criterion = torch.nn.MSELoss()
optimizer = torch.optim.RMSprop(sae.parameters(), lr=0.01, weight_decay=0.5)

---

### Training our model

Training our model over 50 epochs give us an average loss of approxomatily 1.01

In [79]:
nb_epochs = 50
for epoch in range(1, nb_epochs + 1):
    train_loss = 0
    nb_usr_rated = 0.0 # to get the mean loss of all the users
    
    for id_user in range(nb_users):
        input_batch = training_d[id_user] # our input batch here is just an example at a time
        #torch requirement: taking a batch so we add one extra dimension
        input_batch = torch.autograd.Variable(input_batch).unsqueeze(0) 
        target = input_batch.clone()
        # For memeory optimization, We're gonna try to avoid users that didn't rate any movie
        if torch.sum(target.data > 0) > 0: # Avoiding users who didn't rate
            out = sae(input_batch)
            target.require_grad = False # to avoid get gradient for the targets
            out[target == 0] = 0
            
            loss = criterion(out, target)
            mean_corrector = nb_movies / float(torch.sum(target.data > 0) + 1e-10)  
            loss.backward()
            
            train_loss += np.sqrt(loss.data[0] * mean_corrector)
            
            nb_usr_rated += 1
            optimizer.step()
    if epoch % 10 == 0:
        print(f'Epoch num: {epoch}, Loss value is: {(train_loss / nb_usr_rated)}')
    



Epoch num: 10, Loss value is: 1.0236103534698486
Epoch num: 20, Loss value is: 1.020880937576294
Epoch num: 30, Loss value is: 1.0180864334106445
Epoch num: 40, Loss value is: 1.0165773630142212
Epoch num: 50, Loss value is: 1.0112740993499756


---

### Testing our model 

In [83]:
testing_loss = 0
nb_usr_rated = 0.0 # to get the mean loss of all the users
    
for id_user in range(nb_users):
    input_batch = torch.autograd.Variable(training_d[id_user]).unsqueeze(0) 
    target = torch.autograd.Variable(test_d[id_user]).unsqueeze(0)
    # For memeory optimization, We're gonna try to avoid users that didn't rate any movie
    if torch.sum(target.data > 0) > 0: # Avoiding users who didn't rate
        out = sae(input_batch)
        target.require_grad = False # to avoid get gradient for the targets
        out[target == 0] = 0

        loss = criterion(out, target)
        mean_corrector = nb_movies / float(torch.sum(target.data > 0) + 1e-10)  

        testing_loss += np.sqrt(loss.data[0] * mean_corrector)
        nb_usr_rated += 1
        
print(f'The Testing data loss value is: {testing_loss / nb_usr_rated}')

  app.launch_new_instance()


The Testing data loss value is: 1.015864610671997


**Hope you enjoyed it,  
Thanks for reading.  
Peace ^_^**