In [1]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
#Set your project path 
project_path =  '/content/gdrive/MyDrive/dl-rl/recommendation-systems/'

# Introduction to Recommendation Systems

In this tutorial we used a [deep autoencoder](https://arxiv.org/abs/1708.01715) to create a recommender system with the [Netflix dataset](https://netflixprize.com/). 

The deep autoencoder in this tutorial is done with [PyTorch](http://pytorch.org/) and is based on [this repo](https://github.com/NVIDIA/DeepRecommender) by NVIDIA. 

## Overview of Recommendation Systems

A [recommendation system](https://en.wikipedia.org/wiki/Recommender_system) seeks to understand the user preferences with the objective of recommending him or her items. These systems has become increasingly popular in recent years, in parallel with the growth of internet retailers like Amazon, Netflix or Spotify. Recommender systems are used in a variety of areas including movies, music, news, books, research articles, search queries, social tags, and products in general. In terms of business impact, according to a recent [study](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.895.3477&rep=rep1&type=pdf) from Wharton School, recommendation engines can cause a 25% lift in number of views and 35% lift in number of items purchased. So it is worth to understand these systems.   


Generally speaking, there are 3 methodologies for recommendation systems: collaborative filtering, content-based filtering and hybrid recommender systems.

[**Collaborative filtering**](https://en.wikipedia.org/wiki/Collaborative_filtering) collects large amounts of information on users’ behaviors, activities or preferences in order to predict what users will like based on their similarity to other users. This information can be explicit, where the user provides directly the ratings of the items, or implicit, where the ratings have to be extracted for the implicit user behavior, like number of views, likes, purchases, etc. 

<p align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif" width=300px/>
</p>

Mathematically, it is based on inferring the missing entries in an `mxn` matrix, `R`, whose `(i, j)` entry describes the ratings given by the `ith` user to the `jth` item. The performance is then measured using Root Mean Squared Error (RMSE). This problem has been addressed in a variaety of ways. Traditional methods include low rank matrix factorization like [Alternating Least Squares](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf), which became popular due to its implementation in [Spark](https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). Methods based on Deep Learning have grown in popularity recently, some techniques use embbedings and RELU activations like in [Youtube](http://dl.acm.org/citation.cfm?doid=2959100.2959190), recurrent neural networks like in [Hedasi et al., 2016](http://arxiv.org/abs/1511.06939) or deep autoencoders like the present tutorial.

**Content-based filtering** take into account contextual user factors such as location, date of purchase, user demographics and item factors like price, brand, type of item, etc, to recommend items that are similar to those that a user liked in the past.

The system creates a content-based profile of users based on a weighted vector of item features. The weights denote the importance of each feature to the user and can be computed from individually rated content vectors using a variety of techniques. Simple approaches use the average values of the rated item vector while other sophisticated methods use machine learning techniques such as Bayesian Classifiers, cluster analysis, decision trees, and neural networks in order to estimate the probability that the user is going to like the item.

**Hybrid recommender systems** combine multiple techniques together to achieve some synergy between them. They can use aspects from collaborative filtering, content-based filtering, knowledge based and demographics. They have proved to be very effective in some cases, like the [Bellkor solution](https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf), winner of the Netflix prize, the [Netflix Recommender System](http://delivery.acm.org/10.1145/2850000/2843948/a13-gomez-uribe.pdf) or [LightFM](https://github.com/lyst/lightfm). 

In this tutorial we focused on collaborative filtering with autoencoders. We used data of the Netflix prize 

In [3]:
!pip install aiohttp

Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[?25l[K     |▎                               | 10 kB 30.2 MB/s eta 0:00:01[K     |▋                               | 20 kB 38.5 MB/s eta 0:00:01[K     |▉                               | 30 kB 31.1 MB/s eta 0:00:01[K     |█▏                              | 40 kB 26.0 MB/s eta 0:00:01[K     |█▌                              | 51 kB 20.2 MB/s eta 0:00:01[K     |█▊                              | 61 kB 23.1 MB/s eta 0:00:01[K     |██                              | 71 kB 23.4 MB/s eta 0:00:01[K     |██▎                             | 81 kB 25.1 MB/s eta 0:00:01[K     |██▋                             | 92 kB 26.8 MB/s eta 0:00:01[K     |███                             | 102 kB 27.6 MB/s eta 0:00:01[K     |███▏                            | 112 kB 27.6 MB/s eta 0:00:01[K     |███▌                            | 122 kB 27.6 MB

In [4]:
import sys
import os
import numpy as np
import pandas as pd
import torch
import aiohttp
import asyncio
import json
import requests

sys.path.append('/content/gdrive/MyDrive/dl-rl/recommendation-systems/')
from utils import get_gpu_name, get_number_processors, get_gpu_memory, get_cuda_version
from parameters import *


print("OS: ", sys.platform)
print("Python: ", sys.version)
print("PyTorch: ", torch.__version__)
print("Numpy: ", np.__version__)
print("Number of CPU processors: ", get_number_processors())
print("GPU: ", get_gpu_name())
print("GPU memory: ", get_gpu_memory())
print("CUDA: ", get_cuda_version())

%matplotlib inline
%load_ext autoreload
%autoreload 2

OS:  linux
Python:  3.7.13 (default, Apr 24 2022, 01:04:09) 
[GCC 7.5.0]
PyTorch:  1.11.0+cu113
Numpy:  1.21.6
Number of CPU processors:  2
GPU:  ['Tesla T4']
GPU memory:  ['15109 MiB']
CUDA:  No CUDA in this machine


## Dataset: Netflix

This dataset was constructed to support participants in the [Netflix Prize](http://www.netflixprize.com). The movie rating files contain over 100 million ratings from 480 thousand randomly-chosen, anonymous Netflix customers over 17 thousand movie titles.  The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received during this period.  The ratings are on a scale from 1 to 5 (integral) stars.

The dataset can be [downloaded here](http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a). To uncompress it:

```bash
tar -xvf nf_prize_dataset.tar.gz
tar -xf download/training_set.tar
```

When we download the data, there are two important files:

1) The file `training_set.tar` is a tar of a directory containing 17770 files, one per movie.  The first line of each file contains the movie id followed by a colon.  Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

`CustomerID, Rating, Date`
- MovieIDs range from 1 to 17770 sequentially.
- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
- Ratings are on a five star (integral) scale from 1 to 5.
- Dates have the format YYYY-MM-DD.

2) Movie information in [`movie_titles.txt`](data/movie_titles.txt) is in the following format:

`MovieID, YearOfRelease, Title`

- MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
- YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
- Title in English is the Netflix movie.

### Data prep

The first step is to covert the data to the correct format for the autoencoder to read. This can take between 1 to 2 hours.  

In [7]:
%%time
%run ./content/gdrive/MyDrive/dl-rl/recommendation-systems//DeepRecommender/data_utils/data_convert.py $NF_PRIZE_DATASET $NF_DATA

ERROR:root:File `'./content/gdrive/MyDrive/dl-rl/recommendation-systems//DeepRecommender/data_utils/data_convert.py'` not found.


CPU times: user 1.51 ms, sys: 72 µs, total: 1.58 ms
Wall time: 1.44 ms


The script splitted the data into train, test and validation set, creating files with three columns: `CustomerID,MovieID,Rating`. The data is splitted over time generating 4 datasets: Netflix 3months, Netflix 6 months, Netflix 1 year and Netflix full. Here there is a table with some details of each dataset:

| Dataset  | Netflix 3 months | Netflix 6 months | Netflix 1 year | Netflix full |
| -------- | ---------------- | ---------------- | ----------- |  ------------ |
| Ratings train | 13,675,402 | 29,179,009 | 41,451,832 | 98,074,901 |
| Users train | 311,315 |390,795  | 345,855 | 477,412 |
| Items train | 17,736 |17,757  | 16,907 | 17,768 |
| Time range train | 2005-09-01 to 2005-11-31 | 2005-06-01 to 2005-11-31 | 2004-06-01 to 2005-05-31 | 1999-12-01 to 2005-11-31
|  |  |  |   | |
| Ratings test | 2,082,559 | 2,175,535  | 3,888,684| 2,250,481 |
| Users test | 160,906 | 169,541  | 197,951| 173,482 |
| Items test | 17,261 | 17,290  | 16,506| 17,305 |
| Time range test | 2005-12-01 to 2005-12-31 | 2005-12-01 to 2005-12-31 | 2005-06-01 to 2005-06-31 | 2005-12-01 to 2005-12-31

Let's take a look at some of the files.

In [None]:
nf_3m_valid = os.path.join(NF_DATA, 'N3M_VALID', 'n3m.valid.txt')
df = pd.read_csv(nf_3m_valid, names=['CustomerID','MovieID','Rating'], sep='\t')
print(df.shape)
df.head()

In [None]:
nf_3m_test = os.path.join(NF_DATA, 'N3M_TEST', 'n3m.test.txt')
df2 = pd.read_csv(nf_3m_test, names=['CustomerID','MovieID','Rating'], sep='\t')
print(df2.shape)
df2.head()

(1040820, 3)


Unnamed: 0,CustomerID,MovieID,Rating
0,0,159,4.0
1,0,4830,1.0
2,0,1261,3.0
3,0,12058,3.0
4,0,13412,2.0


## Deep Autoencoder for Collaborative Filtering

Once we have the data, let's explain in some detail the model that we are going to use. The [model](https://arxiv.org/abs/1708.01715), developed by NVIDIA folks, is a Deep autoencoder with 6 layers with non-linear activation function SELU (scaled exponential linear units), dropout and iterative dense refeeding.

An autoencoder is a network which implements two transformations: $encode(x): R^n \Rightarrow R^d$ and $decoder(z): R^d \Rightarrow R^n$. The “goal” of autoencoder is to obtain a $d$ dimensional representation of data such that an error measure between $x$ and $f(x) = decode(encode(x))$ is minimized. In the next figure, the autocoder architecture proposed in the [paper](https://arxiv.org/abs/1708.01715) is showed. Encoder has 2 layers $e_1$ and $e_2$ and decoder has 2 layers $d_1$ and $d_2$. Dropout may be applied to coding layer $z$. In the paper, the authors show experiments with different number of layers, from 2 to 12 (see Table 2 in the original paper).

<p align="center">
    <img src="./data/AutoEncoder.png" width=350px/>
</p>

During the forward pass the model takes a user representation by his vector of ratings from the training set $x \in R^n$, where $n$ is number of items. Note that $x$ is very sparse, while the output of the decoder, $y=f(x) \in R^n$ is dense and contains the rating predictions for all items in the corpus. The loss is the root mean squared error (RMSE).

One of the key ideas of the paper is dense re-feeding. Let's consider an idealized scenario with a perfect $f$. Then $f(x)_i = x_i ,\forall i : x_i \ne 0$ and $f(x)_i$ accurately predicts all user's future ratings. This means that if a user rates a new item $k$ (thereby creating a new vector $x'$) then $f(x)_k = x'_k$ and $f(x) = f(x')$. Therefore, the authors refeed the input in the autoencoder to augment the dataset. The method consists of the following steps:

1. Given a sparse $x$, compute the forward pass to get $f(x)$ and the loss.

2. Backpropagate the loss and update the weights.

3. Treat $f(x)$ as a new example and compute $f(f(x))$

4. Compute a second backward pass.

Steps 3 and 4 can be repeated several times.

Finally, the authors explore different non-linear [activation functions](https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py). They found that on this task ELU, SELU and LRELU, which have non-zero negative parts, perform much better than SIGMOID, RELU, RELU6, and TANH.

Now, let's compute the training. The model parameters can be found in [parameters.py](parameters.py).

In [None]:
%run ./DeepRecommender/run.py --gpu_ids $GPUS \
    --path_to_train_data $TRAIN \
    --path_to_eval_data $EVAL \
    --hidden_layers $HIDDEN \
    --non_linearity_type $ACTIVATION \
    --batch_size $BATCH_SIZE \
    --logdir $MODEL_OUTPUT_DIR \
    --drop_prob $DROPOUT \
    --optimizer $OPTIMIZER \
    --lr $LR \
    --weight_decay $WD \
    --aug_step $AUG_STEP \
    --num_epochs $EPOCHS 

Namespace(aug_step=1, batch_size=128, constrained=False, drop_prob=0.8, gpu_ids='0', hidden_layers='512,512,1024', logdir='model_save', lr=0.005, noise_prob=0.0, non_linearity_type='selu', num_epochs=10, optimizer='momentum', path_to_eval_data='Netflix/N3M_VALID', path_to_train_data='Netflix/N3M_TRAIN', skip_last_layer_nl=False, weight_decay=0.0)
Loading training data from Netflix/N3M_TRAIN
Data loaded
Total items found: 311315
Vector dim: 17736
Loading eval data from Netflix/N3M_VALID
******************************
******************************
[17736, 512, 512, 1024]
Dropout drop probability: 0.8
Encoder pass:
torch.Size([512, 17736])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([1024, 512])
torch.Size([1024])
Decoder pass:
torch.Size([512, 1024])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([17736, 512])
torch.Size([17736])
******************************
******************************
Using GPUs: [0]
Doing epoch 0 of 10
Total epoch 

## Evaluation
Now we are going to evaluate the model on the test set and compute the final loss.

In [None]:
%run ./DeepRecommender/infer.py \
--path_to_train_data $TRAIN \
--path_to_eval_data $TEST \
--hidden_layers $HIDDEN \
--non_linearity_type $ACTIVATION \
--save_path  $MODEL_PATH \
--drop_prob $DROPOUT \
--predictions_path $INFER_OUTPUT

Namespace(constrained=False, drop_prob=0.8, hidden_layers='512,512,1024', non_linearity_type='selu', path_to_eval_data='Netflix/N3M_TEST', path_to_train_data='Netflix/N3M_TRAIN', predictions_path='preds.txt', save_path='model_save/model.epoch_9', skip_last_layer_nl=False)
Loading training data
Data loaded
Total items found: 311315
Vector dim: 17736
Loading eval data
******************************
******************************
[17736, 512, 512, 1024]
Dropout drop probability: 0.8
Encoder pass:
torch.Size([512, 17736])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([1024, 512])
torch.Size([1024])
Decoder pass:
torch.Size([512, 1024])
torch.Size([512])
torch.Size([512, 512])
torch.Size([512])
torch.Size([17736, 512])
torch.Size([17736])
******************************
******************************
Loading model from: model_save/model.epoch_9
Done: 0
Done: 10000
Done: 20000
Done: 30000
Done: 40000
Done: 50000
Done: 60000
Done: 70000
Done: 80000
Done: 90000
Done: 100

In [None]:
%run ./DeepRecommender/compute_RMSE.py --path_to_predictions=$INFER_OUTPUT

Namespace(path_to_predictions='preds.txt', round=False)
####################
RMSE: 0.9746437597050387
####################


## Conclusion
In this notebook we showed how to create a recommendation system with a deep autoencoder. 


Happy coding!