# Neural Collaborative Filtering

Recommender systems have benefited from deep learning's success. Today's state-of-the-art recommender systems such as those at YouTube and Amazon are powered by complex deep learning systems, and less so on traditional methods.

In this notebook, we will go through the following:

- How to create your own deep learning based recommender system using PyTorch Lightning/PyTorch
- The difference between implicit and explicit feedback for recommender systems
- How to train-test split a dataset for training recommender systems without introducing biases and data leakages
- Metrics for evaluating recommender systems (hint: accuracy or RMSE is not appropriate!)



[Refer](https://www.kaggle.com/jamesloy/deep-learning-based-recommender-systems)

## Building Recommender Systems using Implicit Feedback

Before we build our model, it is important to understand the distinction between _implicit_ and _explicit_ feedback in the context of recommender systems, and why __modern recommender systems are built on implicit feedback__.


### Explicit Feedback

In the context of recommender systems, explicit feedback are __direct__ and __quantitative__ data collected from users.
- For example, Amazon allows users to rate purchased items on a scale of 1-10. These ratings are provided directly from users, and the scale allows Amazon to quantify user preference. 
- Another example of explicit feedback includes the thumbs up button on YouTube, which captures users' explicit preference (i.e. like or dislike) of a particular video.

However, __the problem with explicit feedback is that they are rare__. If you think about it, when was the last time you clicked the like button on a YouTube video, or rated your online purchases? Chances are, the amount of videos you watch on YouTube is far greater than the amount of videos that you have explicitly rated.


### Implicit Feedback

On the other hand, __implicit feedback are collected indirectly from user _interactions___, and they act as a proxy for user preference.
- For example. videos that you watch on YouTube are used as implicit feedback to tailor recommendations for you, even if you don't rate the videos explicitly.
- Another example of implicit feedback includes the items that you have browsed on Amazon, which are used to suggest other similar items for you.

__The advantage of implicit feedback is that it is abundant. Recommender systems built using implicit feedback also allows us to tailor recommendations in real time, with every click and interaction. Today, online recommender systems are built using implicit feedback, which allows the system to tune its recommendation in real-time, with every user interaction__.

_However, implicit feedback has its shortcomings as well. Unlike explicit feedback, every interaction is assumed to be positive and we are unable to capture negative preference from users_. How then do we capture negative feedback? One technique that can be applied is negative sampling, which we will go through in a later section.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
!cp /content/drive/MyDrive/ratings_small.csv /content

In [13]:
!cp /content/drive/MyDrive/movie.csv /content

In [14]:
# Read in CSV-
# ratings = pd.read_csv("rating.csv", parse_dates = ['timestamp'], nrows = 100000)
ratings = pd.read_csv("ratings_small.csv", parse_dates = ['timestamp'])

In [15]:
# Sort by time stamp-
ratings.sort_values(by = 'timestamp', ascending = True, inplace = True)

In [16]:
ratings.shape
# (20000263, 4)

(100000, 5)

In [17]:
ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp', 'rank_latest'], dtype='object')

In [18]:
ratings.isna().values.any()

False

In [19]:
ratings.isna().sum().sum()

0

In [20]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   userId       100000 non-null  int64         
 1   movieId      100000 non-null  int64         
 2   rating       100000 non-null  float64       
 3   timestamp    100000 non-null  datetime64[ns]
 4   rank_latest  100000 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(2)
memory usage: 4.6 MB


In [None]:
# Save 'ratings.csv'-
# ratings.to_csv("ratings_small.csv", index = False)

In [None]:
# In order to keep memory low, use data randomly selected from 30% of the users in this dataset-
'''
random_userIds = np.random.choice(
    ratings['userId'].unique(), size = int(len(ratings['userId'].unique()) * 0.3),
    replace = False
)

ratings = ratings.loc[ratings['userId'].isin(random_userIds)]

print(f"{len(ratings)} rows of data from {len(random_userIds)} users")
'''

5975024 rows of data from 41547 users


- After filtering the dataset, there are now 5979876 rows of data from 41,547 users
- Each row in the dataframe corresponds to a movie review made by a single user.

In [21]:
print(f"number of users = {ratings['userId'].nunique()} and number of movies = {ratings['movieId'].nunique()}")

number of users = 702 and number of movies = 8227


In [22]:
ratings.reset_index(drop = True, inplace = True) 

In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,491,5,3.0,1996-04-11 15:02:44
1,491,10,4.0,1996-04-11 15:02:44
2,491,11,4.0,1996-04-11 15:02:44
3,491,17,5.0,1996-04-11 15:02:44
4,491,709,4.0,1996-04-11 15:02:44


In [23]:
ratings.sample(6)

Unnamed: 0,userId,movieId,rating,timestamp,rank_latest
13711,552,1193,5.0,1999-07-26 11:27:28,4.0
22248,394,1950,5.0,2000-01-16 04:13:57,996.0
78345,11,53000,5.0,2009-01-01 04:16:37,473.0
76900,387,1347,3.0,2008-10-11 23:41:14,136.0
617,607,141,5.0,1996-05-15 08:12:31,11.0
46917,208,6870,3.5,2003-11-15 18:48:58,662.0


In [24]:
movie = pd.read_csv("movie.csv")

In [25]:
movie.shape

(27278, 3)

In [26]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [27]:
# Merge 'movie' and 'ratings' pd DataFrames-
combined = pd.merge(ratings, movie, on = 'movieId', how = 'inner')

In [28]:
combined.shape, ratings.shape, movie.shape

((100000, 7), (100000, 5), (27278, 3))

In [29]:
combined.head()

Unnamed: 0,userId,movieId,rating,timestamp,rank_latest,title,genres
0,491,5,3.0,1996-04-11 15:02:44,27.0,Father of the Bride Part II (1995),Comedy
1,524,5,4.0,1996-04-22 08:13:13,12.0,Father of the Bride Part II (1995),Comedy
2,591,5,5.0,1996-06-19 13:46:00,2.0,Father of the Bride Part II (1995),Comedy
3,390,5,5.0,1996-06-30 12:59:43,42.0,Father of the Bride Part II (1995),Comedy
4,509,5,5.0,1996-07-13 09:11:58,55.0,Father of the Bride Part II (1995),Comedy


## Train-test split:

Along with the rating, there is also a timestamp column that shows the date and time the review was submitted. Using the timestamp column, we will implement our train-test split strategy using the __leave-one-out methodology. For each user, the most recent review is used as the test set (i.e. leave one out), while the rest will be used as training data__.

To illustrate this, the movies reviewed by user 39,849 is shown below. The last movie reviewed by the user is the 2014 hit movie _Guardians of The Galaxy_. We'll use this movie as the testing data for this user, and use the rest of the reviewed movies as training data.

NOTE: There is no user with userId = 39849, hence userId = 57814 is shown instead having 128 reviews.

In [None]:
# ratings[ratings['userId'] == 491]

In [30]:
# ratings[ratings['userId'] == 491]
combined[combined['userId'] == 491]

Unnamed: 0,userId,movieId,rating,timestamp,rank_latest,title,genres
0,491,5,3.0,1996-04-11 15:02:44,27.0,Father of the Bride Part II (1995),Comedy
69,491,10,4.0,1996-04-11 15:02:44,28.0,GoldenEye (1995),Action|Adventure|Thriller
207,491,11,4.0,1996-04-11 15:02:44,29.0,"American President, The (1995)",Comedy|Drama|Romance
305,491,17,5.0,1996-04-11 15:02:44,30.0,Sense and Sensibility (1995),Drama|Romance
404,491,709,4.0,1996-04-11 15:02:44,31.0,Oliver & Company (1988),Adventure|Animation|Children|Comedy|Musical
415,491,62,4.0,1996-04-11 15:02:44,32.0,Mr. Holland's Opus (1995),Drama
515,491,494,4.0,1996-04-11 15:02:44,33.0,Executive Decision (1996),Action|Adventure|Thriller
1105,491,39,4.0,1996-04-16 17:44:13,25.0,Clueless (1995),Comedy|Romance
1308,491,165,4.0,1996-04-16 17:44:13,10.0,Die Hard: With a Vengeance (1995),Action|Crime|Thriller
1472,491,110,4.0,1996-04-16 17:44:13,14.0,Braveheart (1995),Action|Drama|War


In [31]:
combined.loc[combined['userId'] == 491, 'title']

0                      Father of the Bride Part II (1995)
69                                       GoldenEye (1995)
207                        American President, The (1995)
305                          Sense and Sensibility (1995)
404                               Oliver & Company (1988)
415                             Mr. Holland's Opus (1995)
515                             Executive Decision (1996)
1105                                      Clueless (1995)
1308                    Die Hard: With a Vengeance (1995)
1472                                    Braveheart (1995)
2537                                     Apollo 13 (1995)
2772                    Ace Ventura: Pet Detective (1994)
3105                                  Crimson Tide (1995)
3226                                        Casper (1995)
3516                                          Babe (1995)
3800                      Clear and Present Danger (1994)
4647               Dumb & Dumber (Dumb and Dumber) (1994)
5314          

The last movie reviewed by the user is the 1995 movie _The Usual Suspects_. We will use this movie as the testing data for this user, and use the rest of the reviewed movies as training data.

__This train-test split strategy is often used when training and evaluating recommender systems. Doing a random split would not be fair, as we could potentially be using a user's recent reviews for training and earlier reviews for testing. This introduces data leakage with a look-ahead bias, and the performance of the trained model would not be generalizable to
real-world performance.__

The code below will split our ratings dataset into a train and test set using the __leave-one-out methodology__:

In [32]:
ratings['rank_latest'] = ratings.groupby(['userId'])['timestamp'].rank(method = 'first', ascending = False)

In [33]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,rank_latest
0,491,5,3.0,1996-04-11 15:02:44,27.0
1,491,10,4.0,1996-04-11 15:02:44,28.0
2,491,11,4.0,1996-04-11 15:02:44,29.0
3,491,17,5.0,1996-04-11 15:02:44,30.0
4,491,709,4.0,1996-04-11 15:02:44,31.0


In [34]:
ratings[ratings['rank_latest'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp,rank_latest
78,611,265,4.0,1996-04-13 17:07:38,1.0
99,491,186,3.0,1996-04-16 17:44:14,1.0
133,524,661,3.0,1996-04-22 08:24:32,1.0
168,172,173,3.0,1996-04-23 17:19:31,1.0
248,438,722,2.0,1996-04-23 21:36:39,1.0
...,...,...,...,...,...
99917,342,108190,3.5,2015-03-21 22:47:21,1.0
99995,260,107230,5.0,2015-03-24 21:00:53,1.0
99997,631,5147,4.0,2015-03-29 06:13:52,1.0
99998,96,2337,2.0,2015-03-29 16:19:58,1.0


In [35]:
# Create training and testing sets-
train_ratings = ratings[ratings['rank_latest'] != 1]
test_ratings = ratings[ratings['rank_latest'] == 1]

In [36]:
train_ratings.shape, test_ratings.shape

((99298, 5), (702, 5))

In [37]:
# Use features needed for training ('timestamp' and 'rank_latest' columns are deleted)-
# train_ratings.drop(labels = ['timestamp', 'rank_latest'], axis = 1, inplace = True)
train_ratings = train_ratings[['userId', 'movieId', 'rating']]
test_ratings = test_ratings[['userId', 'movieId', 'rating']]

In [38]:
train_ratings.head()

Unnamed: 0,userId,movieId,rating
0,491,5,3.0
1,491,10,4.0
2,491,11,4.0
3,491,17,5.0
4,491,709,4.0


## Converting the dataset into an implicit feedback dataset:

As discussed earlier, we will train a recommender system using _implicit feedback_. However, the MovieLens dataset that we're using is based on explicit feedback. __To convert this dataset into an implicit feedback dataset, we'll simply binarize the ratings such that they are are '1' (i.e. positive class). The value of '1' represents that the user has interacted with the item__.

It is important to note that __using implicit feedback reframes the problem that our recommender is trying to solve. Instead of trying to predict movie ratings (when using explicit feedback), we are trying to predict whether the user will interact (i.e. click/buy/watch) with each movie, with the aim of presenting to users the movies with the highest interaction likelihood__.

In [39]:
# Add 'rating' column as ones-
train_ratings.loc[:, 'rating'] = 1

In [40]:
train_ratings.sample(5)

Unnamed: 0,userId,movieId,rating
48464,278,1991,1
12662,362,1208,1
48131,489,4889,1
60408,116,8622,1
41781,156,2946,1


In [41]:
train_ratings['userId'].nunique(), test_ratings['userId'].nunique()

(702, 702)

We do have a problem now though. After binarizing our dataset, we see that every sample in the dataset now belongs to the positive class. However we also require __negative samples to train our models, to indicate movies that the user has not interacted with__. _We assume that such movies are those that the user are not interested in - even though this is a sweeping assumption that may not be true, it usually works out rather well in practice_.

The code below generates 4 negative samples for each row of data. In other words, the ratio of negative to positive samples is 4:1. This ratio is chosen arbitrarily but I found that it works rather well (feel free to find the best ratio yourself)-

In [42]:
# Get a list of all movie IDs
all_movieIds = ratings['movieId'].unique()

# Placeholders that will hold the training data
users, items, labels = [], [], []

# This is the set of items that each user has interaction with
user_item_set = set(zip(train_ratings['userId'], train_ratings['movieId']))

In [43]:
# 4:1 ratio of negative to positive samples
num_negatives = 4

# for (u, i) in tqdm(user_item_set):
for user, item in user_item_set:
    users.append(user)
    items.append(item)
    labels.append(1) # items that the user has interacted with are positive
    
    for _ in range(num_negatives):
        # randomly select an item-
        negative_item = np.random.choice(all_movieIds)
        
        # check that the user has not interacted with this item-
        while (user, negative_item) in user_item_set:
            negative_item = np.random.choice(all_movieIds)
        
        users.append(user)
        items.append(negative_item)
        labels.append(0) # items not interacted with are negative

We now have the data in the required format by our model. Before we move on, let's define a PyTorch Dataset to facilitate training. The class below simply encapsulates the code we have written above into a PyTorch Dataset class.

In [44]:
len(users), len(items), len(labels)

(496490, 496490, 496490)

In [45]:
users[:5]

[247, 247, 247, 247, 247]

In [46]:
items[:5]

[3998, 101525, 100083, 3737, 53972]

In [47]:
labels[:5]

[1, 0, 0, 0, 0]

In [48]:
import torch
from torch.utils.data import Dataset, DataLoader
# from torchvision import transforms, utils

In [49]:
print(f"PyTorch Version: {torch.__version__}")

PyTorch Version: 1.9.0+cu102


In [50]:
# Device configuration-
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Available device is {device}')

Available device is cuda


In [51]:
print(torch.cuda.is_available())

True


In [52]:
# Check if there are multiple devices (i.e., GPU cards)-
print(torch.cuda.device_count())

1


In [53]:
# Which GPU Is The Current GPU?
print(torch.cuda.current_device())

0


In [54]:
# Get the name of the current GPU^M
print(torch.cuda.get_device_name(torch.cuda.current_device()))

Tesla T4


In [55]:
class MovieLensTrainDataset(Dataset):
    """
    MovieLens PyTorch Dataset for Training
    
    Args:
    1. ratings (pd.DataFrame): Dataframe containing the movie ratings
    2. all_movieIds (list): List containing all movieIds
    """

    def __init__(self, ratings, all_movieIds):
        self.users, self.items, self.labels = self.get_dataset(ratings, all_movieIds)

    def __len__(self):
        return len(self.users)
  
    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.labels[idx]

    def get_dataset(self, ratings, all_movieIds):
        users, items, labels = [], [], []
        user_item_set = set(zip(ratings['userId'], ratings['movieId']))

        num_negatives = 4
        for u, i in user_item_set:
            users.append(u)
            items.append(i)
            labels.append(1)
            for _ in range(num_negatives):
                negative_item = np.random.choice(all_movieIds)
                while (u, negative_item) in user_item_set:
                    negative_item = np.random.choice(all_movieIds)
                users.append(u)
                items.append(negative_item)
                labels.append(0)

        return torch.tensor(users), torch.tensor(items), torch.tensor(labels)
    

## Neural Collaborative Filtering (NCF):

[Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) by Xiangnan He et al. is implemented here. It is straightforward using a _simple_ model.

### User Embeddings:

Before looking at the architecture of the neural network, let's look at the concept of _embeddings_. An __embedding is a low-dimensional space that captures the relationship of vectors from a higher dimensional space__. To better understand this concept, let's take a closer look at user embeddings.

Imagine that we want to represent our users according to their preference for two genres of movies - 'action' and 'romance' movies. Let the first dimension be how much the user likes action movies, and the second dimension be how much the user likes romance movies.

<img src="pics/user1.png" width = 600 height = 600 />

Now, assume that Bob is our first user. Bob likes action movies but isn't a fan of romance movies. To represent Bob as a two dimensional vector, we place him in the graph according to his preference.

<img src="pics/user2.png" width = 600 height = 600 />

Our next user is Joe. Joe is a huge fan of both action and romance movies. We represent Joe using a two dimensional vector just like Bob.

<img src="pics/user3.png" width = 600 height = 600 />


__This two dimensional space is known as an embedding. Essentially, the embedding reduces our users such that they can be represented in a meaningful manner in a lower dimensional space.__ _In this embedding, users with similar movie preferences are placed near to each other, and vice versa_.

<img src="pics/embedding.png" width = 600 height = 600 />


_We are not just restricted to using just 2 dimensions to represent our users. We can use an arbitrary number of dimensions to represent our users_. __A larger number of dimensions would allow us to capture the traits of each user more accurately, at the cost of model complexity__. In this code example, we use 8 dimensions (which we will see later).

### Learned Embeddings:

_Similarly, we will use a separate item embedding layer to represent the traits of the items (i.e. movies) in a lower dimensional space_.

You might be wondering, _how can we learn the weights of the embedding layer, such that it provides an accurate representation of users and items_? In our previous example, we used Bob and Joe's preference for action and romance movies to manually create our embedding. Is there a way to learn such preferences automatically?

The answer is __Collaborative Filtering - by using the ratings dataset, we can identify similar users and movies, creating user and item embeddings learned from existing ratings__.

### Model Architecture:

Now that we have a better understanding of _embeddings_, we are ready to define the neural network model architecture. As you will see, the _user_ and _item_ embeddings are key to the model.

Let's walk through the model architecture using the following training sample:

<img src="pics/example.png" width = 200 height = 200 />

<img src="pics/model_architecture.png" width = 600 height = 600 />


The inputs to the model are the one-hot encoded user and item vector for ```userId = 3``` and ```movieId = 1```. Because this is a positive sample (movie actually rated by the user), the true label (```interacted```) is 1.

The user input vector and item input vector are fed to the user embedding and item embedding respectively, which results in a smaller, denser user and item vectors.

The embedded user and item vectors are concatenated before passing through a series of fully connected layers, which maps the concatenated embeddings into a prediction vector as output. Finally, we apply a ```Sigmoid``` function to obtain the most probable class. In the example above, the most probable class is 1 (positive class), since 0.8 > 0.2.

In [None]:
!pip install pytorch-lightning

In [None]:
'''
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
'''
import pytorch_lightning as pl

In [60]:
print(f"PyTorch lightning version: {pl.__version__}")

PyTorch lightning version: 1.3.8


In [61]:
class NCF(pl.LightningModule):
    """ Neural Collaborative Filtering (NCF)
    
        Args:
            num_users (int): Number of unique users
            num_items (int): Number of unique items
            ratings (pd.DataFrame): Dataframe containing the movie ratings for training
            all_movieIds (list): List containing all movieIds (train + test)
    """
    
    def __init__(self, num_users, num_items, ratings, all_movieIds):
        super().__init__()
        self.user_embedding = nn.Embedding(num_embeddings = num_users, embedding_dim = 8)
        self.item_embedding = nn.Embedding(num_embeddings = num_items, embedding_dim = 8)
        self.fc1 = nn.Linear(in_features=16, out_features = 64)
        self.fc2 = nn.Linear(in_features=64, out_features = 32)
        self.output = nn.Linear(in_features=32, out_features = 1)
        self.ratings = ratings
        self.all_movieIds = all_movieIds
        
    def forward(self, user_input, item_input):
        
        # Pass through embedding layers
        user_embedded = self.user_embedding(user_input)
        item_embedded = self.item_embedding(item_input)

        # Concat the two embedding layers
        vector = torch.cat([user_embedded, item_embedded], dim = -1)

        # Pass through dense layer
        vector = nn.ReLU()(self.fc1(vector))
        vector = nn.ReLU()(self.fc2(vector))

        # Output layer
        pred = nn.Sigmoid()(self.output(vector))

        return pred
    
    def training_step(self, batch, batch_idx):
        user_input, item_input, labels = batch
        predicted_labels = self(user_input, item_input)
        loss = nn.BCELoss()(predicted_labels, labels.view(-1, 1).float())
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

    def train_dataloader(self):
        return DataLoader(MovieLensTrainDataset(self.ratings, self.all_movieIds),
                          batch_size = 512, num_workers = 4)
    

In [62]:
num_users = ratings['userId'].max() + 1
num_items = ratings['movieId'].max() + 1

all_movieIds = ratings['movieId'].unique()

In [63]:
num_users, num_items, all_movieIds.shape

(703, 128595, (8227,))

In [66]:
# Instantiate NCF model-
model = NCF(num_users, num_items, train_ratings, all_movieIds)

In [67]:
trainer = pl.Trainer(
    max_epochs = 5, gpus = 1,
    reload_dataloaders_every_epoch = True,
    progress_bar_refresh_rate = 50, logger = False,
    checkpoint_callback = False)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores


Notice that we are using the argument ```reload_dataloaders_every_epoch=True```. This creates a new randomly chosen set of negative samples for each epoch, which ensures that our model is not biased by the selection of negative samples.

In [68]:
# Train NCF model-
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type      | Params
---------------------------------------------
0 | user_embedding | Embedding | 5.6 K 
1 | item_embedding | Embedding | 1.0 M 
2 | fc1            | Linear    | 1.1 K 
3 | fc2            | Linear    | 2.1 K 
4 | output         | Linear    | 33    
---------------------------------------------
1.0 M     Trainable params
0         Non-trainable params
1.0 M     Total params
4.150     Total estimated model params size (MB)
  cpuset_checked))


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…




### Evaluating our Recommender System:

Now that our model is trained, we are ready to evaluate it using the test data. In traditional Machine Learning projects, we evaluate our models using metrics such as Accuracy (for classification problems) and RMSE (for regression problems). However, such metrics are too simplistic for evaluating recommender systems.


_To design a good metric for evaluating recommender systems, we need to first understand how modern recommender systems are used_.

For example, Netflix produces a list of recommendations and so does Amazon.

The __key__ here is that __we don't need the user to interact on every single item in the list of recommendations. Instead, we just need the user to interact with at least one item on the list__ - _as long as the user does that, the recommendations have worked_.

To simulate this, let's run the following evaluation protocol to generate a list of 10 recommended items for each user.

- For each user, randomly select 99 items that the user __has not interacted with__
- Combine these 99 items with the test item (the actual item that the user interacted with). We now have 100 items.
- Run the model on these 100 items, and rank them according to their predicted probabilities
- Select the top 10 items from the list of 100 items. If the test item is present within the top 10 items, then we say that this is a hit.
- Repeat the process for all users. The Hit Ratio is then the average hits.

This evaluation protocol is known as __Hit Ratio @ 10__, and it is commonly used to evaluate recommender systems.

In [69]:
# User-item pairs for testing
test_user_item_set = set(zip(test_ratings['userId'], test_ratings['movieId']))

# Dict of all items that are interacted with by each user
user_interacted_items = ratings.groupby('userId')['movieId'].apply(list).to_dict()

In [73]:
from tqdm import tqdm

In [74]:
hits = []
for (u,i) in tqdm(test_user_item_set):
    interacted_items = user_interacted_items[u]
    not_interacted_items = set(all_movieIds) - set(interacted_items)
    selected_not_interacted = list(np.random.choice(list(not_interacted_items), 99))
    test_items = selected_not_interacted + [i]
    
    predicted_labels = np.squeeze(model(torch.tensor([u]*100), 
                                        torch.tensor(test_items)).detach().numpy())
    
    top10_items = [test_items[i] for i in np.argsort(predicted_labels)[::-1][0:10].tolist()]
    
    if i in top10_items:
        hits.append(1)
    else:
        hits.append(0)
        
print("The Hit Ratio @ 10 is {:.2f}".format(np.average(hits)))

100%|██████████| 702/702 [00:01<00:00, 355.41it/s]

The Hit Ratio @ 10 is 0.56





### _Hit Ratio @ 10_ interpretation:

This means that 56% of the users were recommended the actual item (amongst a list of 10 items) that they eventually interacted with.