# Recommender Systems 2024/25

### Practice 9 - Deep Learning Models

## The basics of Deep Learning: Multi-Layer Perceptron 

*Deep Learning* is a branch of Machine Learning research whose name is associated to many different meanings. With *Machine Learning*, we model a problem using data points that are often represented by means of multiple **features** of our design. In *Deep Learning* scenarios, instead, the "machine" often receives input data that received very little pre-processing, and the deep learner extracts the relevant features to solve the problem by itself. The following are some *Deep Learning*-based approaches to Recommendation.

In [1]:
from Data_manager.split_functions.split_train_validation_random_holdout import split_train_in_two_percentage_global_sample
from Data_manager.Movielens.Movielens10MReader import Movielens10MReader

data_reader = Movielens10MReader()
data_loaded = data_reader.load_data()

URM_all = data_loaded.get_URM_all()

URM_train_val, URM_test = split_train_in_two_percentage_global_sample(URM_all, 0.8)
URM_train, URM_val = split_train_in_two_percentage_global_sample(URM_train_val, 0.8)

Movielens10M: Verifying data consistency...
Movielens10M: Verifying data consistency... Passed!
DataReader: current dataset is: Movielens10M
	Number of items: 10681
	Number of users: 69878
	Number of interactions in URM_all: 10000054
	Value range in URM_all: 0.50-5.00
	Interaction density: 1.34E-02
	Interactions per user:
		 Min: 2.00E+01
		 Avg: 1.43E+02
		 Max: 7.36E+03
	Interactions per item:
		 Min: 0.00E+00
		 Avg: 9.36E+02
		 Max: 3.49E+04
	Gini Index: 0.57

	ICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10106, feature occurrences: 106820, density 9.90E-04
	ICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01
	ICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10126, feature occurrences: 128384, density 1.19E-03
	ICM name: ICM_year, Value range: 1.92E+03 / 2.01E+03, Num features: 1, feature occurrences: 10681, density 1.00E+00




In [2]:
# Training and testing
from Evaluation.Evaluator import EvaluatorHoldout

evaluator_test = EvaluatorHoldout(URM_test, [10])
evaluator_validation = EvaluatorHoldout(URM_val, [10])

EvaluatorHoldout: Ignoring 73 ( 0.1%) Users that have less than 1 test interactions
EvaluatorHoldout: Ignoring 243 ( 0.3%) Users that have less than 1 test interactions


Most *Deep Learning* techniques are **neural networks** of sorts. You may think Machine Learning is a "young" field of research, because its greatest breakthroughs are relatively recent, but its roots grow a bit further back in time: the concept of *Perceptron* as a function estimator was introduced by McCulloch and Pitts in 1943, and simulated for the first time by Rosenblatt in 1958! Neural networks, in their basic **Multi-Layer Perceptron** implementation, are nothing short than networks of densely interconnected *Perceptrons* (hence the name).

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim

if torch.backends.mps.is_available(): # if torch.cuda.is_available() if you use NVIDIA GPUs
    device = torch.device("mps")
else:
    device = torch.device("cpu")

Let's build a simple Multi-Layer Perceptron in PyTorch to understand how it works. We will train and test it over the simple Iris Dataset (see also Practice 11).

In [4]:
# load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create a custom dataset class
class IrisDataset(torch.utils.data.Dataset):
    def __init__(self, X, y, dtype=torch.float32):
        self.X = torch.tensor(X, dtype=dtype)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

Our simple implementation will have 4 layers: one input layer, one output layer, and two hidden layers. In theory, three layers are enough to build a universal estimator (given that the hidden layer amounts for infinite perceptrons).

In [5]:
# create a custom nn.Module class
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 16) # Input (4) -> hidden layer 1 (16)
        self.fc2 = nn.Linear(16, 32) # hidden layer 1 (16) -> hidden layer 2 (32)
        self.fc3 = nn.Linear(32, 3) # hidden layer 2 (32) -> output (3)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

The output layer has 3 neurons, as the Iris classification problem has 3 classes. Notice how we implement the model's "architecture", i.e., its layer structure, in the `__init__()` method, while the `forward()` method is used to define how layers interact with one another (by means of **activation functions**). Let's now fit and test the model on the data.

In [6]:
# create a data loader and model
def train(dtype : torch.dtype):
    dataset = IrisDataset(X_train, y_train, dtype=dtype)

    torch.set_default_dtype(dtype)
    data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
    model = MLP()
    
    # define a loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    
    # train the model
    for epoch in range(100):
        running_loss = 0.0
        for i, data in enumerate(data_loader, 0):
            inputs, labels = data
            optimizer.zero_grad()
    
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
    
            running_loss += loss.item()
        print('Epoch %d, loss: %.3f' % (epoch+1, running_loss/(i+1)))
    
    # evaluate the model
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data in data_loader:
            inputs, labels = data
            outputs = model(inputs)
            test_loss += criterion(outputs, labels).item()
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == labels).sum().item()
    
    accuracy = correct / len(dataset)
    print('Test loss: {:.3f}, Accuracy: {:.2f}'.format(test_loss/(len(data_loader)), accuracy*100))

In [7]:
train(torch.float32)

Epoch 1, loss: 1.074
Epoch 2, loss: 1.054
Epoch 3, loss: 1.037
Epoch 4, loss: 1.019
Epoch 5, loss: 1.005
Epoch 6, loss: 0.988
Epoch 7, loss: 0.974
Epoch 8, loss: 0.960
Epoch 9, loss: 0.948
Epoch 10, loss: 0.934
Epoch 11, loss: 0.920
Epoch 12, loss: 0.908
Epoch 13, loss: 0.895
Epoch 14, loss: 0.882
Epoch 15, loss: 0.867
Epoch 16, loss: 0.855
Epoch 17, loss: 0.839
Epoch 18, loss: 0.825
Epoch 19, loss: 0.814
Epoch 20, loss: 0.799
Epoch 21, loss: 0.782
Epoch 22, loss: 0.768
Epoch 23, loss: 0.756
Epoch 24, loss: 0.742
Epoch 25, loss: 0.728
Epoch 26, loss: 0.717
Epoch 27, loss: 0.704
Epoch 28, loss: 0.691
Epoch 29, loss: 0.676
Epoch 30, loss: 0.669
Epoch 31, loss: 0.653
Epoch 32, loss: 0.642
Epoch 33, loss: 0.637
Epoch 34, loss: 0.621
Epoch 35, loss: 0.611
Epoch 36, loss: 0.603
Epoch 37, loss: 0.594
Epoch 38, loss: 0.587
Epoch 39, loss: 0.575
Epoch 40, loss: 0.568
Epoch 41, loss: 0.557
Epoch 42, loss: 0.552
Epoch 43, loss: 0.545
Epoch 44, loss: 0.531
Epoch 45, loss: 0.525
Epoch 46, loss: 0.5

In the cell above, we define an **optimization strategy** (SGD), a **loss function** (Cross Entropy Loss) and a **training loop** that goes on for 100 epochs. We will need to do the same for every Deep Learning model we want to create.

Unsurprisingly, our Multi-Layer Perceptron is very accurate! After all, the Iris problem is very simple for Machine Learning models.

Let's now define a framework to build Deep Learning Recommender Systems. The following are our ingredients:
- `__init__()` function that defines model architecture
- data sampling strategy to generate batches of data over which models can be trained
- `forward()` method to define layer-layer dynamics
- `fit()` method to integrate architecture, sampling and `forward()` into a cohesive training schema and optimize the loss function
- `_compute_item_score()` method to compute rating predictions. The mother class `BaseRecommender`, when this method is implemented, will correctly handle the `recommend()` part.

In [11]:
import numpy as np
import scipy.sparse as sp

In [12]:
from Recommenders.BaseRecommender import BaseRecommender

class DeepLearningRecommender(nn.Module, BaseRecommender):

    def __init__(self, URM_train, verbose=True):
        super().__init__()
        BaseRecommender.__init__(self, URM_train, verbose)
        # self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
        self.device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

    def _data_generator(self, batch_size, num_negatives=3, num_items=None):
        user_input, item_input, labels = [], [], []
        dok_train = URM_train.todok() # <- Dictionary representation of a sparse matrix: allows us to check existing interactions as key-value pairs
        if num_items is None : num_items = self.URM_train.shape[1]

        self.batch_counter = 0
        start = self.batch_counter
        stop = min(self.batch_counter + batch_size, len(dok_train.keys()))
        for (u,i) in dok_train[start:stop].keys():
            # positive interaction
            user_input.append(u)
            item_input.append(i)
            labels.append(1) # <- (Implicit ratings)
            # negative interactions
            for t in range(num_negatives): # <- num_negatives is a hyperparameter
                # randomly select an interaction; check if negative
                j = np.random.randint(num_items)
                while (u,j) in dok_train:
                    j = np.random.randint(num_items)
                user_input.append(u)
                item_input.append(j)
                labels.append(0)
        self.batch_counter += 1
        
        user_input = torch.tensor(user_input, dtype=torch.int32, device=self.device)
        item_input = torch.tensor(item_input, dtype=torch.int32, device=self.device)
        labels = torch.tensor(labels, dtype=torch.int32, device=self.device)
        labels = labels.reshape((labels.shape[0],1))
        yield user_input, item_input, labels
    
    def forward(self, user_input, item_input=None):
        raise NotImplementedError("Forward function not implemented.")

    def fit(self, epochs=30, batch_size=1024, learning_rate=0.0001):
        optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate) # <- The optimizer can be (additionally) considered as a hyperparameter
        for i in range(epochs):
            for user_input, item_input, labels in self._data_generator(batch_size):
                optimizer.zero_grad()
                predictions = self.forward(user_input, item_input)
                loss = torch.nn.BCELoss().to(self.device) # <- The loss function can be (additionally) considered as a hyperparameter
                loss = loss(predictions, labels.float())
                loss.backward()
                optimizer.step()
            self._print("Epoch {} finished. Loss: {}".format(i, loss.item()))

    def _compute_item_score(self, user_id_array, items_to_compute=None):
        step = user_id_array.shape[0]
        
        if items_to_compute is None:
            items_to_compute = np.arange(self.URM_train.shape[1], dtype=np.int32)
        
        predictions = np.empty((step,items_to_compute.shape[0]))
        for item in items_to_compute:
            with torch.no_grad():
                predictions[:, item] = self.forward(
                    torch.tensor(user_id_array),
                    torch.tensor(
                        np.ones(step, dtype=np.int32) * item)
                    ).cpu().detach().numpy().ravel()
        return predictions
    

## AutoEncoders

### Denoising Autoencoder

The first Deep Learning model we're going to study is a Denoising Autoencoder.

*Autoencoders* are a family of Deep Learning algorithms often used for *Representation Learning*. They can be divided in two components: an *encoder*, which computes a compressed representation of the input data, and a *decoder*, which is tasked with reconstructing the original input data given a representation.

A *Denoising Autoencoder* adds some noise to the input before passing it to the encoder. Since the final label for the decoder is the original (not noisy) input, training such an algorithm will return a model that is able of filtering noise out.

In Recommendation, we make no use of a "perfect" Autoencoder: in fact, the interactions it will make up to recommend are nothing but the **reconstruction errors** the decoder will (inevitably) make. We count on the fact that such errors are driven by a representation learned over collaborative signals.

In [13]:
class DenoisingAutoencoder(DeepLearningRecommender):

    RECOMMENDER_NAME = """DENOISING_AUTOENCODER"""
    def __init__(self, URM_train, encoding_dim=69, noise_p=0.01, verbose=True):
        super().__init__(URM_train, verbose)
        self.noise_p = noise_p
        num_items = URM_train.shape[1]
        self.encoder = nn.Sequential(
            nn.Linear(num_items, 420),
            nn.ReLU(),
            nn.Linear(420, encoding_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 420),
            nn.ReLU(),
            nn.Linear(420, num_items)
        )
        self.to(self.device)
    
    # override: both input and label are batches of user profiles
    def _data_generator(self, batch_size):
        row_idx = np.arange(self.URM_train.shape[0])
        for start in range(0, len(row_idx), batch_size):
            end = min(len(row_idx), start + batch_size)
            user_input = torch.tensor(self.URM_train[row_idx[start:end],:].toarray(), dtype=torch.float32, device=self.device)
            labels = user_input
            yield user_input, _, labels

    def forward(self, user_input, item_input=None):
        # assert(item_input == None, "Item input not needed")
        noisy_input = self._add_noise(user_input)
        encoded = self.encoder(noisy_input)
        reconstructed = self.decoder(encoded)
        return reconstructed

    # override: evaluator passes user profile ids as inputs, we need the
    #           full profiles for the forward function to work properly
    def _compute_item_score(self, user_id_array, items_to_compute=None):
        user_profiles = self.URM_train[user_id_array, :]

        if items_to_compute is not None:
            mask = np.zeros(self.URM.shape[1], dtype=np.int32)
            mask[items_to_compute] = 1
            user_profiles = user_profiles[:, mask]

        with torch.no_grad():
            predictions = self.forward(torch.tensor(user_profiles.toarray(), dtype=torch.float32, device=self.device))

        return predictions.cpu().detach().numpy()

    def _add_noise(self, x):
        zeros_mask = np.random.choice([False,True], size=x.shape, p=[1-self.noise_p, self.noise_p])
        ones_mask = np.random.choice([False,True], size=x.shape, p=[1-self.noise_p, self.noise_p])
        x[zeros_mask] = 0
        x[ones_mask] = 1
        return x

As the class we created inherits from `DeepLearningRecommender`, which, in turn, inherits from `BaseRecommender`, we can now evaluate its performance using our framework's classic evaluation procedure!

In [14]:
denoising_autoencoder = DenoisingAutoencoder(URM_train)

denoising_autoencoder.fit(epochs=100, batch_size=1024, learning_rate=0.005)

results_df, _ = evaluator_test.evaluateRecommender(denoising_autoencoder)

results_df

DENOISING_AUTOENCODER: URM Detected 69 ( 0.6%) items with no interactions.
DENOISING_AUTOENCODER: Epoch 0 finished. Loss: 0.22018179297447205
DENOISING_AUTOENCODER: Epoch 1 finished. Loss: 0.29662758111953735
DENOISING_AUTOENCODER: Epoch 2 finished. Loss: 0.24997271597385406
DENOISING_AUTOENCODER: Epoch 3 finished. Loss: 0.21621082723140717
DENOISING_AUTOENCODER: Epoch 4 finished. Loss: 0.2159867286682129
DENOISING_AUTOENCODER: Epoch 5 finished. Loss: 0.21620379388332367
DENOISING_AUTOENCODER: Epoch 6 finished. Loss: 0.21621476113796234
DENOISING_AUTOENCODER: Epoch 7 finished. Loss: 0.21616677939891815
DENOISING_AUTOENCODER: Epoch 8 finished. Loss: 0.21621081233024597
DENOISING_AUTOENCODER: Epoch 9 finished. Loss: 0.2159128487110138
DENOISING_AUTOENCODER: Epoch 10 finished. Loss: 0.21611090004444122
DENOISING_AUTOENCODER: Epoch 11 finished. Loss: 0.21600240468978882
DENOISING_AUTOENCODER: Epoch 12 finished. Loss: 0.21617646515369415
DENOISING_AUTOENCODER: Epoch 13 finished. Loss: 0.215

Unnamed: 0_level_0,PRECISION,PRECISION_RECALL_MIN_DEN,RECALL,MAP,MAP_MIN_DEN,MRR,NDCG,F1,HIT_RATE,ARHR_ALL_HITS,...,COVERAGE_USER,COVERAGE_USER_HIT,USERS_IN_GT,DIVERSITY_GINI,SHANNON_ENTROPY,RATIO_DIVERSITY_HERFINDAHL,RATIO_DIVERSITY_GINI,RATIO_SHANNON_ENTROPY,RATIO_AVERAGE_POPULARITY,RATIO_NOVELTY
cutoff,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10,0.005063,0.005264,0.001533,0.001578,0.001653,0.014941,0.00333,0.002354,0.048048,0.015353,...,0.998955,0.047998,0.998955,0.001035,3.396463,0.903551,0.005315,0.299901,0.160543,0.15083


### $EASE^R$

An Autoencoder with no hidden layers used in Recommendation is equivalent to an **item-item similarity** based Recommender. $EASE^R$ is a closed form, similarity-based model built with such notion in mind, that's why we classify it as an Autoencoder. It's usually a very strong baseline and very fast to compute, but its memory consumption is a known issue.

The algorithm is defined by the following formula: $$S^* = \underset{S}{\arg\min} ||R - RS||_F + 2 \vec{\gamma} \odot diag(S),$$ which develops into this solution: $$S^* = I_{|I|} - P \cdot diagMat(\vec{1} \oslash diag(P)),$$ having $P = (R^T \cdot R + \lambda I_{|I|})^{-1}$.

Below, we present its `fit` function, that defines the gist of the algorithm.

In [15]:
from sklearn.preprocessing import normalize

def fit(self, topK=None, l2_norm = 1e3, normalize_matrix = False):

        if normalize_matrix:
            # Normalize rows and then columns
            self.URM_train = normalize(self.URM_train, norm='l2', axis=1)
            self.URM_train = normalize(self.URM_train, norm='l2', axis=0)
            self.URM_train = sp.csr_matrix(self.URM_train)


        # Grahm matrix is X^t X, compute dot product
        grahm_matrix = self.URM_train.T.dot(self.URM_train).toarray() # <- HUGE dense matrix!

        diag_indices = np.diag_indices(grahm_matrix.shape[0])
        grahm_matrix[diag_indices] += l2_norm

        P = np.linalg.inv(grahm_matrix) # <- Matrix inversion is a HEAVY computation!

        B = P / (-np.diag(P))

        B[diag_indices] = 0.0 # <- avoid having identity matrix as a solution

$EASE^R$ is available in the `Recommenders` module, so you can use it by means of a simple import.

In [16]:
from Recommenders.EASE_R.EASE_R_Recommender import EASE_R_Recommender

model = EASE_R_Recommender(URM_train)

model.fit() # <- hyperparams left to default value, obviously could (and should) be optimized

results_df, _ = evaluator_test.evaluateRecommender(model)

results_df

EASE_R_Recommender: URM Detected 69 ( 0.6%) items with no interactions.
EASE_R_Recommender: Fitting model... 
EASE_R_Recommender: Fitting model... done in 16.12 sec
EvaluatorHoldout: Processed 69805 (100.0%) in 22.03 sec. Users per second: 3169


Unnamed: 0_level_0,PRECISION,PRECISION_RECALL_MIN_DEN,RECALL,MAP,MAP_MIN_DEN,MRR,NDCG,F1,HIT_RATE,ARHR_ALL_HITS,...,COVERAGE_USER,COVERAGE_USER_HIT,USERS_IN_GT,DIVERSITY_GINI,SHANNON_ENTROPY,RATIO_DIVERSITY_HERFINDAHL,RATIO_DIVERSITY_GINI,RATIO_SHANNON_ENTROPY,RATIO_AVERAGE_POPULARITY,RATIO_NOVELTY
cutoff,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10,0.277227,0.322483,0.189253,0.16523,0.187821,0.534191,0.277978,0.224944,0.867775,0.91869,...,0.998955,0.866868,0.998955,0.025288,8.329067,0.994685,0.129835,0.73544,1.690005,0.091214


## Two-Tower Models

Two-tower neural recommenders are sort of an industry standard, nowadays. Being model-based, very fast at inference time and very versatile, they fit very well in "Big Data" contexts, where users and items are in the order of millions and content diversity would overwhelm anyone who tried to model the task's features by hand. Usually, they are used as a "first filter", selecting a large amount of recommendable items, which are subsequently sorted by a more refined re-ranker.

Their basic architecture, as the name suggests, has two components (towers): one that processes user-related information and another for item-related information. Information could include their interaction profiles, some type of unstructured data or pretty much anything, really. After creating two representations of profile batches, they are merged and passed to a final set of layers, tasked with predicting ratings/ranking.

In this session, we will see two different ways to implement a basic Two Tower model. The first type, reported below, has shallow towers and a deeper "post-merge block".

In [17]:
# Variant 1:
# 1. Make 2 embeddings of equal dimensions and concatenate
# 2. Couple of Dense layers
# 3. Obtain prediction (single score)
class TwoTowerRecommender_type1(DeepLearningRecommender):
    
    RECOMMENDER_NAME = """TWO_TOWER_1"""
    
    def __init__(self, URM_train, num_users, num_items, layers=[10], reg_layers=[0], verbose = True):
        super().__init__(URM_train, verbose)
        self.mlp_embedding_user = nn.Embedding(num_users, int(layers[0]/2), device=self.device) # <- The input for each tower will be a learned latent representation,
        self.mlp_embedding_item = nn.Embedding(num_items, int(layers[0]/2), device=self.device) # <- sort of like what we have seen for Matrix Factorization.

        self.mlp_layers = nn.ModuleList([
            nn.Linear(layers[i-1], layers[i], bias=True, device=self.device) for i in range(1, len(layers))
            ])
        for i, layer in enumerate(self.mlp_layers):
            nn.init.normal_(layer.weight)
            layer.bias.data.zero_()
            layer.weight_decay = reg_layers[i]

        self.prediction_layer = nn.Linear(layers[-1], 1, bias=True, device=self.device)
        nn.init.uniform_(self.prediction_layer.weight)
        self.prediction_layer.bias.data.zero_()
        self.to(self.device)

    def forward(self, user_input, item_input):
        mlp_user_latent = self.mlp_embedding_user(user_input.long().to(self.device)) # <- shallow tower: we just extract the embedding corresponding to the profiles
        mlp_item_latent = self.mlp_embedding_item(item_input.long().to(self.device))
        mlp_vector = torch.cat((mlp_user_latent, mlp_item_latent), dim=1) # <- Concatenate user and item embeddings
        for layer in self.mlp_layers:
            mlp_vector = torch.relu(layer(mlp_vector)) # <- MLP after-merge processing block

        predict_vector = mlp_vector
        prediction = torch.sigmoid(self.prediction_layer(predict_vector))
        return prediction

The second type has deeper towers, and a simple sigmoid predictor as "post-merge block".

In [18]:
# Variant 2:
# 1. Couple of Dense layers process user/item profiles
# 2. Merge and final Dense layer to obtain prediciton
class TwoTowerRecommender_type2(DeepLearningRecommender):

    RECOMMENDER_NAME = """TWO_TOWER_2"""

    def __init__(self, URM_train, num_users, num_items, layers=[10], reg_layers=[0], verbose = True):
        super().__init__(URM_train, verbose)
        layers[0] = int(layers[0]/2) # <- The first layer is split in two tower inputs at the beginning
        self.mlp_embedding_user = nn.Embedding(num_users, layers[0], device=self.device)
        self.mlp_embedding_item = nn.Embedding(num_items, layers[0], device=self.device) # <- It's possible to make the towers asymmetric! Mind the output dimension though

        self.mlp_layers_tower1 = nn.ModuleList([ # <- First tower MLP
            nn.Linear(
                layers[i-1],
                layers[i], bias=True, device=self.device
                ) for i in range(1, len(layers))
            ])
        
        self.mlp_layers_tower2 = nn.ModuleList([ # <- Second tower MLP
            nn.Linear(
                layers[i-1],
                layers[i], bias=True, device=self.device
                ) for i in range(1, len(layers))
            ])
        
        for i, layer in enumerate(self.mlp_layers_tower1):
            nn.init.normal_(layer.weight)
            layer.bias.data.zero_()
            layer.weight_decay = reg_layers[i]

        for i, layer in enumerate(self.mlp_layers_tower2):
            nn.init.normal_(layer.weight)
            layer.bias.data.zero_()
            layer.weight_decay = reg_layers[i]

        self.prediction_layer = nn.Linear(layers[-1], 1, bias=True, device=self.device) # <- shallow post-merge block: a simple linear layer with sigmoid activation
        nn.init.uniform_(self.prediction_layer.weight)
        self.prediction_layer.bias.data.zero_()
        self.to(self.device)

    def forward(self, user_input, item_input):
        mlp_user_latent = self.mlp_embedding_user(user_input.long().to(self.device))
        mlp_item_latent = self.mlp_embedding_item(item_input.long().to(self.device))

        mlp_user_vector = mlp_user_latent
        mlp_item_vector = mlp_item_latent

        for layer in self.mlp_layers_tower1:
            mlp_user_vector = torch.relu(layer(mlp_user_vector))

        for layer in self.mlp_layers_tower2:
            mlp_item_vector = torch.relu(layer(mlp_item_vector))

        predict_vector = mlp_user_vector * mlp_item_vector # <- Merge the tensors via element-wise multiplication
        prediction = torch.sigmoid(self.prediction_layer(predict_vector))
        return prediction

Both classes inherit from `DeepLearningRecommender`, without overriding its `_data_generator()` and `fit()` methods. Given that `DeepLearningRecommender` inherits from `BaseRecommender`, we can now use the framework's evaluation functions.

In [19]:
# Train and test type 1
twotower_1 = TwoTowerRecommender_type1(URM_train, URM_train.shape[0], URM_train.shape[1], layers=[10,5,2,2], reg_layers=[0,0,0,0])

twotower_1.fit(epochs=100, batch_size=1024, learning_rate=0.01)

results_df, _ = evaluator_test.evaluateRecommender(twotower_1)

results_df

TWO_TOWER_1: URM Detected 69 ( 0.6%) items with no interactions.
TWO_TOWER_1: Epoch 0 finished. Loss: 1.126193642616272
TWO_TOWER_1: Epoch 1 finished. Loss: 1.0683515071868896
TWO_TOWER_1: Epoch 2 finished. Loss: 1.0175378322601318
TWO_TOWER_1: Epoch 3 finished. Loss: 0.9728472828865051
TWO_TOWER_1: Epoch 4 finished. Loss: 0.9329859614372253
TWO_TOWER_1: Epoch 5 finished. Loss: 0.8970998525619507
TWO_TOWER_1: Epoch 6 finished. Loss: 0.8661079406738281
TWO_TOWER_1: Epoch 7 finished. Loss: 0.8362732529640198
TWO_TOWER_1: Epoch 8 finished. Loss: 0.8134678602218628
TWO_TOWER_1: Epoch 9 finished. Loss: 0.791581928730011
TWO_TOWER_1: Epoch 10 finished. Loss: 0.7729694843292236
TWO_TOWER_1: Epoch 11 finished. Loss: 0.7558155655860901
TWO_TOWER_1: Epoch 12 finished. Loss: 0.7418611645698547
TWO_TOWER_1: Epoch 13 finished. Loss: 0.729861855506897
TWO_TOWER_1: Epoch 14 finished. Loss: 0.7185139060020447
TWO_TOWER_1: Epoch 15 finished. Loss: 0.708511471748352
TWO_TOWER_1: Epoch 16 finished. Loss:

Unnamed: 0_level_0,PRECISION,PRECISION_RECALL_MIN_DEN,RECALL,MAP,MAP_MIN_DEN,MRR,NDCG,F1,HIT_RATE,ARHR_ALL_HITS,...,COVERAGE_USER,COVERAGE_USER_HIT,USERS_IN_GT,DIVERSITY_GINI,SHANNON_ENTROPY,RATIO_DIVERSITY_HERFINDAHL,RATIO_DIVERSITY_GINI,RATIO_SHANNON_ENTROPY,RATIO_AVERAGE_POPULARITY,RATIO_NOVELTY
cutoff,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10,0.036142,0.039076,0.016211,0.0137,0.014795,0.098304,0.024266,0.022383,0.267946,0.115308,...,0.998955,0.267667,0.998955,0.003863,5.724543,0.970558,0.019834,0.505465,0.990784,0.106248


In [20]:
# Train and test type 2
twotower_2 = TwoTowerRecommender_type2(URM_train, URM_train.shape[0], URM_train.shape[1], layers=[10,5,2,2], reg_layers=[0,0,0,0])

twotower_2.fit(epochs=100, batch_size=1024, learning_rate=0.01)

results_df, _ = evaluator_test.evaluateRecommender(twotower_2)

results_df

TWO_TOWER_2: URM Detected 69 ( 0.6%) items with no interactions.
TWO_TOWER_2: Epoch 0 finished. Loss: 0.7569167613983154
TWO_TOWER_2: Epoch 1 finished. Loss: 0.7312115430831909
TWO_TOWER_2: Epoch 2 finished. Loss: 0.7133404016494751
TWO_TOWER_2: Epoch 3 finished. Loss: 0.7009876370429993
TWO_TOWER_2: Epoch 4 finished. Loss: 0.6925101280212402
TWO_TOWER_2: Epoch 5 finished. Loss: 0.6866433620452881
TWO_TOWER_2: Epoch 6 finished. Loss: 0.6820237040519714
TWO_TOWER_2: Epoch 7 finished. Loss: 0.6783905625343323
TWO_TOWER_2: Epoch 8 finished. Loss: 0.6752374768257141
TWO_TOWER_2: Epoch 9 finished. Loss: 0.6725500226020813
TWO_TOWER_2: Epoch 10 finished. Loss: 0.6700538396835327
TWO_TOWER_2: Epoch 11 finished. Loss: 0.6677030920982361
TWO_TOWER_2: Epoch 12 finished. Loss: 0.6654337644577026
TWO_TOWER_2: Epoch 13 finished. Loss: 0.6632372736930847
TWO_TOWER_2: Epoch 14 finished. Loss: 0.6610905528068542
TWO_TOWER_2: Epoch 15 finished. Loss: 0.6589948534965515
TWO_TOWER_2: Epoch 16 finished. L

Unnamed: 0_level_0,PRECISION,PRECISION_RECALL_MIN_DEN,RECALL,MAP,MAP_MIN_DEN,MRR,NDCG,F1,HIT_RATE,ARHR_ALL_HITS,...,COVERAGE_USER,COVERAGE_USER_HIT,USERS_IN_GT,DIVERSITY_GINI,SHANNON_ENTROPY,RATIO_DIVERSITY_HERFINDAHL,RATIO_DIVERSITY_GINI,RATIO_SHANNON_ENTROPY,RATIO_AVERAGE_POPULARITY,RATIO_NOVELTY
cutoff,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10,0.000219,0.00022,3.5e-05,7.5e-05,7.5e-05,0.000645,0.000116,6e-05,0.001977,0.000686,...,0.998955,0.001975,0.998955,0.001135,3.687291,0.913732,0.005825,0.32558,0.007468,0.167939


## Graph Convolution

Graph Convolution methods combine the qualities of Graph-based models to the expressiveness of latent factor-based models. In this session, we will see a short synthesis of the main algorithms, LightGCN and GF-CF.

To build a Graph Convolution method, we need 2 ingredients (plus, we need to decide whether to use self connections or not):
1. A convolution (embedding aggregation) function
2. A loss function

### LightGCN

LightGCN is a barebone, but very effective Graph Convolution recommender.

1. Its convolution function is simply a weighted average (no self connections)
2. It uses **BPR** as loss function

The `computer()` method defines how to obtain the embedding matrix at step $h$.

In [21]:
def computer(self):
        """
        propagate methods for lightGCN
        """
        users_emb = self.embedding_user.weight
        items_emb = self.embedding_item.weight
        all_emb = torch.cat([users_emb, items_emb])
        embs = [all_emb]
        if self.dropout_rate > 0.0:
            if self.training:
                g_dropped = self.__dropout(1 - self.dropout_rate)
            else:
                g_dropped = self.Graph
        else:
            g_dropped = self.Graph

        for layer in range(self.n_layers): # <- n_layers = hops
            all_emb = torch.sparse.mm(g_dropped, all_emb) # <- G * all_emb
            embs.append(all_emb) # <- Collect results
        embs = torch.stack(embs, dim=1)
        light_out = torch.mean(embs, dim=1) # <- Aggregation
        users, items = torch.split(light_out, [self.n_users, self.n_items])
        return users, items

The `forward()` function uses `computer()` to get the embeddings, then computes predictions by multiplying user embeddings and item embeddings together.

In [22]:
def forward(self, users, items):
        # compute embedding
        all_users, all_items = self.computer()
        users_emb = all_users[users]
        items_emb = all_items[items]
        inner_pro = torch.mul(users_emb, items_emb)
        gamma = torch.sum(inner_pro, dim=1)
        return gamma

## A quick word about speed

Let's have a look at what happens when we change the precision of our `dtype` by changing it in the Iris Dataset's initialization. As you can see, on domestic NVIDIA GPUs the training time becomes dramatically higher when working with double precision (float64). That is why we work with float32 most of the time. Sometimes, even going down to 16 bits can be useful. Latest NVIDIA GPUs can also use the brainfloat16 dtype (bfloat16), which is recommended to maximize batch size.

In [None]:
from Recommenders.GraphBased.LightGCNRecommender import LightGCNRecommender

In [25]:
%%timeit
torch.set_default_dtype(torch.float32)
rec = LightGCNRecommender(URM_train)
rec.fit()
result_df, _ = evaluator_test.evaluateRecommender(rec)

result_df

LightGCNRecommender: URM Detected 69 ( 0.6%) items with no interactions.


AssertionError: GPU is requested but not available

In [None]:
%%timeit
torch.set_default_dtype(torch.float64)
rec = LightGCNRecommender(URM_train)
rec.fit()
result_df, _ = evaluator_test.evaluateRecommender(rec)

result_df

In [None]:
%%timeit
torch.set_default_dtype(torch.float16)
rec = LightGCNRecommender(URM_train)
rec.fit()
result_df, _ = evaluator_test.evaluateRecommender(rec)

result_df

### GF-CF

Graph Filter Collaborative Filtering is a Graph Convolution algorithm that combines simple item-based similarity recommendation to graph convolution. It imposes a "high-pass filter" over the adjacency matrix to select short-distance collaborative signals, ignoring those related to popularity.

The model can be summarized by the following formula: $$S = \hat{R^T} \cdot \hat{R} + \alpha D_I^{-\frac12}\cdot V_K \cdot V_K^T \cdot D_I^{+\frac12},$$ where $\hat{R}$ is the L2-normalized URM, $D_I$ is the item degree diagonal matrix, and $V_K^T$ is the singular value matrix obtained from the URM using *Truncated SVD*. Just like $EASE^R$, it's very fast to compute (closed form solution), but very memory intensive. The following is its `fit()` method.

In [23]:
from sklearn.utils.extmath import randomized_svd

def fit(self, alpha=1.0, num_factors=50, random_seed = None):
        self._print("Computing SVD decomposition of the normalized adjacency matrix...")

        self.alpha = alpha

        self.D_I = np.sqrt(np.array(self.URM_train.sum(axis = 0))).squeeze()
        self.D_I_inv = 1/(self.D_I + 1e-6)
        self.D_U_inv = 1/np.sqrt(np.array(self.URM_train.sum(axis = 1))).squeeze() + 1e-6

        self.D_I = sp.diags(self.D_I)
        self.D_I_inv = sp.diags(self.D_I_inv)
        self.D_U_inv = sp.diags(self.D_U_inv)

        self.R_tilde = self.D_U_inv.dot(self.URM_train).dot(self.D_I_inv) # <- Normalized URM

        _, _, self.V = randomized_svd(self.R_tilde,
                                     n_components = num_factors,
                                     random_state = random_seed) # <- Obtain V_K using truncated SVD

        self.D_I = sp.csr_matrix(self.D_I)
        self.D_I_inv = sp.csr_matrix(self.D_I_inv)

The `_compute_item_score()` method will then perform the aggregation of the components we thus obtained following the formula above. To avoid memory errors, matrix multiplications should be performed in blocks, as we have seen covering Graph-based models.