# BT4222: Neural Matrix Factorization (NeuMF) for Movie Recommendation
## Introduction
This notebook presents the development and evaluation of a Neural Matrix Factorization (NeuMF) model for personalized movie recommendation. NeuMF is a hybrid deep learning architecture that combines the strengths of generalized matrix factorization (GMF) and multi-layer perceptrons (MLP) to effectively model both linear and nonlinear user-item interaction patterns.

In this architecture, user and item embeddings are passed through two parallel paths:

- The GMF path performs element-wise interactions to capture linear relationships

- The MLP path concatenates embeddings and processes them through stacked dense layers to learn complex, high-order patterns.

Alongside these two paths, metadata embeddings are also incorporated into the MLP path, allowing the model to learn from additional item-specific features. These embeddings are learned from metadata attributes like director, main cast, genre, and language, enriching the model’s understanding of item relationships beyond user-item interactions.

The outputs from both the GMF and MLP paths, along with the metadata embeddings, are then fused and passed through a final prediction layer. The entire model is trained end-to-end using PyTorch on explicit rating data.

To assess model performance, we evaluate:

- RMSE for measuring the accuracy of predicted ratings, and

- Top-K ranking metrics such as Precision@K, Recall@K, NDCG@K, MAP@K, Hit Rate, and MRR, which reflect the quality of the ranked recommendations across users.

- AUC which evaluates the model’s ability to distinguish between relevant (positive) and irrelevant (negative) items across users

# Mount Google Drive
This code cell mounts your Google Drive in the Colab environment, enabling access to datasets stored in your drive for subsequent processing.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import Libraries and Load Raw Datasets
Importing necessary libraries.
Loading all primary datasets used in the project:

- `df_tmdb_final.csv`: Final movie metadata with embeddings

- `df_links_with_ratings.csv`: User ratings merged with TMDB movie IDs


In [None]:
import pandas as pd
import numpy as np
import ast
import re
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from collections import defaultdict
import tqdm as tqdm
from tqdm import tqdm
import heapq
import random
from sklearn.metrics import roc_auc_score


df_tmdb_final = pd.read_csv('/content/drive/MyDrive/BT4222_Project/df_tmdb_final.csv')
df_links_with_ratings = pd.read_csv('/content/drive/MyDrive/BT4222_Project/df_links_with_ratings.csv')

## Step 1: Prepare Content Feature Dataset
We create a new dataframe df_content by selecting only the embeddings column needed for this model:

- `movie_id`: Unique identifier for each movie  
- `original_title`: Movie title  
- `weighted_vote_score`: Popularity/review score used for ranking  
- `genre`: Genre tags (typically as lists or encoded vectors)  
- `director_embedding`: Numeric embedding representing the movie’s director  
- `main_cast_embeddings`: Averaged or stacked embeddings of main cast members  
- `production_company_embedding`: Embedding of the associated production company  
- `original_language_embedding`: Embedding for the movie's original language  

Duplicate entries are removed based on `movie_id` to ensure each movie appears only once.

### 1.1 String Cleaning through parse_embedding_from_array_wrapper
When the embeddings were extracted and saved during data cleaning, they were stored as strings in the CSV. These strings sometimes include extra wrappers, such as array( ... ), and have numbers separated by spaces instead of commas—this is not a valid Python literal.

- **Removing Array Wrappers**:
It uses a regular expression to detect and remove the "array(...)" wrapper so that only the inner content remains.

- **Ensuring Proper Bracketing**:
It makes sure that the cleaned string starts with [ and ends with ] so that it represents a valid Python list literal.

- **Inserting Commas**:
It then uses another regular expression to insert commas between numbers that are only separated by whitespace

### 1.2 Defining parse_embedding
When the embeddings were extracted and saved during the data cleaning process, they were converted to strings in the CSV file. This makes it necessary to parse these string representations back into numerical arrays before further processing. The parse_embedding function handles this conversion by:

- **Checking if the input is a string**:
It may be a string representation of a list (e.g., "[0.1, 0.2, 0.3]"), so we need to convert it back to a Python list and then into a NumPy array.

- **Handling errors gracefully**:
If the string cannot be converted (due to formatting issues), the function returns a zero vector with the specified dimension as a fallback.

- **Working with lists or arrays**:
If the embedding is already in list or array format, it converts (or reaffirms) it as a NumPy array.

In [None]:
#step 1: Preparing the Dataset

# Select relevant content-based features
df_content = df_tmdb_final[[
    'movie_id',
    'original_title',
    'weighted_vote_score',
    'genre',
    'director_embedding',
    'main_cast_embeddings',
    'production_company_embedding',
    'original_language_embedding'
]].drop_duplicates(subset='movie_id').reset_index(drop=True)


def parse_embedding_from_array_wrapper(s, dim=300):
    """
    Convert messy string embeddings (with array(...) wrappers or irregular formatting)
    into fixed-length NumPy vectors.
    """
    if pd.isna(s):
        return np.zeros(dim)

    # Clean wrappers and whitespace
    s = re.sub(r'array\((.*?)\)', r'\1', s)      # Remove 'array(...)'
    s = s.replace('\n', ' ')                     # Remove newlines
    s = s.replace('[', '').replace(']', '')      # Strip brackets

    # Extract floats using regex (handles scientific notation too)
    numbers = re.findall(r'-?\d+\.\d+(?:e[+-]?\d+)?', s)

    try:
        vec = np.array([float(n) for n in numbers], dtype=np.float32)

        # Pad or truncate to target dimension
        if len(vec) < dim:
            vec = np.pad(vec, (0, dim - len(vec)), mode='constant')
        elif len(vec) > dim:
            vec = vec[:dim]

        return vec

    except:
        return np.zeros(dim)


def parse_embedding(x, dim=300):
    """
    Parses an embedding from either a valid string or an actual list/array.
    """
    if isinstance(x, str):
        x = x.strip()
        if x.startswith('[') and x.endswith(']') and 'array' not in x:
            try:
                return np.array(ast.literal_eval(x))
            except:
                return np.zeros(dim)
    elif isinstance(x, (list, np.ndarray)):
        return np.array(x)

    return np.zeros(dim)



embedding_cols = [
    'main_cast_embeddings',
    'director_embedding',
    'production_company_embedding',
    'original_language_embedding',
]

# Apply wrapper-based parsing to messy embedding strings
for col in embedding_cols:
    df_content[col] = df_content[col].apply(lambda v: parse_embedding_from_array_wrapper(v, dim=300))

# Apply standard parsing to clean genre embeddings
df_content['genre'] = df_content['genre'].apply(lambda v: parse_embedding(v, dim=300))

df_merged = pd.merge(df_links_with_ratings, df_content, left_on='tmdbId', right_on='movie_id', how='inner')


In [None]:
embedding_cols = [
    'main_cast_embeddings',
    'director_embedding',
    'production_company_embedding',
    'original_language_embedding',
    'genre'
]

for col in embedding_cols:
    print(f"{col} — valid (non-zero) vectors:", df_content[col].apply(lambda x: np.sum(x) > 0).sum())

main_cast_embeddings — valid (non-zero) vectors: 268
director_embedding — valid (non-zero) vectors: 1216
production_company_embedding — valid (non-zero) vectors: 1341
original_language_embedding — valid (non-zero) vectors: 1418
genre — valid (non-zero) vectors: 1446


# 1.3 Combining embeddings and Extracting the Relevant Columns
This code combines multiple feature embeddings (such as main cast, director, production company, genre, and original language) into a single metadata embedding vector for each movie. It concatenates the embeddings for each movie and stores them in the `combined_metadata` column. The final dataset, `df_ratings`, includes userId, movieId, rating, and combined_metadata. The code also checks for consistent embedding lengths and prints the dimension of the combined metadata vector, ensuring that all vectors have the same length for model input.

In [None]:
# Combine all into a single metadata embedding vector
df_merged['combined_metadata'] = df_merged.apply(
    lambda row: np.concatenate([row['main_cast_embeddings'].astype(np.float32),
                               row['director_embedding'].astype(np.float32),
                               row['production_company_embedding'].astype(np.float32),
                               row['genre'].astype(np.float32),
                               row['original_language_embedding'].astype(np.float32)]),
    axis=1
)
# Final dataset to be used, columns extracted
df_ratings = df_merged[['userId', 'movieId', 'rating', 'combined_metadata']]

# Checks for embedding lengths and print dimension of metadata vector
embedding_lengths = df_ratings['combined_metadata'].apply(lambda x: len(x))
assert embedding_lengths.nunique() == 1, "Inconsistent embedding lengths detected!"
metadata_dim = embedding_lengths.iloc[0]
print(f"Combined metadata dimension: {metadata_dim}")

Combined metadata dimension: 1246


## Step 2: Splitting User Interaction Data into Train and Test Sets by Ratio
To ensure a fair and user-centric evaluation, we split the ratings data per user, such that 80% of each user’s ratings are allocated for training and the remaining 20% for testing. This strategy ensures that every user is represented in both sets, and that the model is evaluated on items it has not seen for that specific user.

The splitting is randomized but reproducible via a fixed random seed, and guarantees at least one test item per user, avoiding cold-start issues during evaluation.

Outputs:

- `train_df`: User-item interaction data used for training the model
- `test_df`: Held-out interactions used for Top-K recommendation evaluation

In [None]:
def user_train_test_split_by_ratio(df, test_ratio=0.2, user_col='userId', item_col='movieId', random_state=42):
    """
    Splits each user's interactions into train and test sets by percentage.

    Parameters:
        df (pd.DataFrame): Input user-item interaction DataFrame.
        test_ratio (float): Proportion of each user’s interactions to hold out for test.
        user_col (str): User ID column name.
        item_col (str): Item ID column name.
        random_state (int): Random seed for reproducibility.

    Returns:
        train_df, test_df: DataFrames split per user.
    """
    train_list = []
    test_list = []

    for user_id, user_data in df.groupby(user_col):
        n_test = max(1, int(len(user_data) * test_ratio))
        if len(user_data) <= n_test:
            continue
        test = user_data.iloc[:n_test]
        train = user_data.iloc[n_test:]
        test_list.append(test)
        train_list.append(train)

    train_df = pd.concat(train_list).reset_index(drop=True)
    test_df = pd.concat(test_list).reset_index(drop=True)
    return train_df, test_df

train_df, test_df = user_train_test_split_by_ratio(df_ratings, test_ratio=0.2)


## Step 3: Neural Matrix Factorization (NeuMF) with PyTorch
In this step, we implement the Neural Matrix Factorization (NeuMF) model using PyTorch. NeuMF is a hybrid deep learning architecture that combines Generalized Matrix Factorization (GMF) and Multi-Layer Perceptrons (MLP) to model both linear and non-linear user-item interaction patterns.

The model is designed to learn from both explicit user-item interactions (ratings) and additional metadata embeddings. Specifically, it integrates two main components:

- GMF (Generalized Matrix Factorization): This path captures linear relationships between users and items by learning low-dimensional embeddings for users and items, and then performing element-wise interactions between them.

- MLP (Multi-Layer Perceptron): This path learns non-linear interaction patterns by concatenating the embeddings of users, items, and metadata, followed by a series of dense layers to learn complex relationships. This helps the model capture high-order interactions.

These two paths are combined, and the output is passed through a final layer that predicts the rating for each user-item pair. Additionally, `metadata` embeddings (such as director, cast, genre, language, etc.) are incorporated into the MLP path to enrich the representation of items, making the model more robust to data sparsity and improving the recommendation quality.

Components:
- `RatingDataset`: A custom PyTorch dataset that handles batching of user-item-rating-triplets, ensuring efficient data loading during training.

- `NeuMFWithMetadata class`: The main model class, integrating GMF and MLP components, along with metadata processing. This model is capable of learning from both user-item interactions and metadata embeddings.

Outputs:
A fully compiled NeuMF model that can be trained on batches of user-item interactions and metadata. The model predicts ratings by learning from both shallow linear interactions (GMF) and deep non-linear patterns (MLP).

In [None]:
class RatingDataset(Dataset):
    def __init__(self, df):
        self.users = df['user'].values
        self.items = df['item'].values
        self.ratings = df['rating'].values
        self.metadata = np.stack(df['combined_metadata'].values)

    def __len__(self):
        return len(self.users)

    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.metadata[idx], self.ratings[idx]


class NeuMFWithMetadata(nn.Module):
    def __init__(self, n_users, n_items, mf_dim=32, mlp_dims=[128, 64, 32], metadata_dim=800):
        super(NeuMFWithMetadata, self).__init__()

        # GMF embeddings (no metadata)
        self.mf_user_embed = nn.Embedding(n_users, mf_dim)
        self.mf_item_embed = nn.Embedding(n_items, mf_dim)

        # MLP embeddings
        self.mlp_user_embed = nn.Embedding(n_users, mlp_dims[0] // 4)
        self.mlp_item_embed = nn.Embedding(n_items, mlp_dims[0] // 4)

        # Input size to MLP = user_embed + item_embed + metadata_emb
        mlp_input_size = (mlp_dims[0] // 2) + metadata_dim

        # MLP layers
        mlp_layers = []
        input_size = mlp_input_size
        for dim in mlp_dims:
            mlp_layers.append(nn.Linear(input_size, dim))
            mlp_layers.append(nn.ReLU())
            input_size = dim
        self.mlp = nn.Sequential(*mlp_layers)

        # Final output layer (GMF + MLP output)
        self.output_layer = nn.Linear(mf_dim + mlp_dims[-1], 1)

    def forward(self, user, item, metadata_embedding):
        # GMF path (no metadata)
        mf_user = self.mf_user_embed(user)
        mf_item = self.mf_item_embed(item)
        mf_vector = mf_user * mf_item

        # MLP path (user + item + metadata)
        mlp_user = self.mlp_user_embed(user)
        mlp_item = self.mlp_item_embed(item)
        mlp_input = torch.cat([mlp_user, mlp_item, metadata_embedding], dim=-1)  # Concatenate metadata
        mlp_vector = self.mlp(mlp_input)

        # Combine both paths
        combined = torch.cat([mf_vector, mlp_vector], dim=-1)
        prediction = self.output_layer(combined).squeeze()
        return prediction


## Step 4: Training the Neural Matrix Factorization (NeuMF) Model
In this step, we train the Neural Matrix Factorization (NeuMF) model, which combines the strengths of Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP), while incorporating metadata embeddings to enhance prediction accuracy.

User and item IDs are first label-encoded into integer indices, which are then used to create a custom `RatingDataset`. The dataset is loaded into PyTorch `DataLoader`s for efficient mini-batch training. The model is trained with Mean Squared Error (MSE) loss, which measures the discrepancy between predicted and actual ratings, and optimized using the Adam optimizer.

During training:

- The model processes batches of user-item-rating triplets.

- It computes predictions through the GMF and MLP paths, which are later combined.

- Metadata embeddings (e.g., movie genre, director, cast) are also used in the MLP path to provide additional context, helping the model learn richer item representations.

- The model performs backpropagation and updates the parameters after each batch.

This process is repeated over multiple epochs, allowing the model to converge and better fit the training data.

Components:
- `NeuMF Model`: Combines GMF and MLP components, with metadata embeddings incorporated into the MLP path.

- `DataLoader`: Efficiently streams batches of user-item-rating data for training.

- `MSE Loss`: Guides the model to minimize prediction error on ratings.

- `Adam Optimizer`: An adaptive optimization algorithm used to fine-tune the model’s parameters.

Output:
A fully trained NeuMF model that learns to predict ratings by modeling both shallow linear interactions (via GMF) and deep non-linear patterns (via MLP), with enriched metadata features. The average loss printed after each epoch shows how well the model is fitting the training data.

In [None]:
le_user = LabelEncoder()
le_item = LabelEncoder()

df_ratings.loc[:, 'user'] = le_user.fit_transform(df_ratings['userId'])
df_ratings.loc[:, 'item'] = le_item.fit_transform(df_ratings['movieId'])

n_users = df_ratings['user'].nunique()
n_items = df_ratings['item'].nunique()

# Apply label encoding to split sets
train_df['user'] = le_user.transform(train_df['userId'])
train_df['item'] = le_item.transform(train_df['movieId'])

test_df['user'] = le_user.transform(test_df['userId'])
test_df['item'] = le_item.transform(test_df['movieId'])

train_dataset = RatingDataset(train_df)
test_dataset = RatingDataset(test_df)

train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=512, shuffle=False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = NeuMFWithMetadata(n_users, n_items, metadata_dim=metadata_dim).to(device)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


# Train loop
for epoch in range(5):
    model.train()
    epoch_loss = 0
    for users, items, metadata, ratings in train_loader:
        users = users.to(device)
        items = items.to(device)
        metadata = metadata.float().to(device)
        ratings = ratings.float().to(device)

        preds = model(users, items, metadata)
        loss = criterion(preds, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(train_loader)
    print(f"Epoch {epoch+1}, Avg Loss: {avg_loss:.4f}")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ratings['user'] = le_user.fit_transform(df_ratings['userId'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ratings['item'] = le_item.fit_transform(df_ratings['movieId'])


Epoch 1, Avg Loss: 0.8428
Epoch 2, Avg Loss: 0.7275
Epoch 3, Avg Loss: 0.7046
Epoch 4, Avg Loss: 0.6928
Epoch 5, Avg Loss: 0.6796


## Step 5: NeuMF Model Prediction and RMSE Evaluation
In this step, we evaluate the performance of the trained NeuMF model by generating rating predictions for the test set and calculating the Root Mean Squared Error (RMSE). RMSE measures the average deviation between predicted and actual user ratings, offering a quantitative assessment of the model’s prediction accuracy.

We begin by switching the model to evaluation mode (`model.eval()`) and disabling gradient computation (`torch.no_grad()`) for efficiency. The model then generates predictions for all user-item pairs in the test set. These predicted ratings are compared with the ground-truth ratings to assess how well the model has learned user preferences.

The predicted ratings, along with their corresponding user and item IDs, are stored in a DataFrame. The label encoders are used to reverse the encoding and convert the indices back to the original userId and movieId for readability.

Steps Involved:
- Switch the model to evaluation mode (`model.eval()`) and disable gradients (`torch.no_grad()`).

- Generate predictions for each user-item pair in the test set.

- Collect predictions and actual ratings into a DataFrame.

- Reverse the label encoding for userId and movieId to return the original values.

- Calculate RMSE to measure the prediction error.

Outputs:
- `neumf_df`: A DataFrame containing userId, movieId, true rating, and predicted rating.

- `rmse_neumf`: A scalar value representing the average RMSE on the test set.

This RMSE score serves as an interpretable metric for evaluating how well the NeuMF model predicts user ratings and how closely it approximates actual user preferences.

In [None]:
# Prepare NeuMF predictions
model.eval()
preds = []
true_ratings = []
user_ids = []
item_ids = []

with torch.no_grad():
    for users, items, metadata, ratings in test_loader:
        users = users.to(device)
        items = items.to(device)
        metadata = metadata.float().to(device)

        outputs = model(users, items, metadata).cpu().numpy()
        preds.extend(outputs)
        true_ratings.extend(ratings.numpy())
        user_ids.extend(users.cpu().numpy())
        item_ids.extend(items.cpu().numpy())

# Build results DataFrame
rmse_df = pd.DataFrame({
    'user': user_ids,
    'item': item_ids,
    'true_rating': true_ratings,
    'predicted_rating': preds
})

# Inverse label encoding for readability
rmse_df['userId'] = le_user.inverse_transform(rmse_df['user'])
rmse_df['movieId'] = le_item.inverse_transform(rmse_df['item'])

# Compute RMSE
rmse = np.sqrt(mean_squared_error(rmse_df['true_rating'], rmse_df['predicted_rating']))
print(f"NeuMF with Metadata RMSE: {rmse:.4f}")


NeuMF with Metadata RMSE: 0.8813


## Step 6: Batched Rank-All Evaluation of the Model
In this step, we implement a batched Rank-All evaluation procedure to assess the performance of the trained Neural Collaborative Filtering (NCF) model, which incorporates metadata embeddings in the prediction process. This evaluation simulates a realistic recommendation scenario where the model ranks all unseen items for each user and recommends the Top-K most relevant items.

To improve computational efficiency, scores for all candidate items are computed in batches using PyTorch tensor operations. For each user, the model computes scores for all candidate items (which are the items not seen during training) and also includes the test items in the candidate pool to ensure the model has the opportunity to rank them correctly.

This evaluation is done using the following Top-K ranking metrics:

- Precision@K: The proportion of items in the top-K recommendations that are relevant (i.e., items in the test set for the user).

- Recall@K: The proportion of the user’s relevant items that appear in the top-K recommendations.

- Hit Rate@K: The proportion of users who received at least one relevant recommendation in their top-K list.

- MRR@K (Mean Reciprocal Rank): Measures the ranking of the first relevant item in the top-K list.

- MAP@K (Mean Average Precision): Averages the precision values at the ranks where relevant items appear in the top-K list.

- NDCG@K (Normalized Discounted Cumulative Gain): Penalizes relevant items ranked lower and rewards those ranked higher in the list.

The evaluation leverages metadata embeddings (e.g., director, cast, genre) as part of the model’s predictions, which help improve recommendation accuracy in sparse scenarios. The model is evaluated on both the train and test sets to ensure generalization and robust performance.

Output:
A printed summary of the mean Top-K metrics across all users, reflecting the model’s ability to rank and recommend relevant items when faced with a large number of candidate items. These metrics help evaluate how well the Neural Collaborative Filtering model, with metadata, performs in a full ranking scenario.

In [None]:
# === Build user-item interaction dictionaries from DataFrame ===
def get_user_item_dict(df, user_col='user', item_col='item'):
    user_item_dict = defaultdict(set)
    for row in df.itertuples():
        user_item_dict[getattr(row, user_col)].add(getattr(row, item_col))
    return user_item_dict

# Build a mapping from item index to metadata vector
item_metadata_dict = dict(zip(
    df_ratings['item'],  # this is after label encoding
    df_ratings['combined_metadata']
))

# === Batched Top-K Evaluation ===
def evaluate_rank_all_batched_with_metadata(model, train_dict, test_dict, n_items, item_metadata_dict, K=10, device='cpu'):
    precision_list, recall_list, hit_list, ndcg_list, map_list, mrr_list = [], [], [], [], [], []

    model.eval()
    with torch.no_grad():
        for user in tqdm(test_dict.keys(), desc="Rank-All Evaluation (With Metadata)"):
            test_items = test_dict[user]
            if len(test_items) == 0:
                continue

            candidate_items = list(set(range(n_items)) - train_dict[user])
            for item in test_items:
                if item not in candidate_items:
                    candidate_items.append(item)

            # Prepare input tensors
            user_tensor = torch.tensor([user] * len(candidate_items), dtype=torch.long).to(device)
            item_tensor = torch.tensor(candidate_items, dtype=torch.long).to(device)

            metadata_tensor = torch.stack([torch.tensor(item_metadata_dict[i]) for i in candidate_items]).float().to(device)

            scores = model(user_tensor, item_tensor, metadata_tensor).cpu().numpy()
            ranked_items = list(zip(candidate_items, scores))

            # Top-K selection
            top_k = heapq.nlargest(K, ranked_items, key=lambda x: x[1])
            top_k_items = set([i[0] for i in top_k])
            hits = [item for item in test_items if item in top_k_items]

            # Metrics
            precision_list.append(len(hits) / K)
            recall_list.append(len(hits) / len(test_items))
            hit_list.append(1.0 if hits else 0.0)

            ap = 0.0
            for idx, (item_id, _) in enumerate(top_k):
                if item_id in test_items:
                    ap += len([i for i, (itm, _) in enumerate(top_k[:idx+1]) if itm in test_items]) / (idx+1)
            map_list.append(ap / len(test_items))

            rr = 0.0
            for idx, (item_id, _) in enumerate(top_k):
                if item_id in test_items:
                    rr = 1.0 / (idx + 1)
                    break
            mrr_list.append(rr)

            dcg = sum([1.0 / np.log2(idx + 2) for idx, (itm, _) in enumerate(top_k) if itm in test_items])
            idcg = sum([1.0 / np.log2(i + 2) for i in range(min(len(test_items), K))])
            ndcg = dcg / idcg if idcg > 0 else 0.0
            ndcg_list.append(ndcg)

    print(f"\nPrecision@{K}: {np.mean(precision_list):.4f}")
    print(f"Recall@{K}: {np.mean(recall_list):.4f}")
    print(f"Hit Rate@{K}: {np.mean(hit_list):.4f}")
    print(f"MRR@{K}: {np.mean(mrr_list):.4f}")
    print(f"MAP@{K}: {np.mean(map_list):.4f}")
    print(f"NDCG@{K}: {np.mean(ndcg_list):.4f}")


train_user_items = get_user_item_dict(train_df)
test_user_items = get_user_item_dict(test_df)

print("Model's performance on Test Set\n")
evaluate_rank_all_batched_with_metadata(
    model=model,
    train_dict=train_user_items,
    test_dict=test_user_items,
    item_metadata_dict=item_metadata_dict,
    n_items=n_items,
    K=10,
    device=device
)

print()
print("Model's performance on Train Set\n")
evaluate_rank_all_batched_with_metadata(
    model=model,
    train_dict=train_user_items,
    test_dict=train_user_items,
    item_metadata_dict=item_metadata_dict,
    n_items=n_items,
    K=10,
    device=device
)


Model's performance on Test Set



Rank-All Evaluation (With Metadata): 100%|██████████| 215586/215586 [1:07:15<00:00, 53.42it/s]



Precision@10: 0.1048
Recall@10: 0.1943
Hit Rate@10: 0.4647
MRR@10: 0.2782
MAP@10: 0.1019
NDCG@10: 0.1885
Model's performance on Train Set



Rank-All Evaluation (With Metadata): 100%|██████████| 215586/215586 [1:10:18<00:00, 51.11it/s]



Precision@10: 0.0983
Recall@10: 0.0760
Hit Rate@10: 0.5069
MRR@10: 0.1993
MAP@10: 0.0335
NDCG@10: 0.1125


## Step 7: AUC Evaluation of the Model
In this step, we compute the Area Under the ROC Curve (AUC) to evaluate the model’s ability to distinguish between relevant (positive) and irrelevant (negative) items for each user. A higher AUC score indicates that the model is better at distinguishing between relevant and irrelevant items.

For each user:
- Positive items are those present in the user's test set (i.e., items the user has rated).

- Negative items are the items not present in the test set, and a random sample of unseen items is selected as negative samples (up to a maximum of 100 items).

- The model generates predicted scores for both positive and negative items, and AUC is computed based on how well the model ranks relevant items higher than irrelevant ones.

AUC Calculation:
- A user-specific AUC is computed using roc_auc_score from sklearn.metrics, which compares the predicted scores for relevant (positive) and irrelevant (negative) items.

- Only users who have both positive and negative samples are included in the AUC calculation to maintain a balanced evaluation.

- The final output is the mean AUC across all users, representing the model’s ability to rank relevant items higher than irrelevant ones.

Key Steps:
- Sample negative items from the pool of unseen items for each user.

- Compute predicted scores for both positive (test) and negative (sampled) items.

- Calculate AUC using roc_auc_score, comparing the predicted scores and true labels (1 for relevant, 0 for irrelevant).

- Mean AUC is computed across all users.

Output:
- The mean AUC score represents the model's overall ability to correctly rank relevant items higher than irrelevant ones, independent of a fixed threshold.



In [None]:
# === Compute AUC for Train Set ===
user_aucs_train = []

for user in train_user_items:
    positives = train_user_items[user]
    negatives = list(set(range(n_items)) - positives)
    if len(positives) == 0 or len(negatives) == 0:
        continue

    # Sample negatives to reduce computation
    sampled_negatives = random.sample(negatives, min(len(negatives), 100))
    items = list(positives) + sampled_negatives
    labels = [1] * len(positives) + [0] * len(sampled_negatives)

    # Score all items for this user
    user_tensor = torch.tensor([user] * len(items)).to(device)
    item_tensor = torch.tensor(items).to(device)

    # Metadata tensor for this batch
    metadata_tensor = torch.stack([torch.tensor(item_metadata_dict[i]) for i in items]).float().to(device)

    # Pass metadata to the model during prediction
    scores = model(user_tensor, item_tensor, metadata_tensor).detach().cpu().numpy()

    # Calculate AUC if both labels exist
    if len(set(labels)) == 2:
        auc = roc_auc_score(labels, scores)
        user_aucs_train.append(auc)

print(f"Train AUC: {np.mean(user_aucs_train):.4f}")

# === Compute AUC for Test Set ===
user_aucs = []

for user in test_user_items:
    positives = test_user_items[user]
    negatives = list(set(range(n_items)) - train_user_items[user] - positives)
    if len(positives) == 0 or len(negatives) == 0:
        continue

    sampled_negatives = random.sample(negatives, min(len(negatives), 100))  # or use all
    items = list(positives) + sampled_negatives
    labels = [1] * len(positives) + [0] * len(sampled_negatives)

    user_tensor = torch.tensor([user] * len(items)).to(device)
    item_tensor = torch.tensor(items).to(device)

    # Metadata tensor for this batch
    metadata_tensor = torch.stack([torch.tensor(item_metadata_dict[i]) for i in items]).float().to(device)

    # Pass metadata to the model during prediction
    scores = model(user_tensor, item_tensor, metadata_tensor).detach().cpu().numpy()

    if len(set(labels)) == 2:
        auc = roc_auc_score(labels, scores)
        user_aucs.append(auc)

print(f"Test AUC: {np.mean(user_aucs):.4f}")


Train AUC: 0.7582
Test AUC: 0.8236
