## Model: NeuMF (Neural Collaborative Filtering)

This notebook implements the **NeuMF** model as described in:

> Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua.  
> **Neural Collaborative Filtering**  
> WWW 2017.  
> [PDF](https://arxiv.org/pdf/1708.05031.pdf)

### Core idea
NeuMF fuses two models:
- GMF (Generalized Matrix Factorization): A linear latent factor model.
- MLP (Multi-Layer Perceptron): A deep model that captures non-linear user–item interactions.

These are combined into a unified neural architecture to predict interaction likelihoods.

### Implementation notes
- The model consists of embedding layers, MLP layers, and a fusion layer.
- Binary cross-entropy is used as the loss function.
- Training uses negative sampling and is optimized with Adam.

The architecture matches the original paper’s proposal and is structured for comparison with other recommenders and post-processing reranking methods.


### 📊 Comparison with RecBole: NeuMF Implementation

This notebook implements the **NeuMF** architecture based on *He et al. (2017)*, combining both GMF and MLP components to learn user-item interactions. While the design remains faithful to the original paper, there are notable architectural and procedural differences when compared to RecBole’s built-in `NeuMF` model.

#### 🔍 Metric Comparison (ML-100K, Original NeuMF Only)

| Metric         | RecBole NeuMF | Custom NeuMF | Δ (%)        |
|----------------|---------------|---------------|--------------|
| NDCG@10        | 0.2812        | 0.2657        | –5.51%       |
| Precision@10   | 0.1940        | 0.2928        | **+50.93%**  |
| Recall@10      | 0.2410        | 0.1942        | –19.41%      |
| Gini Index     | 0.8954        | 0.6376        | –28.79%      |
| Item Coverage  | 0.3363        | 0.4444        | **+32.16%**  |
| Entropy        | 0.0096        | 0.7905        | **+8140%**   |
| Tail %         | 0.0           | 0.0           | –            |

> *Note: RecBole's entropy appears artificially low, likely due to the metric being calculated over the ranking positions rather than raw item occurrence distributions.*

---

### Explanation of Deviations from RecBole

The following technical differences likely explain the observed deviations between this implementation and RecBole’s internal `NeuMF`:

- **Model Architecture**  
  The NeuMF model here separates GMF and MLP paths and merges them only at the final layer.  
  RecBole may differ in the way it fuses or pretrains these components.

- **Training Procedure**  
  This code uses **binary cross-entropy** with **uniform negative sampling** and PyTorch `DataLoader`.  
  RecBole offers several loss functions and sampling schemes configurable via YAML (e.g., BPR, popularity-based negatives).

- **Embedding Initialization**  
  Embeddings are initialized with a **normal distribution** (std=0.01).  
  RecBole typically uses **Xavier** or **He** initialization for better convergence.

- **Data Splitting and Filtering**  
  This implementation uses a **stratified 80/20 split** without filtering.  
  RecBole filters users/items based on configurable thresholds and supports **leave-one-out** and **chronological splits**.

- **Relevance Scoring for NDCG**  
  Here, **raw ratings** (e.g., 4 or 5) are used as gain in DCG.  
  RecBole often applies **binary relevance**, marking items as relevant or not.

- **Evaluation Framework**  
  All metrics are computed explicitly via NumPy, giving full transparency.  
  RecBole uses an internal evaluator which may include cold-start filtering or thresholding.

- **Reranking Compatibility**  
  This code enables custom rerankers like **Simple** and **MMR**, applied post hoc.  
  RecBole does not natively support external reranking models within its evaluation pipeline.

---

### Conclusion

This NeuMF implementation provides a faithful reproduction of the original paper's concepts while enabling full flexibility in training, evaluation, and reranking. Despite modest deviations in accuracy and more significant differences in diversity metrics, the results demonstrate the strengths of a transparent, research-friendly framework compared to black-box pipelines like RecBole.


In [1]:
# NeuMF Recommender with Diversity Reranking

import numpy as np
import pandas as pd
from collections import defaultdict, Counter
import math
import random
import time
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

#################################
# NEUMF RECOMMENDER IMPLEMENTATION
#################################

class NCFDataset(Dataset):
    """Dataset for NCF"""
    def __init__(self, user_item_matrix, neg_samples=4):
        """
        Initialize the dataset
        
        Parameters:
        - user_item_matrix: scipy sparse matrix with user-item interactions
        - neg_samples: number of negative samples per positive interaction
        """
        self.user_item_matrix = user_item_matrix
        self.users, self.items = user_item_matrix.nonzero()
        self.n_users = user_item_matrix.shape[0]
        self.n_items = user_item_matrix.shape[1]
        self.neg_samples = neg_samples
        
        # Create a set of (user, item) pairs for quick lookup
        self.user_item_set = set(zip(self.users, self.items))
        
        # Create a dictionary of items each user has interacted with
        self.user_items = defaultdict(set)
        for u, i in zip(self.users, self.items):
            self.user_items[u].add(i)
    
    def __len__(self):
        return len(self.users) * (1 + self.neg_samples)
    
    def __getitem__(self, idx):
        # Determine if this is a positive or negative sample
        if idx < len(self.users):
            # Positive sample
            user = self.users[idx]
            item = self.items[idx]
            label = 1.0
        else:
            # Negative sample - sample a user and item that don't have an interaction
            pos_idx = idx % len(self.users)
            user = self.users[pos_idx]
            
            # Sample a negative item for this user
            item = random.randint(0, self.n_items - 1)
            while item in self.user_items[user]:
                item = random.randint(0, self.n_items - 1)
            
            label = 0.0
            
        return user, item, label

class GMF(nn.Module):
    """Generalized Matrix Factorization model"""
    def __init__(self, n_users, n_items, latent_dim):
        super(GMF, self).__init__()
        self.user_embedding = nn.Embedding(n_users, latent_dim)
        self.item_embedding = nn.Embedding(n_items, latent_dim)
        self.output_layer = nn.Linear(latent_dim, 1)
        
        # Initialize weights
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.item_embedding.weight, std=0.01)
        
    def forward(self, user_indices, item_indices):
        user_embeddings = self.user_embedding(user_indices)
        item_embeddings = self.item_embedding(item_indices)
        element_product = torch.mul(user_embeddings, item_embeddings)
        output = self.output_layer(element_product)
        return output.view(-1)

class MLP(nn.Module):
    """Multi-Layer Perceptron model"""
    def __init__(self, n_users, n_items, latent_dim, layers=[64, 32, 16, 8]):
        super(MLP, self).__init__()
        self.user_embedding = nn.Embedding(n_users, latent_dim)
        self.item_embedding = nn.Embedding(n_items, latent_dim)
        
        # MLP layers
        self.layers = nn.ModuleList()
        layer_dims = [2 * latent_dim] + layers
        
        for i in range(len(layer_dims) - 1):
            self.layers.append(nn.Linear(layer_dims[i], layer_dims[i+1]))
            self.layers.append(nn.ReLU())
        
        # Output layer
        self.output_layer = nn.Linear(layer_dims[-1], 1)
        
        # Initialize weights
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.item_embedding.weight, std=0.01)
        
    def forward(self, user_indices, item_indices):
        user_embeddings = self.user_embedding(user_indices)
        item_embeddings = self.item_embedding(item_indices)
        vector = torch.cat([user_embeddings, item_embeddings], dim=-1)
        
        # Apply each layer
        for layer in self.layers:
            vector = layer(vector)
            
        output = self.output_layer(vector)
        return output.view(-1)

class NeuMF(nn.Module):
    """Neural Matrix Factorization model"""
    def __init__(self, n_users, n_items, latent_dim=32, mlp_layers=[64, 32, 16, 8]):
        super(NeuMF, self).__init__()
        self.gmf = GMF(n_users, n_items, latent_dim)
        self.mlp = MLP(n_users, n_items, latent_dim, mlp_layers)
        
        # NeuMF output layer
        self.output_layer = nn.Linear(mlp_layers[-1] + latent_dim, 1)
        
        # Initialize weights
        nn.init.normal_(self.output_layer.weight, std=0.01)
        
    def forward(self, user_indices, item_indices):
        # GMF path
        gmf_user = self.gmf.user_embedding(user_indices)
        gmf_item = self.gmf.item_embedding(item_indices)
        gmf_vector = torch.mul(gmf_user, gmf_item)
        
        # MLP path
        mlp_user = self.mlp.user_embedding(user_indices)
        mlp_item = self.mlp.item_embedding(item_indices)
        mlp_vector = torch.cat([mlp_user, mlp_item], dim=-1)
        
        for layer in self.mlp.layers:
            mlp_vector = layer(mlp_vector)
        
        # Concatenate GMF and MLP vectors
        vector = torch.cat([gmf_vector, mlp_vector], dim=-1)
        
        # Final output
        output = self.output_layer(vector)
        
        return torch.sigmoid(output.view(-1))

class NeuMFRecommender:
    def __init__(self, latent_dim=32, mlp_layers=[64, 32, 16, 8], epochs=20, batch_size=256, 
                 lr=0.001, neg_samples=4, device=None, random_state=42):
        """
        Neural Matrix Factorization recommender algorithm
        
        Parameters:
        - latent_dim: dimensionality of latent factors
        - mlp_layers: list of layer sizes for MLP component
        - epochs: number of training epochs
        - batch_size: batch size for training
        - lr: learning rate
        - neg_samples: number of negative samples per positive interaction
        - device: torch device (cpu or cuda)
        - random_state: seed for reproducibility
        """
        self.latent_dim = latent_dim
        self.mlp_layers = mlp_layers
        self.epochs = epochs
        self.batch_size = batch_size
        self.lr = lr
        self.neg_samples = neg_samples
        self.random_state = random_state
        
        # Set random seeds
        random.seed(random_state)
        np.random.seed(random_state)
        torch.manual_seed(random_state)
        
        # Set device
        if device is None:
            self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        else:
            self.device = device
        
    def fit(self, user_item_matrix):
        """
        Train the NeuMF model on the user-item matrix
        
        Parameters:
        - user_item_matrix: scipy sparse matrix with user-item interactions
        
        Returns:
        - self
        """
        self.user_item_matrix = user_item_matrix
        self.n_users, self.n_items = user_item_matrix.shape
        
        # Create a dictionary of items each user has interacted with
        self.user_items = defaultdict(set)
        for user, item in zip(*self.user_item_matrix.nonzero()):
            self.user_items[user].add(item)
        
        # Create dataset and dataloader
        dataset = NCFDataset(user_item_matrix, neg_samples=self.neg_samples)
        dataloader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True)
        
        # Initialize model
        self.model = NeuMF(self.n_users, self.n_items, self.latent_dim, self.mlp_layers).to(self.device)
        
        # Loss function and optimizer
        criterion = nn.BCELoss()
        optimizer = optim.Adam(self.model.parameters(), lr=self.lr)
        
        # Training loop
        print(f"Training NeuMF model for {self.epochs} epochs...")
        self.model.train()
        
        for epoch in range(self.epochs):
            start_time = time.time()
            running_loss = 0.0
            
            for users, items, labels in dataloader:
                users = users.to(self.device)
                items = items.to(self.device)
                labels = labels.float().to(self.device)
                
                # Forward pass
                outputs = self.model(users, items)
                loss = criterion(outputs, labels)
                
                # Backward pass and optimize
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                running_loss += loss.item()
            
            elapsed_time = time.time() - start_time
            if (epoch + 1) % 5 == 0 or epoch == 0:
                print(f"Epoch {epoch+1}/{self.epochs}, Loss: {running_loss/len(dataloader):.4f}, Time: {elapsed_time:.2f}s")
        
        # Switch to evaluation mode
        self.model.eval()
        
        # Precompute all user embeddings
        with torch.no_grad():
            self.user_gmf_embeddings = self.model.gmf.user_embedding.weight.data
            self.item_gmf_embeddings = self.model.gmf.item_embedding.weight.data
            self.user_mlp_embeddings = self.model.mlp.user_embedding.weight.data
            self.item_mlp_embeddings = self.model.mlp.item_embedding.weight.data
        
        # Set up item factors (for compatibility with rerankers)
        # We'll use a combination of GMF and MLP embeddings as item factors
        self.item_factors = np.concatenate([
            self.item_gmf_embeddings.cpu().numpy(),
            self.item_mlp_embeddings.cpu().numpy()
        ], axis=1)
        
        return self
    
    def recommend(self, user_id, n=10, exclude_seen=True):
        """
        Generate item recommendations for a user
        
        Parameters:
        - user_id: user index
        - n: number of recommendations to generate
        - exclude_seen: whether to exclude items the user has already interacted with
        
        Returns:
        - list of n recommended item indices
        """
        # If the user has no interactions in training set, return random recommendations
        if user_id not in self.user_items:
            all_items = list(range(self.n_items))
            recommendations = random.sample(all_items, min(n, len(all_items)))
            return np.array(recommendations)
        
        # Predict scores for all items for this user
        with torch.no_grad():
            user_tensor = torch.LongTensor([user_id] * self.n_items).to(self.device)
            item_tensor = torch.LongTensor(list(range(self.n_items))).to(self.device)
            
            scores = self.model(user_tensor, item_tensor).cpu().numpy()
        
        # If requested, exclude items the user has already interacted with
        if exclude_seen:
            for item_id in self.user_items[user_id]:
                scores[item_id] = -np.inf
        
        # Get top n items by score
        top_items = np.argsort(scores)[::-1][:n]
        
        return top_items

#################################
# RERANKER IMPLEMENTATION
#################################

class SimpleReranker:
    """
    Simple reranker that balances original scores with diversity
    """
    def __init__(self, model, alpha=0.7):
        """
        Initialize reranker
        
        Parameters:
        - model: trained recommender model
        - alpha: weight for original scores (between 0 and 1)
                 higher alpha means more focus on accuracy
        """
        self.model = model
        self.alpha = alpha
        
        # Calculate item popularity
        self.item_popularity = np.zeros(model.n_items)
        for user in range(model.n_users):
            if user in model.user_items:
                for item in model.user_items[user]:
                    self.item_popularity[item] += 1
        
        # Normalize popularity
        max_pop = np.max(self.item_popularity)
        if max_pop > 0:
            self.norm_popularity = self.item_popularity / max_pop
        else:
            self.norm_popularity = np.zeros_like(self.item_popularity)
    
    def rerank(self, user_id, n=10):
        """
        Generate reranked recommendations
        """
        # Get original recommendations as a larger candidate pool
        candidates = self.model.recommend(user_id, n=n*3, exclude_seen=True)
        
        # Get scores for all items for this user
        with torch.no_grad():
            user_tensor = torch.LongTensor([user_id] * self.model.n_items).to(self.model.device)
            item_tensor = torch.LongTensor(list(range(self.model.n_items))).to(self.model.device)
            scores = self.model.model(user_tensor, item_tensor).cpu().numpy()
        
        # Initialize selected items
        selected = []
        
        # Iteratively select items
        while len(selected) < n and candidates.size > 0:
            best_score = -np.inf
            best_item = None
            
            for item in candidates:
                if item in selected:
                    continue
                
                # Original score component
                score_orig = scores[item]
                
                # Diversity component
                diversity_score = 0
                if selected:
                    # Use item factors to calculate similarity
                    item_factors = self.model.item_factors[item]
                    selected_factors = self.model.item_factors[selected]
                    
                    # Calculate average similarity
                    similarities = []
                    for sel_factors in selected_factors:
                        # Cosine similarity
                        dot_product = np.dot(item_factors, sel_factors)
                        norm_product = np.linalg.norm(item_factors) * np.linalg.norm(sel_factors)
                        if norm_product > 0:
                            sim = dot_product / norm_product
                        else:
                            sim = 0
                        similarities.append(sim)
                    
                    if similarities:
                        avg_sim = np.mean(similarities)
                        diversity_score = 1 - avg_sim
                
                # Novelty component (inverse popularity)
                novelty_score = 1 - self.norm_popularity[item]
                
                # Calculate weighted score
                combined_score = (
                    self.alpha * score_orig + 
                    (1 - self.alpha) * 0.5 * diversity_score + 
                    (1 - self.alpha) * 0.5 * novelty_score
                )
                
                if combined_score > best_score:
                    best_score = combined_score
                    best_item = item
            
            if best_item is None:
                break
                
            selected.append(best_item)
            candidates = candidates[candidates != best_item]
            
        return np.array(selected)

class MMRReranker:
    """
    Maximum Marginal Relevance (MMR) Reranker
    
    This reranker balances between relevance and diversity explicitly by
    selecting items that maximize marginal relevance - items that are
    both relevant to the user and different from already selected items.
    
    MMR formula: MMR = λ * rel(i) - (1-λ) * max(sim(i,j)) for j in selected items
    
    Where:
    - rel(i) is the relevance of item i to the user
    - sim(i,j) is the similarity between items i and j
    - λ is a parameter that controls the trade-off between relevance and diversity
    """
    
    def __init__(self, model, lambda_param=0.7):
        """
        Initialize the MMR reranker
        
        Parameters:
        - model: trained recommender model
        - lambda_param: trade-off parameter between relevance and diversity (0-1)
                        higher values favor relevance, lower values favor diversity
        """
        self.model = model
        self.lambda_param = lambda_param
        
    def calculate_item_similarity(self, item1, item2):
        """
        Calculate similarity between two items
        
        Parameters:
        - item1: index of first item
        - item2: index of second item
        
        Returns:
        - similarity: similarity between items (0 to 1)
        """
        # Calculate cosine similarity between item embeddings
        item1_factors = self.model.item_factors[item1]
        item2_factors = self.model.item_factors[item2]
        
        # Cosine similarity
        dot_product = np.dot(item1_factors, item2_factors)
        norm_product = np.linalg.norm(item1_factors) * np.linalg.norm(item2_factors)
        
        if norm_product == 0:
            return 0
        
        return dot_product / norm_product
    
    def rerank(self, user_id, n=10, candidate_size=100):
        """
        Generate reranked recommendations using Maximum Marginal Relevance
        
        Parameters:
        - user_id: user index in the model
        - n: number of recommendations to return
        - candidate_size: number of initial candidates to consider
        
        Returns:
        - reranked_items: list of reranked item indices
        """
        # Get candidate items and their scores
        candidates = self.model.recommend(user_id, n=candidate_size, exclude_seen=True)
        
        # Get scores for candidate items
        with torch.no_grad():
            user_tensor = torch.LongTensor([user_id] * self.model.n_items).to(self.model.device)
            item_tensor = torch.LongTensor(list(range(self.model.n_items))).to(self.model.device)
            relevance_scores = self.model.model(user_tensor, item_tensor).cpu().numpy()
        
        # Normalize relevance scores to [0,1] range for the candidates
        candidate_scores = relevance_scores[candidates]
        min_score = np.min(candidate_scores)
        max_score = np.max(candidate_scores)
        score_range = max_score - min_score
        
        if score_range > 0:
            normalized_scores = (candidate_scores - min_score) / score_range
        else:
            normalized_scores = np.zeros_like(candidate_scores)
        
        # Initialize selected items
        selected = []
        
        # Select first item (most relevant)
        if candidates.size > 0:
            selected.append(candidates[np.argmax(normalized_scores)])
            remaining_candidates = set(candidates) - set(selected)
        else:
            remaining_candidates = set()
        
        # Iteratively select items using MMR
        while len(selected) < n and remaining_candidates:
            max_mmr = -np.inf
            max_item = None
            
            for item in remaining_candidates:
                # Get relevance component
                item_idx = np.where(candidates == item)[0][0]
                relevance = normalized_scores[item_idx]
                
                # Calculate diversity component (inverse of maximum similarity)
                max_sim = 0
                for selected_item in selected:
                    sim = self.calculate_item_similarity(item, selected_item)
                    max_sim = max(max_sim, sim)
                
                # Calculate MMR score
                mmr_score = self.lambda_param * relevance - (1 - self.lambda_param) * max_sim
                
                if mmr_score > max_mmr:
                    max_mmr = mmr_score
                    max_item = item
            
            if max_item is not None:
                selected.append(max_item)
                remaining_candidates.remove(max_item)
            else:
                break
                
        return np.array(selected)

#################################
# EVALUATION METRICS
#################################

def calculate_ndcg(recommended_items, relevant_items, relevant_scores, k=None):
    """
    Calculate Normalized Discounted Cumulative Gain
    """
    if k is None:
        k = len(recommended_items)
    else:
        k = min(k, len(recommended_items))
    
    # Create a dictionary mapping relevant items to their scores
    relevance_map = {item_id: score for item_id, score in zip(relevant_items, relevant_scores)}
    
    # Calculate DCG
    dcg = 0
    for i, item_id in enumerate(recommended_items[:k]):
        if item_id in relevance_map:
            # Use rating as relevance score
            rel = relevance_map[item_id]
            # DCG formula: (2^rel - 1) / log2(i+2)
            dcg += (2 ** rel - 1) / np.log2(i + 2)
    
    # Calculate ideal DCG (IDCG)
    # Sort relevant items by their relevance scores in descending order
    sorted_relevant = sorted(zip(relevant_items, relevant_scores), 
                           key=lambda x: x[1], reverse=True)
    
    idcg = 0
    for i, (item_id, rel) in enumerate(sorted_relevant[:k]):
        # IDCG formula: (2^rel - 1) / log2(i+2)
        idcg += (2 ** rel - 1) / np.log2(i + 2)
    
    # Avoid division by zero
    if idcg == 0:
        return 0
    
    # Calculate NDCG
    ndcg = dcg / idcg
    
    return ndcg

def calculate_precision(recommended_items, relevant_items):
    """
    Calculate Precision@k
    """
    # Count number of relevant items in recommended items
    num_relevant_recommended = sum(1 for item in recommended_items if item in relevant_items)
    
    # Calculate precision
    precision = num_relevant_recommended / len(recommended_items) if recommended_items else 0
    
    return precision

def calculate_recall(recommended_items, relevant_items):
    """
    Calculate Recall@k
    """
    # Count number of relevant items in recommended items
    num_relevant_recommended = sum(1 for item in recommended_items if item in relevant_items)
    
    # Calculate recall
    recall = num_relevant_recommended / len(relevant_items) if relevant_items else 0
    
    return recall

def calculate_diversity_metrics(recommendations, item_popularity, total_items, tail_items=None):
    """
    Calculate diversity metrics for a set of recommendations
    """
    # Count occurrences of each item in recommendations
    rec_counts = Counter(recommendations)
    
    # 1. Item Coverage
    recommended_items = len(rec_counts)
    item_coverage = recommended_items / total_items
    
    # 2. Gini Index
    sorted_counts = sorted(rec_counts.values())
    n = len(sorted_counts)
    
    if n == 0:
        gini_index = 0
    else:
        cumulative_sum = 0
        for i, count in enumerate(sorted_counts):
            cumulative_sum += (i + 1) * count
        
        # Gini index formula
        gini_index = (2 * cumulative_sum) / (n * sum(sorted_counts)) - (n + 1) / n
    
    # 3. Shannon Entropy
    recommendations_count = sum(rec_counts.values())
    probabilities = [count / recommendations_count for count in rec_counts.values()]
    entropy = -sum(p * np.log2(p) for p in probabilities if p > 0)
    
    # Normalize entropy
    max_entropy = np.log2(min(total_items, recommendations_count))
    normalized_entropy = entropy / max_entropy if max_entropy > 0 else 0
    
    # 4. Tail Percentage
    if tail_items is None:
        # If tail_items not provided, use the bottom 20% by popularity
        sorted_pop_indices = np.argsort(item_popularity)
        num_tail_items = int(len(sorted_pop_indices) * 0.2)  # 20% least popular items
        tail_items = set(sorted_pop_indices[:num_tail_items])
    
    tail_recommendations = sum(1 for item in recommendations if item in tail_items)
    tail_percentage = tail_recommendations / len(recommendations) if recommendations else 0
    
    # Create results dictionary
    metrics = {
        'item_coverage': item_coverage,
        'gini_index': gini_index,
        'shannon_entropy': normalized_entropy,
        'tail_percentage': tail_percentage
    }
    
    return metrics, tail_items

#################################
# HELPER FUNCTIONS
#################################

def load_movielens_100k(path="ml-100k"):
    """
    Load the MovieLens 100K dataset
    """
    # Load ratings
    ratings_df = pd.read_csv(f"{path}/u.data", sep='\t', 
                           names=['user_id', 'item_id', 'rating', 'timestamp'])
    
    # Load movie information
    movie_df = pd.read_csv(f"{path}/u.item", sep='|', encoding='latin-1',
                          names=['item_id', 'title', 'release_date', 'video_release_date',
                                 'IMDb_URL'] + [f'genre_{i}' for i in range(19)])
    
    return ratings_df, movie_df

def create_user_item_matrix(ratings_df):
    """
    Create a sparse user-item interaction matrix from ratings
    """
    # Create mappings from original IDs to matrix indices
    user_ids = ratings_df['user_id'].unique()
    item_ids = ratings_df['item_id'].unique()
    
    user_mapping = {user_id: i for i, user_id in enumerate(user_ids)}
    item_mapping = {item_id: i for i, item_id in enumerate(item_ids)}
    
    # Map original IDs to matrix indices
    rows = ratings_df['user_id'].map(user_mapping)
    cols = ratings_df['item_id'].map(item_mapping)
    
    # Create binary matrix (1 if interaction exists, 0 otherwise)
    data = np.ones(len(ratings_df))
    user_item_matrix = csr_matrix((data, (rows, cols)), 
                                 shape=(len(user_mapping), len(item_mapping)))
    
    return user_item_matrix, user_mapping, item_mapping

#################################
# COMPREHENSIVE EVALUATION
#################################

def comprehensive_evaluation_multiple_rerankers(k=10, sample_size=None):
    """
    Run a comprehensive evaluation measuring both accuracy and diversity for multiple rerankers
    """
    print("="*80)
    print(f"COMPREHENSIVE EVALUATION WITH MULTIPLE RERANKERS (k={k})")
    print("="*80)
    
    # Load and prepare data
    print("\nLoading MovieLens 100K dataset...")
    ratings_df, movie_df = load_movielens_100k()
    
    print("Splitting data for evaluation...")
    train_df, test_df = train_test_split(
        ratings_df, 
        test_size=0.2, 
        stratify=ratings_df['user_id'], 
        random_state=42
    )
    
    print("Creating user-item matrix...")
    user_item_matrix, user_mapping, item_mapping = create_user_item_matrix(train_df)
    
    # Prepare for evaluation
    reverse_user_mapping = {v: k for k, v in user_mapping.items()}
    reverse_item_mapping = {v: k for k, v in item_mapping.items()}
    
    # Create test set ground truth
    test_relevant_items = defaultdict(list)
    test_relevant_scores = defaultdict(list)
    
    for _, row in test_df.iterrows():
        user_id = row['user_id']
        item_id = row['item_id']
        rating = row['rating']
        
        # Only include users and items that exist in our mappings
        if user_id in user_mapping and item_id in item_mapping:
            test_relevant_items[user_id].append(item_id)
            test_relevant_scores[user_id].append(rating)
    
    # Train model - use fewer epochs for NeuMF since it's more computationally intensive
    print("\nTraining NeuMF model...")
    model = NeuMFRecommender(latent_dim=32, epochs=10, batch_size=256)
    model.fit(user_item_matrix)
    
    # Initialize rerankers
    print("\nInitializing rerankers...")
    simple_reranker = SimpleReranker(model=model, alpha=0.7)
    mmr_reranker = MMRReranker(model=model, lambda_param=0.7)
    
    # Setup dictionary for all rerankers' results
    rerankers = {
        "Original NeuMF": None,
        "Simple Reranker": simple_reranker,
        "MMR Reranker": mmr_reranker
    }
    
    # Results dictionary
    all_results = {}
    
    # Select users for evaluation
    if sample_size is not None and sample_size < len(test_relevant_items):
        eval_users = random.sample(list(test_relevant_items.keys()), sample_size)
    else:
        eval_users = list(test_relevant_items.keys())
    
    print(f"\nEvaluating {len(eval_users)} users...")
    
    # Evaluate each reranker
    for reranker_name, reranker in rerankers.items():
        print(f"\nEvaluating {reranker_name}...")
        
        # Initialize metrics collectors
        ndcg_scores = []
        precision_scores = []
        recall_scores = []
        all_recs = []
        
        # Evaluate each user
        for user_id in eval_users:
            # Skip if user has no relevant items
            if not test_relevant_items[user_id]:
                continue
            
            user_idx = user_mapping[user_id]
            
            # Get recommendations
            if reranker is None:  # Original NeuMF
                rec_idx = model.recommend(user_idx, n=k)
            else:  # Use reranker
                rec_idx = reranker.rerank(user_idx, n=k)
                
            rec = [reverse_item_mapping[idx] for idx in rec_idx]
            all_recs.extend(rec_idx)
            
            # Calculate accuracy metrics
            ndcg_scores.append(calculate_ndcg(
                rec, test_relevant_items[user_id], test_relevant_scores[user_id]
            ))
            precision_scores.append(calculate_precision(
                rec, test_relevant_items[user_id]
            ))
            recall_scores.append(calculate_recall(
                rec, test_relevant_items[user_id]
            ))
        
        # Calculate average accuracy metrics
        accuracy_metrics = {
            f'ndcg@{k}': np.mean(ndcg_scores),
            f'precision@{k}': np.mean(precision_scores),
            f'recall@{k}': np.mean(recall_scores)
        }
        
        # Calculate diversity metrics
        # First calculate item popularity
        item_popularity = np.zeros(model.n_items)
        for user in range(model.n_users):
            if user in model.user_items:
                for item in model.user_items[user]:
                    item_popularity[item] += 1
        
        # Then calculate diversity metrics
        diversity_metrics, _ = calculate_diversity_metrics(
            recommendations=all_recs,
            item_popularity=item_popularity,
            total_items=model.n_items
        )
        
        # Store results
        all_results[reranker_name] = {
            'accuracy': accuracy_metrics,
            'diversity': diversity_metrics
        }
    
    # Print comparative results
    print("\n" + "="*30 + " ACCURACY METRICS COMPARISON " + "="*30)
    print(f"{'Metric':<15}", end='')
    for reranker_name in rerankers.keys():
        print(f"{reranker_name:<20}", end='')
    print()
    print("-" * 80)
    
    for metric in [f'ndcg@{k}', f'precision@{k}', f'recall@{k}']:
        print(f"{metric:<15}", end='')
        baseline = all_results["Original NeuMF"]['accuracy'][metric]
        for reranker_name in rerankers.keys():
            value = all_results[reranker_name]['accuracy'][metric]
            change = ((value - baseline) / baseline * 100) if baseline > 0 else float('inf')
            
            if reranker_name == "Original NeuMF":
                print(f"{value:.4f}{' '*15}", end='')
            else:
                print(f"{value:.4f} ({change:+.1f}%){' '*5}", end='')
        print()
    
    print("\n" + "="*30 + " DIVERSITY METRICS COMPARISON " + "="*30)
    print(f"{'Metric':<15}", end='')
    for reranker_name in rerankers.keys():
        print(f"{reranker_name:<20}", end='')
    print()
    print("-" * 80)
    
    for metric in ['item_coverage', 'gini_index', 'shannon_entropy', 'tail_percentage']:
        print(f"{metric:<15}", end='')
        baseline = all_results["Original NeuMF"]['diversity'][metric]
        for reranker_name in rerankers.keys():
            value = all_results[reranker_name]['diversity'][metric]
            change = ((value - baseline) / baseline * 100) if baseline > 0 else float('inf')
            
            if reranker_name == "Original NeuMF":
                print(f"{value:.4f}{' '*15}", end='')
            else:
                print(f"{value:.4f} ({change:+.1f}%){' '*5}", end='')
        print()
    
    # Print interpretations
    print("\n" + "="*30 + " METRIC INTERPRETATIONS " + "="*30)
    print("Accuracy Metrics:")
    print("- NDCG: Higher is better, measures ranking quality")
    print("- Precision: Higher is better, measures relevant item ratio in recommendations")
    print("- Recall: Higher is better, measures coverage of all relevant items")
    
    print("\nDiversity Metrics:")
    print("- Item Coverage: Higher means more catalog items are recommended")
    print("- Gini Index: Lower means more equality in item recommendations")
    print("- Shannon Entropy: Higher means more diverse recommendations")
    print("- Tail Percentage: Higher means more niche items are recommended")
    
    # Return all results
    return all_results

# Execute with multiple rerankers when running the script directly
if __name__ == "__main__":
    comprehensive_evaluation_multiple_rerankers(k=10)

COMPREHENSIVE EVALUATION WITH MULTIPLE RERANKERS (k=10)

Loading MovieLens 100K dataset...
Splitting data for evaluation...
Creating user-item matrix...

Training NeuMF model...
Training NeuMF model for 10 epochs...
Epoch 1/10, Loss: 0.3813, Time: 3.56s
Epoch 5/10, Loss: 0.2489, Time: 3.44s
Epoch 10/10, Loss: 0.1978, Time: 3.09s

Initializing rerankers...

Evaluating 943 users...

Evaluating Original NeuMF...

Evaluating Simple Reranker...

Evaluating MMR Reranker...

Metric         Original NeuMF      Simple Reranker     MMR Reranker        
--------------------------------------------------------------------------------
ndcg@10        0.2657               0.1717 (-35.4%)     0.2371 (-10.8%)     
precision@10   0.2928               0.2223 (-24.1%)     0.2678 (-8.5%)     
recall@10      0.1942               0.1394 (-28.2%)     0.1759 (-9.4%)     

Metric         Original NeuMF      Simple Reranker     MMR Reranker        
----------------------------------------------------------------