# NeuMF+ Training - Advanced Model with Content Features

This notebook trains **NeuMF+**, the advanced model that combines:
- **Collaborative Filtering** (NeuMF: GMF + MLP)
- **Genre Features** (content information)
- **Gated Fusion** (dynamic weighting of CF and content signals)

**Difference from NeuMF (baseline):**
- NeuMF: Only user-item interactions
- NeuMF+: User-item + Genre + Gated Fusion

**Configuration:**
- Data: 10% sample (~900K ratings)
- Batch size: 512
- Mixed precision: FP16
- Max epochs: 15 (stops early if no improvement)
- Early stopping patience: 3 (more aggressive)
- Estimated time: 1-2 hours on T4 GPU

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✓ Google Drive mounted!")

## Step 2: Clone and Setup

In [None]:
# Clone the repository
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')
!git pull origin main

# Link data
!rm -rf data
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/data" data

# Link experiments
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/experiments/trained_models
!rm -rf experiments
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/experiments experiments

# Install dependencies
!pip install -q torch torchvision pandas numpy scikit-learn tqdm tensorboard

print("\n✓ Setup complete!")

## Step 3: Load and Sample Data (10%)

In [None]:
import sys
sys.path.insert(0, '.')

import pandas as pd
import numpy as np
import pickle
import torch

from src.negative_sampling import build_user_history
from src.models.neumf_plus import NeuMFPlus
from src.train import train_model

# Load data
train_df = pd.read_pickle('data/train.pkl')
val_df = pd.read_pickle('data/val.pkl')
test_df = pd.read_pickle('data/test.pkl')

with open('data/mappings.pkl', 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

print("="*70)
print("SAMPLING 10% OF TRAINING DATA")
print("="*70)
print(f"\nOriginal train size: {len(train_df):,}")

# Sample 10% of training data
SAMPLE_RATIO = 0.10
np.random.seed(42)
sample_idx = np.random.choice(
    len(train_df), 
    int(len(train_df) * SAMPLE_RATIO), 
    replace=False
)
train_df = train_df.iloc[sample_idx].copy()

print(f"Sampled train size: {len(train_df):,} ({SAMPLE_RATIO*100:.0f}%)")

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

# Extract genre features
print("\nExtracting genre features...")
train_genre_features = np.stack(train_df['genre_features'].values)
val_genre_features = np.stack(val_df['genre_features'].values)
print(f"Train genre shape: {train_genre_features.shape}")
print(f"Val genre shape: {val_genre_features.shape}")

# Build user history
user_history = build_user_history(train_users, train_items)

print(f"\nUsers: {num_users:,}")
print(f"Items: {num_items:,}")
print(f"Genres: {num_genres}")
print(f"\nTrain: {len(train_users):,} ratings")
print(f"Val:   {len(val_users):,} ratings")

## Step 4: Check GPU

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"\nGPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## Step 5: Create NeuMF+ Model

**NeuMF+ Architecture:**
- **CF Branch**: NeuMF (GMF + MLP) from user-item interactions
- **Content Branch**: Genre encoder for movie features
- **Gated Fusion**: Dynamic weighting of CF and content signals

In [None]:
model = NeuMFPlus(
    num_users=num_users,
    num_items=num_items,
    num_genres=num_genres,
    # Content encoder settings
    genre_embed_dim=64,
    content_embed_dim=256,
    content_encoder_dropout=0.1,
    # Gated fusion settings
    gated_fusion_hidden_dim=64,
    gated_fusion_dropout=0.1,
    # Output settings
    output_hidden_dim=64,
    output_dropout=0.2,
    # Ablation study flags
    use_genre=True,         # Enable genre features
    use_synopsis=False,     # Disable synopsis (need SBERT embeddings)
    use_gated_fusion=True,  # Enable gated fusion
)

model = model.to('cuda')
param_count = sum(p.numel() for p in model.parameters())

print("\n" + "="*70)
print("NeuMF+ MODEL CREATED")
print("="*70)
print(f"\nModel: NeuMF+ (Neural Collaborative Filtering + Content Features)")
print(f"Parameters: {param_count:,}")
print(f"\nArchitecture:")
print(f"  ✓ CF Branch (NeuMF: GMF + MLP)")
print(f"  ✓ Content Encoder (Genre Features)")
print(f"  ✓ Gated Fusion (CF + Content)")
print(f"\nAblation Flags:")
print(f"  - use_genre: True (enabled)")
print(f"  - use_synopsis: False (disabled - need SBERT embeddings)")
print(f"  - use_gated_fusion: True (enabled)")

## Step 6: Train NeuMF+

**Configuration:**
- Data: 10% sample (~900K ratings)
- Batch size: 512
- Workers: 2 (Colab optimal)
- Mixed precision: FP16
- Estimated time: 1-2 hours on T4 GPU

In [None]:
from src.train import train_model

print("\n" + "="*70)
print("NeuMF+ TRAINING - 10% DATA SAMPLE")
print("="*70)
print(f"\nGPU: {torch.cuda.get_device_name(0)}")
print(f"\nConfiguration:")
print(f"  - Data: 10% sample ({len(train_users):,} ratings)")
print(f"  - Batch size: 512")
print(f"  - Workers: 2")
print(f"  - Mixed precision: FP16")
print(f"  - Max epochs: 15 (stops early if no improvement)")
print(f"  - Early stopping patience: 3 (more aggressive)")
print(f"  - Genre features: Enabled")
print(f"  - Gated fusion: Enabled")
print(f"\nEstimated time: 1-2 hours")
print("="*70)

# Prepare validation data with genre features
val_data = {
    'users': val_users,
    'items': val_items,
    'genre_features': val_genre_features,
}

# Train the model with optimized settings
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data=val_data,
    num_items=num_items,
    num_epochs=15,  # Reduced from 30
    batch_size=512,
    learning_rate=1e-3,
    weight_decay=1e-5,
    num_negatives=4,
    device='cuda',
    num_workers=2,
    use_amp=True,
    save_dir='./experiments/trained_models',
    early_stopping_patience=3,  # Reduced from 5 for faster stopping
    early_stopping_metric='hr@10',
    lr_scheduler_patience=2,  # Also reduced
    lr_scheduler_factor=0.5,
    log_dir='./experiments/logs/tensorboard',
    train_genre_features=train_genre_features,  # IMPORTANT: Pass genre features for training
)

print("\n" + "="*70)
print("TRAINING COMPLETE!")
print("="*70)
print(f"\nBest HR@10: {max([m.get('hr@10', 0) for m in history['val_metrics']]):.4f}")
print(f"\nModel saved to: experiments/trained_models/")
print(f"Also saved to Google Drive for persistence.")

## Summary

**What happened:**
1. ✅ Loaded 10% sample of training data (~900K ratings)
2. ✅ Created NeuMF+ with genre features and gated fusion
3. ✅ Trained on T4 GPU with mixed precision
4. ✅ Saved best model to Google Drive

**NeuMF+ vs NeuMF (baseline):**
- **NeuMF**: Only user-item CF interactions
- **NeuMF+**: CF + Genre + Gated Fusion (better accuracy)

**Next steps:**
- Compare HR@10 with NeuMF baseline
- For better results, run full preprocessing with real genres
- For even better results, add synopsis embeddings (SBERT)