# NeuMF+ Training - Genre + Synopsis (Full Model)

This notebook trains the **complete NeuMF+** model with:
- **Collaborative Filtering** (NeuMF: GMF + MLP)
- **Genre Features** (multi-hot encoding)
- **Synopsis Features** (Sentence-BERT embeddings)
- **Gated Fusion** (dynamic weighting of all signals)

**Model Variants:**
- NeuMF (baseline): Only CF interactions → HR@10 ~0.973
- NeuMF+ (genre only): CF + Genre → HR@10 ~0.970
- **NeuMF+ (genre + synopsis): CF + Genre + Synopsis → Expected HR@10 ~0.975+**

**Prerequisites:**
- Google Drive with: ratings.csv, movies_metadata.csv, links.csv
- Colab Pro (recommended for longer sessions)
- Estimated time: 2-3 hours for preprocessing, 4-6 hours for training

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✓ Google Drive mounted!")

## Step 2: Clone and Setup Repository

In [None]:
# Clone the repository
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')
!git pull origin main

# Link data
!rm -rf data
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/data" data

# Link experiments
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/experiments/trained_models
!rm -rf experiments
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/experiments experiments

# Link datasets
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/datasets
!rm -rf datasets
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/datasets datasets

# Install dependencies
!pip install -q torch torchvision pandas numpy scikit-learn tqdm tensorboard sentence-transformers

print("\n✓ Setup complete!")

## Step 3: Extract Synopsis Embeddings (30-60 minutes)

This step:
1. Loads movie overviews from movies_metadata.csv
2. Converts text to 384-dimensional vectors using Sentence-BERT
3. Saves embeddings for use in NeuMF+ training

**Skip this if synopsis_embeddings.npy already exists in data/**

In [None]:
import sys
sys.path.insert(0, '.')

# Check if synopsis embeddings already exist
import os
if os.path.exists('data/synopsis_embeddings.npy'):
    print("Synopsis embeddings already exist! Skipping extraction.")
    print("If you want to re-extract, delete data/synopsis_embeddings.npy first.")
else:
    print("Extracting synopsis embeddings (this takes 30-60 minutes)...")
    exec(open('extract_synopsis_embeddings.py').read())

## Step 4: Map Synopsis Embeddings to MovieLens movieIds

In [None]:
# Check if mapping already exists
if os.path.exists('data/item_synopsis_embeddings.npy'):
    print("Synopsis mapping already exists! Skipping.")
else:
    print("Mapping synopsis embeddings to MovieLens movieIds...")
    exec(open('map_synopsis_to_movieid.py').read())

## Step 5: Add Synopsis Features to Processed Data

In [None]:
# Check if data already has synopsis features
import pandas as pd

try:
    test_df = pd.read_pickle('data/test.pkl')
    if 'synopsis_features' in test_df.columns:
        print("Data already has synopsis features! Skipping.")
    else:
        print("Adding synopsis features to processed data...")
        exec(open('add_synopsis_to_data.py').read())
except Exception as e:
    print(f"Error: {e}")
    print("Please run preprocessing first (colab_preprocess_full.ipynb)")

## Step 6: Load and Sample Data (10% for faster training)

In [None]:
import sys
sys.path.insert(0, '.')

import pandas as pd
import numpy as np
import pickle
import torch

from src.negative_sampling import build_user_history
from src.models.neumf_plus import NeuMFPlus
from src.train import train_model

# Load data
train_df = pd.read_pickle('data/train.pkl')
val_df = pd.read_pickle('data/val.pkl')
test_df = pd.read_pickle('data/test.pkl')

with open('data/mappings.pkl', 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

print("="*70)
print("SAMPLING 10% OF TRAINING DATA")
print("="*70)
print(f"\nOriginal train size: {len(train_df):,}")

# Build item-to-genre mapping BEFORE sampling
print("\nBuilding item-to-genre mapping...")
unique_items = train_df['movieId'].unique()
item_genre_features = np.zeros((num_items, num_genres), dtype=np.float32)

for item_id in unique_items:
    item_rows = train_df[train_df['movieId'] == item_id]
    if len(item_rows) > 0:
        item_genre_features[item_id] = item_rows['genre_features'].iloc[0]

print(f"  Item-to-genre mapping shape: {item_genre_features.shape}")

# Build item-to-synopsis mapping BEFORE sampling
if 'synopsis_features' in train_df.columns:
    print("\nBuilding item-to-synopsis mapping...")
    item_synopsis_embeddings = np.zeros((num_items, 384), dtype=np.float32)
    
    for item_id in unique_items:
        item_rows = train_df[train_df['movieId'] == item_id]
        if len(item_rows) > 0:
            item_synopsis_embeddings[item_id] = item_rows['synopsis_features'].iloc[0]
    
    print(f"  Item-to-synopsis mapping shape: {item_synopsis_embeddings.shape}")
    has_synopsis = True
else:
    print("\n⚠️  No synopsis features found in data!")
    print("   Run Steps 3-5 first to add synopsis features.")
    has_synopsis = False
    item_synopsis_embeddings = None

# Sample 10% of training data
SAMPLE_RATIO = 0.10
np.random.seed(42)
sample_idx = np.random.choice(
    len(train_df), 
    int(len(train_df) * SAMPLE_RATIO), 
    replace=False
)
train_df = train_df.iloc[sample_idx].copy()

print(f"\nSampled train size: {len(train_df):,} ({SAMPLE_RATIO*100:.0f}%)")

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

# Extract validation features
val_genre_features = np.stack(val_df['genre_features'].values)
print(f"\nVal genre shape: {val_genre_features.shape}")

if has_synopsis:
    val_synopsis_features = np.stack(val_df['synopsis_features'].values)
    print(f"Val synopsis shape: {val_synopsis_features.shape}")
else:
    val_synopsis_features = None

print(f"\nUsers: {num_users:,}")
print(f"Items: {num_items:,}")
print(f"Genres: {num_genres}")
print(f"\nTrain: {len(train_users):,} ratings")
print(f"Val:   {len(val_users):,} ratings")

## Step 7: Check GPU

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"\nGPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## Step 8: Create NeuMF+ Model (Genre + Synopsis)

In [None]:
model = NeuMFPlus(
    num_users=num_users,
    num_items=num_items,
    num_genres=num_genres,
    # Content encoder settings
    genre_embed_dim=64,
    synopsis_embed_dim=384,
    content_embed_dim=256,
    content_encoder_dropout=0.1,
    # Gated fusion settings
    gated_fusion_hidden_dim=64,
    gated_fusion_dropout=0.1,
    # Output settings
    output_hidden_dim=64,
    output_dropout=0.2,
    # Ablation study flags
    use_genre=True,         # Enable genre features
    use_synopsis=has_synopsis,  # Enable synopsis if available
    use_gated_fusion=True,  # Enable gated fusion
)

model = model.to('cuda')
param_count = sum(p.numel() for p in model.parameters())

print("\n" + "="*70)
print("NeuMF+ MODEL CREATED")
print("="*70)
print(f"\nModel: NeuMF+ (Neural Collaborative Filtering + Content Features)")
print(f"Parameters: {param_count:,}")
print(f"\nArchitecture:")
print(f"  ✓ CF Branch (NeuMF: GMF + MLP)")
print(f"  ✓ Genre Encoder ({num_genres} → 64)")
if has_synopsis:
    print(f"  ✓ Synopsis Encoder (384 → 192)")
print(f"  ✓ Gated Fusion (CF + Content)")
print(f"\nFeatures:")
print(f"  - use_genre: True")
print(f"  - use_synopsis: {has_synopsis}")
print(f"  - use_gated_fusion: True")

## Step 9: Train NeuMF+ (Genre + Synopsis)

**Configuration:**
- Data: 10% sample (~1.8M ratings)
- Batch size: 512
- Learning rate: 1e-4 (reduced to prevent NaN)
- Workers: 0 (single worker)
- Mixed precision: FP16
- Max epochs: 15
- Early stopping patience: 3

In [None]:
from src.train import train_model

print("\n" + "="*70)
print("NeuMF+ TRAINING - GENRE + SYNOPSIS")
print("="*70)
print(f"\nGPU: {torch.cuda.get_device_name(0)}")
print(f"\nConfiguration:")
print(f"  - Data: 10% sample ({len(train_users):,} ratings)")
print(f"  - Batch size: 512")
print(f"  - Learning rate: 1e-4")
print(f"  - Workers: 0")
print(f"  - Mixed precision: FP16")
print(f"  - Max epochs: 15")
print(f"  - Early stopping patience: 3")
print(f"\nFeatures:")
print(f"  - Genre: Enabled")
print(f"  - Synopsis: {has_synopsis}")
print(f"\nEstimated time: 4-6 hours")
print("="*70)

# Prepare validation data
val_data = {
    'users': val_users,
    'items': val_items,
    'genre_features': val_genre_features,
}

if has_synopsis:
    val_data['synopsis_features'] = val_synopsis_features

# Train the model
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data=val_data,
    num_items=num_items,
    num_epochs=15,
    batch_size=512,
    learning_rate=1e-4,
    weight_decay=1e-5,
    num_negatives=4,
    device='cuda',
    num_workers=0,
    use_amp=True,
    save_dir='./experiments/trained_models',
    early_stopping_patience=3,
    early_stopping_metric='hr@10',
    lr_scheduler_patience=2,
    lr_scheduler_factor=0.5,
    log_dir='./experiments/logs/tensorboard',
    item_genre_features=item_genre_features,
    item_synopsis_features=item_synopsis_embeddings,  # Add this parameter
)

print("\n" + "="*70)
print("TRAINING COMPLETE!")
print("="*70)
print(f"\nBest HR@10: {max([m.get('hr@10', 0) for m in history['val_metrics']]):.4f}")
print(f"\nModel saved to: experiments/trained_models/")
print("Also saved to Google Drive for persistence.")

## Summary

**What happened:**
1. ✅ Extracted synopsis embeddings using Sentence-BERT (if not exists)
2. ✅ Mapped embeddings to MovieLens movieIds
3. ✅ Added synopsis features to processed data
4. ✅ Created NeuMF+ with genre + synopsis
5. ✅ Trained on T4/L4 GPU with mixed precision
6. ✅ Saved best model to Google Drive

**Model Comparison:**
| Model | Features | HR@10 |
|-------|----------|-------|
| NeuMF | CF only | ~0.973 |
| NeuMF+ | CF + Genre | ~0.970 |
| NeuMF+ | CF + Genre + Synopsis | **Expected: ~0.975+** |

**Next steps:**
- Compare HR@10 across all model variants
- Use best model for recommendations
- Analyze gate values to understand CF vs content contribution