# NeuMF+ Training - CPU Version (Genre Only)

This notebook trains **NeuMF+** on **CPU** when GPU credits are not available.

**Model Features:**
- Collaborative Filtering (NeuMF: GMF + MLP)
- Genre Features (content information)
- Gated Fusion (dynamic weighting)

**Configuration:**
- Data: 5% sample (~900K ratings)
- Device: CPU (no GPU required)
- Batch size: 256 (smaller for CPU)
- Max epochs: 3
- Early stopping patience: 1
- Estimated time: 1-2 hours on CPU

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✓ Google Drive mounted!")

## Step 2: Clone and Setup

In [None]:
# Clone the repository
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')
!git pull origin main

# Link data
!rm -rf data
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/data" data

# Link experiments
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/experiments/trained_models
!rm -rf experiments
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/experiments experiments

# Install dependencies
!pip install -q torch torchvision pandas numpy scikit-learn tqdm tensorboard

print("\n✓ Setup complete!")

## Step 3: Load and Sample Data (5%)

In [None]:
import sys
sys.path.insert(0, '.')

import pandas as pd
import numpy as np
import pickle
import torch

from src.negative_sampling import build_user_history
from src.models.neumf_plus import NeuMFPlus
from src.train import train_model

# Load data
train_df = pd.read_pickle('data/train.pkl')
val_df = pd.read_pickle('data/val.pkl')
test_df = pd.read_pickle('data/test.pkl')

with open('data/mappings.pkl', 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

print("="*70)
print("SAMPLING 5% OF TRAINING DATA (CPU OPTIMIZED)")
print("="*70)
print(f"\nOriginal train size: {len(train_df):,}")

# Build item-to-genre mapping BEFORE sampling
print("\nBuilding item-to-genre mapping...")
unique_items = train_df['movieId'].unique()
item_genre_features = np.zeros((num_items, num_genres), dtype=np.float32)

for item_id in unique_items:
    item_rows = train_df[train_df['movieId'] == item_id]
    if len(item_rows) > 0:
        item_genre_features[item_id] = item_rows['genre_features'].iloc[0]

print(f"  Item-to-genre mapping shape: {item_genre_features.shape}")

# Sample 5% of training data (faster for CPU)
SAMPLE_RATIO = 0.05
np.random.seed(42)
sample_idx = np.random.choice(
    len(train_df), 
    int(len(train_df) * SAMPLE_RATIO), 
    replace=False
)
train_df = train_df.iloc[sample_idx].copy()

print(f"\nSampled train size: {len(train_df):,} ({SAMPLE_RATIO*100:.0f}%)")

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

# Extract validation genre features
val_genre_features = np.stack(val_df['genre_features'].values)
print(f"\nVal genre shape: {val_genre_features.shape}")

print(f"\nUsers: {num_users:,}")
print(f"Items: {num_items:,}")
print(f"Genres: {num_genres}")
print(f"\nTrain: {len(train_users):,} ratings")
print(f"Val:   {len(val_users):,} ratings")

## Step 4: Check Device

In [None]:
device = 'cpu'  # Force CPU usage
print(f"Using device: {device}")
print(f"\n⚠️  Training on CPU - will take longer than GPU")

## Step 5: Create NeuMF+ Model (Genre Only)

In [None]:
model = NeuMFPlus(
    num_users=num_users,
    num_items=num_items,
    num_genres=num_genres,
    # Content encoder settings
    genre_embed_dim=64,
    content_embed_dim=256,
    content_encoder_dropout=0.1,
    # Gated fusion settings
    gated_fusion_hidden_dim=64,
    gated_fusion_dropout=0.1,
    # Output settings
    output_hidden_dim=64,
    output_dropout=0.2,
    # Ablation study flags
    use_genre=True,         # Enable genre features
    use_synopsis=False,     # Disable synopsis (not available)
    use_gated_fusion=True,  # Enable gated fusion
)

model = model.to('cpu')  # Use CPU
param_count = sum(p.numel() for p in model.parameters())

print("\n" + "="*70)
print("NeuMF+ MODEL CREATED (GENRE ONLY)")
print("="*70)
print(f"\nModel: NeuMF+ (CF + Genre + Gated Fusion)")
print(f"Parameters: {param_count:,}")
print(f"\nFeatures:")
print(f"  - use_genre: True")
print(f"  - use_synopsis: False")
print(f"  - use_gated_fusion: True")
print(f"  - Device: CPU")

## Step 6: Train NeuMF+ (CPU, 3 Epochs, Patience 1)

**Configuration:**
- Data: 5% sample (~900K ratings)
- Device: CPU
- Batch size: 256 (smaller for CPU)
- Max epochs: 3
- Early stopping patience: 1
- Estimated time: 1-2 hours

In [None]:
from src.train import train_model

print("\n" + "="*70)
print("NeuMF+ TRAINING - CPU VERSION (GENRE ONLY)")
print("="*70)
print(f"\nConfiguration:")
print(f"  - Data: 5% sample ({len(train_users):,} ratings)")
print(f"  - Device: CPU")
print(f"  - Batch size: 256")
print(f"  - Learning rate: 1e-4")
print(f"  - Max epochs: 3")
print(f"  - Early stopping patience: 1")
print(f"  - Mixed precision: Disabled (CPU only)")
print(f"\nFeatures:")
print(f"  - Genre: Enabled")
print(f"  - Synopsis: Disabled")
print(f"\n⚠️  Estimated time: 1-2 hours on CPU")
print("="*70)

# Prepare validation data
val_data = {
    'users': val_users,
    'items': val_items,
    'genre_features': val_genre_features,
}

# Train the model
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data=val_data,
    num_items=num_items,
    num_epochs=3,                      # Only 3 epochs for faster training
    batch_size=256,                    # Smaller batch for CPU
    learning_rate=1e-4,
    weight_decay=1e-5,
    num_negatives=4,
    device='cpu',                      # Use CPU
    num_workers=2,                     # Use 2 workers for data loading
    use_amp=False,                     # Disable mixed precision (CUDA only)
    save_dir='./experiments/trained_models',
    early_stopping_patience=1,         # Stop after 1 epoch without improvement
    early_stopping_metric='hr@10',
    lr_scheduler_patience=1,
    lr_scheduler_factor=0.5,
    log_dir='./experiments/logs/tensorboard',
    item_genre_features=item_genre_features,
    item_synopsis_embeddings=None,     # No synopsis for this model
)

print("\n" + "="*70)
print("TRAINING COMPLETE!")
print("="*70)
print(f"\nBest HR@10: {max([m.get('hr@10', 0) for m in history['val_metrics']]):.4f}")
print(f"\nModel saved to: experiments/trained_models/NeuMFPlus_genre_best.pt")
print("Also saved to Google Drive for persistence.")

## Summary

**What happened:**
1. ✅ Loaded 5% sample of training data (~900K ratings)
2. ✅ Created NeuMF+ with genre features
3. ✅ Trained on CPU (no GPU required)
4. ✅ Saved model as `NeuMFPlus_genre_best.pt`

**Model Comparison:**
| Model | Features | HR@10 |
|-------|----------|-------|
| NeuMF | CF only | ~0.973 |
| NeuMF+ | CF + Genre | TBD |
| NeuMF+ | CF + Genre + Synopsis | TBD |

**Next:** Train NeuMF+ with synopsis features (requires SBERT embeddings)