# NCF Training - Preprocessed Data (Google Drive)

This notebook assumes you've already run preprocessing locally and
uploaded the processed data to Google Drive.

**Files needed in Google Drive:**
- `data/train.pkl`
- `data/val.pkl`
- `data/test.pkl`
- `data/mappings.pkl`

**Workflow:**
1. Mount Google Drive
2. Clone code from GitHub
3. Link preprocessed data
4. Train model
5. Save trained models back to Drive

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✓ Google Drive mounted!")

## Step 2: Verify Preprocessed Data

In [None]:
import os

# Check if data exists in Drive
data_path = "/content/drive/MyDrive/NCF-Movie-Recommender/data"

if os.path.exists(data_path):
    files = os.listdir(data_path)
    print(f"✓ Found {len(files)} files in Drive:")
    for f in files:
        size = os.path.getsize(os.path.join(data_path, f)) / 1024 / 1024
        print(f"  - {f}: {size:.1f} MB")
else:
    print(f"❌ Data not found at: {data_path}")
    print("\nPlease upload these files from your PC to Google Drive:")
    print("  - data/train.pkl")
    print("  - data/val.pkl")
    print("  - data/test.pkl")
    print("  - data/mappings.pkl")

## Step 3: Clone Repository from GitHub

In [None]:
# Clone the repository
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')

# Pull latest changes
!git pull origin main

print("\n✓ Repository ready")
print(f"Working directory: {os.getcwd()}")

## Step 4: Link Preprocessed Data

In [None]:
# Create symlink to preprocessed data in Drive
!rm -rf data  # Remove empty data folder if exists
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/data" data

# Verify
!ls -lh data/

print("\n✓ Preprocessed data linked from Google Drive!")

## Step 5: Install Dependencies

In [None]:
!pip install -q torch torchvision pandas numpy scikit-learn matplotlib seaborn tqdm tensorboard

print("✓ Dependencies installed!")

## Step 6: Load Preprocessed Data

In [None]:
import sys
sys.path.insert(0, '.')

import pandas as pd
import numpy as np
import pickle
import torch

from src.negative_sampling import build_user_history

# Load data splits
train_df = pd.read_pickle('data/train.pkl')
val_df = pd.read_pickle('data/val.pkl')
test_df = pd.read_pickle('data/test.pkl')

# Load mappings
with open('data/mappings.pkl', 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

# Build user history for negative sampling
user_history = build_user_history(train_users, train_items)

print("="*50)
print("DATASET LOADED")
print("="*50)
print(f"Users: {num_users:,}")
print(f"Items: {num_items:,}")
print(f"Genres: {num_genres}")
print(f"\nTrain: {len(train_df):,} ratings")
print(f"Val:   {len(val_df):,} ratings")
print(f"Test:  {len(test_df):,} ratings")

## Step 7: Check GPU

In [None]:
# Check available GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"\nGPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"\n✓ GPU ready for training!")
else:
    print("\n⚠️  No GPU found. Training will be slow on CPU.")

## Step 8: Create NeuMF Model

**Note:** NeuMF is a baseline collaborative filtering model that only needs
user-item interactions. It doesn't use content features (genres/synopsis).

In [None]:
from src.models.neumf import NeuMF

# Create baseline NeuMF model
model = NeuMF(
    num_users=num_users,
    num_items=num_items,
)

param_count = sum(p.numel() for p in model.parameters())
print(f"\nModel: NeuMF (Neural Matrix Factorization)")
print(f"Parameters: {param_count:,}")
print(f"\nArchitecture:")
print(f"  - GMF branch (element-wise user×item)")
print(f"  - MLP branch (concatenated user/item)")
print(f"  - Fusion layer combining both")

## Step 9: Train Model (GPU-Optimized for Colab Pro)

**Optimizations applied:**
- Larger batch size (512) for T4 GPU with 16GB VRAM
- 4 parallel workers for data loading
- Mixed precision training (FP16) for faster computation
- Optimized for Colab Pro (no timeout)

This will take **1.5-2 hours** on Colab Pro's T4 GPU.

In [None]:
from src.train import train_model

# Create symlink for experiments (save to Drive)
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/experiments/trained_models
!rm -rf experiments
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/experiments experiments

print("\n" + "="*60)
print("GPU-OPTIMIZED TRAINING (Colab Pro)")
print("="*60)
print(f"\nGPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"\nOptimizations enabled:")
print(f"  ✓ Batch size: 512 (maximizes GPU utilization)")
print(f"  ✓ Data workers: 4 (parallel loading)")
print(f"  ✓ Mixed precision: FP16 (1.5-2x faster)")
print(f"\nEstimated time: 1-1.5 hours")
print("="*60)

# Train the model with all GPU optimizations
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data={
        'users': val_users,
        'items': val_items,
    },
    num_items=num_items,
    num_epochs=30,
    batch_size=512,  # Large batch for T4 GPU (16GB VRAM)
    learning_rate=1e-3,
    weight_decay=1e-5,
    num_negatives=4,
    device='cuda',
    num_workers=4,  # Parallel data loading
    use_amp=True,   # Enable mixed precision (FP16)
    save_dir='./experiments/trained_models',
    early_stopping_patience=5,
    early_stopping_metric='hr@10',
    lr_scheduler_patience=3,
    lr_scheduler_factor=0.5,
    log_dir='./experiments/logs/tensorboard',
)

print("\n" + "="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"\nBest HR@10: {max([m.get('hr@10', 0) for m in history['val_metrics']]):.4f}")
print(f"\nModel saved to: experiments/trained_models/")
print(f"Also saved to Google Drive for persistence.")

## Step 10: Monitor Training (Optional)

Open this in a new cell while training is running to monitor progress.

In [None]:
%load_ext tensorboard
%tensorboard --logdir experiments/logs/tensorboard --port 6006

## Step 11: Evaluate on Test Set

Once training is complete, evaluate the model on the held-out test set.

In [None]:
from src.evaluate import evaluate_model
from src.models.neumf import NeuMF

# Load best model
model = NeuMF.load(
    './experiments/trained_models/NeuMF_best.pt',
    NeuMF,
    num_users=num_users,
    num_items=num_items,
)

model = model.to('cuda')

# Evaluate on test set
test_metrics = evaluate_model(
    model=model,
    users=test_df['userId'].values,
    items=test_df['movieId'].values,
    k_values=[5, 10, 20],
    device='cuda',
    num_items=num_items,
    user_history=user_history,
)

print("\n" + "="*50)
print("TEST SET RESULTS")
print("="*50)
for metric, value in test_metrics.items():
    print(f"{metric}: {value:.4f}")
print("="*50)

## Summary

**What happened:**
1. ✅ Loaded preprocessed data from Google Drive
2. ✅ Cloned latest code from GitHub
3. ✅ Trained NeuMF model on T4 GPU
4. ✅ Saved trained model back to Google Drive
5. ✅ Evaluated on test set

**Next steps:**
- Download trained model from `experiments/trained_models/`
- Run inference locally on your WSL2 machine
- Or try NeuMF+ with content features (run full preprocessing locally)