# Fast Training - Smaller Dataset for Colab Pro

This notebook uses a 10% sample of the full dataset for much faster training.

**Trade-off:**
- Full dataset: 18M samples, ~50 hours training
- 10% sample: 1.8M samples, ~2-3 hours training

**Recommended:** Use 10% sample for quick iteration, then train on full dataset overnight.

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✓ Google Drive mounted!")

## Step 2: Clone and Setup

In [None]:
# Clone the repository
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')
!git pull origin main

# Link data
!rm -rf data
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/data" data

# Link experiments
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/experiments/trained_models
!rm -rf experiments
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/experiments experiments

# Install dependencies
!pip install -q torch torchvision pandas numpy scikit-learn tqdm tensorboard

print("\n✓ Setup complete!")

## Step 3: Load Data with 10% Sample

In [None]:
import sys
sys.path.insert(0, '.')

import pandas as pd
import numpy as np
import pickle
import torch

from src.negative_sampling import build_user_history
from src.models.neumf import NeuMF
from src.train import train_model

# Load data
train_df = pd.read_pickle('data/train.pkl')
val_df = pd.read_pickle('data/val.pkl')
test_df = pd.read_pickle('data/test.pkl')

with open('data/mappings.pkl', 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']

print("="*60)
print("SAMPLING 10% OF TRAINING DATA")
print("="*60)
print(f"\nOriginal train size: {len(train_df):,}")

# Sample 10% of training data
SAMPLE_RATIO = 0.1
np.random.seed(42)
sample_idx = np.random.choice(
    len(train_df), 
    int(len(train_df) * SAMPLE_RATIO), 
    replace=False
)
train_df_sample = train_df.iloc[sample_idx].copy()

print(f"Sampled train size: {len(train_df_sample):,} ({SAMPLE_RATIO*100:.0f}%)")

train_users = train_df_sample['userId'].values
train_items = train_df_sample['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

user_history = build_user_history(train_users, train_items)

print(f"\nUsers: {num_users:,}")
print(f"Items: {num_items:,}")
print(f"\nTrain: {len(train_users):,} ratings")
print(f"Val:   {len(val_users):,} ratings")

## Step 4: Check GPU

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## Step 5: Create Model

In [None]:
model = NeuMF(
    num_users=num_users,
    num_items=num_items,
)

model = model.to('cuda')
param_count = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {param_count:,}")

## Step 6: Train (Fast!)

**With 10% data:**
- ~3.5K iterations per epoch
- ~15-20 minutes per epoch
- ~1-2 hours total

In [None]:
print("\n" + "="*60)
print("FAST TRAINING (20% DATA SAMPLE)")
print("="*60)
print(f"\nGPU: {torch.cuda.get_device_name(0)}")
print(f"\nConfiguration:")
print(f"  - Data: 20% sample ({len(train_users):,} ratings)")
print(f"  - Batch size: 512")
print(f"  - Workers: 2 (Colab optimal)")
print(f"  - Mixed precision: FP16")
print(f"\nEstimated time: 2-3 hours")
print("="*60)

history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data={
        'users': val_users,
        'items': val_items,
    },
    num_items=num_items,
    num_epochs=30,
    batch_size=512,
    learning_rate=1e-3,
    weight_decay=1e-5,
    num_negatives=4,
    device='cuda',
    num_workers=2,
    use_amp=True,
    save_dir='./experiments/trained_models',
    early_stopping_patience=5,
    early_stopping_metric='hr@10',
    lr_scheduler_patience=3,
    lr_scheduler_factor=0.5,
    log_dir='./experiments/logs/tensorboard',
)

print("\n" + "="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"\nBest HR@10: {max([m.get('hr@10', 0) for m in history['val_metrics']]):.4f}")