# NCF Movie Recommender - Google Colab Training

This notebook is optimized for Google Colab's free GPU (T4 with 16GB VRAM).

**Steps:**
1. Upload your datasets to Google Drive
2. Mount Google Drive
3. Run preprocessing
4. Train models
5. Download trained models

## 1. Setup & Mount Drive

In [None]:
# Check GPU
!nvidia-smi

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Set paths
import os

# Change this to your folder in Google Drive
DRIVE_PATH = "/content/drive/MyDrive/NCF-Movie-Recommender"

# Create directory if it doesn't exist
!mkdir -p "{DRIVE_PATH}"

# Set working directory
os.chdir(DRIVE_PATH)
print(f"Working directory: {os.getcwd()}")

## 2. Upload Datasets

**Before running this cell, upload these files to your Google Drive folder:**
- `ratings.csv`
- `movies_metadata.csv`
- `links.csv`
- `keywords.csv` (optional)

Place them in a `datasets/` subfolder.

In [None]:
# Check if datasets exist
!ls -lh datasets/

# Should see:
# ratings.csv (~709MB)
# movies_metadata.csv (~34MB)
# links.csv (~989KB)

## 3. Clone/Install Code

In [None]:
# Option A: If you have the code in Drive (it should already be there)
print("Using existing code in Drive")

# Option B: Clone from GitHub (if you've pushed the code)
# !git clone https://github.com/YOUR_USERNAME/NCF-Movie-Recommender.git
# %cd NCF-Movie-Recommender

In [None]:
# Install dependencies
!pip install -q torch torchvision pandas numpy scikit-learn matplotlib seaborn tqdm tensorboard sentence-transformers

## 4. Run Preprocessing

In [None]:
import sys
sys.path.insert(0, '.')

from src.preprocessing import DataPreprocessor
from src.config import config

# Run preprocessing
preprocessor = DataPreprocessor()
preprocessor.run()

## 5. Train NeuMF+ (Full Model)

In [None]:
import pandas as pd
import numpy as np
import pickle
import torch

from src.models.neumf_plus import NeuMFPlus
from src.train import train_model
from src.negative_sampling import build_user_history

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Load data
train_df = pd.read_pickle(config.paths.train_path)
val_df = pd.read_pickle(config.paths.val_path)
test_df = pd.read_pickle(config.paths.test_path)

with open(config.paths.mappings_path, 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

user_history = build_user_history(train_users, train_items)

# Content features
genre_features = np.stack(train_df['genre_features'].values)
val_genre_features = np.stack(val_df['genre_features'].values)
synopsis_embeddings = np.random.randn(num_items, 384).astype(np.float32)

print(f"Dataset: {num_users:,} users, {num_items:,} items")

In [None]:
# Create model
model = NeuMFPlus(
    num_users=num_users,
    num_items=num_items,
    num_genres=num_genres,
)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Train with Colab-optimized settings (larger batch size)
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data={
        'users': val_users,
        'items': val_items,
        'genre_features': val_genre_features,
        'synopsis_embeddings': synopsis_embeddings,
    },
    num_items=num_items,
    num_epochs=30,
    batch_size=512,  # Larger batch for Colab's T4
    learning_rate=1e-3,
    num_negatives=4,
    device='cuda',
    save_dir='./experiments/trained_models',
    early_stopping_patience=5,
    log_dir='./experiments/logs/tensorboard',
)

## 6. Monitor Training

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir experiments/logs/tensorboard

## 7. Download Trained Models

After training, download the best model to your local machine.

In [None]:
# List trained models
!ls -lh experiments/trained_models/

# Download to local machine (browser will download the file)
from google.colab import files

# Download the best model
# files.download('experiments/trained_models/NeuMFPlus_best.pt')

## Colab Tips:

1. **Session timeout**: Colab free tier disconnects after ~90 min of inactivity
2. **Runtime limit**: Maximum ~12 hours continuous runtime
3. **Save frequently**: Models are saved to Google Drive, so they persist
4. **GPU types**: Free tier gives T4 (16GB). Pro ($10/mo) may give V100 or A100
5. **Memory**: Colab has ~12GB RAM, sufficient for this dataset