# Colab with Google Drive - One-Time Setup

This notebook uploads your datasets to Google Drive once, then you can reuse them for all training sessions.

**Workflow:**
1. Upload datasets to Google Drive (one time, takes ~10 minutes)
2. Mount Drive in Colab
3. Clone code from GitHub
4. Train using Drive datasets
5. Models save back to Drive

## Step 1: Upload Datasets to Google Drive (One-Time Setup)

**Do this once from your computer:**

1. Go to https://drive.google.com/
2. Create a folder: `NCF-Movie-Recommender/datasets/`
3. Upload these files:
   - `ratings.csv` (~709MB)
   - `movies_metadata.csv` (~34MB)
   - `links.csv` (~989KB)

**Or use Drive File Stream (faster):**
1. Install Google Drive for desktop
2. Copy files directly via file manager or command line

## Step 2: Mount Google Drive in Colab

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# You'll need to click a link and authenticate
print("Drive mounted successfully!")

## Step 3: Clone Repository from GitHub

In [None]:
# Clone your repo
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')

!pwd

## Step 4: Symlink Drive Datasets to Repo

This creates a link so the code can access datasets from Drive.

In [None]:
# Remove empty datasets folder
!rm -rf datasets

# Create symlink to Google Drive datasets
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/datasets" datasets

# Verify
!ls -lh datasets/

print("Datasets linked from Google Drive!")

## Step 5: Install Dependencies

In [None]:
!pip install -q torch torchvision pandas numpy scikit-learn matplotlib seaborn tqdm tensorboard sentence-transformers

print("Dependencies installed!")

## Step 6: Run Preprocessing

In [None]:
import sys
sys.path.insert(0, '.')

from src.preprocessing import DataPreprocessor
from src.config import config

# Run preprocessing (saves to local Colab storage)
preprocessor = DataPreprocessor()
preprocessor.run()

## Step 7: (Optional) Copy Processed Data to Drive

Save processed data to Drive so you don't need to re-preprocess next time.

In [None]:
# Create data folder in Drive
!mkdir -p "/content/drive/MyDrive/NCF-Movie-Recommender/data"

# Copy processed data
!cp data/* "/content/drive/MyDrive/NCF-Movie-Recommender/data/"

# Also save experiments to Drive
!mkdir -p "/content/drive/MyDrive/NCF-Movie-Recommender/experiments/trained_models"
!mkdir -p "/content/drive/MyDrive/NCF-Movie-Recommender/experiments/logs"

# Create symlink for experiments
!rm -rf experiments
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/experiments" experiments

print("Processed data and experiments will be saved to Google Drive!")

## Step 8: Train Model

In [None]:
import pandas as pd
import numpy as np
import pickle
import torch

from src.models.neumf_plus import NeuMFPlus
from src.train import train_model
from src.negative_sampling import build_user_history

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Load data
train_df = pd.read_pickle(config.paths.train_path)
val_df = pd.read_pickle(config.paths.val_path)
test_df = pd.read_pickle(config.paths.test_path)

with open(config.paths.mappings_path, 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

user_history = build_user_history(train_users, train_items)

genre_features = np.stack(train_df['genre_features'].values)
val_genre_features = np.stack(val_df['genre_features'].values)
synopsis_embeddings = np.random.randn(num_items, 384).astype(np.float32)

print(f"Dataset: {num_users:,} users, {num_items:,} items")

In [None]:
# Create and train model
model = NeuMFPlus(
    num_users=num_users,
    num_items=num_items,
    num_genres=num_genres,
)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Train (models save to Drive via symlink)
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data={
        'users': val_users,
        'items': val_items,
        'genre_features': val_genre_features,
        'synopsis_embeddings': synopsis_embeddings,
    },
    num_items=num_items,
    num_epochs=30,
    batch_size=512,
    learning_rate=1e-3,
    num_negatives=4,
    device='cuda',
    save_dir='./experiments/trained_models',
    early_stopping_patience=5,
    log_dir='./experiments/logs/tensorboard',
)

## Complete!

**Benefits:**
- Datasets stay in Google Drive (upload once)
- Trained models save to Drive (persist between sessions)
- Clone fresh code from GitHub each time
- Everything is backed up

**Next time:**
1. Open this notebook
2. Mount Drive
3. Clone repo
4. Create symlinks
5. Start training (no need to re-upload!)