# NCF Movie Recommender - Colab + GitHub Workflow

This notebook clones your code from GitHub and trains the model in Colab.

**Workflow:**
1. Clone code from GitHub (always latest version)
2. Upload datasets directly to Colab
3. Train model
4. Download trained models
5. (Optional) Push updated code/models back to GitHub

## 0. Your GitHub Repository URL

**Replace `YOUR_USERNAME` below with your actual GitHub username.**

In [None]:
# ====================== CONFIGURATION ======================
# Replace with YOUR GitHub username and repo name
GITHUB_USERNAME = "YOUR_USERNAME"  # CHANGE THIS!
REPO_NAME = "NCF-Movie-Recommender"

GITHUB_REPO = f"https://github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"
GITHUB_SSH = f"git@github.com:{GITHUB_USERNAME}/{REPO_NAME}.git"

print(f"Will clone from: {GITHUB_REPO}")
print(f"\nMake sure your repo is public or you've set up GitHub credentials!")

## 1. Check GPU & Setup

In [None]:
# Check available GPU
!nvidia-smi

# Check RAM
!cat /proc/meminfo | grep MemTotal

## 2. Clone Repository from GitHub

In [None]:
# Clone the repository
!git clone {GITHUB_REPO}

# Change to repo directory
import os
os.chdir(REPO_NAME)

# Verify we're in the right place
!pwd
!ls -la

## 3. Install Dependencies

In [None]:
# Install all requirements
!pip install -q torch torchvision pandas numpy scikit-learn matplotlib seaborn tqdm tensorboard sentence-transformers

print("Dependencies installed!")

## 4. Upload Datasets

**You need to upload your dataset files to Colab:**

Click the folder icon üìÅ on the left sidebar, then:
1. Click the upload button (file with ‚Üë arrow)
2. Upload these files to the `{REPO_NAME}/datasets/` folder:
   - `ratings.csv` (~709MB)
   - `movies_metadata.csv` (~34MB)
   - `links.csv` (~989KB)
   - `keywords.csv` (optional, ~6MB)

**OR use the upload widget below:**

In [None]:
# Alternative: Upload files using a widget
from google.colab import files
import os

# Create datasets directory
!mkdir -p datasets

print("Please upload these files when prompted:")
print("1. ratings.csv")
print("2. movies_metadata.csv")
print("3. links.csv")
print("4. keywords.csv (optional)")
print("\nUpload them one by one.")

# Upload each file
for filename in ['ratings.csv', 'movies_metadata.csv', 'links.csv']:
    print(f"\nWaiting for {filename}...")
    uploaded = files.upload()
    
    # Move to datasets folder
    for uploaded_file in uploaded.keys():
        os.rename(uploaded_file, f"datasets/{uploaded_file}")
        print(f"‚úì Moved {uploaded_file} to datasets/")

print("\nAll files uploaded!")

## 5. Verify Datasets

In [None]:
# Check that datasets are present
!ls -lh datasets/

# Expected sizes:
# ratings.csv           ~709MB
# movies_metadata.csv   ~34MB
# links.csv             ~989KB
# keywords.csv          ~6MB (if uploaded)

## 6. Run Preprocessing

In [None]:
import sys
sys.path.insert(0, '.')

from src.preprocessing import DataPreprocessor
from src.config import config

# Run preprocessing
preprocessor = DataPreprocessor()
preprocessor.run()

## 7. Train NeuMF+ Model

In [None]:
import pandas as pd
import numpy as np
import pickle
import torch

from src.models.neumf_plus import NeuMFPlus
from src.train import train_model
from src.negative_sampling import build_user_history

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Load processed data
train_df = pd.read_pickle(config.paths.train_path)
val_df = pd.read_pickle(config.paths.val_path)
test_df = pd.read_pickle(config.paths.test_path)

with open(config.paths.mappings_path, 'rb') as f:
    mappings = pickle.load(f)

num_users = mappings['num_users']
num_items = mappings['num_items']
num_genres = mappings['num_genres']

train_users = train_df['userId'].values
train_items = train_df['movieId'].values
val_users = val_df['userId'].values
val_items = val_df['movieId'].values

user_history = build_user_history(train_users, train_items)

# Content features
genre_features = np.stack(train_df['genre_features'].values)
val_genre_features = np.stack(val_df['genre_features'].values)

# For demo: random synopsis embeddings (in production, use Sentence-BERT)
synopsis_embeddings = np.random.randn(num_items, 384).astype(np.float32)

print(f"\n{'='*50}")
print(f"Dataset Loaded")
print(f"{'='*50}")
print(f"Users: {num_users:,}")
print(f"Items: {num_items:,}")
print(f"Genres: {num_genres}")
print(f"Train samples: {len(train_df):,}")
print(f"Val samples: {len(val_df):,}")
print(f"Test samples: {len(test_df):,}")

In [None]:
# Create model
model = NeuMFPlus(
    num_users=num_users,
    num_items=num_items,
    num_genres=num_genres,
)

param_count = sum(p.numel() for p in model.parameters())
print(f"\nModel: NeuMF+ (full model)")
print(f"Parameters: {param_count:,}")

In [None]:
# Train the model
history = train_model(
    model=model,
    train_users=train_users,
    train_items=train_items,
    val_data={
        'users': val_users,
        'items': val_items,
        'genre_features': val_genre_features,
        'synopsis_embeddings': synopsis_embeddings,
    },
    num_items=num_items,
    num_epochs=30,
    batch_size=512,  # Larger batch for Colab's T4 GPU
    learning_rate=1e-3,
    weight_decay=1e-5,
    num_negatives=4,
    device='cuda',
    save_dir='./experiments/trained_models',
    early_stopping_patience=5,
    early_stopping_metric='hr@10',
    lr_scheduler_patience=3,
    lr_scheduler_factor=0.5,
    gradient_clip_max_norm=5.0,
    log_dir='./experiments/logs/tensorboard',
)

## 8. Monitor Training with TensorBoard

In [None]:
%load_ext tensorboard
%tensorboard --logdir experiments/logs/tensorboard --port 6006

## 9. Download Trained Models

In [None]:
# List trained models
!ls -lh experiments/trained_models/

# Download the best model
from google.colab import files

# Uncomment to download
# files.download('experiments/trained_models/NeuMFPlus_best.pt')

## 10. (Optional) Push Results Back to GitHub

**If you want to save your trained models and logs back to GitHub:**

Note: This requires setting up GitHub credentials or using a personal access token.

In [None]:
# Configure git (if needed)
!git config user.email "your.email@example.com"
!git config user.name "Your Name"

# Check git status
!git status

# Add new files (data/, experiments/)
# !git add data/ experiments/

# Commit
# !git commit -m "Add trained models and processed data"

# Push (requires GitHub auth)
# !git push origin main

print("\nNote: To push to GitHub, you need to:")
print("1. Set up a personal access token, or")
print("2. Use SSH keys, or")
print("3. Run 'git push' in a terminal where you're authenticated")

## Complete!

You've trained a NeuMF+ model in Colab using code from GitHub.

**Next steps:**
1. Download the trained model
2. Copy it to your local WSL2: `experiments/trained_models/`
3. Run inference locally using the trained model