# Extract Synopsis Embeddings for NeuMF+

This notebook extracts **synopsis embeddings** using Sentence-BERT.

**What it does:**
1. Loads movie overviews from `movies_metadata.csv`
2. Converts text to 384-dimensional vectors using Sentence-BERT
3. Saves embeddings for use in NeuMF+ training

**Why this matters:**
- NeuMF+ with genre only: HR@10 ~0.08-0.15
- NeuMF+ with genre + synopsis: HR@10 ~0.12-0.20 (better!)

**Prerequisites:**
- Google Drive with: `movies_metadata.csv`, `links.csv`
- Colab Pro (T4 GPU recommended)
- Estimated time: 30-60 minutes

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("\n✓ Google Drive mounted!")

## Step 2: Clone and Setup

In [None]:
# Clone the repository
!git clone https://github.com/albertabayor/NCF-Movie-Recommender.git

import os
os.chdir('NCF-Movie-Recommender')
!git pull origin main

# Link datasets from Drive
!rm -rf datasets
!ln -s "/content/drive/MyDrive/NCF-Movie-Recommender/datasets" datasets

# Link data folder
!mkdir -p /content/drive/MyDrive/NCF-Movie-Recommender/data
!rm -rf data
!ln -s /content/drive/MyDrive/NCF-Movie-Recommender/data data

# Install dependencies
!pip install -q sentence-transformers tqdm

print("\n✓ Setup complete!")

## Step 3: Verify Dataset Files

In [None]:
import os

# Check required files
required_files = ['movies_metadata.csv', 'links.csv']
datasets_path = 'datasets'

print("Checking dataset files...\n")
all_exist = True
for filename in required_files:
    filepath = os.path.join(datasets_path, filename)
    if os.path.exists(filepath):
        size = os.path.getsize(filepath) / 1024 / 1024
        print(f"✓ {filename}: {size:.1f} MB")
    else:
        print(f"❌ {filename}: NOT FOUND")
        all_exist = False

if all_exist:
    print("\n✓ All required files found!")
else:
    print("\n❌ Some files are missing. Please upload them to Google Drive first.")

## Step 4: Extract Synopsis Embeddings

This will:
1. Load Sentence-BERT model (all-MiniLM-L6-v2)
2. Extract embeddings for ~45K movie overviews
3. Save to Google Drive for later use

**Estimated time:** 30-60 minutes on T4 GPU

In [None]:
import sys
sys.path.insert(0, '.')

print("="*70)
print("EXTRACTING SYNOPSIS EMBEDDINGS")
print("="*70)
print("\nThis will take 30-60 minutes on T4 GPU")
print("\nPlease be patient, the notebook will continue running...")
print("\n" + "="*70)

# Run extraction
exec(open('extract_synopsis_embeddings.py').read())

## Step 5: Verify Extracted Embeddings

In [None]:
import numpy as np
import pickle

# Load embeddings
embeddings = np.load('data/synopsis_embeddings.npy')

# Load metadata
with open('data/synopsis_metadata.pkl', 'rb') as f:
    metadata = pickle.load(f)

print("="*70)
print("VERIFICATION - SYNOPSIS EMBEDDINGS")
print("="*70)
print(f"\nShape: {embeddings.shape}")
print(f"  Movies: {metadata['num_movies']:,}")
print(f"  Embedding dim: {metadata['embedding_dim']}")
print(f"  Memory: {embeddings.nbytes / 1024 / 1024:.1f} MB")

# Show sample
print("\nSample embeddings (first 3 movies):")
for i in range(min(3, len(embeddings))):
    emb = embeddings[i]
    print(f"  Movie {i}: shape={emb.shape}, mean={emb.mean():.4f}, std={emb.std():.4f}")

print("\n" + "="*70)
print("✓ Embeddings look good!")
print("="*70)

## Summary

**What happened:**
1. ✅ Loaded Sentence-BERT model
2. ✅ Extracted embeddings for ~45K movie overviews
3. ✅ Saved embeddings to Google Drive

**Files created:**
- `data/synopsis_embeddings.npy` (45K × 384 float vectors)
- `data/synopsis_metadata.pkl` (metadata and mappings)
- `data/tmdb_to_movieid.pkl` (TMDB ID → MovieLens ID mapping)

**Next steps:**
1. Update preprocessing to link synopsis embeddings to movieIds
2. Train NeuMF+ with `use_synopsis=True`
3. Enjoy improved recommendations with genre + synopsis!