# üé¨ Movie Discovery Assistant - Data Preparation

**Goal**: Prepare TMDb data for RAG by creating a vector database.

## What We'll Do
1. Load TMDb 5000 dataset from Kaggle
2. Parse JSON fields (genres, keywords, cast)
3. Create enriched "documents" for each movie
4. Generate embeddings using sentence-transformers
5. Store in ChromaDB for fast semantic search
6. Download the ChromaDB folder for local use

---

## üéì Key Concepts

### What is a "Document" in RAG?
For each movie, we create a text document that combines:
- **Title**: "Inception"
- **Plot**: "A thief who steals corporate secrets through dream-sharing..."
- **Genres**: "Action, Sci-Fi, Thriller"
- **Keywords**: "dream, subconscious, heist"

This becomes: `"Inception. A thief who steals corporate secrets... Action Sci-Fi Thriller dream subconscious heist"`

### Why Combine Fields?
The embedding model needs **context**. Just "Inception" doesn't tell us much. But the full document captures:
- Semantic meaning (what the movie is about)
- Genre signals (for filtering)
- Thematic keywords (for similarity)

### What are Embeddings?
Embeddings convert text to vectors (lists of numbers). Similar movies get similar vectors.

Example:
- "Inception" ‚Üí [0.23, -0.45, 0.78, ...] (384 numbers)
- "Interstellar" ‚Üí [0.25, -0.43, 0.81, ...] (very close!)
- "Toy Story" ‚Üí [-0.12, 0.67, -0.34, ...] (far away)

---

Let's start! üöÄ

In [None]:
# ============================================================================
# Step 1: Install Required Libraries
# ============================================================================
# üéì CONCEPT: Dependencies
#
# - chromadb: Vector database (stores embeddings)
# - sentence-transformers: Creates embeddings locally
# - pandas: Data manipulation
#
# Why ChromaDB?
# - Works in-memory (no server setup)
# - Can persist to disk (save for later)
# - Built for embeddings (not like PostgreSQL)
# ============================================================================

!pip install -q chromadb sentence-transformers pandas numpy

print("‚úÖ Libraries installed!")

In [None]:
# ============================================================================
# Step 2: Download TMDb Dataset from Kaggle API
# ============================================================================
# üéì CONCEPT: Kaggle API
#
# Kaggle provides an API to download datasets programmatically.
# You need to upload your kaggle.json credentials file first.
#
# How to get kaggle.json:
# 1. Go to kaggle.com ‚Üí Account ‚Üí Create New API Token
# 2. Download kaggle.json
# 3. Upload it to Colab when prompted below
# ============================================================================

from google.colab import files
import os

# Upload kaggle.json
print("üìÅ Please upload your kaggle.json file:")
uploaded = files.upload()

# Set up Kaggle credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download TMDb dataset
print("‚¨áÔ∏è Downloading TMDb 5000 dataset...")
!kaggle datasets download -d tmdb/tmdb-movie-metadata
!unzip -q tmdb-movie-metadata.zip

print("‚úÖ Dataset downloaded!")
!ls -lh *.csv

In [None]:
# ============================================================================
# Step 3: Load and Explore the Data
# ============================================================================
# üéì CONCEPT: Data Loading
#
# TMDb provides two CSVs:
# - tmdb_5000_movies.csv: Movie metadata (title, plot, budget, revenue)
# - tmdb_5000_credits.csv: Cast and crew (actors, directors)
#
# We'll merge these on 'title' to get complete information.
# ============================================================================

import pandas as pd
import json
from ast import literal_eval

# Load datasets
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

print(f"üìä Loaded {len(movies)} movies")
print(f"üìä Loaded {len(credits)} credit records\n")

# Preview the data
print("üîç Movie columns:")
print(movies.columns.tolist())
print("\nüîç First movie:")
movies.head(2)

In [None]:
# ============================================================================
# Step 4: Parse JSON Fields
# ============================================================================
# üéì CONCEPT: JSON Parsing in Pandas
#
# TMDb stores complex fields as JSON strings:
# genres: '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}]'
#
# We need to:
# 1. Parse the JSON string ‚Üí Python list
# 2. Extract just the 'name' field
# 3. Join into a single string: "Action Adventure"
#
# Why?
# The embedding model works better with text than structured data.
# ============================================================================

def parse_json_field(field, key='name', limit=5):
    """
    Parse a JSON field and extract specific key.
    
    Args:
        field: JSON string like '[{"name": "Action"}]'
        key: Which key to extract (default: 'name')
        limit: Maximum items to extract
    
    Returns:
        List of strings: ['Action', 'Adventure']
    """
    if pd.isna(field):
        return []
    try:
        parsed = literal_eval(field)  # Safe eval for Python literals
        return [item[key] for item in parsed[:limit]]
    except:
        return []

# Parse genres
movies['genres_list'] = movies['genres'].apply(parse_json_field)

# Parse keywords
movies['keywords_list'] = movies['keywords'].apply(parse_json_field, limit=10)

# Parse production companies
movies['companies_list'] = movies['production_companies'].apply(parse_json_field, limit=3)

print("‚úÖ Parsed JSON fields!")
print("\nüîç Example:")
print(f"Title: {movies.iloc[0]['title']}")
print(f"Genres: {movies.iloc[0]['genres_list']}")
print(f"Keywords: {movies.iloc[0]['keywords_list']}")

In [None]:
# ============================================================================
# Step 5: Merge with Credits Data
# ============================================================================
# üéì CONCEPT: Data Merging
#
# We merge movies + credits to get:
# - Cast (top 5 actors)
# - Director
#
# This allows queries like:
# "Movies with Tom Hanks"
# "Christopher Nolan films"
# ============================================================================

# Parse cast (top 5 actors)
credits['cast_list'] = credits['cast'].apply(parse_json_field, limit=5)

# Parse crew to get director
def get_director(crew_str):
    """Extract director from crew JSON."""
    if pd.isna(crew_str):
        return None
    try:
        crew = literal_eval(crew_str)
        for person in crew:
            if person.get('job') == 'Director':
                return person.get('name')
    except:
        pass
    return None

credits['director'] = credits['crew'].apply(get_director)

# Merge on title
movies_full = movies.merge(
    credits[['title', 'cast_list', 'director']],
    on='title',
    how='left'
)

print(f"‚úÖ Merged data: {len(movies_full)} movies with cast & crew")
print("\nüîç Example:")
sample = movies_full.iloc[0]
print(f"Title: {sample['title']}")
print(f"Director: {sample['director']}")
print(f"Cast: {sample['cast_list']}")

In [None]:
# ============================================================================
# Step 6: Create Enriched Documents
# ============================================================================
# üéì CONCEPT: Document Construction for RAG
#
# Each movie becomes a single text document with ALL relevant info.
#
# Structure:
# Title: [Movie Name]
# Plot: [Overview]
# Genres: [Genre1, Genre2]
# Keywords: [Keyword1, Keyword2]
# Director: [Name]
# Cast: [Actor1, Actor2]
#
# Why this format?
# - Clear section headers help the LLM understand context
# - Natural language (not just concatenated words)
# - Embeddings capture both semantic and structural info
# ============================================================================

def create_movie_document(row):
    """
    Create a rich text document for a movie.
    
    This becomes the "content" that gets embedded.
    """
    parts = []
    
    # Title (always include)
    parts.append(f"Title: {row['title']}")
    
    # Plot/Overview
    if pd.notna(row['overview']) and row['overview'].strip():
        parts.append(f"Plot: {row['overview']}")
    
    # Genres
    if row['genres_list']:
        parts.append(f"Genres: {', '.join(row['genres_list'])}")
    
    # Keywords
    if row['keywords_list']:
        parts.append(f"Keywords: {', '.join(row['keywords_list'])}")
    
    # Director
    if pd.notna(row['director']):
        parts.append(f"Director: {row['director']}")
    
    # Cast
    if row['cast_list']:
        parts.append(f"Cast: {', '.join(row['cast_list'])}")
    
    # Release year (helpful for filtering)
    if pd.notna(row['release_date']):
        year = row['release_date'][:4]
        parts.append(f"Year: {year}")
    
    return "\n".join(parts)

# Create documents
movies_full['document'] = movies_full.apply(create_movie_document, axis=1)

# Filter out movies without overview (empty documents)
movies_clean = movies_full[movies_full['overview'].notna()].copy()
movies_clean = movies_clean.reset_index(drop=True)

print(f"‚úÖ Created {len(movies_clean)} documents\n")
print("üîç Example document:")
print("=" * 60)
print(movies_clean.iloc[0]['document'])
print("=" * 60)

In [None]:
# ============================================================================
# Step 7: Prepare Metadata for ChromaDB
# ============================================================================
# üéì CONCEPT: Metadata in Vector Databases
#
# ChromaDB stores:
# 1. Document text (the enriched string)
# 2. Embedding vector (384 numbers)
# 3. Metadata (structured info for filtering)
#
# Metadata allows queries like:
# "Recommend a horror movie from 2015-2020"
#
# ChromaDB can filter by metadata BEFORE doing semantic search.
# This is MUCH faster than searching all 5000 movies!
# ============================================================================

def create_metadata(row):
    """
    Extract metadata for filtering.
    
    Note: ChromaDB only supports:
    - Strings
    - Numbers (int, float)
    - Booleans
    
    No lists/dicts directly. We'll convert lists to comma-separated strings.
    """
    meta = {
        'title': row['title'],
        'movie_id': str(row['id']),
    }
    
    # Genres (as string for filtering)
    if row['genres_list']:
        meta['genres'] = ','.join(row['genres_list'])
    
    # Year (as integer for range filtering)
    if pd.notna(row['release_date']):
        try:
            meta['year'] = int(row['release_date'][:4])
        except:
            pass
    
    # Rating
    if pd.notna(row['vote_average']):
        meta['rating'] = float(row['vote_average'])
    
    # Director
    if pd.notna(row['director']):
        meta['director'] = row['director']
    
    return meta

# Create metadata for all movies
metadatas = movies_clean.apply(create_metadata, axis=1).tolist()

print("‚úÖ Created metadata for all movies\n")
print("üîç Example metadata:")
print(metadatas[0])

In [None]:
# ============================================================================
# Step 8: Initialize ChromaDB and Embedding Model
# ============================================================================
# üéì CONCEPT: Embedding Models
#
# all-MiniLM-L6-v2 is a sentence-transformer model:
# - Input: Text of any length
# - Output: 384-dimensional vector
#
# Why this model?
# - Fast: 14,000 sentences/second on CPU
# - Small: 80MB download
# - Quality: Trained on 1 billion sentence pairs
#
# Alternatives:
# - all-mpnet-base-v2: Better quality, slower (768-dim)
# - all-distilroberta-v1: Good balance (768-dim)
# ============================================================================

import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Initialize embedding model
print("üìä Loading embedding model: all-MiniLM-L6-v2")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("‚úÖ Model loaded!\n")

# Initialize ChromaDB (in-memory for now)
print("üóÑÔ∏è Initializing ChromaDB...")
chroma_client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./chroma_db"  # Save to disk
))

# Create or get collection
# If collection exists, delete it (fresh start)
try:
    chroma_client.delete_collection("movies")
except:
    pass

collection = chroma_client.create_collection(
    name="movies",
    metadata={"description": "TMDb 5000 movies for RAG"}
)

print("‚úÖ ChromaDB initialized!")

In [None]:
# ============================================================================
# Step 9: Generate Embeddings and Ingest into ChromaDB
# ============================================================================
# üéì CONCEPT: Batch Processing
#
# We have ~4800 movies. Encoding them one-by-one would be slow.
#
# batch_size=32 means:
# - Process 32 movies at once
# - GPU can parallelize this (if available)
# - Much faster than one-at-a-time
#
# Why not batch_size=4800?
# - Would exceed GPU memory
# - Diminishing returns after ~64
#
# This step takes ~2-3 minutes on Colab's free tier.
# ============================================================================

from tqdm import tqdm

print("üöÄ Generating embeddings and ingesting into ChromaDB...")
print(f"   Total movies: {len(movies_clean)}")
print("   This will take ~2-3 minutes...\n")

# Batch size for ingestion
BATCH_SIZE = 100

documents = movies_clean['document'].tolist()
ids = [f"movie_{i}" for i in range(len(documents))]

# Process in batches
for i in tqdm(range(0, len(documents), BATCH_SIZE)):
    batch_docs = documents[i:i+BATCH_SIZE]
    batch_ids = ids[i:i+BATCH_SIZE]
    batch_meta = metadatas[i:i+BATCH_SIZE]
    
    # Generate embeddings
    embeddings = embedding_model.encode(
        batch_docs,
        convert_to_numpy=True,
        show_progress_bar=False
    ).tolist()
    
    # Add to ChromaDB
    collection.add(
        ids=batch_ids,
        documents=batch_docs,
        embeddings=embeddings,
        metadatas=batch_meta
    )

print(f"\n‚úÖ Ingested {collection.count()} movies into ChromaDB!")

In [None]:
# ============================================================================
# Step 10: Test the Vector Database
# ============================================================================
# üéì CONCEPT: Semantic Search
#
# Let's test if our embeddings work!
#
# We'll query: "mind-bending sci-fi about dreams"
#
# The database will:
# 1. Embed the query ‚Üí vector
# 2. Compare to all movie vectors (cosine similarity)
# 3. Return top-K most similar
#
# Expected result: Inception, Paprika, The Matrix, Interstellar
# ============================================================================

def test_query(query_text, n_results=5):
    """Test semantic search."""
    print(f"üîç Query: '{query_text}'\n")
    
    # Embed query
    query_embedding = embedding_model.encode([query_text]).tolist()
    
    # Search ChromaDB
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=n_results
    )
    
    # Display results
    print("üìã Top Results:")
    for i, (doc, meta, dist) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        print(f"\n{i+1}. {meta['title']}")
        print(f"   Similarity: {1 - dist:.3f}")  # Convert distance to similarity
        print(f"   Year: {meta.get('year', 'N/A')} | Rating: {meta.get('rating', 'N/A')}")
        print(f"   Genres: {meta.get('genres', 'N/A')}")

# Test queries
test_query("mind-bending sci-fi about dreams")
print("\n" + "="*60 + "\n")
test_query("family-friendly animated adventure")
print("\n" + "="*60 + "\n")
test_query("dark thriller with a detective")

In [None]:
# ============================================================================
# Step 11: Persist ChromaDB to Disk
# ============================================================================
# üéì CONCEPT: Persistence
#
# ChromaDB auto-saves to 'persist_directory' on every add().
# But we'll explicitly persist to be safe.
#
# The 'chroma_db' folder contains:
# - Parquet files (embeddings)
# - DuckDB index (for fast search)
# - Metadata (schemas, config)
#
# This folder is ~50MB (compressed embeddings).
# ============================================================================

# Ensure data is persisted
chroma_client.persist()

print("‚úÖ ChromaDB persisted to ./chroma_db")
print("\nüìÅ Folder contents:")
!ls -lh chroma_db/

In [None]:
# ============================================================================
# Step 12: Download ChromaDB for Local Use
# ============================================================================
# üéì CONCEPT: Transferring Data from Colab
#
# We need to download the 'chroma_db' folder to use locally.
#
# Steps:
# 1. Zip the folder
# 2. Download the zip file
# 3. Extract it in your local project
# ============================================================================

# Zip the ChromaDB folder
!zip -r chroma_db.zip chroma_db/

print("üì¶ ChromaDB zipped!")
!ls -lh chroma_db.zip

# Download the file
print("\n‚¨áÔ∏è Downloading chroma_db.zip...")
files.download('chroma_db.zip')

print("\n‚úÖ Download complete!")
print("\nüìã Next Steps:")
print("1. Extract chroma_db.zip in your local project")
print("2. Place it at: d:/PROJECTS/StreamSage/data/chroma_db")
print("3. We'll use this in the RAG service!")

---

## üéâ Congratulations!

You've successfully:
1. ‚úÖ Loaded TMDb data
2. ‚úÖ Parsed complex JSON fields
3. ‚úÖ Created enriched documents
4. ‚úÖ Generated embeddings for 4800+ movies
5. ‚úÖ Ingested into ChromaDB
6. ‚úÖ Tested semantic search
7. ‚úÖ Downloaded for local use

---

## üéì What You Learned

### 1. Document Construction
- Why we combine multiple fields into one text
- How structured metadata enables filtering
- Balance between context and noise

### 2. Embeddings
- What embeddings are (text ‚Üí vectors)
- Why similar content gets similar vectors
- Trade-offs between model size and quality

### 3. Vector Databases
- How ChromaDB stores and searches embeddings
- Metadata filtering vs semantic search
- Batch processing for efficiency

---

## üöÄ Next Steps

Now that we have the vector database ready, we'll build the **RAG Service** locally:

1. Load this ChromaDB
2. Connect to Mistral LLM (via Ollama)
3. Create a FastAPI service
4. Test conversational recommendations

Let's do it! üé¨