# Movie Embeddings Creation for Semantic Search

**Project:** Albert - Intelligent SQL Query Agent  
**Purpose:** Create vector embeddings for semantic movie/show search  
**Author:** Vincent Lamy  
**Date:** 2025-11-10

---

## üìã Objective

This notebook creates vector embeddings for all movies and TV shows across Netflix, Amazon Prime, and Disney+ databases to enable:

- **Semantic search:** "dark psychological thrillers"
- **Similarity search:** "movies like Inception"
- **Mood-based queries:** "heartwarming family films"
- **Style matching:** "intense crime dramas"

## üéØ Process Overview

1. Load movies from SQLite databases
2. Create rich text representations (title + description + genres)
3. Generate embeddings using OpenAI API
4. Store in Chroma vector database
5. Test and validate semantic search

## üí∞ Cost Estimation

- **Model:** `text-embedding-3-small` (1536 dimensions)
- **Records:** ~20,000 movies/shows
- **Estimated cost:** $0.02 - $0.05 USD (one-time)
- **Processing time:** 5-10 minutes

---

## üì¶ Step 1: Setup & Dependencies

In [1]:
# Install required packages (run once)
# !pip install langchain-openai langchain-chroma pandas python-dotenv tqdm

In [2]:
import sqlite3
import pandas as pd
import os
from pathlib import Path
from dotenv import load_dotenv
from tqdm import tqdm
import json
from datetime import datetime

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Load environment variables
load_dotenv()

# Verify API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("‚ùå OPENAI_API_KEY not found in .env file")

print("‚úÖ Dependencies loaded successfully")
print(f"üìÅ Working directory: {Path.cwd()}")

‚úÖ Dependencies loaded successfully
üìÅ Working directory: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\code


---

## üóÑÔ∏è Step 2: Load Data from SQLite Databases

We'll load all movies and TV shows from the three streaming platform databases.

In [3]:
# Define database paths (adjust if needed)
# Assuming notebook is in code/ folder and databases are in ../data/
DB_PATH = Path("../data")  # Adjust this path if needed

DATABASES = {
    "netflix": DB_PATH / "netflix.db",
    "amazon_prime": DB_PATH / "amazon_prime.db",
    "disney_plus": DB_PATH / "disney_plus.db"
}

# Verify databases exist
for platform, db_path in DATABASES.items():
    if db_path.exists():
        print(f"‚úÖ Found {platform}: {db_path}")
    else:
        print(f"‚ùå Missing {platform}: {db_path}")

‚úÖ Found netflix: ..\data\netflix.db
‚úÖ Found amazon_prime: ..\data\amazon_prime.db
‚úÖ Found disney_plus: ..\data\disney_plus.db


In [4]:
def load_database(db_path: Path, platform: str) -> pd.DataFrame:
    """Load all shows from a SQLite database"""
    try:
        conn = sqlite3.connect(db_path)
        df = pd.read_sql_query("SELECT * FROM shows", conn)
        df['platform'] = platform
        conn.close()
        print(f"üìä Loaded {len(df)} records from {platform}")
        return df
    except Exception as e:
        print(f"‚ùå Error loading {platform}: {e}")
        return pd.DataFrame()

# Load all databases
dataframes = []
for platform, db_path in DATABASES.items():
    if db_path.exists():
        df = load_database(db_path, platform)
        if not df.empty:
            dataframes.append(df)

# Combine all data
if dataframes:
    all_shows = pd.concat(dataframes, ignore_index=True)
    print(f"\n‚úÖ Total records loaded: {len(all_shows)}")
else:
    raise ValueError("‚ùå No data loaded from databases")

üìä Loaded 8807 records from netflix
üìä Loaded 9668 records from amazon_prime
üìä Loaded 1450 records from disney_plus

‚úÖ Total records loaded: 19925


In [5]:
# Display sample data
print("\nüìã Sample Data:")
print("="*80)
all_shows.head(3)


üìã Sample Data:


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,added_at,platform
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2025-11-05 07:17:19,netflix
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2025-11-05 07:17:19,netflix
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2025-11-05 07:17:19,netflix


In [6]:
# Data overview
print("\nüìä Data Summary:")
print("="*80)
print(f"Total shows: {len(all_shows)}")
print(f"\nBy Platform:")
print(all_shows['platform'].value_counts())
print(f"\nBy Type:")
print(all_shows['type'].value_counts())
print(f"\nColumn Info:")
print(all_shows.info())


üìä Data Summary:
Total shows: 19925

By Platform:
platform
amazon_prime    9668
netflix         8807
disney_plus     1450
Name: count, dtype: int64

By Type:
type
Movie      14997
TV Show     4928
Name: count, dtype: int64

Column Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19925 entries, 0 to 19924
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       19925 non-null  object
 1   type          19925 non-null  object
 2   title         19925 non-null  object
 3   director      14736 non-null  object
 4   cast          17677 non-null  object
 5   country       9879 non-null   object
 6   date_added    10399 non-null  object
 7   release_year  19925 non-null  int64 
 8   rating        19581 non-null  object
 9   duration      19922 non-null  object
 10  listed_in     19925 non-null  object
 11  description   19925 non-null  object
 12  added_at      19925 non-null  object
 13  platform      19

---

## üìù Step 3: Create Rich Text for Embeddings

We'll combine multiple fields to create semantically rich text:
- **Title** - The show name
- **Type** - Movie or TV Show
- **Genres** - From `listed_in` field
- **Description** - Plot summary

This allows the embeddings to capture both explicit (title, genre) and implicit (description) information.

In [7]:
def create_embedding_text(row) -> str:
    """
    Create rich text representation for embedding.
    Format: Title | Type | Genres | Description
    """
    text_parts = [
        f"Title: {row['title']}",
        f"Type: {row['type']}",
    ]
    
    # Add genres if available
    if pd.notna(row['listed_in']) and row['listed_in'].strip():
        text_parts.append(f"Genres: {row['listed_in']}")
    
    # Add description if available
    if pd.notna(row['description']) and row['description'].strip():
        text_parts.append(f"Description: {row['description']}")
    
    return " | ".join(text_parts)

# Create embedding text for all records
print("üîÑ Creating embedding text...")
all_shows['embedding_text'] = all_shows.apply(create_embedding_text, axis=1)
print("‚úÖ Embedding text created")

üîÑ Creating embedding text...
‚úÖ Embedding text created


In [8]:
# Analyze text quality
print("\nüìä Text Quality Analysis:")
print("="*50)

# Check for missing descriptions
has_description = all_shows['description'].notna() & (all_shows['description'].str.strip() != '')
coverage_pct = (has_description.sum() / len(all_shows)) * 100

print(f"Records with descriptions: {has_description.sum()} ({coverage_pct:.1f}%)")
print(f"Average text length: {all_shows['embedding_text'].str.len().mean():.0f} characters")
print(f"Min text length: {all_shows['embedding_text'].str.len().min()}")
print(f"Max text length: {all_shows['embedding_text'].str.len().max()}")

# Show examples
print("\nüìù Example Embedding Texts:")
print("="*50)
for i in range(min(3, len(all_shows))):
    print(f"\nExample {i+1}:")
    print(all_shows['embedding_text'].iloc[i])


üìä Text Quality Analysis:
Records with descriptions: 19925 (100.0%)
Average text length: 284 characters
Min text length: 66
Max text length: 1185

üìù Example Embedding Texts:

Example 1:
Title: Dick Johnson Is Dead | Type: Movie | Genres: Documentaries | Description: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.

Example 2:
Title: Blood & Water | Type: TV Show | Genres: International TV Shows, TV Dramas, TV Mysteries | Description: After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.

Example 3:
Title: Ganglands | Type: TV Show | Genres: Crime TV Shows, International TV Shows, TV Action & Adventure | Description: To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.


---

## üßπ Step 4: Data Cleaning & Preparation

Remove duplicates and prepare metadata for vector storage.

In [9]:
"""# Remove exact duplicates
initial_count = len(all_shows)
all_shows = all_shows.drop_duplicates(subset=['show_id'], keep='first')
print(f"üßπ Removed {initial_count - len(all_shows)} duplicate records")
print(f"‚úÖ Final record count: {len(all_shows)}")"""

'# Remove exact duplicates\ninitial_count = len(all_shows)\nall_shows = all_shows.drop_duplicates(subset=[\'show_id\'], keep=\'first\')\nprint(f"üßπ Removed {initial_count - len(all_shows)} duplicate records")\nprint(f"‚úÖ Final record count: {len(all_shows)}")'

In [10]:
def prepare_metadata(row) -> dict:
    """
    Prepare metadata dictionary for vector storage.
    This metadata will be stored with each embedding for filtering and retrieval.
    """
    return {
        "show_id": str(row['show_id']),
        "title": str(row['title']),
        "type": str(row['type']),
        "platform": str(row['platform']),
        "release_year": int(row['release_year']) if pd.notna(row['release_year']) else 0,
        "rating": str(row['rating']) if pd.notna(row['rating']) else "",
        "duration": str(row['duration']) if pd.notna(row['duration']) else "",
        "genres": str(row['listed_in']) if pd.notna(row['listed_in']) else "",
        "director": str(row['director']) if pd.notna(row['director']) else "",
        "cast": str(row['cast'])[:200] if pd.notna(row['cast']) else "",  # Truncate cast
        "country": str(row['country']) if pd.notna(row['country']) else ""
    }

# Prepare documents and metadata
print("üîÑ Preparing documents and metadata...")
documents = all_shows['embedding_text'].tolist()
metadatas = [prepare_metadata(row) for _, row in all_shows.iterrows()]

print(f"‚úÖ Prepared {len(documents)} documents with metadata")

üîÑ Preparing documents and metadata...
‚úÖ Prepared 19925 documents with metadata


---

## üöÄ Step 5: Generate Embeddings & Create Vector Store

This is the main step where we:
1. Initialize OpenAI embeddings model
2. Generate embeddings for all documents
3. Store in Chroma vector database

‚ö†Ô∏è **Note:** This will make API calls to OpenAI and may take 5-10 minutes.

In [11]:
# Initialize embeddings model
print("üîÑ Initializing OpenAI Embeddings model...")
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # 1536 dimensions, fast and cheap
    api_key=OPENAI_API_KEY
)

print("‚úÖ Embeddings model initialized")
print(f"   Model: text-embedding-3-small")
print(f"   Dimensions: 1536")
print(f"   Cost per 1M tokens: ~$0.02")

üîÑ Initializing OpenAI Embeddings model...
‚úÖ Embeddings model initialized
   Model: text-embedding-3-small
   Dimensions: 1536
   Cost per 1M tokens: ~$0.02


In [26]:
from pathlib import Path

# Obtenir le r√©pertoire de travail actuel
CURRENT_DIR = Path.cwd()

# D√©finir le chemin vers le dossier "data/chroma_db" depuis la racine du projet
PROJECT_ROOT = CURRENT_DIR.parent  # Remonte d'un niveau depuis "code/"
CHROMA_DIR = PROJECT_ROOT / "data" / "chroma_db"

# Cr√©e le dossier s'il n'existe pas
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Vector store will be saved to: {CHROMA_DIR.absolute()}")


üìÅ Vector store will be saved to: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\data\chroma_db


In [27]:
# Create vector store with progress tracking
print("\nüöÄ Creating vector store...")
print("   This will take 5-10 minutes. Please wait...\n")

start_time = datetime.now()

try:
    vectorstore = Chroma.from_texts(
        texts=documents,
        embedding=embeddings,
        metadatas=metadatas,
        persist_directory=str(CHROMA_DIR),
        collection_name="movies_shows"
    )
    
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds()
    
    print(f"\n‚úÖ Vector store created successfully!")
    print(f"   Records embedded: {vectorstore._collection.count()}")
    print(f"   Time taken: {duration:.1f} seconds ({duration/60:.1f} minutes)")
    print(f"   Storage location: {CHROMA_DIR.absolute()}")
    
except Exception as e:
    print(f"‚ùå Error creating vector store: {e}")
    raise


üöÄ Creating vector store...
   This will take 5-10 minutes. Please wait...


‚úÖ Vector store created successfully!
   Records embedded: 19925
   Time taken: 96.2 seconds (1.6 minutes)
   Storage location: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\data\chroma_db


---

## üß™ Step 6: Test Semantic Search

Let's test the semantic search capabilities with various query types.

In [29]:
def test_semantic_search(query: str, k: int = 5, filters: dict = None):
    """
    Test semantic search with a query.
    
    Args:
        query: Search query
        k: Number of results to return
        filters: Optional metadata filters (e.g., {"platform": "netflix"})
    """
    print(f"\nüîç Query: '{query}'")
    if filters:
        print(f"   Filters: {filters}")
    print("-" * 80)
    
    try:
        if filters:
            results = vectorstore.similarity_search(query, k=k, filter=filters)
        else:
            results = vectorstore.similarity_search(query, k=k)
        
        for i, doc in enumerate(results, 1):
            metadata = doc.metadata
            print(f"\n{i}. {metadata['title']} ({metadata['release_year']})")
            print(f"   Type: {metadata['type']} | Platform: {metadata['platform']}")
            print(f"   Genres: {metadata['genres'][:60]}..." if len(metadata['genres']) > 60 else f"   Genres: {metadata['genres']}")
            if metadata['director']:
                print(f"   Director: {metadata['director']}")
        
        return results
    
    except Exception as e:
        print(f"‚ùå Search error: {e}")
        return []

### Test Case 1: Mood-Based Queries

In [30]:
# Test various mood-based queries
mood_queries = [
    "dark psychological thrillers with twist endings",
    "heartwarming family movies",
    "intense crime dramas"
]

for query in mood_queries:
    test_semantic_search(query, k=3)


üîç Query: 'dark psychological thrillers with twist endings'
--------------------------------------------------------------------------------

1. Dark Crimes (2016)
   Type: Movie | Platform: netflix
   Genres: Dramas, Thrillers
   Director: Alexandros Avranas

2. A Kind of Murder (2016)
   Type: Movie | Platform: netflix
   Genres: Thrillers
   Director: Andy Goddard

3. Face your Fears | Thriller shorts for Adults (2020)
   Type: Movie | Platform: amazon_prime
   Genres: Horror, Suspense
   Director: Vanessa Gazy,  Jeremy Robbins,  Nicholas Verso,  Shawn Thompson

üîç Query: 'heartwarming family movies'
--------------------------------------------------------------------------------

1. From Our Family to Yours (2020)
   Type: Movie | Platform: disney_plus
   Genres: Animation, Family
   Director: Angela Affinita

2. A Family Reunion Christmas (2019)
   Type: Movie | Platform: netflix
   Genres: Children & Family Movies, Comedies
   Director: Robbie Countryman

3. The Family Tree 

### Test Case 2: Similarity Queries

In [31]:
# Test similarity searches
similarity_queries = [
    "movies like Inception",
    "shows similar to Stranger Things",
    "films like The Dark Knight"
]

for query in similarity_queries:
    test_semantic_search(query, k=3)


üîç Query: 'movies like Inception'
--------------------------------------------------------------------------------

1. Inception (2010)
   Type: Movie | Platform: netflix
   Genres: Action & Adventure, Sci-Fi & Fantasy, Thrillers
   Director: Christopher Nolan

2. Inconceivable (2017)
   Type: Movie | Platform: amazon_prime
   Genres: Suspense
   Director: Jonathan Baker

3. In Paradox (2019)
   Type: Movie | Platform: netflix
   Genres: International Movies, Sci-Fi & Fantasy, Thrillers
   Director: Hamad AlSarraf

üîç Query: 'shows similar to Stranger Things'
--------------------------------------------------------------------------------

1. Stranger Things (2019)
   Type: TV Show | Platform: netflix
   Genres: TV Horror, TV Mysteries, TV Sci-Fi & Fantasy

2. Beyond Stranger Things (2017)
   Type: TV Show | Platform: netflix
   Genres: Stand-Up Comedy & Talk Shows, TV Mysteries, TV Sci-Fi & Fant...

3. Tales From the Stranger Side (2021)
   Type: TV Show | Platform: amazon_prime


### Test Case 3: Genre/Style Queries

In [33]:
# Test genre and style queries
style_queries = [
    "funny animated shows for kids",
    "romantic comedies",
    "sci-fi space adventures"
]

for query in style_queries:
    test_semantic_search(query, k=3)


üîç Query: 'funny animated shows for kids'
--------------------------------------------------------------------------------

1. Moral stories and more shows (2021)
   Type: Movie | Platform: amazon_prime
   Genres: Animation, Anime

2. LooLoo Kids (2021)
   Type: TV Show | Platform: amazon_prime
   Genres: Animation, Kids

3. Steve and Maggie - Funny Friends (2021)
   Type: TV Show | Platform: amazon_prime
   Genres: Animation, Kids

üîç Query: 'romantic comedies'
--------------------------------------------------------------------------------

1. Mr. Romantic (2009)
   Type: Movie | Platform: netflix
   Genres: Comedies, International Movies, Romantic Movies
   Director: Ahmed Al-Badry

2. Valentine's Day (2010)
   Type: Movie | Platform: netflix
   Genres: Comedies, Romantic Movies
   Director: Garry Marshall

3. Romantik Komedi (2010)
   Type: Movie | Platform: netflix
   Genres: Comedies, International Movies, Romantic Movies
   Director: Ketche

üîç Query: 'sci-fi space advent

### Test Case 4: Filtered Search

Test search with metadata filters (platform, year, etc.)

In [34]:
# Test with platform filter
test_semantic_search(
    "action movies",
    k=5,
    filters={"platform": "netflix"}
)


üîç Query: 'action movies'
   Filters: {'platform': 'netflix'}
--------------------------------------------------------------------------------

1. Xtreme (2021)
   Type: Movie | Platform: netflix
   Genres: Action & Adventure, International Movies
   Director: Daniel Benmayor

2. XXx (2002)
   Type: Movie | Platform: netflix
   Genres: Action & Adventure, Sports Movies
   Director: Rob Cohen

3. Acts of Violence (2018)
   Type: Movie | Platform: netflix
   Genres: Action & Adventure
   Director: Brett Donowho

4. Triple Threat (2019)
   Type: Movie | Platform: netflix
   Genres: Action & Adventure, International Movies
   Director: Jesse V. Johnson, Jesse Johnson

5. Ava (2020)
   Type: Movie | Platform: netflix
   Genres: Action & Adventure, Dramas
   Director: Tate Taylor


[Document(id='f1b6577c-48ff-4f69-bb9d-7fd102ff8fa3', metadata={'title': 'Xtreme', 'rating': 'TV-MA', 'type': 'Movie', 'platform': 'netflix', 'genres': 'Action & Adventure, International Movies', 'show_id': 's766', 'director': 'Daniel Benmayor', 'duration': '112 min', 'release_year': 2021, 'cast': 'Teo Garc√≠a, √ìscar Jaenada, √ìscar Casas, Andrea Duro, Sergio Peris-Mencheta, Alberto Jo Lee, Luis Zahera, Andr√©s Herrera, Nao Albet, C√©sar Bandera, Isa Montalb√°n', 'country': 'Spain'}, page_content='Title: Xtreme | Type: Movie | Genres: Action & Adventure, International Movies | Description: In this fast-paced and action-packed thriller, a retired hitman ‚Äî along with his sister and a troubled teen ‚Äî takes revenge on his lethal stepbrother.'),
 Document(id='8d9093e0-115a-49a6-a658-036af5e4d400', metadata={'type': 'Movie', 'genres': 'Action & Adventure, Sports Movies', 'title': 'XXx', 'duration': '124 min', 'release_year': 2002, 'country': 'United States', 'director': 'Rob Cohen', 'sho

In [35]:
# Test with type filter
test_semantic_search(
    "comedy",
    k=5,
    filters={"type": "Movie"}
)


üîç Query: 'comedy'
   Filters: {'type': 'Movie'}
--------------------------------------------------------------------------------

1. The Human Comedy (2017)
   Type: Movie | Platform: amazon_prime
   Genres: Comedy, Drama
   Director: Mohammad Hadi Karimi

2. Indian Comedy Tour (2010)
   Type: Movie | Platform: amazon_prime
   Genres: Arts, Entertainment, and Culture, Comedy, Special Interest
   Director: Iqbal Hans

3. Kims of Comedy (2005)
   Type: Movie | Platform: amazon_prime
   Genres: Arts, Entertainment, and Culture, Comedy, Special Interest
   Director: Chuck Vinson

4. Is This A Joke? (2021)
   Type: Movie | Platform: amazon_prime
   Genres: Comedy
   Director: Jim Haggerty

5. Can We Take a Joke? (2016)
   Type: Movie | Platform: amazon_prime
   Genres: Arthouse, Arts, Entertainment, and Culture, Comedy
   Director: Ted Balaker


[Document(id='36f2fcc7-ba7c-4808-934c-90b43accb4a7', metadata={'country': '', 'rating': '13+', 'platform': 'amazon_prime', 'director': 'Mohammad Hadi Karimi', 'cast': 'Arman Darvish, Hooman Seyedi, Leila Zareh, Hasti Mahdavifard, Niki Karimi, Bahareh Kianafshar, Alireza Shojanoori', 'duration': '90 min', 'title': 'The Human Comedy', 'genres': 'Comedy, Drama', 'type': 'Movie', 'release_year': 2017, 'show_id': 's6474'}, page_content='Title: The Human Comedy | Type: Movie | Genres: Comedy, Drama | Description: A comedy of the tragic life of someone who must live like the people around him despite his own preference. The Human Comedy is a comedy narrative of the tragic life of a person who finds himself doing as the Romans do unwillingly.'),
 Document(id='db3fd554-ea9d-4611-bcdb-b8b6de6894fe', metadata={'cast': 'Vidur Kapur, Vijai Nathan, Mark Saldana, Dalia McPhee, Rajiv Satyal', 'country': '', 'director': 'Iqbal Hans', 'title': 'Indian Comedy Tour', 'rating': '18+', 'genres': 'Arts, Ente

---

## üìä Step 7: Evaluation & Quality Metrics

In [36]:
print("\nüìä EMBEDDING QUALITY METRICS")
print("=" * 80)

# Storage metrics
print(f"\nüì¶ Storage:")
print(f"   Total vectors: {vectorstore._collection.count()}")
print(f"   Vector dimensions: 1536")
print(f"   Storage location: {CHROMA_DIR.absolute()}")

# Data coverage
print(f"\nüìà Data Coverage:")
has_desc = all_shows['description'].notna() & (all_shows['description'].str.strip() != '')
print(f"   Records with descriptions: {has_desc.sum()} ({(has_desc.sum()/len(all_shows)*100):.1f}%)")
print(f"   Average text length: {all_shows['embedding_text'].str.len().mean():.0f} chars")

# Platform distribution
print(f"\nüé¨ Platform Distribution:")
for platform, count in all_shows['platform'].value_counts().items():
    pct = (count / len(all_shows)) * 100
    print(f"   {platform}: {count} ({pct:.1f}%)")

# Type distribution
print(f"\nüì∫ Content Type:")
for content_type, count in all_shows['type'].value_counts().items():
    pct = (count / len(all_shows)) * 100
    print(f"   {content_type}: {count} ({pct:.1f}%)")

# Year range
valid_years = all_shows[all_shows['release_year'] > 0]['release_year']
if len(valid_years) > 0:
    print(f"\nüìÖ Release Year Range:")
    print(f"   Min: {valid_years.min():.0f}")
    print(f"   Max: {valid_years.max():.0f}")
    print(f"   Mean: {valid_years.mean():.0f}")


üìä EMBEDDING QUALITY METRICS

üì¶ Storage:
   Total vectors: 19925
   Vector dimensions: 1536
   Storage location: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\data\chroma_db

üìà Data Coverage:
   Records with descriptions: 19925 (100.0%)
   Average text length: 284 chars

üé¨ Platform Distribution:
   amazon_prime: 9668 (48.5%)
   netflix: 8807 (44.2%)
   disney_plus: 1450 (7.3%)

üì∫ Content Type:
   Movie: 14997 (75.3%)
   TV Show: 4928 (24.7%)

üìÖ Release Year Range:
   Min: 1920
   Max: 2021
   Mean: 2011


---

## üíæ Step 8: Save Metadata & Statistics

In [37]:
# Save embedding statistics
stats = {
    "created_at": datetime.now().isoformat(),
    "total_records": len(all_shows),
    "total_embeddings": vectorstore._collection.count(),
    "embedding_model": "text-embedding-3-small",
    "dimensions": 1536,
    "platforms": all_shows['platform'].value_counts().to_dict(),
    "content_types": all_shows['type'].value_counts().to_dict(),
    "year_range": {
        "min": int(valid_years.min()) if len(valid_years) > 0 else 0,
        "max": int(valid_years.max()) if len(valid_years) > 0 else 0
    },
    "coverage": {
        "has_description": int(has_desc.sum()),
        "description_percentage": float((has_desc.sum()/len(all_shows)*100))
    }
}

stats_file = CHROMA_DIR / "embedding_stats.json"
with open(stats_file, 'w') as f:
    json.dump(stats, f, indent=2)

print(f"‚úÖ Statistics saved to: {stats_file}")

‚úÖ Statistics saved to: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\data\chroma_db\embedding_stats.json


---

## ‚úÖ Step 9: Verification & Next Steps

In [38]:
print("\n" + "="*80)
print("üéâ EMBEDDING CREATION COMPLETE!")
print("="*80)

print(f"\n‚úÖ Successfully created embeddings for {vectorstore._collection.count()} movies/shows")
print(f"\nüìÅ Files created:")
print(f"   - Vector database: {CHROMA_DIR.absolute()}")
print(f"   - Statistics: {stats_file.absolute()}")

print(f"\nüöÄ Next Steps:")
print(f"   1. Integrate semantic_search_node into albert_v3.py")
print(f"   2. Add hybrid search combining SQL + vector search")
print(f"   3. Update workflow routing to use semantic search")
print(f"   4. Test with Albert agent in Streamlit")

print(f"\nüí° Try these queries in Albert:")
queries = [
    "dark psychological thrillers",
    "movies like Inception",
    "heartwarming family films",
    "intense crime dramas"
]
for q in queries:
    print(f"   - {q}")

print(f"\nüìö Documentation: See README.md and albert_v3_architecture_analysis.md")


üéâ EMBEDDING CREATION COMPLETE!

‚úÖ Successfully created embeddings for 19925 movies/shows

üìÅ Files created:
   - Vector database: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\data\chroma_db
   - Statistics: c:\Users\Vincent\GitHub\Vincent-20-100\Agentic_Systems_Project_Vlamy\data\chroma_db\embedding_stats.json

üöÄ Next Steps:
   1. Integrate semantic_search_node into albert_v3.py
   2. Add hybrid search combining SQL + vector search
   3. Update workflow routing to use semantic search
   4. Test with Albert agent in Streamlit

üí° Try these queries in Albert:
   - dark psychological thrillers
   - movies like Inception
   - heartwarming family films
   - intense crime dramas

üìö Documentation: See README.md and albert_v3_architecture_analysis.md


---

## üîÑ Optional: Reload Vector Store

Use this cell to reload the vector store in future sessions without recreating embeddings.

In [None]:
# Reload existing vector store (run this instead of creating new embeddings)
def reload_vectorstore():
    """Reload existing Chroma vector store"""
    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small",
        api_key=os.getenv("OPENAI_API_KEY")
    )
    
    vectorstore = Chroma(
        persist_directory=str(CHROMA_DIR),
        embedding_function=embeddings,
        collection_name="movies_shows"
    )
    
    print(f"‚úÖ Vector store reloaded: {vectorstore._collection.count()} vectors")
    return vectorstore

# Uncomment to reload:
# vectorstore = reload_vectorstore()