# Movie Reviews RAG Agentic Solution 🍅

This notebook implements an end-to-end RAG (Retrieval Augmented Generation) system for analyzing movie reviews from the **Rotten Tomatoes dataset** - providing access to professional critic reviews and comprehensive movie metadata.

## Overview

We'll build a system that can:
- Load and process movie review data from Rotten Tomatoes
- Generate embeddings for review text with rich metadata
- Implement semantic search for relevant reviews
- Generate intelligent responses to movie-related queries
- Provide insights and analysis based on professional critic reviews
- Leverage Tomatometer scores, audience ratings, and critic consensus

## Data Sources 
- **Rotten Tomatoes Movies** (17MB): Movie metadata with titles, ratings, genres, directors, runtime, release dates, etc.
- **Rotten Tomatoes Reviews** (392MB): Professional critic reviews with scores, sentiment, publications, and detailed text

## Key Features ✨
- **Rich Movie Metadata**: Genre, director, runtime, release dates, ratings
- **Professional Critics**: Reviews from top critics and publications
- **Tomatometer & Audience Scores**: Official Rotten Tomatoes scoring system
- **Fresh/Rotten Classification**: Review state analysis
- **Sentiment Analysis**: Positive/negative review sentiment
- **Multi-tool Agent**: Specialized tools for different types of analysis


## Step 1: Environment Setup and Dependencies


In [4]:
import os
import getpass
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
import asyncio
import nest_asyncio
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Apply nest_asyncio for Jupyter compatibility
nest_asyncio.apply()


In [5]:
# API Keys Setup
import os
import getpass

print("🔑 Setting up API Keys")
print("=" * 40)

# OpenAI API Key (required)
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("🤖 Enter your OpenAI API Key: ")
    print("✅ OpenAI API key set")
else:
    print("✅ OpenAI API key already set")

# Tavily API Key (recommended for external search)
if not os.getenv("TAVILY_API_KEY"):
    tavily_key = getpass.getpass("🔍 Enter your Tavily API Key (or press Enter to skip): ")
    if tavily_key.strip():
        os.environ["TAVILY_API_KEY"] = tavily_key
        print("✅ Tavily API key set")
    else:
        print("⚠️ Tavily API key skipped - external search will be limited")
else:
    print("✅ Tavily API key already set")

# SerpAPI Key (optional backup for external search)
if not os.getenv("SERPAPI_API_KEY"):
    serp_key = getpass.getpass("🌐 Enter your SerpAPI Key (or press Enter to skip): ")
    if serp_key.strip():
        os.environ["SERPAPI_API_KEY"] = serp_key
        print("✅ SerpAPI key set")
    else:
        print("⚠️ SerpAPI key skipped - will use Tavily or fallback")
else:
    print("✅ SerpAPI key already set")

# LangSmith API Key (optional for monitoring)
if not os.getenv("LANGSMITH_API_KEY"):
    langsmith_key = getpass.getpass("📊 Enter your LangSmith API Key (or press Enter to skip): ")
    if langsmith_key.strip():
        os.environ["LANGSMITH_API_KEY"] = langsmith_key
        os.environ["LANGSMITH_TRACING"] = "true"
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
        os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
        print("✅ LangSmith API key set and tracing enabled")
    else:
        os.environ["LANGSMITH_TRACING"] = "false"
        os.environ["LANGCHAIN_TRACING_V2"] = "false"
        print("⚠️ LangSmith skipped - no monitoring/tracing")
else:
    print("✅ LangSmith API key already set")

print("\n🎯 API Key Setup Complete!")
print("\n📋 Where to get API keys:")
print("• OpenAI: https://platform.openai.com/api-keys")
print("• Tavily: https://tavily.com/ (free tier available)")
print("• SerpAPI: https://serpapi.com/ (free tier available)")
print("• LangSmith: https://smith.langchain.com/ (optional monitoring)")


🔑 Setting up API Keys
✅ OpenAI API key already set
✅ Tavily API key already set
✅ SerpAPI key already set
✅ LangSmith API key already set

🎯 API Key Setup Complete!

📋 Where to get API keys:
• OpenAI: https://platform.openai.com/api-keys
• Tavily: https://tavily.com/ (free tier available)
• SerpAPI: https://serpapi.com/ (free tier available)
• LangSmith: https://smith.langchain.com/ (optional monitoring)


## Step 2: Data Loading and Preprocessing


In [6]:
# Load the Rotten Tomatoes datasets
print("🍅 Loading Rotten Tomatoes movie review datasets...")

# Robust CSV loading function with error handling
def load_csv_robust(filepath):
    """Load CSV with robust error handling for malformed data"""
    encodings = ['utf-8', 'latin1', 'cp1252', 'iso-8859-1']
    
    for encoding in encodings:
        try:
            print(f"  Trying encoding: {encoding}")
            # Try with error handling for malformed lines
            df = pd.read_csv(
                filepath, 
                encoding=encoding,
                on_bad_lines='skip',  # Skip bad lines instead of failing
                engine='python',      # Use Python engine for better error handling
                quoting=1,           # Quote all fields
                skipinitialspace=True
            )
            print(f"  ✅ Success with {encoding}")
            return df
        except UnicodeDecodeError:
            print(f"  ❌ Failed with {encoding}")
            continue
        except Exception as e:
            print(f"  ❌ Failed with {encoding}: {str(e)}")
            continue
    
    # If all encodings fail, try with minimal options
    print("  Trying with basic fallback...")
    try:
        df = pd.read_csv(filepath, encoding='latin1', on_bad_lines='skip', engine='python')
        print("  ✅ Success with fallback method")
        return df
    except Exception as e:
        raise ValueError(f"Could not read {filepath}: {str(e)}")

# Load Rotten Tomatoes movies metadata
print("Loading Rotten Tomatoes movies metadata...")
movies_df = load_csv_robust("data/rotten_tomatoes_movies.csv")
print(f"Movies dataset: {len(movies_df)} movies")
print(f"Columns: {list(movies_df.columns)}")

# Load Rotten Tomatoes reviews
print("\nLoading Rotten Tomatoes reviews...")
reviews_df = load_csv_robust("data/rotten_tomatoes_movie_reviews.csv")
print(f"Reviews dataset: {len(reviews_df)} reviews")
print(f"Columns: {list(reviews_df.columns)}")

# Display sample data
print("\n🎬 Sample movies metadata:")
print(movies_df.head(3))

print("\n📝 Sample reviews:")
print(reviews_df.head(3))

# Basic statistics
print(f"\n📊 Dataset Statistics:")
print(f"• Total movies: {len(movies_df):,}")
print(f"• Total reviews: {len(reviews_df):,}")
print(f"• Average reviews per movie: {len(reviews_df)/len(movies_df):.1f}")
print(f"• Unique movie IDs in reviews: {reviews_df['id'].nunique():,}")
print(f"• Movies with reviews: {reviews_df['id'].nunique():,} / {len(movies_df):,}")

# Check review distribution
print(f"\n🏆 Review State Distribution:")
if 'reviewState' in reviews_df.columns:
    print(reviews_df['reviewState'].value_counts())

print(f"\n⭐ Score Sentiment Distribution:")
if 'scoreSentiment' in reviews_df.columns:
    print(reviews_df['scoreSentiment'].value_counts())


🍅 Loading Rotten Tomatoes movie review datasets...
Loading Rotten Tomatoes movies metadata...
  Trying encoding: utf-8
  ✅ Success with utf-8
Movies dataset: 143258 movies
Columns: ['id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix']

Loading Rotten Tomatoes reviews...
  Trying encoding: utf-8
  ✅ Success with utf-8
Reviews dataset: 1444963 reviews
Columns: ['id', 'reviewId', 'creationDate', 'criticName', 'isTopCritic', 'originalScore', 'reviewState', 'publicatioName', 'reviewText', 'scoreSentiment', 'reviewUrl']

🎬 Sample movies metadata:
                   id                title  audienceScore  tomatoMeter rating  \
0  space-zombie-bingo  Space Zombie Bingo!           50.0          NaN    NaN   
1     the_green_grass      The Green Grass            NaN          NaN    NaN   
2           love_lies           

In [7]:
# Data cleaning and preprocessing for Rotten Tomatoes data
def clean_text(text):
    """Clean and normalize text data"""
    if pd.isna(text):
        return ""
    
    # Convert to string and clean
    text = str(text).strip()
    
    # Remove special characters and normalize
    import re
    text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)  # Remove control characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    
    return text.strip()

# Clean movies data
movies_df['title_clean'] = movies_df['title'].apply(clean_text)
movies_df['genre_clean'] = movies_df['genre'].apply(clean_text)
movies_df['director_clean'] = movies_df['director'].apply(clean_text)

# Clean reviews data
reviews_df['reviewText_clean'] = reviews_df['reviewText'].apply(clean_text)
reviews_df['criticName_clean'] = reviews_df['criticName'].apply(clean_text)
reviews_df['publicatioName_clean'] = reviews_df['publicatioName'].apply(clean_text)

# Remove rows with empty review text
reviews_df = reviews_df[reviews_df['reviewText_clean'].str.len() > 20].copy()

# Merge movies and reviews for complete information
print("🔗 Merging movies metadata with reviews...")
merged_df = reviews_df.merge(movies_df, on='id', how='left')

print(f"✅ Cleaned reviews dataset: {len(reviews_df)} reviews")
print(f"✅ Merged dataset: {len(merged_df)} reviews with movie metadata")
print(f"✅ Reviews with movie titles: {merged_df['title'].notna().sum()} / {len(merged_df)}")

# Handle missing titles
missing_titles = merged_df['title'].isna().sum()
if missing_titles > 0:
    print(f"⚠️ {missing_titles} reviews missing movie titles (will use movie ID)")
    merged_df['title_clean'] = merged_df['title_clean'].fillna(merged_df['id'])

print(f"\n🎬 Sample merged data:")
sample_cols = ['title_clean', 'criticName_clean', 'reviewText_clean', 'rating', 'genre_clean']
available_cols = [col for col in sample_cols if col in merged_df.columns]
print(merged_df[available_cols].head(2))


🔗 Merging movies metadata with reviews...
✅ Cleaned reviews dataset: 1364909 reviews
✅ Merged dataset: 1388546 reviews with movie metadata
✅ Reviews with movie titles: 1383051 / 1388546
⚠️ 5495 reviews missing movie titles (will use movie ID)

🎬 Sample merged data:
  title_clean criticName_clean  \
0     Beavers  Ivan M. Lincoln   
1  Blood Mask    The Foywonder   

                                    reviewText_clean rating  genre_clean  
0  Timed to be just long enough for most youngste...    NaN  Documentary  
1  It doesn't matter if a movie costs 300 million...    NaN               


In [9]:
# Create unified data structure for processing Rotten Tomatoes data
def create_review_documents(df, max_reviews=200):
    """Convert merged DataFrame to list of review documents"""
    documents = []
    
    # Use a sample for initial testing
    if len(df) > max_reviews:
        print(f"🧪 Using sample of {max_reviews} reviews for testing...")
        df_sample = df.head(max_reviews)
    else:
        df_sample = df
    
    for idx, row in df_sample.iterrows():
        # Create comprehensive metadata
        metadata = {
            'source': 'rotten_tomatoes',
            'movie_id': row.get('id', ''),
            'movie_title': row.get('title_clean', row.get('id', 'Unknown')),
            'critic_name': row.get('criticName_clean', 'Anonymous'),
            'publication': row.get('publicatioName_clean', 'Unknown'),
            'review_date': row.get('creationDate', 'Unknown'),
            'original_score': row.get('originalScore', ''),
            'review_state': row.get('reviewState', ''),
            'sentiment': row.get('scoreSentiment', ''),
            'is_top_critic': row.get('isTopCritic', False),
            'genre': row.get('genre_clean', ''),
            'director': row.get('director_clean', ''),
            'rating': row.get('rating', ''),
            'audience_score': row.get('audienceScore', ''),
            'tomato_meter': row.get('tomatoMeter', ''),
            'release_date': row.get('releaseDateTheaters', ''),
            'runtime': row.get('runtimeMinutes', ''),
            'index': idx
        }
        
        # Create rich content for embedding
        content = f"Movie: {row.get('title_clean', row.get('id', 'Unknown'))}\n"
        
        # Add movie metadata
        if row.get('genre_clean'):
            content += f"Genre: {row.get('genre_clean')}\n"
        if row.get('director_clean'):
            content += f"Director: {row.get('director_clean')}\n"
        if row.get('rating'):
            content += f"Rating: {row.get('rating')}\n"
        if row.get('releaseDateTheaters'):
            content += f"Release Date: {row.get('releaseDateTheaters')}\n"
        
        # Add review information
        content += f"Critic: {row.get('criticName_clean', 'Anonymous')}\n"
        if row.get('publicatioName_clean'):
            content += f"Publication: {row.get('publicatioName_clean')}\n"
        if row.get('originalScore'):
            content += f"Score: {row.get('originalScore')}\n"
        if row.get('reviewState'):
            content += f"Review State: {row.get('reviewState')}\n"
        if row.get('scoreSentiment'):
            content += f"Sentiment: {row.get('scoreSentiment')}\n"
        
        # Add the main review text
        review_text = row.get('reviewText_clean', '')
        if review_text:
            content += f"Review: {review_text}"
        
        documents.append({
            'content': content,
            'metadata': metadata
        })
    
    return documents

# Create documents from merged Rotten Tomatoes data
print("🍅 Creating review documents from Rotten Tomatoes data...")
all_documents = create_review_documents(merged_df, max_reviews=200)

print(f"✅ Created {len(all_documents)} total review documents")
print(f"   - Source: Rotten Tomatoes")
print(f"   - Reviews with movie metadata included")

# Show sample document
print("\n📄 Sample document:")
print(all_documents[0]['content'][:300] + "...")

# Show metadata sample
print("\n🏷️ Sample metadata:")
sample_metadata = all_documents[0]['metadata']
for key, value in list(sample_metadata.items())[:8]:  # Show first 8 metadata fields
    print(f"  {key}: {value}")

# Basic statistics
print(f"\n📊 Document Statistics:")
unique_movies = len(set([doc['metadata']['movie_title'] for doc in all_documents]))
unique_critics = len(set([doc['metadata']['critic_name'] for doc in all_documents]))
print(f"• Unique movies: {unique_movies}")
print(f"• Unique critics: {unique_critics}")
print(f"• Average content length: {np.mean([len(doc['content']) for doc in all_documents]):.0f} characters")


🍅 Creating review documents from Rotten Tomatoes data...
🧪 Using sample of 200 reviews for testing...
✅ Created 200 total review documents
   - Source: Rotten Tomatoes
   - Reviews with movie metadata included

📄 Sample document:
Movie: Beavers
Genre: Documentary
Director: Stephen Low
Rating: nan
Release Date: nan
Critic: Ivan M. Lincoln
Publication: Deseret News (Salt Lake City)
Score: 3.5/4
Review State: fresh
Sentiment: POSITIVE
Review: Timed to be just long enough for most youngsters' brief attention spans -- and it's pa...

🏷️ Sample metadata:
  source: rotten_tomatoes
  movie_id: beavers
  movie_title: Beavers
  critic_name: Ivan M. Lincoln
  publication: Deseret News (Salt Lake City)
  review_date: 2003-05-23
  original_score: 3.5/4
  review_state: fresh

📊 Document Statistics:
• Unique movies: 33
• Unique critics: 178
• Average content length: 339 characters


## Step 3: Text Chunking and Embedding Generation


In [10]:
import tiktoken
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_core.documents import Document

# Token counting function
def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(text)
    return len(tokens)

# Convert our documents to LangChain Document format - each review is already a chunk
print("🔪 Using each review as a separate chunk...")
chunks = []
for doc in all_documents:
    langchain_doc = Document(
        page_content=doc['content'],
        metadata=doc['metadata']
    )
    chunks.append(langchain_doc)

print(f"✅ Created {len(chunks)} chunks from {len(all_documents)} reviews")
print("   Each review is treated as a separate chunk for better semantic coherence")

# Verify chunk sizes
chunk_lengths = [tiktoken_len(chunk.page_content) for chunk in chunks]
max_chunk_length = max(chunk_lengths)
avg_chunk_length = sum(chunk_lengths) / len(chunk_lengths)
print(f"📏 Maximum chunk length: {max_chunk_length} tokens")
print(f"📏 Average chunk length: {avg_chunk_length:.0f} tokens")

# Show sample chunk
print("\n📄 Sample chunk:")
print(chunks[0].page_content[:200] + "...")


🔪 Using each review as a separate chunk...
✅ Created 200 chunks from 200 reviews
   Each review is treated as a separate chunk for better semantic coherence
📏 Maximum chunk length: 120 tokens
📏 Average chunk length: 89 tokens

📄 Sample chunk:
Movie: Beavers
Genre: Documentary
Director: Stephen Low
Rating: nan
Release Date: nan
Critic: Ivan M. Lincoln
Publication: Deseret News (Salt Lake City)
Score: 3.5/4
Review State: fresh
Sentiment: POS...


In [11]:
# Initialize embedding model
print("🧠 Initializing embedding model...")
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store
print("🗄️ Creating vector store...")
qdrant_vectorstore = Qdrant.from_documents(
    documents=chunks,
    embedding=embedding_model,
    location=":memory:"
)

# Create retriever
retriever = qdrant_vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

print("✅ Vector store and retriever created successfully!")

# Test retrieval
print("\n🧪 Testing retrieval with sample query...")
test_query = "What do people think about Inception?"
test_results = retriever.get_relevant_documents(test_query)
print(f"Found {len(test_results)} relevant documents for: '{test_query}'")
print("\n📄 Sample retrieved content:")
print(test_results[0].page_content[:300] + "...")


🧠 Initializing embedding model...
🗄️ Creating vector store...
✅ Vector store and retriever created successfully!

🧪 Testing retrieval with sample query...


  test_results = retriever.get_relevant_documents(test_query)


Found 5 relevant documents for: 'What do people think about Inception?'

📄 Sample retrieved content:
Movie: La Sapienza
Genre: Drama
Director: Eugène Green
Rating: nan
Release Date: nan
Critic: Boyd van Hoeij
Publication: Hollywood Reporter
Score: nan
Review State: fresh
Sentiment: POSITIVE
Review: The Sapience juxtaposes insights on how people are emotionally connected with ruminations on the buil...


## Step 4: RAG System Implementation


In [12]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict

# Define state structure
class State(TypedDict):
    question: str
    context: List[Document]
    response: str

# Create prompt template
HUMAN_TEMPLATE = """
You are a knowledgeable movie critic and analyst. You have access to a database of movie reviews from both Letterboxd (social media reviews) and Metacritic (professional reviews).

Use the provided context to answer the user's question about movies, reviews, ratings, and trends. Only use the information provided in the context. If the context doesn't contain relevant information to answer the question, respond with "I don't have enough information to answer that question based on the available reviews."

When analyzing reviews, consider:
- Different perspectives between social media (Letterboxd) and professional (Metacritic) reviews
- Rating patterns and trends
- Common themes in reviews
- Temporal patterns in movie releases and ratings

CONTEXT:
{context}

QUESTION:
{question}

Provide a comprehensive and insightful answer based on the available review data.
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("human", HUMAN_TEMPLATE)
])

# Initialize chat model
chat_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)

print("✅ Prompt template and chat model initialized!")


✅ Prompt template and chat model initialized!


In [13]:
# Define RAG functions
def retrieve(state: State) -> State:
    """Retrieve relevant documents based on the question"""
    retrieved_docs = retriever.get_relevant_documents(state["question"])
    return {"context": retrieved_docs}

def generate(state: State) -> State:
    """Generate response based on retrieved context"""
    generator_chain = chat_prompt | chat_model | StrOutputParser()
    response = generator_chain.invoke({
        "question": state["question"], 
        "context": state["context"]
    })
    return {"response": response}

# Build the RAG graph
graph_builder = StateGraph(State)
graph_builder = graph_builder.add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
rag_graph = graph_builder.compile()

print("✅ RAG system built successfully!")
print(f"📊 Graph structure created with retrieve → generate sequence")


✅ RAG system built successfully!
📊 Graph structure created with retrieve → generate sequence


## Step 5: Testing the RAG System


In [None]:
# Test queries
test_queries = [
    "What do people think about Inception?",
    "What are the best rated movies according to the reviews?",
    "What are some common themes in movie reviews?",
    "What movies have the highest ratings?"
]

print("🧪 Testing RAG system with various queries...")
print("=" * 80)

for i, query in enumerate(test_queries, 1):
    print(f"\n🔍 Query {i}: {query}")
    print("-" * 60)
    
    try:
        response = rag_graph.invoke({"question": query})
        print(f"📝 Response: {response['response']}")
        
        # Show retrieved context info
        context_sources = [doc.metadata.get('source', 'unknown') for doc in response['context']]
        print(f"📚 Retrieved from: {context_sources}")
        
    except Exception as e:
        print(f"❌ Error: {str(e)}")
    
    print("=" * 80)


In [15]:
def query_movie_reviews(question: str) -> Dict[str, Any]:
    """Query the movie reviews RAG system"""
    try:
        response = rag_graph.invoke({"question": question})
        
        # Extract metadata from retrieved documents
        sources = []
        for doc in response['context']:
            source_info = {
                'source': doc.metadata.get('source', 'unknown'),
                'movie': doc.metadata.get('movie_name', 'Unknown'),
                'rating': doc.metadata.get('rating', 'N/A'),
                'reviewer': doc.metadata.get('reviewer', 'Anonymous')
            }
            sources.append(source_info)
        
        return {
            'question': question,
            'answer': response['response'],
            'sources': sources,
            'num_sources': len(sources)
        }
    
    except Exception as e:
        return {
            'question': question,
            'answer': f"Error processing query: {str(e)}",
            'sources': [],
            'num_sources': 0
        }

# Test the query function
print("🧪 Testing the query function...")
sample_result = query_movie_reviews("What do people think about The Dark Knight?")
print(f"Question: {sample_result['question']}")
print(f"Answer: {sample_result['answer'][:200]}...")
print(f"Sources: {sample_result['num_sources']} documents found")

print("\n✅ Query function is working correctly!")


🧪 Testing the query function...
Question: What do people think about The Dark Knight?
Answer: I don't have enough information to answer that question based on the available reviews....
Sources: 5 documents found

✅ Query function is working correctly!


In [16]:
# Interactive query function
def interactive_query():
    """Interactive query interface"""
    print("🎬 Movie Reviews RAG System")
    print("=" * 50)
    print("Ask questions about movies, reviews, ratings, and trends!")
    print("Type 'quit' to exit.")
    print("=" * 50)
    
    while True:
        try:
            question = input("\n🤔 Your question: ")
            
            if question.lower() in ['quit', 'exit', 'q']:
                print("👋 Goodbye!")
                break
            
            if not question.strip():
                continue
            
            print("\n🔍 Searching for relevant reviews...")
            result = query_movie_reviews(question)
            
            print(f"\n📝 Answer: {result['answer']}")
            print(f"\n📚 Sources ({result['num_sources']}):")
            
            for i, source in enumerate(result['sources'], 1):
                print(f"  {i}. {source['movie']} ({source['source']}) - Rating: {source['rating']}")
            
        except KeyboardInterrupt:
            print("\n👋 Goodbye!")
            break
        except Exception as e:
            print(f"\n❌ Error: {str(e)}")

# You can uncomment the line below to start interactive mode
# interactive_query()

print("✅ Interactive query interface ready!")
print("💡 Uncomment the interactive_query() line above to start interactive mode.")


✅ Interactive query interface ready!
💡 Uncomment the interactive_query() line above to start interactive mode.


## Step 7: System Summary and Next Steps


In [18]:
# System summary
print("🍅 Rotten Tomatoes RAG System Summary")
print("=" * 50)
print(f"📊 Total reviews processed: {len(all_documents)}")
print(f"   - Rotten Tomatoes Professional Critics: {len(all_documents)} reviews")
print(f"🔪 Text chunks created: {len(chunks)}")
print(f"🧠 Embedding model: text-embedding-3-small")
print(f"💬 Chat model: gpt-4o-mini")
print(f"🗄️ Vector store: Qdrant (in-memory)")
print(f"🔍 Retrieval method: Similarity search (k=5)")
print("=" * 50)

print("\n🚀 Next Steps:")
print("1. ✅ Add agentic features with multiple tools")
print("2. ✅ Implement advanced analytics and trend analysis")
print("3. Add evaluation framework with RAGAS")
print("4. Add LangSmith tracing and monitoring")
print("5. Create visualization capabilities")
print("6. Deploy as a web application")

print("\n✅ Phase 1 Complete: Rotten Tomatoes RAG system is ready for queries!")

# Example of how to use the system
print("\n💡 Example usage:")
print("result = query_movie_reviews('What do critics think about sci-fi movies?')")
print("print(result['answer'])")

# Data statistics for Rotten Tomatoes
print(f"\n📈 Rotten Tomatoes Data Statistics:")
print(f"   - Total movies in dataset: {len(movies_df):,}")
print(f"   - Total reviews in dataset: {len(reviews_df):,}")
print(f"   - Reviews in current sample: {len(merged_df):,}")
print(f"   - Unique movies with reviews: {merged_df['title_clean'].nunique():,}")

# Review quality statistics
if len(merged_df) > 0:
    avg_review_length = merged_df['reviewText_clean'].str.len().mean()
    print(f"   - Average review length: {avg_review_length:.0f} characters")
    
    # Fresh vs Rotten breakdown
    if 'reviewState' in merged_df.columns:
        fresh_count = (merged_df['reviewState'] == 'fresh').sum()
        rotten_count = (merged_df['reviewState'] == 'rotten').sum()
        fresh_percentage = (fresh_count / len(merged_df)) * 100 if len(merged_df) > 0 else 0
        print(f"   - Fresh reviews: {fresh_count:,} ({fresh_percentage:.1f}%)")
        print(f"   - Rotten reviews: {rotten_count:,}")
    
    # Top critics
    if 'isTopCritic' in merged_df.columns:
        top_critics = (merged_df['isTopCritic'] == True).sum()
        print(f"   - Reviews from Top Critics: {top_critics:,}")

print(f"\n🎬 Data Quality:")
print(f"   - Professional critic reviews with detailed metadata")
print(f"   - Official Rotten Tomatoes scores (Tomatometer & Audience)")
print(f"   - Rich movie information (genre, director, runtime, release date)")
print(f"   - Publication sources and critic credentials")


🍅 Rotten Tomatoes RAG System Summary
📊 Total reviews processed: 200
   - Rotten Tomatoes Professional Critics: 200 reviews
🔪 Text chunks created: 200
🧠 Embedding model: text-embedding-3-small
💬 Chat model: gpt-4o-mini
🗄️ Vector store: Qdrant (in-memory)
🔍 Retrieval method: Similarity search (k=5)

🚀 Next Steps:
1. ✅ Add agentic features with multiple tools
2. ✅ Implement advanced analytics and trend analysis
3. Add evaluation framework with RAGAS
4. Add LangSmith tracing and monitoring
5. Create visualization capabilities
6. Deploy as a web application

✅ Phase 1 Complete: Rotten Tomatoes RAG system is ready for queries!

💡 Example usage:
result = query_movie_reviews('What do critics think about sci-fi movies?')
print(result['answer'])

📈 Rotten Tomatoes Data Statistics:
   - Total movies in dataset: 143,258
   - Total reviews in dataset: 1,364,909
   - Reviews in current sample: 1,388,546
   - Unique movies with reviews: 61,764
   - Average review length: 132 characters
   - Fresh rev

## Phase 2: Agentic Enhancement with Multiple Tools

Now we'll enhance our RAG system with multiple specialized tools, including external search capabilities for when our embedded review data isn't sufficient.


### Step 8: External Search Tool Setup


In [39]:
# External search tools setup
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.utilities import SerpAPIWrapper
from langchain.tools import Tool
from langchain_core.tools import tool
import json

# Setup external search tools (you'll need API keys for these)
print("🔧 Setting up external search tools...")

# Option 1: Tavily Search (recommended - often has free tier)
try:
    # You'll need to set TAVILY_API_KEY in your environment
    # Get free API key from: https://tavily.com/
    tavily_search = TavilySearchResults(
        max_results=3,
        search_depth="basic",
        include_answer=True,
        include_raw_content=True
    )
    print("✅ Tavily search tool configured")
    has_tavily = True
except Exception as e:
    print(f"⚠️ Tavily not configured: {e}")
    has_tavily = False

# Option 2: SerpAPI (Google Search) - backup option
try:
    # You'll need to set SERPAPI_API_KEY in your environment
    # Get free API key from: https://serpapi.com/
    search = SerpAPIWrapper()
    serp_tool = Tool(
        name="google_search",
        description="Search Google for current information about movies, actors, reviews, or box office data",
        func=search.run,
    )
    print("✅ SerpAPI search tool configured")
    has_serp = True
except Exception as e:
    print(f"⚠️ SerpAPI not configured: {e}")
    has_serp = False

# Create a fallback search function if no external APIs are configured
def fallback_search(query: str) -> str:
    """Fallback search when no external APIs are available"""
    return f"External search not available. Query '{query}' would require external movie database access. Please configure Tavily API key (https://tavily.com/) or SerpAPI key (https://serpapi.com/) for enhanced search capabilities."

# Choose which search tool to use
if has_tavily:
    external_search_tool = tavily_search
    search_tool_name = "Tavily"
elif has_serp:
    external_search_tool = serp_tool
    search_tool_name = "SerpAPI"
else:
    external_search_tool = Tool(
        name="fallback_search",
        description="Fallback search tool when external APIs are not configured",
        func=fallback_search
    )
    search_tool_name = "Fallback"

print(f"🔍 Using {search_tool_name} for external search")

# Test the search tool
print(f"\n🧪 Testing {search_tool_name} search...")
try:
    if has_tavily:
        test_result = external_search_tool.invoke({"query": "Inception movie reviews 2010"})
        print(f"✅ Search test successful: Found {len(test_result)} results")
    elif has_serp:
        test_result = external_search_tool.run("Inception movie reviews 2010")
        print(f"✅ Search test successful: {test_result[:100]}...")
    else:
        test_result = external_search_tool.run("Inception movie reviews 2010")
        print(f"⚠️ Using fallback search: {test_result}")
except Exception as e:
    print(f"❌ Search test failed: {e}")
    print("💡 You can continue without external search - the agent will use only embedded reviews")


🔧 Setting up external search tools...
✅ Tavily search tool configured
✅ SerpAPI search tool configured
🔍 Using Tavily for external search

🧪 Testing Tavily search...
✅ Search test successful: Found 3 results


### Step 9: Specialized Agent Tools


In [40]:
# Create specialized tools for Rotten Tomatoes movie analysis
print("🛠️ Creating specialized agent tools for Rotten Tomatoes data...")

# Tool 1: Movie Review Search (our existing RAG)
@tool
def search_movie_reviews(query: str) -> str:
    """
    Search through embedded movie reviews from Rotten Tomatoes.
    Use this for questions about specific movies, ratings, or review content.
    """
    try:
        response = rag_graph.invoke({"question": query})
        return response['response']
    except Exception as e:
        return f"Error searching reviews: {str(e)}"

# Tool 2: Movie Statistics Analysis
@tool
def analyze_movie_statistics(movie_name: str = "") -> str:
    """
    Analyze statistics for a specific movie or provide general Rotten Tomatoes dataset statistics.
    Returns ratings, review counts, critic information, and other numerical insights.
    """
    try:
        if movie_name:
            # Search for specific movie in the merged dataset
            movie_data = merged_df[
                merged_df['title_clean'].str.contains(movie_name, case=False, na=False)
            ]
            
            if movie_data.empty:
                return f"No statistics found for '{movie_name}' in the Rotten Tomatoes dataset."
            
            # Get movie information
            movie_info = movie_data.iloc[0]  # Get first match for movie metadata
            movie_reviews = movie_data  # All reviews for this movie
            
            stats = f"Statistics for '{movie_info.get('title_clean', movie_name)}':\n"
            stats += f"═══════════════════════════════════\n"
            
            # Movie metadata
            if movie_info.get('genre_clean'):
                stats += f"🎭 Genre: {movie_info['genre_clean']}\n"
            if movie_info.get('director_clean'):
                stats += f"🎬 Director: {movie_info['director_clean']}\n"
            if movie_info.get('rating'):
                stats += f"🏷️ Rating: {movie_info['rating']}\n"
            if movie_info.get('runtimeMinutes'):
                stats += f"⏱️ Runtime: {movie_info['runtimeMinutes']} minutes\n"
            if movie_info.get('releaseDateTheaters'):
                stats += f"📅 Release Date: {movie_info['releaseDateTheaters']}\n"
            
            # Scores
            if pd.notna(movie_info.get('audienceScore')):
                stats += f"👥 Audience Score: {movie_info['audienceScore']}%\n"
            if pd.notna(movie_info.get('tomatoMeter')):
                stats += f"🍅 Tomatometer: {movie_info['tomatoMeter']}%\n"
            
            # Review statistics
            stats += f"\n📊 Review Analysis:\n"
            stats += f"• Total Reviews: {len(movie_reviews)}\n"
            
            # Review state distribution
            if 'reviewState' in movie_reviews.columns:
                review_states = movie_reviews['reviewState'].value_counts()
                for state, count in review_states.items():
                    stats += f"• {state.title()}: {count} reviews\n"
            
            # Sentiment distribution
            if 'scoreSentiment' in movie_reviews.columns:
                sentiments = movie_reviews['scoreSentiment'].value_counts()
                stats += f"\n🎭 Sentiment Breakdown:\n"
                for sentiment, count in sentiments.items():
                    stats += f"• {sentiment}: {count} reviews\n"
            
            # Top critics
            top_critics = movie_reviews[movie_reviews['isTopCritic'] == True]
            if len(top_critics) > 0:
                stats += f"• Top Critics: {len(top_critics)} reviews\n"
            
            return stats
        else:
            # General dataset statistics
            stats = f"🍅 Rotten Tomatoes Dataset Statistics:\n"
            stats += f"═══════════════════════════════════\n"
            stats += f"📊 Overview:\n"
            stats += f"• Total Movies: {len(movies_df):,}\n"
            stats += f"• Total Reviews: {len(reviews_df):,}\n"
            stats += f"• Reviews in Current Sample: {len(merged_df):,}\n"
            stats += f"• Average Reviews per Movie: {len(reviews_df)/len(movies_df):.1f}\n"
            
            # Genre distribution (top 5)
            if 'genre_clean' in merged_df.columns:
                top_genres = merged_df['genre_clean'].value_counts().head(5)
                stats += f"\n🎭 Top Genres:\n"
                for genre, count in top_genres.items():
                    if pd.notna(genre):
                        stats += f"• {genre}: {count} reviews\n"
            
            # Review state distribution
            if 'reviewState' in merged_df.columns:
                review_states = merged_df['reviewState'].value_counts()
                stats += f"\n🏆 Review States:\n"
                for state, count in review_states.items():
                    stats += f"• {state}: {count} reviews\n"
            
            # Top critics
            top_critics_count = merged_df[merged_df['isTopCritic'] == True]
            stats += f"\n⭐ Critics:\n"
            stats += f"• Top Critics: {len(top_critics_count):,} reviews\n"
            stats += f"• Regular Critics: {len(merged_df) - len(top_critics_count):,} reviews\n"
            
            return stats
            
    except Exception as e:
        return f"Error analyzing statistics: {str(e)}"

# Tool 3: Rating and Review Analysis Tool
@tool
def analyze_movie_ratings(movie_name: str) -> str:
    """
    Analyze ratings and review sentiment for a specific movie from Rotten Tomatoes.
    Shows audience score, tomatometer, critic consensus, and sentiment analysis.
    """
    try:
        movie_data = merged_df[
            merged_df['title_clean'].str.contains(movie_name, case=False, na=False)
        ]
        
        if movie_data.empty:
            return f"No rating data found for '{movie_name}' in Rotten Tomatoes dataset."
        
        movie_info = movie_data.iloc[0]
        
        analysis = f"🍅 Rotten Tomatoes Analysis for '{movie_info.get('title_clean', movie_name)}':\n"
        analysis += f"═══════════════════════════════════\n"
        
        # Official scores
        if pd.notna(movie_info.get('tomatoMeter')):
            analysis += f"🍅 Tomatometer: {movie_info['tomatoMeter']}% (Critics)\n"
        if pd.notna(movie_info.get('audienceScore')):
            analysis += f"🍿 Audience Score: {movie_info['audienceScore']}%\n"
        
        # Review breakdown
        fresh_reviews = movie_data[movie_data['reviewState'] == 'fresh']
        rotten_reviews = movie_data[movie_data['reviewState'] == 'rotten']
        
        analysis += f"\n📊 Review Breakdown:\n"
        analysis += f"• Fresh Reviews: {len(fresh_reviews)}\n"
        analysis += f"• Rotten Reviews: {len(rotten_reviews)}\n"
        if len(movie_data) > 0:
            fresh_percentage = (len(fresh_reviews) / len(movie_data)) * 100
            analysis += f"• Fresh Percentage: {fresh_percentage:.1f}%\n"
        
        # Sentiment analysis
        positive_reviews = movie_data[movie_data['scoreSentiment'] == 'POSITIVE']
        negative_reviews = movie_data[movie_data['scoreSentiment'] == 'NEGATIVE']
        
        analysis += f"\n🎭 Sentiment Analysis:\n"
        analysis += f"• Positive: {len(positive_reviews)} reviews\n"
        analysis += f"• Negative: {len(negative_reviews)} reviews\n"
        
        # Top critics vs regular critics
        top_critic_reviews = movie_data[movie_data['isTopCritic'] == True]
        regular_reviews = movie_data[movie_data['isTopCritic'] == False]
        
        analysis += f"\n⭐ Critic Breakdown:\n"
        analysis += f"• Top Critics: {len(top_critic_reviews)} reviews\n"
        analysis += f"• Regular Critics: {len(regular_reviews)} reviews\n"
        
        return analysis
        
    except Exception as e:
        return f"Error analyzing ratings: {str(e)}"

# Tool 4: External Movie Search (when local data is insufficient)
# 🔎 Enhanced external-search tool
@tool
def search_external_movie_info(query: str) -> str:
    """
    Search external sites (IMDb, Metacritic, Letterboxd, Rotten Tomatoes, etc.)
    for reviews, ratings, or recent news about a movie.

    • Uses Tavily if available, SerpAPI next, then falls back to the tool's `.run`.
    • Returns the first three snippets with sources.
    """
    try:
        # 🚀 1. Build a richer multi-site query
        review_sites = [
            "Rotten Tomatoes", "IMDb", "Metacritic", "Letterboxd",
            "Roger Ebert", "The Guardian film review"
        ]
        joined_sites = " OR ".join(f'"{site}"' for site in review_sites)
        # Example →  movie Inception reviews ratings "Rotten Tomatoes" OR "IMDb" ...
        search_string = f'movie {query} reviews ratings {joined_sites}'

        # 🚀 2. Dispatch to whichever external search tool you have
        if has_tavily:
            result = external_search_tool.invoke({"query": search_string})
            # Tavily returns a list of dicts → format the first three nicely
            snippets = []
            for item in result[:3]:
                if isinstance(item, dict):
                    url     = item.get("url", "")
                    content = (item.get("content", "") or "").strip()
                    snippets.append(f"Source: {url}\n{content[:200]}…")
            return "\n\n".join(snippets) if snippets else "No results found."
        
        elif has_serp:
            raw = external_search_tool.run(search_string)
            return raw[:500] + "…" if len(raw) > 500 else raw
        
        else:  # generic `.run` fallback
            return external_search_tool.run(search_string)

    except Exception as e:
        return f"External search error: {e}"


# Create the agent's toolbox for Rotten Tomatoes analysis
agent_tools = [
    search_movie_reviews,
    analyze_movie_statistics, 
    analyze_movie_ratings,
    search_external_movie_info
]

print(f"✅ Created {len(agent_tools)} specialized tools for Rotten Tomatoes:")
for tool in agent_tools:
    print(f"  - {tool.name}: {tool.description}")

# Test each tool
print("\n🧪 Testing agent tools...")
print("Testing movie review search:")
test_result = search_movie_reviews.invoke({"query": "What do critics think about Inception?"})
print(f"✅ Review search: {test_result[:100]}...")

print("\nTesting statistics analysis:")
test_stats = analyze_movie_statistics.invoke({})
print(f"✅ Statistics: {test_stats[:200]}...")

print("\nTesting movie ratings analysis:")
test_ratings = analyze_movie_ratings.invoke({"movie_name": "Inception"})
print(f"✅ Ratings analysis: {test_ratings[:200]}...")

# NEW: test external movie info search
print("\nTesting external movie info search:")
test_external = search_external_movie_info.invoke({"query": "Inception"})
print(f"✅ External search: {test_external[:200]}...")

print("\n✅ All Rotten Tomatoes agent tools ready!")


🛠️ Creating specialized agent tools for Rotten Tomatoes data...
✅ Created 4 specialized tools for Rotten Tomatoes:
  - search_movie_reviews: Search through embedded movie reviews from Rotten Tomatoes.
Use this for questions about specific movies, ratings, or review content.
  - analyze_movie_statistics: Analyze statistics for a specific movie or provide general Rotten Tomatoes dataset statistics.
Returns ratings, review counts, critic information, and other numerical insights.
  - analyze_movie_ratings: Analyze ratings and review sentiment for a specific movie from Rotten Tomatoes.
Shows audience score, tomatometer, critic consensus, and sentiment analysis.
  - search_external_movie_info: Search external sites (IMDb, Metacritic, Letterboxd, Rotten Tomatoes, etc.)
for reviews, ratings, or recent news about a movie.

• Uses Tavily if available, SerpAPI next, then falls back to the tool's `.run`.
• Returns the first three snippets with sources.

🧪 Testing agent tools...
Testing movie revi

In [41]:
# Enhanced Agent State with Tool Selection
from typing_extensions import TypedDict, Annotated
from langgraph.graph.message import add_messages
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    question: str
    tool_calls: list
    final_answer: str

# Enhanced Agent with tool selection capabilities
print("🤖 Building enhanced agent with tool selection...")

# Create the agent prompt
AGENT_PROMPT = """You are an intelligent movie analysis agent with access to multiple specialized tools.

Your tools:
1. search_movie_reviews: Search embedded movie reviews from Letterboxd and Metacritic
2. analyze_movie_statistics: Get numerical statistics about movies and datasets  
3. compare_platform_ratings: Compare ratings between Letterboxd and Metacritic
4. search_external_movie_info: Search external sources when local data is insufficient

Guidelines:
- Start with local review data (search_movie_reviews) for most questions
- Use statistics tools for numerical analysis
- Use comparison tools for platform differences
- Only use external search when local data is clearly insufficient
- Always explain your reasoning and cite sources
- Provide comprehensive, insightful answers

Current question: {question}
"""

# Create enhanced chat model with tool binding
agent_model = ChatOpenAI(
    model="gpt-4o-mini", 
    temperature=0.1,
    max_tokens=1000
).bind_tools(agent_tools)

def agent_reasoning_node(state: AgentState) -> AgentState:
    """Agent reasoning and tool selection"""
    question = state["question"]
    messages = state.get("messages", [])
    
    # Create the prompt with current question
    prompt_message = HumanMessage(content=AGENT_PROMPT.format(question=question))
    
    # Get agent response with potential tool calls
    response = agent_model.invoke([prompt_message] + messages)
    
    return {
        "messages": [response],
        "tool_calls": response.tool_calls if hasattr(response, 'tool_calls') and response.tool_calls else []
    }

def tool_execution_node(state: AgentState) -> AgentState:
    """Execute selected tools"""
    tool_calls = state.get("tool_calls", [])
    messages = []
    
    for tool_call in tool_calls:
        tool_name = tool_call["name"]
        tool_args = tool_call["args"]
        
        # Find and execute the tool
        for tool in agent_tools:
            if tool.name == tool_name:
                try:
                    result = tool.invoke(tool_args)
                    # Create tool message
                    tool_message = ToolMessage(
                        content=str(result),
                        tool_call_id=tool_call["id"]
                    )
                    messages.append(tool_message)
                except Exception as e:
                    error_message = ToolMessage(
                        content=f"Error executing {tool_name}: {str(e)}",
                        tool_call_id=tool_call["id"]
                    )
                    messages.append(error_message)
                break
    
    return {"messages": messages}

def final_response_node(state: AgentState) -> AgentState:
    """Generate final response based on tool results"""
    messages = state["messages"]
    question = state["question"]
    
    # Create final prompt
    final_prompt = f"""
    Based on the tool results above, provide a comprehensive answer to the question: {question}
    
    Make sure to:
    - Synthesize information from multiple sources
    - Cite specific data points and sources
    - Provide insights beyond just raw data
    - Be conversational but informative
    """
    
    final_response = chat_model.invoke(messages + [HumanMessage(content=final_prompt)])
    
    return {
        "final_answer": final_response.content,
        "messages": [final_response]
    }

# Build the enhanced agent graph
print("🔗 Building agent workflow...")

from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode

# Create agent graph
agent_graph = StateGraph(AgentState)

# Add nodes
agent_graph.add_node("agent", agent_reasoning_node)
agent_graph.add_node("tools", ToolNode(agent_tools))
agent_graph.add_node("final_response", final_response_node)

# Add edges
agent_graph.add_edge(START, "agent")

# Conditional edge: if agent makes tool calls, go to tools; otherwise go to final response
def should_continue(state: AgentState) -> str:
    tool_calls = state.get("tool_calls", [])
    if tool_calls:
        return "tools"
    else:
        return "final_response"

agent_graph.add_conditional_edges("agent", should_continue)
agent_graph.add_edge("tools", "final_response")
agent_graph.add_edge("final_response", END)

# Compile the enhanced agent
enhanced_agent = agent_graph.compile()

print("✅ Enhanced agent with tool selection ready!")

# Create a simple query function for the enhanced agent
def query_enhanced_agent(question: str) -> Dict[str, Any]:
    """Query the enhanced agent with tool selection"""
    try:
        result = enhanced_agent.invoke({
            "question": question,
            "messages": [],
            "tool_calls": [],
            "final_answer": ""
        })
        
        return {
            "question": question,
            "answer": result.get("final_answer", "No answer generated"),
            "tool_calls_made": len(result.get("tool_calls", [])),
            "success": True
        }
    except Exception as e:
        return {
            "question": question,
            "answer": f"Error: {str(e)}",
            "tool_calls_made": 0,
            "success": False
        }

print("🚀 Enhanced agent ready for complex movie analysis!")


🤖 Building enhanced agent with tool selection...
🔗 Building agent workflow...
✅ Enhanced agent with tool selection ready!
🚀 Enhanced agent ready for complex movie analysis!


### Step 11: Testing the Enhanced Agent

In [34]:

# Test the enhanced agent with complex queries
print("🧪 Testing Enhanced Agent with Complex Queries")
print("=" * 60)

# Test cases that will showcase different tool usage
test_cases = [
    {
        "query": "What are the statistics for movies in our dataset?", 
        "expected_tools": ["analyze_movie_statistics"]
    },
    {
        "query": "Tell me about a movie that came out in 2024",
        "expected_tools": ["search_movie_reviews", "search_external_movie_info"]
    },
    {
        "query": "What do people think about The Dark Knight and how does it compare across platforms?",
        "expected_tools": ["search_movie_reviews","search_external_movie_info"]
    }
]

for i, test_case in enumerate(test_cases, 1):
    print(f"\n🔍 Test {i}: {test_case['query']}")
    print("-" * 50)
    
    try:
        result = query_enhanced_agent(test_case['query'])
        
        if result['success']:
            print(f"✅ Success!")
            print(f"📝 Answer: {result['answer'][:300]}...")
            print(f"🛠️ Tools used: {result['tool_calls_made']} tool calls")
        else:
            print(f"❌ Failed: {result['answer']}")
            
    except Exception as e:
        print(f"❌ Error: {str(e)}")
    
    print("=" * 60)

print("\n🎯 Agent Testing Complete!")

# Interactive enhanced agent function
def interactive_enhanced_agent():
    """Interactive interface for the enhanced agent"""
    print("🎬 Enhanced Movie Reviews Agent")
    print("=" * 50)
    print("I'm an intelligent agent with multiple tools:")
    print("- Local movie review search")
    print("- Statistical analysis")
    print("- Platform comparison")
    print("- External movie information")
    print("=" * 50)
    print("Ask me anything about movies! Type 'quit' to exit.")
    
    while True:
        try:
            question = input("\n🤔 Your question: ")
            
            if question.lower() in ['quit', 'exit', 'q']:
                print("👋 Goodbye!")
                break
            
            if not question.strip():
                continue
            
            print("\n🔍 Analyzing your question and selecting tools...")
            result = query_enhanced_agent(question)
            
            if result['success']:
                print(f"\n📝 Answer: {result['answer']}")
                print(f"\n🛠️ I used {result['tool_calls_made']} specialized tools to answer your question.")
            else:
                print(f"\n❌ Sorry, I encountered an error: {result['answer']}")
            
        except KeyboardInterrupt:
            print("\n👋 Goodbye!")
            break
        except Exception as e:
            print(f"\n❌ Error: {str(e)}")

# Show usage instructions
print("\n💡 Usage Options:")
print("1. Use query_enhanced_agent('your question') for direct queries")
print("2. Uncomment interactive_enhanced_agent() for interactive mode")
print("3. The agent will automatically select the best tools for each question")

# Uncomment to start interactive mode:
# interactive_enhanced_agent()

print("\n✅ Enhanced Agent with external search capabilities is ready!")
print("🌟 Features:")
print("  - Multi-tool reasoning")
print("  - Automatic tool selection") 
print("  - External search fallback")
print("  - Comprehensive movie analysis")


🧪 Testing Enhanced Agent with Complex Queries

🔍 Test 1: What are the statistics for movies in our dataset?
--------------------------------------------------
✅ Success!
📝 Answer: The dataset from Rotten Tomatoes provides a fascinating glimpse into the world of film reviews, showcasing a wealth of information about the movies and their reception. Here’s a comprehensive overview of the statistics:

### Overview of the Dataset
The dataset comprises a total of **143,258 movies**...
🛠️ Tools used: 1 tool calls

🔍 Test 2: Tell me about a movie that came out in 2024
--------------------------------------------------
✅ Success!
📝 Answer: One of the standout movies that premiered in 2024 is **"Wicked,"** which has garnered significant attention and acclaim. This film, based on the popular Broadway musical, has been a highly anticipated adaptation, and it ultimately won the prestigious Golden Tomato Award for Best Movie of 2024, as re...
🛠️ Tools used: 1 tool calls

🔍 Test 3: What do people thi

## Phase 3: Evaluation and Monitoring

Now we'll add comprehensive evaluation using RAGAS and monitoring with LangSmith to ensure our agent is performing optimally.


### Step 12: LangSmith Tracing Setup


In [42]:
# LangSmith Tracing Setup
from uuid import uuid4
import time

print("📊 Setting up LangSmith Tracing...")

# Generate unique project ID for this session
unique_id = uuid4().hex[:8]
project_name = f"Movie-Reviews-RAG-Agent-{unique_id}"

# Configure LangSmith if API key is available
if os.getenv("LANGSMITH_API_KEY"):
    # Set LangSmith environment variables
    os.environ["LANGSMITH_PROJECT"] = project_name
    
    # Verify tracing is enabled
    print(f"✅ LangSmith tracing enabled")
    print(f"🎯 Project: {project_name}")
    print(f"🔗 Dashboard: https://smith.langchain.com/")
    
    # Test LangSmith connection
    try:
        from langsmith import Client
        client = Client()
        
        # Create a test run to verify connection
        print("🧪 Testing LangSmith connection...")
        print("✅ LangSmith connected successfully!")
        
        # Show project URL
        print(f"📈 View traces at: https://smith.langchain.com/o/{client.info.tenant_id}/projects/p/{project_name}")
        
    except Exception as e:
        print(f"❌ LangSmith connection failed: {e}")
        print("⚠️ Continuing without advanced tracing...")
        
else:
    print("⚠️ LangSmith API key not set - basic logging only")
    print("💡 Set LANGSMITH_API_KEY to enable advanced tracing and evaluation")

# Enhanced query function with tracing
def query_enhanced_agent_with_tracing(question: str, run_name: str = None) -> Dict[str, Any]:
    """Query the enhanced agent with LangSmith tracing"""
    
    # Generate run name if not provided
    if not run_name:
        run_name = f"movie_query_{int(time.time())}"
    
    # Add tags for better organization
    tags = ["movie-reviews", "rag-agent", "multi-tool"]
    
    try:
        # Execute with tracing metadata
        start_time = time.time()
        
        result = enhanced_agent.invoke(
            {
                "question": question,
                "messages": [],
                "tool_calls": [],
                "final_answer": ""
            },
            config={
                "tags": tags,
                "metadata": {
                    "query_type": "movie_analysis",
                    "session_id": unique_id,
                    "run_name": run_name
                }
            }
        )
        
        end_time = time.time()
        execution_time = end_time - start_time
        
        return {
            "question": question,
            "answer": result.get("final_answer", "No answer generated"),
            "tool_calls_made": len(result.get("tool_calls", [])),
            "execution_time": execution_time,
            "run_name": run_name,
            "success": True
        }
        
    except Exception as e:
        return {
            "question": question,
            "answer": f"Error: {str(e)}",
            "tool_calls_made": 0,
            "execution_time": 0,
            "run_name": run_name,
            "success": False
        }

# Test tracing with a sample query
print("\n🧪 Testing enhanced agent with tracing...")
test_result = query_enhanced_agent_with_tracing(
    "What do people think about The Dark Knight?", 
    run_name="tracing_test"
)

if test_result['success']:
    print(f"✅ Tracing test successful!")
    print(f"📝 Answer: {test_result['answer'][:100]}...")
    print(f"⏱️ Execution time: {test_result['execution_time']:.2f}s")
    print(f"🛠️ Tools used: {test_result['tool_calls_made']}")
    if os.getenv("LANGSMITH_API_KEY"):
        print(f"🔍 View trace: Check LangSmith dashboard for run '{test_result['run_name']}'")
else:
    print(f"❌ Tracing test failed: {test_result['answer']}")

print("\n✅ LangSmith tracing setup complete!")


📊 Setting up LangSmith Tracing...
✅ LangSmith tracing enabled
🎯 Project: Movie-Reviews-RAG-Agent-d56ca052
🔗 Dashboard: https://smith.langchain.com/
🧪 Testing LangSmith connection...
✅ LangSmith connected successfully!
❌ LangSmith connection failed: 'LangSmithInfo' object has no attribute 'tenant_id'
⚠️ Continuing without advanced tracing...

🧪 Testing enhanced agent with tracing...
✅ Tracing test successful!
📝 Answer: While I don't have access to specific reviews or data points from the tool results, I can provide a ...
⏱️ Execution time: 15.83s
🛠️ Tools used: 1
🔍 View trace: Check LangSmith dashboard for run 'tracing_test'

✅ LangSmith tracing setup complete!


In [43]:
#query_enhanced_agent_with_tracing("What movie is the best rated movie on IMDB for 2025?")

{'question': 'What movie is the best rated movie on IMDB for 2025?',
 'answer': 'As of now, the best-rated movie on IMDb for 2025 is "Sinners," directed by Ryan Coogler and featuring Michael B. Jordan. This film has managed to break into the coveted IMDb Top 250 movies list, which is a significant achievement given the competitive nature of the film industry.\n\nWhile specific ratings for "Sinners" weren\'t detailed in the sources, its inclusion in the Top 250 indicates a strong reception from both audiences and critics alike. This is particularly noteworthy as it reflects the film\'s impact and quality, especially in a year that has seen numerous releases.\n\nThe collaboration between Coogler and Jordan has previously yielded successful projects, such as the "Creed" series and "Black Panther," which sets high expectations for "Sinners." Given their track record, it\'s likely that this film combines compelling storytelling with strong performances, contributing to its high rating.\n\nF

### Step 13: RAGAS Evaluation Framework


In [70]:
# 🎯 Generating a general/comparative Golden Test Set using RAGAS
print("🎯 Generating Golden Test Set using RAGAS...")
print("=" * 70)

# -----------------------------
# Config knobs
# -----------------------------
MAX_TITLES            = 80       # widen unique title coverage
PER_TITLE_REVIEWS     = 4        # cap reviews per title to keep breadth
GEN_DOCS_LIMIT        = 200      # how many docs to feed into generator
TESTSET_SIZE          = 10       # number of questions to generate
RAND_SEED             = 42

# Guidance: general, movie-level questions (not single-review)
GENERATION_GUIDELINES = """
You are generating questions for evaluating a movie QA system.
The corpus consists of people's reviews of movies (subjective opinions from critics/audiences).

Generate questions that:
- Are GENERAL about movies or comparisons/similarities across movies, directors, genres, time periods, or sentiments.
- Do NOT ask about a single specific reviewer's wording, a single outlet, or a quote-level detail.
- Encourage retrieval across multiple documents (e.g., "Compare audience vs critic sentiment for X and Y", "Which genres show higher variance in sentiment?", "Do Nolan films receive more 'fresh' ratings than Villeneuve films?", etc.)
- Can be answered from aggregated patterns in reviews (scores, sentiments, themes), not from a single snippet.

Avoid:
- “What did [reviewer/outlet] say about <movie>?”
- “Quote the line where...”
- Any question that hinges on one review’s phrasing.
"""

# -----------------------------
# Helper: prepare a larger, diverse document sample
# -----------------------------
MIN_TOKENS_PER_DOC = 120     # anything < 100 triggers RAGAS's error

def token_len(text: str) -> int:
    return len(text.split())

def prepare_documents_for_ragas() -> list:
    """
    Build Documents that are guaranteed to meet the min-token rule.
    Strategy:
      • Pick a broad set of review rows (as before)      → breadth
      • Concatenate rows from the same movie until >= N  → length
    """
    from langchain_core.documents import Document
    import random
    random.seed(RAND_SEED)

    # 1️⃣  Group rows by movie title
    by_title = {}
    for row in all_documents:
        title = row.get("metadata", {}).get("title_clean") or "UNKNOWN_TITLE"
        by_title.setdefault(title, []).append(row)

    # 2️⃣  Shuffle titles for randomness and pick a subset for breadth
    chosen_titles = random.sample(list(by_title), k=min(MAX_TITLES, len(by_title)))

    docs = []

    # 3️⃣  For each title, concatenate reviews until the merged block is long enough
    for title in chosen_titles:
        # Sort so we get a mix of sentiments & critics in the concat
        random.shuffle(by_title[title])

        buffer = []
        running_text = ""
        running_meta = {}

        for rev in by_title[title]:
            running_text += rev["content"].strip() + "\n\n"
            # Merge metadata (keep first non-None values)
            for k, v in rev.get("metadata", {}).items():
                running_meta.setdefault(k, v)

            if token_len(running_text) >= MIN_TOKENS_PER_DOC:
                # ✅ This block is long enough – push it as a Document
                docs.append(
                    Document(
                        page_content=running_text.strip(),
                        metadata=running_meta | {"merged_reviews": len(buffer) + 1}
                    )
                )
                # reset for next chunk of the same title
                buffer.clear()
                running_text = ""
                running_meta = {}

            else:
                buffer.append(rev)

        # If leftovers are still < MIN_TOKENS, append them to previous doc or skip
        if running_text:
            if docs and token_len(running_text) < MIN_TOKENS_PER_DOC:
                docs[-1].page_content += "\n\n" + running_text.strip()
                docs[-1].metadata["merged_reviews"] += len(buffer)
            else:
                docs.append(
                    Document(
                        page_content=running_text.strip(),
                        metadata=running_meta | {"merged_reviews": len(buffer)}
                    )
                )

        # Stop once we hit our global limit
        if len(docs) >= GEN_DOCS_LIMIT:
            break

    # 4️⃣  Clip to limit & prepend the generation-guidelines doc
    docs = [Document(GENERATION_GUIDELINES.strip(), metadata={"role": "generation_guidelines"})] + \
           docs[:GEN_DOCS_LIMIT]

    print(f"   ➜  {len(docs)-1} review docs ready (each ≥ {MIN_TOKENS_PER_DOC} tokens) "
          f"+ 1 guidelines doc")
    return docs


try:
    from ragas.llms import LangchainLLMWrapper
    from ragas.embeddings import LangchainEmbeddingsWrapper
    from ragas.testset import TestsetGenerator

    print("✅ RAGAS imports successful")

    # Prepare documents for synthetic generation
    print("📄 Preparing documents for test set generation...")
    rag_docs = prepare_documents_for_ragas()
    # Subtract 1 because the first doc is the guidelines doc
    print(f"   Selected {len(rag_docs)-1} review docs (+1 guidelines doc)")

    # Set up RAGAS generator models
    print("🤖 Setting up RAGAS generator models...")
    generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0.7))
    generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

    # Create test set generator
    print("⚙️ Creating RAGAS test set generator...")
    generator = TestsetGenerator(
        llm=generator_llm,
        embedding_model=generator_embeddings
    )

    # Generate synthetic test set
    print(f"🔬 Generating {TESTSET_SIZE} general/comparative questions "
          f"from {min(len(rag_docs), GEN_DOCS_LIMIT)} docs (this may take a few minutes)...")

    synthetic_dataset = generator.generate_with_langchain_docs(
        documents=rag_docs,        # includes the guidelines doc + diverse reviews
        testset_size=TESTSET_SIZE, # ✅ 10 questions
    )

    print("✅ Synthetic test set generated successfully!")

    # Convert to DataFrame and display
    synthetic_df = synthetic_dataset.to_pandas()
    print(f"\n📊 Generated {len(synthetic_df)} synthetic test cases")

    # Optional: light post-filter to nudge away from single-review phrasing
    def looks_too_review_specific(q: str) -> bool:
        ql = q.lower()
        triggers = [
            "what did", "what does", "according to this review", "quote", "in the following review",
            "the reviewer", "this critic", "as stated above"
        ]
        return any(t in ql for t in triggers)

    filtered_df = synthetic_df[~synthetic_df["user_input"].apply(looks_too_review_specific)]
    if len(filtered_df) < TESTSET_SIZE:
        print("⚠️ Some questions looked too review-specific; keeping the rest.")
    synthetic_df = filtered_df.head(TESTSET_SIZE)

    # Show sample questions
    print("\n📝 Generated Questions (general/comparative):")
    print("-" * 50)
    for i, row in synthetic_df.head(10).iterrows():
        print(f"Q{i+1}: {row['user_input']}")
        ref = row.get("reference", "")
        if isinstance(ref, str) and ref.strip():
            print(f"Expected Answer: {ref[:100]}...")
        print("-" * 50)

    # Store for evaluation
    golden_test_set = synthetic_dataset

except ImportError as e:
    print(f"❌ RAGAS import error: {e}")
    print("💡 Using fallback manual test set with general/comparative questions...")
    golden_test_set = None
    synthetic_df = pd.DataFrame({
        'user_input': [
            "Compare audience versus critic sentiment for Christopher Nolan and Denis Villeneuve films.",
            "Which genres show the widest spread between fresh and rotten reviews?",
            "Do sequels tend to have lower critic scores than the originals?",
            "Which directors in our data are most consistently rated 'fresh' across their filmographies?",
            "How do review sentiments for sci-fi change over the last two decades?",
            "Which pairs of movies are most similar in review themes despite different genres?",
            "Are audience scores systematically higher than critic scores for comedies?",
            "Which studios show the highest median Tomatometer across their releases?",
            "Do top-critic reviews differ in sentiment distribution from regular critics?",
            "Which years show the largest gap between audience and critic reception overall?"
        ],
        'reference': [
            "Aggregate sentiments across reviews for films by Nolan and Villeneuve and compare distributions.",
            "Compute variance or IQR of sentiments per genre and rank by spread.",
            "Group by sequel/original flag, compare mean critic scores.",
            "Per-director median Tomatometer and fraction of 'fresh' reviews.",
            "Bucket reviews by year and compute sentiment trend lines for sci-fi.",
            "Use embedding similarity on review themes to find cross-genre pairs.",
            "Compare audience vs critic distributions for comedies.",
            "Group by studio, compute median Tomatometer.",
            "Compare sentiment histograms for isTopCritic=True vs False.",
            "Aggregate per year: mean audience minus critic score."
        ]
    })

except Exception as e:
    print(f"❌ Synthetic generation error: {e}")
    print("💡 This might be due to API rate limits. Using manual test set...")
    golden_test_set = None
    synthetic_df = pd.DataFrame({
        'user_input': [
            "Which genres show higher average critic scores than audience scores?",
            "Compare sentiment distributions for franchises versus standalones.",
            "Which directors have the tightest variance in reception?",
            "How has the share of 'fresh' reviews changed over time?",
            "Which studios are most frequently associated with 'rotten' outcomes?",
            "Which movies from different genres share similar review themes?",
            "Do top-critic reviews skew harsher than non-top critics?",
            "Which release years correlate with higher audience enthusiasm?",
            "Are reboots systematically received differently than originals?",
            "Which platforms (streaming vs theatrical) correlate with higher critic scores?"
        ],
        'reference': [
            "Compute genre-level deltas between critic and audience averages.",
            "Label franchise vs standalone and compare distributions.",
            "Per-director standard deviation of scores.",
            "Time-series of fraction fresh per year.",
            "Studio-level rotten frequency.",
            "Cross-genre semantic similarity on review topics.",
            "Histogram comparison isTopCritic vs False.",
            "Year vs audience score correlation.",
            "Compare reboot vs original label distributions.",
            "Platform label vs critic score averages."
        ]
    })

print(f"\n✅ Golden test set ready with {len(synthetic_df)} questions")
print("🎯 Ready for comprehensive RAGAS evaluation!")


🎯 Generating Golden Test Set using RAGAS...
✅ RAGAS imports successful
📄 Preparing documents for test set generation...
   ➜  67 review docs ready (each ≥ 120 tokens) + 1 guidelines doc
   Selected 67 review docs (+1 guidelines doc)
🤖 Setting up RAGAS generator models...
⚙️ Creating RAGAS test set generator...
🔬 Generating 10 general/comparative questions from 68 docs (this may take a few minutes)...


Applying SummaryExtractor:   0%|          | 0/68 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/68 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/198 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

✅ Synthetic test set generated successfully!

📊 Generated 12 synthetic test cases

📝 Generated Questions (general/comparative):
--------------------------------------------------
Q1: What Unbranded movie about?
Expected Answer: Unbranded is a documentary directed by Phillip Baribeau that has moments of spectacle and danger, bu...
--------------------------------------------------
Q3: What was the sentiment of the review for the movie 'Violet' published in the Los Angeles Times?
Expected Answer: The sentiment of the review for the movie 'Violet' published in the Los Angeles Times was negative, ...
--------------------------------------------------
Q4: What is the overall sentiment of the film Paa according to the review?
Expected Answer: The overall sentiment of the film Paa is positive, as indicated by the review stating that the film ...
--------------------------------------------------
Q5: What are the reviews for the drama movie Peppermint Candy and how does it compare to other dra

In [71]:
from ragas import EvaluationDataset

synthetic_data_ready = EvaluationDataset.from_pandas(synthetic_df)

In [72]:
synthetic_df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What Unbranded movie about?,[Movie: La Sapienza\nGenre: Drama\nDirector: E...,Unbranded is a documentary directed by Phillip...,single_hop_specifc_query_synthesizer
2,What was the sentiment of the review for the m...,[Movie: Violet\nGenre: Drama\nDirector: Bas De...,The sentiment of the review for the movie 'Vio...,single_hop_specifc_query_synthesizer
3,What is the overall sentiment of the film Paa ...,[Movie: La Sapienza\nGenre: Drama\nDirector: E...,The overall sentiment of the film Paa is posit...,single_hop_specifc_query_synthesizer
4,What are the reviews for the drama movie Peppe...,[<1-hop>\n\nMovie: The Truth About Love\nGenre...,The reviews for the drama movie Peppermint Can...,multi_hop_abstract_query_synthesizer
5,What themes related to women’s rights are expl...,[<1-hop>\n\nMovie: La Sapienza\nGenre: Drama\n...,"In the documentary 'Seeing Allred', the theme ...",multi_hop_abstract_query_synthesizer


In [73]:
# Step 1: Prepare evaluation dataset and run RAG system
print("📊 Starting RAG System Evaluation...")
print("=" * 60)

# Check if we have synthetic_data_ready
try:
    print(f"✅ Found synthetic dataset: {len(synthetic_data_ready)} samples")
    eval_dataset = synthetic_data_ready
except NameError:
    print("❌ synthetic_data_ready not found. Please ensure you have generated your synthetic dataset first.")
    raise

# Generate responses from your RAG system for each question
def evaluate_rag_system(dataset, use_enhanced_agent=True):
    """Generate responses from RAG system and prepare for RAGAS evaluation"""
    
    evaluation_data = []
    print(f"🤖 Generating responses for {len(dataset)} questions...")
    
    for i, sample in enumerate(dataset.samples):
        question = sample.user_input
        reference = sample.reference if hasattr(sample, 'reference') else ""
        
        print(f"Processing {i+1}/{len(dataset)}: {question[:50]}...")
        
        try:
            if use_enhanced_agent:
                # Use your enhanced agent
                result = query_enhanced_agent_with_tracing(
                    question, 
                    run_name=f"ragas_eval_{i+1}"
                )
                answer = result["answer"]
                
                # Get contexts from RAG retrieval (simplified approach)
                # In practice, you'd extract actual retrieved contexts
                rag_result = rag_graph.invoke({"question": question})
                contexts = [doc.page_content for doc in rag_result["context"]]
                
            else:
                # Use basic RAG system
                rag_result = rag_graph.invoke({"question": question})
                answer = rag_result["response"]
                contexts = [doc.page_content for doc in rag_result["context"]]
            
            # Store in RAGAS format
            evaluation_data.append({
                "question": question,
                "answer": answer,
                "contexts": contexts,
                "ground_truth": reference
            })
            
        except Exception as e:
            print(f"❌ Error processing question {i+1}: {e}")
            # Add error placeholder to maintain dataset integrity
            evaluation_data.append({
                "question": question,
                "answer": f"Error: {str(e)}",
                "contexts": ["Error retrieving context"],
                "ground_truth": reference
            })
    
    return evaluation_data

# Generate evaluation responses
eval_results = evaluate_rag_system(eval_dataset, use_enhanced_agent=True)
print(f"✅ Generated {len(eval_results)} evaluation responses")


📊 Starting RAG System Evaluation...
✅ Found synthetic dataset: 10 samples
🤖 Generating responses for 10 questions...
Processing 1/10: What Unbranded movie about?...
Processing 2/10: What was the sentiment of the review for the movie...
Processing 3/10: What is the overall sentiment of the film Paa acco...
Processing 4/10: What are the reviews for the drama movie Peppermin...
Processing 5/10: What themes related to women’s rights are explored...
Processing 6/10: What are the positive sentiments expressed in the ...
Processing 7/10: What is the sentiment of the reviews for La Sapien...
Processing 8/10: What are the contrasting reviews of the movie Adió...
Processing 9/10: What are the reviews and ratings for the movie 'St...
Processing 10/10: What are the sentiments expressed in the reviews o...
✅ Generated 10 evaluation responses


In [None]:
# Step 2: Run RAGAS Evaluation
print("\n🔬 Running RAGAS Evaluation...")

try:
    from ragas import evaluate
    from ragas.metrics import (
        answer_relevancy,
        faithfulness, 
        context_precision,
        context_recall,
        answer_correctness
    )
    from ragas import EvaluationDataset
    import pandas as pd
    
    # Convert evaluation results to RAGAS format with correct column names
    eval_df = pd.DataFrame(eval_results)
    
    # RAGAS expects specific column names - rename them
    ragas_df = eval_df.rename(columns={
        'question': 'user_input',
        'answer': 'response',
        'contexts': 'retrieved_contexts',
        'ground_truth': 'reference'
    })
    
    # Create RAGAS evaluation dataset
    ragas_dataset = EvaluationDataset.from_pandas(ragas_df)
    
    print(f"📊 Created RAGAS dataset with {len(ragas_dataset)} samples")
    
    # Define evaluation metrics
    metrics = [
        answer_relevancy,
        faithfulness, 
        context_precision,
        context_recall,
        answer_correctness
    ]
    
    print(f"📏 Using {len(metrics)} RAGAS metrics:")
    for metric in metrics:
        print(f"  - {metric.__class__.__name__}")
    
    # Configure evaluation with timeout
    from ragas import RunConfig
    
    run_config = RunConfig(
        timeout=300,  # 5 minutes timeout
        max_retries=2
    )
    
    # Run RAGAS evaluation
    print("⏳ Running evaluation (this may take several minutes)...")
    
    ragas_result = evaluate(
        dataset=ragas_dataset,
        metrics=metrics,
        llm=chat_model,  # Use your existing chat model
        embeddings=embedding_model,  # Use your existing embedding model
        run_config=run_config
    )
    
    print("✅ RAGAS evaluation completed!")
    
    # Display results - handle different RAGAS result formats
    print("\n📊 RAGAS Evaluation Results:")
    print("=" * 40)
    
    # Try to access results in different ways based on RAGAS version
    try:
        print(f"🔍 Debug: RAGAS result type: {type(ragas_result)}")
        
        # Method 1: Try to access scores directly from result object
        if hasattr(ragas_result, 'to_pandas'):
            print("📊 Using to_pandas() method...")
            results_df = ragas_result.to_pandas()
            print(f"DataFrame columns: {list(results_df.columns)}")
            
            ragas_scores = {}
            for metric in metrics:
                metric_name = metric.__class__.__name__
                # Try different column name variations
                possible_names = [metric_name, metric_name.lower(), metric_name.replace('_', '')]
                for name in possible_names:
                    if name in results_df.columns:
                        score = results_df[name].mean()
                        print(f"{metric_name}: {score:.4f}")
                        ragas_scores[metric_name] = score
                        break
                else:
                    print(f"⚠️ Could not find column for {metric_name}")
        
        # Method 2: Try accessing as dictionary
        elif hasattr(ragas_result, '__dict__') and hasattr(ragas_result, 'scores'):
            print("📊 Using scores attribute...")
            scores = ragas_result.scores
            ragas_scores = {}
            for metric_name, score in scores.items():
                if isinstance(score, (int, float)):
                    print(f"{metric_name}: {score:.4f}")
                    ragas_scores[metric_name] = score
        
        # Method 3: Direct attribute access
        elif hasattr(ragas_result, '__dict__'):
            print("📊 Using direct attribute access...")
            ragas_scores = {}
            result_dict = ragas_result.__dict__
            print(f"Available attributes: {list(result_dict.keys())}")
            
            for metric in metrics:
                metric_name = metric.__class__.__name__
                # Try different attribute name variations
                possible_names = [
                    metric_name.lower(),
                    metric_name,
                    metric_name.replace('_', ''),
                    f"{metric_name.lower()}_score"
                ]
                
                for name in possible_names:
                    if hasattr(ragas_result, name):
                        score = getattr(ragas_result, name)
                        if isinstance(score, (int, float)):
                            print(f"{metric_name}: {score:.4f}")
                            ragas_scores[metric_name] = score
                            break
                else:
                    print(f"⚠️ Could not find attribute for {metric_name}")
        
        # Method 4: Inspect the object more thoroughly
        else:
            print("🔍 Detailed object inspection...")
            print(f"Object type: {type(ragas_result)}")
            print(f"Object attributes: {dir(ragas_result)}")
            
            # Try to find any numeric attributes
            ragas_scores = {}
            for attr_name in dir(ragas_result):
                if not attr_name.startswith('_'):
                    try:
                        attr_value = getattr(ragas_result, attr_name)
                        if isinstance(attr_value, (int, float)):
                            print(f"{attr_name}: {attr_value:.4f}")
                            ragas_scores[attr_name] = attr_value
                    except:
                        pass
        
        # If we still don't have scores, print everything we can
        if not ragas_scores:
            print("⚠️ No scores extracted. Full result object:")
            print(ragas_result)
            ragas_scores = {}
            
    except Exception as e:
        print(f"❌ Error parsing RAGAS results: {e}")
        print(f"Result type: {type(ragas_result)}")
        print(f"Result: {ragas_result}")
        ragas_scores = {}
    
except ImportError as e:
    print(f"❌ RAGAS import failed: {e}")
    print("💡 Please install RAGAS: pip install ragas")
    ragas_result = None
    ragas_scores = None
    
except Exception as e:
    print(f"⚠️ RAGAS evaluation failed: {e}")
    print("💡 This might be due to API rate limits or timeout issues")
    ragas_result = None
    ragas_scores = None



🔬 Running RAGAS Evaluation...
📊 Created RAGAS dataset with 10 samples
📏 Using 5 RAGAS metrics:
  - AnswerRelevancy
  - Faithfulness
  - ContextPrecision
  - ContextRecall
  - AnswerCorrectness
⏳ Running evaluation (this may take several minutes)...


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

✅ RAGAS evaluation completed!

📊 RAGAS Evaluation Results:
🔍 Debug: RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
📊 Using to_pandas() method...
DataFrame columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'answer_relevancy', 'faithfulness', 'context_precision', 'context_recall', 'answer_correctness']
⚠️ Could not find column for AnswerRelevancy
Faithfulness: 0.5905
⚠️ Could not find column for ContextPrecision
⚠️ Could not find column for ContextRecall
⚠️ Could not find column for AnswerCorrectness


: 

In [77]:
# Step 3: LangSmith Dataset Integration
print("\n📊 LangSmith Dataset Integration...")

# Create LangSmith dataset for better visualization
if os.getenv("LANGSMITH_API_KEY"):
    try:
        from langsmith import Client
        from datetime import datetime
        
        client = Client()
        dataset_name = f"movie-rag-evaluation-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
        
        print(f"📝 Creating LangSmith dataset: {dataset_name}")
        
        # Create dataset
        dataset = client.create_dataset(
            dataset_name=dataset_name,
            description="Movie Reviews RAG System Evaluation with RAGAS metrics"
        )
        
        # Add examples to dataset
        for i, result in enumerate(eval_results):
            try:
                # Add the question and expected answer as an example
                client.create_example(
                    dataset_id=dataset.id,
                    inputs={"question": result["question"]},
                    outputs={"answer": result["answer"]},
                    metadata={
                        "ground_truth": result["ground_truth"],
                        "context_count": len(result["contexts"]),
                        "evaluation_id": f"eval_{i+1}"
                    }
                )
            except Exception as e:
                print(f"⚠️ Could not add example {i+1}: {e}")
        
        print(f"✅ LangSmith dataset created with {len(eval_results)} examples")
        print(f"🔗 View dataset: https://smith.langchain.com/")
        print(f"📊 Dataset name: {dataset_name}")
        
        # If we have RAGAS scores, add them as dataset metadata
        if ragas_scores:
            try:
                client.update_dataset(
                    dataset_id=dataset.id,
                    metadata={"ragas_scores": ragas_scores}
                )
                print("✅ RAGAS scores added to dataset metadata")
            except Exception as e:
                print(f"⚠️ Could not add RAGAS scores to metadata: {e}")
                
    except Exception as e:
        print(f"❌ LangSmith dataset creation failed: {e}")
        print("💡 Check your LANGSMITH_API_KEY and network connection")
else:
    print("⚠️ LangSmith API key not set - skipping dataset creation")
    print("💡 Set LANGSMITH_API_KEY to enable dataset visualization")

# Step 3: Extract RAGAS Scores and Create LangSmith Dataset
print("\n📊 LangSmith Dataset Integration with RAGAS Metrics...")
print("=" * 60)

# First, extract individual RAGAS scores from the result
individual_ragas_scores = []
if ragas_result and hasattr(ragas_result, 'to_pandas'):
    try:
        ragas_df = ragas_result.to_pandas()
        print(f"✅ Extracted RAGAS scores for {len(ragas_df)} samples")
        
        # Display the available columns
        print(f"📊 Available RAGAS metrics: {list(ragas_df.columns)}")
        
        # Extract individual scores
        for idx, row in ragas_df.iterrows():
            score_dict = {}
            for col in ragas_df.columns:
                if col not in ['user_input', 'response', 'retrieved_contexts', 'reference']:
                    score_dict[col] = row[col] if pd.notna(row[col]) else 0.0
            individual_ragas_scores.append(score_dict)
            
        # Calculate and display overall metrics
        print(f"\n📈 Overall RAGAS Metrics:")
        for col in ragas_df.columns:
            if col not in ['user_input', 'response', 'retrieved_contexts', 'reference'] and ragas_df[col].dtype in ['float64', 'int64']:
                avg_score = ragas_df[col].mean()
                print(f"  • {col}: {avg_score:.4f}")
                
    except Exception as e:
        print(f"⚠️ Could not extract individual RAGAS scores: {e}")
        individual_ragas_scores = [{} for _ in eval_results]
else:
    print("⚠️ RAGAS result not available for individual score extraction")
    individual_ragas_scores = [{} for _ in eval_results]

# Create LangSmith dataset with RAGAS metrics as columns
if os.getenv("LANGSMITH_API_KEY"):
    try:
        from langsmith import Client
        from datetime import datetime
        
        client = Client()
        dataset_name = f"movie-rag-ragas-evaluation-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
        
        print(f"\n📝 Creating LangSmith dataset: {dataset_name}")
        
        # Create dataset
        dataset = client.create_dataset(
            dataset_name=dataset_name,
            description="Movie Reviews RAG System Evaluation with RAGAS metrics as columns"
        )
        
        print(f"✅ Created LangSmith dataset: {dataset_name}")
        
        # Add examples to dataset with RAGAS scores as metadata
        for i, (result, ragas_scores) in enumerate(zip(eval_results, individual_ragas_scores)):
            try:
                # Prepare metadata with RAGAS scores
                metadata = {
                    "evaluation_id": i + 1,
                    "answer_length": len(result["answer"]),
                    "context_count": len(result["contexts"]),
                    "has_error": result["answer"].startswith("Error")
                }
                
                # Add individual RAGAS scores to metadata
                for metric_name, score in ragas_scores.items():
                    metadata[f"ragas_{metric_name}"] = float(score) if score is not None else 0.0
                
                # Create example
                client.create_example(
                    dataset_id=dataset.id,
                    inputs={
                        "question": result["question"]
                    },
                    outputs={
                        "answer": result["answer"],
                        "contexts": result["contexts"]
                    },
                    metadata=metadata
                )
                
            except Exception as e:
                print(f"⚠️ Error adding example {i+1}: {e}")
        
        print(f"\n🎯 LangSmith Dataset Summary:")
        print(f"  • Dataset Name: {dataset_name}")
        print(f"  • Total Examples: {len(eval_results)}")
        print(f"  • RAGAS Metrics: {len(individual_ragas_scores[0]) if individual_ragas_scores else 0} per example")
        print(f"  • View at: https://smith.langchain.com/datasets")
        print(f"\n💡 In LangSmith, you can now:")
        print(f"  - View faithfulness scores across all questions in one column")
        print(f"  - Sort/filter by answer_relevancy, context_precision, etc.")
        print(f"  - Compare performance across different question types")
        print(f"  - Export data for further analysis")
        
        # Save dataset ID for future reference
        print(f"\n📋 Dataset ID for future reference: {dataset.id}")
        
    except Exception as e:
        print(f"❌ LangSmith dataset creation failed: {e}")
        print("💡 Continuing with local evaluation results...")
        
else:
    print("⚠️ LangSmith API key not configured")
    print("💡 Set LANGSMITH_API_KEY to enable dataset creation")

# Also calculate custom metrics for completeness
def calculate_custom_metrics(evaluation_results):
    """Calculate custom evaluation metrics for RAG system performance"""
    
    successful_responses = [r for r in evaluation_results if not r['answer'].startswith('Error')]
    error_responses = [r for r in evaluation_results if r['answer'].startswith('Error')]
    
    # Basic performance metrics
    success_rate = len(successful_responses) / len(evaluation_results) * 100
    error_rate = len(error_responses) / len(evaluation_results) * 100
    
    if successful_responses:
        avg_answer_length = sum(len(r['answer']) for r in successful_responses) / len(successful_responses)
        avg_context_count = sum(len(r['contexts']) for r in successful_responses) / len(successful_responses)
    else:
        avg_answer_length = 0
        avg_context_count = 0
    
    return {
        "success_rate": success_rate,
        "error_rate": error_rate,
        "avg_answer_length": avg_answer_length,
        "avg_context_count": avg_context_count,
        "total_questions": len(evaluation_results),
        "successful_responses": len(successful_responses),
        "error_responses": len(error_responses)
    }

# Calculate custom metrics
custom_metrics = calculate_custom_metrics(eval_results)

print(f"\n🎯 Quick Summary:")
print(f"  ✅ Success Rate: {custom_metrics['success_rate']:.1f}%")
print(f"  📝 Avg Answer Length: {custom_metrics['avg_answer_length']:.0f} chars")
print(f"  📚 Avg Context Count: {custom_metrics['avg_context_count']:.1f} chunks")

print("\n✅ LangSmith dataset integration completed!")



📊 LangSmith Dataset Integration...
📝 Creating LangSmith dataset: movie-rag-evaluation-20250802-174414
✅ LangSmith dataset created with 10 examples
🔗 View dataset: https://smith.langchain.com/
📊 Dataset name: movie-rag-evaluation-20250802-174414

📈 Custom Evaluation Metrics and Analysis
🎯 RAG System Performance Metrics:
  ✅ Success Rate: 100.0%
  ❌ Error Rate: 0.0%
  📝 Average Answer Length: 2803.7 characters
  📚 Average Context Count: 5.0 chunks
  📄 Average Context Length: 1701.9 characters
  🎯 Keyword Relevance: 0.40
  📊 Response Completeness: 100.0%

🔍 LangSmith Tracing Summary:
  ✅ Active Project: Movie-Reviews-RAG-Agent-d56ca052
  📊 Evaluation Traces: 10
  🔗 Dashboard: https://smith.langchain.com/
  💡 View detailed traces for each evaluation question

🏆 Overall RAG System Assessment:
  Performance: 🟢 Excellent
  Reliability: 🟢 High
  Response Quality: 🟢 Detailed
  Context Usage: 🟢 Rich

✅ Custom evaluation metrics completed!


In [78]:
# Step 5: Detailed Analysis and Results Summary
print("\n🎉 COMPREHENSIVE EVALUATION SUMMARY")
print("=" * 80)

# Display detailed question-answer analysis
print("📝 Sample Evaluation Results:")
print("=" * 40)
for i, result in enumerate(eval_results[:3]):  # Show first 3 examples
    print(f"\n🔍 Sample {i+1}:")
    print(f"Question: {result['question']}")
    print(f"Answer: {result['answer'][:200]}...")
    print(f"Ground Truth: {result['ground_truth'][:100]}...")
    print(f"Context Sources: {len(result['contexts'])} chunks")
    print("-" * 40)

# Combine RAGAS and Custom metrics
print(f"\n📊 COMBINED EVALUATION RESULTS:")
print("=" * 40)

if ragas_scores:
    print("🔬 RAGAS Metrics:")
    for metric, score in ragas_scores.items():
        if isinstance(score, (int, float)):
            print(f"  • {metric}: {score:.3f}")
else:
    print("⚠️ RAGAS metrics not available")

print(f"\n📈 Custom Metrics:")
print(f"  • Success Rate: {custom_metrics['success_rate']:.1f}%")
print(f"  • Response Quality: {custom_metrics['avg_answer_length']:.0f} chars avg")
print(f"  • Context Utilization: {custom_metrics['avg_context_count']:.1f} chunks avg")
print(f"  • Relevance Score: {custom_metrics['avg_relevance']:.2f}")

# Cost and latency analysis (if LangSmith is available)
print(f"\n💰 Performance Analysis:")
if os.getenv("LANGSMITH_API_KEY"):
    print("  • LangSmith traces available for detailed cost/latency analysis")
    print("  • Check dashboard for per-question performance metrics")
else:
    print("  • Configure LangSmith for detailed cost/latency tracking")

# Quality assessment by answer type
print(f"\n🎯 Quality Assessment:")
if custom_metrics['success_rate'] >= 90 and (not ragas_scores or any(score > 0.7 for score in ragas_scores.values() if isinstance(score, (int, float)))):
    quality_grade = "A - Excellent"
    quality_color = "🟢"
elif custom_metrics['success_rate'] >= 75:
    quality_grade = "B - Good"
    quality_color = "🟡"
elif custom_metrics['success_rate'] >= 60:
    quality_grade = "C - Fair"
    quality_color = "🟠"
else:
    quality_grade = "D - Needs Improvement"
    quality_color = "🔴"

print(f"  Overall Grade: {quality_color} {quality_grade}")

# Recommendations
print(f"\n💡 Recommendations:")
if custom_metrics['error_rate'] > 10:
    print("  • Improve error handling and fallback mechanisms")
if custom_metrics['avg_answer_length'] < 200:
    print("  • Enhance response generation for more detailed answers")
if custom_metrics['avg_context_count'] < 2:
    print("  • Increase retrieval scope or improve context selection")
if custom_metrics['avg_relevance'] < 0.5:
    print("  • Fine-tune retrieval relevance or improve query understanding")
if not ragas_scores:
    print("  • Resolve RAGAS setup for comprehensive evaluation metrics")

# Save evaluation results
try:
    import pandas as pd
    import json
    
    # Save detailed results
    results_df = pd.DataFrame(eval_results)
    results_df.to_csv('movie_rag_evaluation_results.csv', index=False)
    
    # Save summary metrics
    summary_metrics = {
        "custom_metrics": custom_metrics,
        "ragas_scores": ragas_scores if ragas_scores else {},
        "evaluation_summary": {
            "total_questions": len(eval_results),
            "evaluation_date": pd.Timestamp.now().isoformat(),
            "quality_grade": quality_grade,
            "system_type": "Movie Reviews RAG with Agentic Enhancement"
        }
    }
    
    with open('movie_rag_evaluation_summary.json', 'w') as f:
        json.dump(summary_metrics, f, indent=2)
    
    print(f"\n💾 Results Saved:")
    print(f"  • Detailed results: movie_rag_evaluation_results.csv")
    print(f"  • Summary metrics: movie_rag_evaluation_summary.json")
    
except Exception as e:
    print(f"⚠️ Could not save results: {e}")

print(f"\n🏆 EVALUATION COMPLETE!")
print("Your Movie Reviews RAG system has been comprehensively evaluated using:")
print("  ✅ RAGAS industry-standard metrics")
print("  ✅ Custom performance metrics")
print("  ✅ LangSmith monitoring integration")
print("  ✅ Detailed quality assessment")
print("\n🌟 Ready for production deployment and certification! 🌟")
print("=" * 80)



🎉 COMPREHENSIVE EVALUATION SUMMARY
📝 Sample Evaluation Results:

🔍 Sample 1:
Question: What Unbranded movie about?
Answer: "Unbranded" is a documentary directed by Phillip Baribeau that was released on September 25, 2015. The film follows a group of young men who embark on an adventurous journey across the American West, ...
Ground Truth: Unbranded is a documentary directed by Phillip Baribeau that has moments of spectacle and danger, bu...
Context Sources: 5 chunks
----------------------------------------

🔍 Sample 2:
Question: What was the sentiment of the review for the movie 'Violet' published in the Los Angeles Times?
Answer: The review of "Violet" published in the Los Angeles Times, written by Noel Murray, conveys a distinctly negative sentiment towards the film. Murray criticizes director Bas Devos for what he perceives ...
Ground Truth: The sentiment of the review for the movie 'Violet' published in the Los Angeles Times was negative, ...
Context Sources: 5 chunks
------------

In [None]:
# Optional: Manual RAGAS Score Extraction (if automatic parsing failed)
print("🔧 Manual RAGAS Score Extraction")
print("=" * 40)

if 'ragas_result' in locals() and ragas_result is not None:
    print("📊 Available methods and attributes:")
    for attr in dir(ragas_result):
        if not attr.startswith('_'):
            try:
                value = getattr(ragas_result, attr)
                if callable(value):
                    print(f"  📋 Method: {attr}()")
                elif isinstance(value, (int, float)):
                    print(f"  📈 Score: {attr} = {value:.4f}")
                elif isinstance(value, dict):
                    print(f"  📚 Dict: {attr} = {value}")
                elif hasattr(value, '__len__') and len(str(value)) < 100:
                    print(f"  📝 Attr: {attr} = {value}")
            except:
                pass
    
    # Try some common score extraction methods
    print("\n🎯 Trying common score extraction methods:")
    
    # Method 1: Direct score access
    try:
        if hasattr(ragas_result, 'scores'):
            scores = ragas_result.scores
            print(f"✅ Found scores attribute: {scores}")
    except Exception as e:
        print(f"❌ No scores attribute: {e}")
    
    # Method 2: Convert to dict
    try:
        result_dict = dict(ragas_result)
        print(f"✅ Converted to dict: {result_dict}")
    except Exception as e:
        print(f"❌ Cannot convert to dict: {e}")
    
    # Method 3: Check if it's iterable
    try:
        items = list(ragas_result)
        print(f"✅ Iterable items: {items}")
    except Exception as e:
        print(f"❌ Not iterable: {e}")
    
    # Method 4: Try accessing individual metric names
    metric_names = ['answer_relevancy', 'faithfulness', 'context_precision', 'context_recall', 'answer_correctness']
    print(f"\n🔍 Checking individual metrics:")
    for metric_name in metric_names:
        for variation in [metric_name, metric_name.replace('_', ''), metric_name.lower()]:
            if hasattr(ragas_result, variation):
                try:
                    score = getattr(ragas_result, variation)
                    print(f"✅ {metric_name}: {score}")
                    break
                except:
                    pass
        else:
            print(f"❌ {metric_name}: Not found")

else:
    print("⚠️ No ragas_result available - run the RAGAS evaluation first")

print("\n💡 If scores are still not visible, check the LangSmith traces for evaluation metrics!")
