## Movie Reviews Retriever Evaluation with Enhanced RAG Chain

This notebook evaluates different retriever methods using the complete enhanced RAG chain from the movie_reviews_rag_system.ipynb, allowing us to test how different retrievers perform with the full agentic pipeline.

### Objectives:
1. Set up the complete enhanced RAG chain with multi-tool agents
2. Create a "golden dataset" using RAGAS Synthetic Data Generation from movie reviews
3. Evaluate 6 different retrievers within the enhanced agent pipeline
4. Compare performance, cost, and latency of the complete system
5. Provide recommendations for movie review retrieval in production

### Data Sources:
- **Rotten Tomatoes Movies**: Movie metadata with titles, ratings, genres, directors, runtime, release dates
- **Rotten Tomatoes Reviews**: Professional critic reviews with scores, sentiment, publications, and detailed text

### Retrievers to Evaluate:
- Naive Retrieval (Embedding-based)
- BM25 Retriever
- Multi-Query Retriever
- Parent-Document Retriever
- Contextual Compression (Reranking)
- Ensemble Retriever

### Enhanced RAG Features:
- Multi-tool agent with specialized movie analysis tools
- External search capabilities
- LangSmith tracing and monitoring
- Advanced analytics and trend analysis


## Step 1: Setup and Dependencies


In [15]:
import os
import time
import pandas as pd
from datetime import datetime
import getpass
import warnings
import numpy as np
from typing import List, Dict, Any, Optional
import asyncio
import nest_asyncio
from uuid import uuid4
warnings.filterwarnings('ignore')

# Apply nest_asyncio for Jupyter compatibility
nest_asyncio.apply()

print("🔑 Setting up API Keys")
print("=" * 40)

# OpenAI API Key (required)
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("🤖 Enter your OpenAI API Key: ")
    print("✅ OpenAI API key set")
else:
    print("✅ OpenAI API key already set")

# Cohere API Key (required for reranking)
if not os.getenv("COHERE_API_KEY"):
    os.environ["COHERE_API_KEY"] = getpass.getpass("🔄 Enter your Cohere API Key: ")
    print("✅ Cohere API key set")
else:
    print("✅ Cohere API key already set")




# Serapi API Key (recommended for external search)
if not os.getenv("SERPAPI_API_KEY"):
    serapi_key = getpass.getpass("🔍 Enter your Serapi API Key (or press Enter to skip): ")
    if serapi_key.strip():
        os.environ["SERPAPI_API_KEY"] = serapi_key
        print("✅ SerpAPI API key set")
    else:
        print("⚠️ SerpAPI API key skipped - external search will be limited")
else:
    print("✅ SerpAPI API key already set")



# Tavily API Key (recommended for external search)
if not os.getenv("TAVILY_API_KEY"):
    tavily_key = getpass.getpass("🔍 Enter your Tavily API Key (or press Enter to skip): ")
    if tavily_key.strip():
        os.environ["TAVILY_API_KEY"] = tavily_key
        print("✅ Tavily API key set")
    else:
        print("⚠️ Tavily API key skipped - external search will be limited")
else:
    print("✅ Tavily API key already set")

# LangSmith API Key (optional for monitoring)
if not os.getenv("LANGSMITH_API_KEY"):
    langsmith_key = getpass.getpass("📊 Enter your LangSmith API Key (or press Enter to skip): ")
    if langsmith_key.strip():
        os.environ["LANGSMITH_API_KEY"] = langsmith_key
        os.environ["LANGSMITH_TRACING"] = "true"
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
        os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
        print("✅ LangSmith API key set and tracing enabled")
    else:
        os.environ["LANGSMITH_TRACING"] = "false"
        os.environ["LANGCHAIN_TRACING_V2"] = "false"
        print("⚠️ LangSmith skipped - no monitoring/tracing")
else:
    print("✅ LangSmith API key already set")

print("\n🎯 API Key Setup Complete!")
print("✅ Ready for enhanced movie review retriever evaluation!")


🔑 Setting up API Keys
✅ OpenAI API key already set
✅ Cohere API key already set
✅ SerpAPI API key already set
✅ Tavily API key already set
✅ LangSmith API key already set

🎯 API Key Setup Complete!
✅ Ready for enhanced movie review retriever evaluation!


## Step 2: Load and Prepare Movie Reviews Data


In [16]:
# Load the Rotten Tomatoes datasets
print("🍅 Loading Rotten Tomatoes movie review datasets...")

# Robust CSV loading function with error handling
def load_csv_robust(filepath):
    """Load CSV with robust error handling for malformed data"""
    encodings = ['utf-8', 'latin1', 'cp1252', 'iso-8859-1']
    
    for encoding in encodings:
        try:
            print(f"  Trying encoding: {encoding}")
            # Try with error handling for malformed lines
            df = pd.read_csv(
                filepath, 
                encoding=encoding,
                on_bad_lines='skip',  # Skip bad lines instead of failing
                engine='python',      # Use Python engine for better error handling
                quoting=1,           # Quote all fields
                skipinitialspace=True
            )
            print(f"  ✅ Success with {encoding}")
            return df
        except UnicodeDecodeError:
            print(f"  ❌ Failed with {encoding}")
            continue
        except Exception as e:
            print(f"  ❌ Failed with {encoding}: {str(e)}")
            continue
    
    # If all encodings fail, try with minimal options
    print("  Trying with basic fallback...")
    try:
        df = pd.read_csv(filepath, encoding='latin1', on_bad_lines='skip', engine='python')
        print("  ✅ Success with fallback method")
        return df
    except Exception as e:
        raise ValueError(f"Could not read {filepath}: {str(e)}")

# Load Rotten Tomatoes movies metadata
print("Loading Rotten Tomatoes movies metadata...")
movies_df = load_csv_robust("data/rotten_tomatoes_movies.csv")
print(f"Movies dataset: {len(movies_df)} movies")
print(f"Columns: {list(movies_df.columns)}")

# Load Rotten Tomatoes reviews
print("\nLoading Rotten Tomatoes reviews...")
reviews_df = load_csv_robust("data/rotten_tomatoes_movie_reviews.csv")
print(f"Reviews dataset: {len(reviews_df)} reviews")
print(f"Columns: {list(reviews_df.columns)}")

# Display sample data
print("\n🎬 Sample movies metadata:")
print(movies_df.head(3))

print("\n📝 Sample reviews:")
print(reviews_df.head(3))

# Basic statistics
print(f"\n📊 Dataset Statistics:")
print(f"• Total movies: {len(movies_df):,}")
print(f"• Total reviews: {len(reviews_df):,}")
print(f"• Average reviews per movie: {len(reviews_df)/len(movies_df):.1f}")
print(f"• Unique movie IDs in reviews: {reviews_df['id'].nunique():,}")
print(f"• Movies with reviews: {reviews_df['id'].nunique():,} / {len(movies_df):,}")

# Check review distribution
print(f"\n🏆 Review State Distribution:")
if 'reviewState' in reviews_df.columns:
    print(reviews_df['reviewState'].value_counts())

print(f"\n⭐ Score Sentiment Distribution:")
if 'scoreSentiment' in reviews_df.columns:
    print(reviews_df['scoreSentiment'].value_counts())


🍅 Loading Rotten Tomatoes movie review datasets...
Loading Rotten Tomatoes movies metadata...
  Trying encoding: utf-8
  ✅ Success with utf-8
Movies dataset: 143258 movies
Columns: ['id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix']

Loading Rotten Tomatoes reviews...
  Trying encoding: utf-8
  ✅ Success with utf-8
Reviews dataset: 1444963 reviews
Columns: ['id', 'reviewId', 'creationDate', 'criticName', 'isTopCritic', 'originalScore', 'reviewState', 'publicatioName', 'reviewText', 'scoreSentiment', 'reviewUrl']

🎬 Sample movies metadata:
                   id                title  audienceScore  tomatoMeter rating  \
0  space-zombie-bingo  Space Zombie Bingo!           50.0          NaN    NaN   
1     the_green_grass      The Green Grass            NaN          NaN    NaN   
2           love_lies           

In [17]:
# Data cleaning and preprocessing for Rotten Tomatoes data
def clean_text(text):
    """Clean and normalize text data"""
    if pd.isna(text):
        return ""
    
    # Convert to string and clean
    text = str(text).strip()
    
    # Remove special characters and normalize
    import re
    text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)  # Remove control characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    
    return text.strip()

# Clean movies data
movies_df['title_clean'] = movies_df['title'].apply(clean_text)
movies_df['genre_clean'] = movies_df['genre'].apply(clean_text)
movies_df['director_clean'] = movies_df['director'].apply(clean_text)

# Clean reviews data
reviews_df['reviewText_clean'] = reviews_df['reviewText'].apply(clean_text)
reviews_df['criticName_clean'] = reviews_df['criticName'].apply(clean_text)
reviews_df['publicatioName_clean'] = reviews_df['publicatioName'].apply(clean_text)

# Remove rows with empty review text
reviews_df = reviews_df[reviews_df['reviewText_clean'].str.len() > 20].copy()

# Merge movies and reviews for complete information
print("🔗 Merging movies metadata with reviews...")
merged_df = reviews_df.merge(movies_df, on='id', how='left')

print(f"✅ Cleaned reviews dataset: {len(reviews_df)} reviews")
print(f"✅ Merged dataset: {len(merged_df)} reviews with movie metadata")
print(f"✅ Reviews with movie titles: {merged_df['title'].notna().sum()} / {len(merged_df)}")

# Handle missing titles
missing_titles = merged_df['title'].isna().sum()
if missing_titles > 0:
    print(f"⚠️ {missing_titles} reviews missing movie titles (will use movie ID)")
    merged_df['title_clean'] = merged_df['title_clean'].fillna(merged_df['id'])

print(f"\n🎬 Sample merged data:")
sample_cols = ['title_clean', 'criticName_clean', 'reviewText_clean', 'rating', 'genre_clean']
available_cols = [col for col in sample_cols if col in merged_df.columns]
print(merged_df[available_cols].head(2))


🔗 Merging movies metadata with reviews...
✅ Cleaned reviews dataset: 1364909 reviews
✅ Merged dataset: 1388546 reviews with movie metadata
✅ Reviews with movie titles: 1383051 / 1388546
⚠️ 5495 reviews missing movie titles (will use movie ID)

🎬 Sample merged data:
  title_clean criticName_clean  \
0     Beavers  Ivan M. Lincoln   
1  Blood Mask    The Foywonder   

                                    reviewText_clean rating  genre_clean  
0  Timed to be just long enough for most youngste...    NaN  Documentary  
1  It doesn't matter if a movie costs 300 million...    NaN               


In [18]:
# Helper function to filter movies by review count
def filter_movies_by_review_count(df, min_reviews=5):
    """Filter movies to only include those with at least min_reviews reviews"""
    print(f"🔍 Filtering movies with at least {min_reviews} reviews...")
    
    # Count reviews per movie
    review_counts = df.groupby('id').size().reset_index(name='review_count')
    
    # Filter movies that meet the minimum review count
    qualified_movies = review_counts[review_counts['review_count'] >= min_reviews]
    
    # Filter the original dataframe to only include qualified movies
    filtered_df = df[df['id'].isin(qualified_movies['id'])]
    
    print(f"✅ Filtered dataset: {len(filtered_df)} reviews from {len(qualified_movies)} movies (min {min_reviews} reviews each)")
    print(f"📊 Review count distribution:")
    review_distribution = qualified_movies['review_count'].describe()
    for stat, value in review_distribution.items():
        print(f"   {stat}: {value:.1f}")
    
    return filtered_df, qualified_movies

# Create unified data structure for processing Rotten Tomatoes data (now groups by movie)
def create_review_documents(df, max_movies=100, min_reviews=5):
    """Convert merged DataFrame to list of movie documents (grouped by movie)"""
    documents = []
    
    # First filter by review count
    filtered_df, qualified_movies = filter_movies_by_review_count(df, min_reviews)
    
    # Group by movie ID to collect all reviews for each movie
    print("🎬 Grouping reviews by movie...")
    movie_groups = filtered_df.groupby('id')
    
    # Sort movies by review count (descending) to prioritize movies with more reviews
    movie_review_counts = qualified_movies.sort_values('review_count', ascending=False)
    
    # Limit the number of movies for performance
    movie_count = 0
    processed_movies = 0
    
    for movie_id in movie_review_counts['id']:
        if movie_count >= max_movies:
            break
            
        movie_reviews = movie_groups.get_group(movie_id)
        processed_movies += 1
        
        # Get the first row for movie metadata (all rows have same movie info)
        first_review = movie_reviews.iloc[0]
        
        # Create comprehensive metadata for the movie
        metadata = {
            'source': 'rotten_tomatoes',
            'movie_id': movie_id,
            'movie_title': first_review.get('title_clean', str(movie_id)),
            'genre': first_review.get('genre_clean', ''),
            'director': first_review.get('director_clean', ''),
            'rating': first_review.get('rating', ''),
            'audience_score': first_review.get('audienceScore', ''),
            'tomato_meter': first_review.get('tomatoMeter', ''),
            'release_date': first_review.get('releaseDateTheaters', ''),
            'runtime': first_review.get('runtimeMinutes', ''),
            'total_reviews': len(movie_reviews),
            'review_count': len(movie_reviews)
        }
        
        # Create rich content for embedding - start with movie info
        content = f"Movie: {first_review.get('title_clean', str(movie_id))}\\n"
        
        # Add movie metadata
        if first_review.get('genre_clean'):
            content += f"Genre: {first_review.get('genre_clean')}\\n"
        if first_review.get('director_clean'):
            content += f"Director: {first_review.get('director_clean')}\\n"
        if first_review.get('rating'):
            content += f"Rating: {first_review.get('rating')}\\n"
        if first_review.get('releaseDateTheaters'):
            content += f"Release Date: {first_review.get('releaseDateTheaters')}\\n"
        if first_review.get('audienceScore'):
            content += f"Audience Score: {first_review.get('audienceScore')}%\\n"
        if first_review.get('tomatoMeter'):
            content += f"Tomato Meter: {first_review.get('tomatoMeter')}%\\n"
        if first_review.get('runtimeMinutes'):
            content += f"Runtime: {first_review.get('runtimeMinutes')} minutes\\n"
        
        # Add review count
        content += f"Total Reviews: {len(movie_reviews)}\\n\\n"
        
        # Add all reviews for this movie (with character limits)
        content += "Reviews:\\n"
        review_count = 0
        total_chars = len(content)  # Start with the content we've already added
        max_chars_per_movie = 10000  # Limit total characters per movie (reduced from 20000)
        
        for idx, review in movie_reviews.iterrows():
            # Check if we're approaching the character limit
            if total_chars > max_chars_per_movie:
                remaining_reviews = len(movie_reviews) - review_count
                content += f"\\n... [Additional {remaining_reviews} reviews truncated due to character limit]\\n"
                break
                
            # Calculate how many characters this review will add
            critic_name = review.get('criticName_clean', 'Anonymous')
            publication = review.get('publicatioName_clean', 'Unknown')
            review_text = review.get('reviewText_clean', '')
            
            # Estimate characters this review will add
            review_header = f"\\n--- Review {idx + 1} ---\\n"
            critic_line = f"Critic: {critic_name}"
            if publication != 'Unknown':
                critic_line += f" ({publication})"
            critic_line += "\\n"
            
            score_lines = ""
            if review.get('originalScore'):
                score_lines += f"Score: {review.get('originalScore')}\\n"
            if review.get('scoreSentiment'):
                score_lines += f"Sentiment: {review.get('scoreSentiment')}\\n"
            if review.get('reviewState'):
                score_lines += f"Status: {review.get('reviewState')}\\n"
            
            # Truncate review text if needed (reduced from 500 to 250 chars)
            if review_text and len(review_text) > 250:
                review_text = review_text[:250] + "..."
            
            review_content = f"Review: {review_text}\\n" if review_text else ""
            
            # Calculate total characters for this review
            review_chars = len(review_header) + len(critic_line) + len(score_lines) + len(review_content)
            
            # Check if adding this review would exceed the limit
            if total_chars + review_chars > max_chars_per_movie:
                remaining_reviews = len(movie_reviews) - review_count
                content += f"\\n... [Additional {remaining_reviews} reviews truncated due to character limit]\\n"
                break
            
            # Add the review content
            content += review_header
            content += critic_line
            content += score_lines
            content += review_content
            
            review_count += 1
            total_chars += review_chars
        
        documents.append({
            'content': content,
            'metadata': metadata
        })
        
        movie_count += 1
    
    print(f"✅ Created {len(documents)} movie documents from {movie_count} movies")
    return documents

# Create documents from merged Rotten Tomatoes data
print("🍅 Creating movie documents from Rotten Tomatoes data...")
all_documents = create_review_documents(merged_df, max_movies=100, min_reviews=5)

print(f"✅ Created {len(all_documents)} total review documents")
print(f"   - Source: Rotten Tomatoes")
print(f"   - Reviews with movie metadata included")

# Show sample document
print("\\n📄 Sample document:")
print(all_documents[0]['content'][:300] + "...")

# Show metadata sample
print("\\n🏷️ Sample metadata:")
sample_metadata = all_documents[0]['metadata']
for key, value in list(sample_metadata.items())[:8]:  # Show first 8 metadata fields
    print(f"  {key}: {value}")

# Basic statistics
print(f"\\n📊 Document Statistics:")
unique_movies = len(set([doc['metadata']['movie_title'] for doc in all_documents]))
#unique_critics = len(set([doc['metadata']['criticName'] for doc in all_documents]))
print(f"• Unique movies: {unique_movies}")
#print(f"• Unique critics: {unique_critics}")
print(f"• Average content length: {np.mean([len(doc['content']) for doc in all_documents]):.0f} characters")


🍅 Creating movie documents from Rotten Tomatoes data...
🔍 Filtering movies with at least 5 reviews...
✅ Filtered dataset: 1324508 reviews from 31456 movies (min 5 reviews each)
📊 Review count distribution:
   count: 31456.0
   mean: 42.1
   std: 65.0
   min: 5.0
   25%: 8.0
   50%: 16.0
   75%: 45.0
   max: 1896.0
🎬 Grouping reviews by movie...
✅ Created 100 movie documents from 100 movies
✅ Created 100 total review documents
   - Source: Rotten Tomatoes
   - Reviews with movie metadata included
\n📄 Sample document:
Movie: Parasite\nGenre: Comedy, Mystery & thriller, Drama\nDirector: Bong Joon Ho\nRating: R\nRelease Date: 2019-11-01\nAudience Score: 90.0%\nTomato Meter: 99.0%\nRuntime: 132.0 minutes\nTotal Reviews: 1896\n\nReviews:\n\n--- Review 492485 ---\nCritic: Patricia Puentes (CNET)\nScore: nan\nSentiment...
\n🏷️ Sample metadata:
  source: rotten_tomatoes
  movie_id: parasite_2019
  movie_title: Parasite
  genre: Comedy, Mystery & thriller, Drama
  director: Bong Joon Ho
  rating

## Step 3: Setup Basic RAG Components


In [19]:
import tiktoken
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict

# Token counting function
def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(text)
    return len(tokens)

# Convert our documents to LangChain Document format - each movie is already a chunk
print("🔪 Using each movie as a separate chunk...")
chunks = []
for doc in all_documents:
    langchain_doc = Document(
        page_content=doc['content'],
        metadata=doc['metadata']
    )
    chunks.append(langchain_doc)

print(f"✅ Created {len(chunks)} chunks from {len(all_documents)} movies")
print("   Each movie is treated as a separate chunk containing all its reviews for better semantic coherence")

# Verify chunk sizes
chunk_lengths = [tiktoken_len(chunk.page_content) for chunk in chunks]
max_chunk_length = max(chunk_lengths)
avg_chunk_length = sum(chunk_lengths) / len(chunk_lengths)
print(f"📏 Maximum chunk length: {max_chunk_length} tokens")
print(f"📏 Average chunk length: {avg_chunk_length:.0f} tokens")

# Initialize embedding model
print("🧠 Initializing embedding model...")
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Initialize chat model
chat_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)

print("✅ Basic RAG components initialized!")


🔪 Using each movie as a separate chunk...
✅ Created 100 chunks from 100 movies
   Each movie is treated as a separate chunk containing all its reviews for better semantic coherence
📏 Maximum chunk length: 2841 tokens
📏 Average chunk length: 2679 tokens
🧠 Initializing embedding model...
✅ Basic RAG components initialized!


## Step 4: Setup Enhanced Agent Tools and External Search


In [20]:
# External search tools setup
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.utilities import SerpAPIWrapper
from langchain.tools import Tool
from langchain_core.tools import tool
import json

# Setup external search tools
print("🔧 Setting up external search tools...")

# Option 1: Tavily Search (recommended - often has free tier)
try:
    tavily_search = TavilySearchResults(
        max_results=3,
        search_depth="basic",
        include_answer=True,
        include_raw_content=True
    )
    print("✅ Tavily search tool configured")
    has_tavily = True
except Exception as e:
    print(f"⚠️ Tavily not configured: {e}")
    has_tavily = False

# Option 2: SerpAPI (Google Search) - backup option
try:
    search = SerpAPIWrapper()
    serp_tool = Tool(
        name="google_search",
        description="Search Google for current information about movies, actors, reviews, or box office data",
        func=search.run,
    )
    print("✅ SerpAPI search tool configured")
    has_serp = True
except Exception as e:
    print(f"⚠️ SerpAPI not configured: {e}")
    has_serp = False

# Create a fallback search function if no external APIs are configured
def fallback_search(query: str) -> str:
    """Fallback search when no external APIs are available"""
    return f"External search not available. Query '{query}' would require external movie database access. Please configure Tavily API key (https://tavily.com/) or SerpAPI key (https://serpapi.com/) for enhanced search capabilities."

# Choose which search tool to use
if has_tavily:
    external_search_tool = tavily_search
    search_tool_name = "Tavily"
elif has_serp:
    external_search_tool = serp_tool
    search_tool_name = "SerpAPI"
else:
    external_search_tool = Tool(
        name="fallback_search",
        description="Fallback search tool when external APIs are not configured",
        func=fallback_search
    )
    search_tool_name = "Fallback"

print(f"🔍 Using {search_tool_name} for external search")

# Test the search tool
print(f"\\n🧪 Testing {search_tool_name} search...")
try:
    if has_tavily:
        test_result = external_search_tool.invoke({"query": "Inception movie reviews 2010"})
        print(f"✅ Search test successful: Found {len(test_result)} results")
    elif has_serp:
        test_result = external_search_tool.run("Inception movie reviews 2010")
        print(f"✅ Search test successful: {test_result[:100]}...")
    else:
        test_result = external_search_tool.run("Inception movie reviews 2010")
        print(f"⚠️ Using fallback search: {test_result}")
except Exception as e:
    print(f"❌ Search test failed: {e}")
    print("💡 You can continue without external search - the agent will use only embedded reviews")


🔧 Setting up external search tools...
✅ Tavily search tool configured
✅ SerpAPI search tool configured
🔍 Using Tavily for external search
\n🧪 Testing Tavily search...
✅ Search test successful: Found 3 results


In [21]:
# Create specialized tools for Rotten Tomatoes movie analysis
print("🛠️ Creating specialized agent tools for Rotten Tomatoes data...")

# We'll create a placeholder retriever that will be swapped out during evaluation
# This will be replaced with different retrievers during testing
def get_base_retriever():
    """Get the base retriever - this will be swapped during evaluation"""
    # Create a default vectorstore for now
    vectorstore = Qdrant.from_documents(
        chunks,
        embedding_model,
        location=":memory:",
        collection_name="MovieReviews"
    )
    return vectorstore.as_retriever(search_kwargs={"k": 5})

# Initialize base retriever
base_retriever = get_base_retriever()

# Create basic RAG chain components for tools
HUMAN_TEMPLATE = """
You are a knowledgeable movie critic and analyst. You have access to a database of movie reviews from Rotten Tomatoes.

Use the provided context to answer the user's question about movies, reviews, ratings, and trends. Only use the information provided in the context. If the context doesn't contain relevant information to answer the question, respond with "I don't have enough information to answer that question based on the available reviews."

When analyzing reviews, consider:
- Professional critic perspectives and ratings
- Review states (fresh/rotten) and sentiment analysis  
- Tomatometer and audience score patterns
- Publication sources and critic credentials
- Movie metadata (genre, director, release date, runtime)

CONTEXT:
{context}

QUESTION:
{question}

Provide a comprehensive and insightful answer based on the available review data.
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("human", HUMAN_TEMPLATE)
])

# Define state structure for enhanced RAG
class State(TypedDict):
    question: str
    context: List[Document]
    response: str

# Tool 1: Movie Review Search (our main RAG)
@tool
def search_movie_reviews(query: str) -> str:
    """
    Search through embedded movie reviews from Rotten Tomatoes.
    Use this for questions about specific movies, ratings, or review content.
    """
    try:
        # Use the current base_retriever (will be swapped during evaluation)
        retrieved_docs = base_retriever.get_relevant_documents(query)
        
        # Generate response using RAG
        generator_chain = chat_prompt | chat_model | StrOutputParser()
        response = generator_chain.invoke({
            "question": query, 
            "context": retrieved_docs
        })
        return response
    except Exception as e:
        return f"Error searching reviews: {str(e)}"

print("✅ Created search_movie_reviews tool")


🛠️ Creating specialized agent tools for Rotten Tomatoes data...
✅ Created search_movie_reviews tool


In [22]:
# Tool 2: Movie Statistics Analysis
@tool
def analyze_movie_statistics(movie_name: str = "") -> str:
    """
    Analyze statistics for a specific movie or provide general Rotten Tomatoes dataset statistics.
    Returns ratings, review counts, critic information, and other numerical insights.
    """
    try:
        if movie_name:
            # Search for specific movie in the merged dataset
            movie_data = merged_df[
                merged_df['title_clean'].str.contains(movie_name, case=False, na=False)
            ]
            
            if movie_data.empty:
                return f"No statistics found for '{movie_name}' in the Rotten Tomatoes dataset."
            
            # Get movie information
            movie_info = movie_data.iloc[0]  # Get first match for movie metadata
            movie_reviews = movie_data  # All reviews for this movie
            
            stats = f"Statistics for '{movie_info.get('title_clean', movie_name)}':\\n"
            stats += f"═══════════════════════════════════\\n"
            
            # Movie metadata
            if movie_info.get('genre_clean'):
                stats += f"🎭 Genre: {movie_info['genre_clean']}\\n"
            if movie_info.get('director_clean'):
                stats += f"🎬 Director: {movie_info['director_clean']}\\n"
            if movie_info.get('rating'):
                stats += f"🏷️ Rating: {movie_info['rating']}\\n"
            if movie_info.get('runtimeMinutes'):
                stats += f"⏱️ Runtime: {movie_info['runtimeMinutes']} minutes\\n"
            if movie_info.get('releaseDateTheaters'):
                stats += f"📅 Release Date: {movie_info['releaseDateTheaters']}\\n"
            
            # Scores
            if pd.notna(movie_info.get('audienceScore')):
                stats += f"👥 Audience Score: {movie_info['audienceScore']}%\\n"
            if pd.notna(movie_info.get('tomatoMeter')):
                stats += f"🍅 Tomatometer: {movie_info['tomatoMeter']}%\\n"
            
            # Review statistics
            stats += f"\\n📊 Review Analysis:\\n"
            stats += f"• Total Reviews: {len(movie_reviews)}\\n"
            
            # Review state distribution
            if 'reviewState' in movie_reviews.columns:
                review_states = movie_reviews['reviewState'].value_counts()
                for state, count in review_states.items():
                    stats += f"• {state.title()}: {count} reviews\\n"
            
            # Sentiment distribution
            if 'scoreSentiment' in movie_reviews.columns:
                sentiments = movie_reviews['scoreSentiment'].value_counts()
                stats += f"\\n🎭 Sentiment Breakdown:\\n"
                for sentiment, count in sentiments.items():
                    stats += f"• {sentiment}: {count} reviews\\n"
            
            # Top critics
            top_critics = movie_reviews[movie_reviews['isTopCritic'] == True]
            if len(top_critics) > 0:
                stats += f"• Top Critics: {len(top_critics)} reviews\\n"
            
            return stats
        else:
            # General dataset statistics
            stats = f"🍅 Rotten Tomatoes Dataset Statistics:\\n"
            stats += f"═══════════════════════════════════\\n"
            stats += f"📊 Overview:\\n"
            stats += f"• Total Movies: {len(movies_df):,}\\n"
            stats += f"• Total Reviews: {len(reviews_df):,}\\n"
            stats += f"• Reviews in Current Sample: {len(merged_df):,}\\n"
            stats += f"• Average Reviews per Movie: {len(reviews_df)/len(movies_df):.1f}\\n"
            
            # Genre distribution (top 5)
            if 'genre_clean' in merged_df.columns:
                top_genres = merged_df['genre_clean'].value_counts().head(5)
                stats += f"\\n🎭 Top Genres:\\n"
                for genre, count in top_genres.items():
                    if pd.notna(genre):
                        stats += f"• {genre}: {count} reviews\\n"
            
            # Review state distribution
            if 'reviewState' in merged_df.columns:
                review_states = merged_df['reviewState'].value_counts()
                stats += f"\\n🏆 Review States:\\n"
                for state, count in review_states.items():
                    stats += f"• {state}: {count} reviews\\n"
            
            # Top critics
            top_critics_count = merged_df[merged_df['isTopCritic'] == True]
            stats += f"\\n⭐ Critics:\\n"
            stats += f"• Top Critics: {len(top_critics_count):,} reviews\\n"
            stats += f"• Regular Critics: {len(merged_df) - len(top_critics_count):,} reviews\\n"
            
            return stats
            
    except Exception as e:
        return f"Error analyzing statistics: {str(e)}"

print("✅ Created analyze_movie_statistics tool")


✅ Created analyze_movie_statistics tool


In [23]:
# Tool 3: Rating and Review Analysis Tool
@tool
def analyze_movie_ratings(movie_name: str) -> str:
    """
    Analyze ratings and review sentiment for a specific movie from Rotten Tomatoes.
    Shows audience score, tomatometer, critic consensus, and sentiment analysis.
    """
    try:
        movie_data = merged_df[
            merged_df['title_clean'].str.contains(movie_name, case=False, na=False)
        ]
        
        if movie_data.empty:
            return f"No rating data found for '{movie_name}' in Rotten Tomatoes dataset."
        
        movie_info = movie_data.iloc[0]
        
        analysis = f"🍅 Rotten Tomatoes Analysis for '{movie_info.get('title_clean', movie_name)}':\\n"
        analysis += f"═══════════════════════════════════\\n"
        
        # Official scores
        if pd.notna(movie_info.get('tomatoMeter')):
            analysis += f"🍅 Tomatometer: {movie_info['tomatoMeter']}% (Critics)\\n"
        if pd.notna(movie_info.get('audienceScore')):
            analysis += f"🍿 Audience Score: {movie_info['audienceScore']}%\\n"
        
        # Review breakdown
        fresh_reviews = movie_data[movie_data['reviewState'] == 'fresh']
        rotten_reviews = movie_data[movie_data['reviewState'] == 'rotten']
        
        analysis += f"\\n📊 Review Breakdown:\\n"
        analysis += f"• Fresh Reviews: {len(fresh_reviews)}\\n"
        analysis += f"• Rotten Reviews: {len(rotten_reviews)}\\n"
        if len(movie_data) > 0:
            fresh_percentage = (len(fresh_reviews) / len(movie_data)) * 100
            analysis += f"• Fresh Percentage: {fresh_percentage:.1f}%\\n"
        
        # Sentiment analysis
        positive_reviews = movie_data[movie_data['scoreSentiment'] == 'POSITIVE']
        negative_reviews = movie_data[movie_data['scoreSentiment'] == 'NEGATIVE']
        
        analysis += f"\\n🎭 Sentiment Analysis:\\n"
        analysis += f"• Positive: {len(positive_reviews)} reviews\\n"
        analysis += f"• Negative: {len(negative_reviews)} reviews\\n"
        
        # Top critics vs regular critics
        top_critic_reviews = movie_data[movie_data['isTopCritic'] == True]
        regular_reviews = movie_data[movie_data['isTopCritic'] == False]
        
        analysis += f"\\n⭐ Critic Breakdown:\\n"
        analysis += f"• Top Critics: {len(top_critic_reviews)} reviews\\n"
        analysis += f"• Regular Critics: {len(regular_reviews)} reviews\\n"
        
        return analysis
        
    except Exception as e:
        return f"Error analyzing ratings: {str(e)}"

# Tool 4: External Movie Search (when local data is insufficient)
@tool
def search_external_movie_info(query: str) -> str:
    """
    Search external sites (IMDb, Metacritic, Letterboxd, Rotten Tomatoes, etc.)
    for reviews, ratings, or recent news about a movie.
    """
    try:
        # Build a richer multi-site query
        review_sites = [
            "Rotten Tomatoes", "IMDb", "Metacritic", "Letterboxd",
            "Roger Ebert", "The Guardian film review"
        ]
        joined_sites = " OR ".join(f'"{site}"' for site in review_sites)
        search_string = f'movie {query} reviews ratings {joined_sites}'

        # Dispatch to whichever external search tool you have
        if has_tavily:
            result = external_search_tool.invoke({"query": search_string})
            # Tavily returns a list of dicts → format the first three nicely
            snippets = []
            for item in result[:3]:
                if isinstance(item, dict):
                    url     = item.get("url", "")
                    content = (item.get("content", "") or "").strip()
                    snippets.append(f"Source: {url}\\n{content[:200]}…")
            return "\\n\\n".join(snippets) if snippets else "No results found."
        
        elif has_serp:
            raw = external_search_tool.run(search_string)
            return raw[:500] + "…" if len(raw) > 500 else raw
        
        else:  # generic `.run` fallback
            return external_search_tool.run(search_string)

    except Exception as e:
        return f"External search error: {e}"

# Create the agent's toolbox for Rotten Tomatoes analysis
agent_tools = [
    search_movie_reviews,
    analyze_movie_statistics, 
    analyze_movie_ratings,
    search_external_movie_info
]

print(f"✅ Created {len(agent_tools)} specialized tools for Rotten Tomatoes:")
for tool in agent_tools:
    print(f"  - {tool.name}: {tool.description}")

print("\\n✅ All Rotten Tomatoes agent tools ready!")


✅ Created 4 specialized tools for Rotten Tomatoes:
  - search_movie_reviews: Search through embedded movie reviews from Rotten Tomatoes.
Use this for questions about specific movies, ratings, or review content.
  - analyze_movie_statistics: Analyze statistics for a specific movie or provide general Rotten Tomatoes dataset statistics.
Returns ratings, review counts, critic information, and other numerical insights.
  - analyze_movie_ratings: Analyze ratings and review sentiment for a specific movie from Rotten Tomatoes.
Shows audience score, tomatometer, critic consensus, and sentiment analysis.
  - search_external_movie_info: Search external sites (IMDb, Metacritic, Letterboxd, Rotten Tomatoes, etc.)
for reviews, ratings, or recent news about a movie.
\n✅ All Rotten Tomatoes agent tools ready!


## Step 5: Setup Enhanced Agent with Tool Selection


In [24]:
# Enhanced Agent State with Tool Selection
from typing_extensions import TypedDict, Annotated
from langgraph.graph.message import add_messages
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    question: str
    tool_calls: list
    final_answer: str

# Enhanced Agent with tool selection capabilities
print("🤖 Building enhanced agent with tool selection...")

# Create the agent prompt
AGENT_PROMPT = """You are an intelligent movie analysis agent with access to multiple specialized tools.

Your tools:
1. search_movie_reviews: Search embedded movie reviews from Rotten Tomatoes
2. analyze_movie_statistics: Get numerical statistics about movies and datasets  
3. analyze_movie_ratings: Analyze ratings and review sentiment for specific movies
4. search_external_movie_info: Search external sources when local data is insufficient

Guidelines:
- Start with local review data (search_movie_reviews) for most questions
- Use statistics tools for numerical analysis
- Use rating analysis for detailed movie performance breakdown
- Only use external search when local data is clearly insufficient
- Always explain your reasoning and cite sources
- Provide comprehensive, insightful answers

Current question: {question}
"""

# Create enhanced chat model with tool binding
agent_model = ChatOpenAI(
    model="gpt-4o-mini", 
    temperature=0.1,
    max_tokens=1000
).bind_tools(agent_tools)

def agent_reasoning_node(state: AgentState) -> AgentState:
    """Agent reasoning and tool selection"""
    question = state["question"]
    messages = state.get("messages", [])
    
    # Create the prompt with current question
    prompt_message = HumanMessage(content=AGENT_PROMPT.format(question=question))
    
    # Get agent response with potential tool calls
    response = agent_model.invoke([prompt_message] + messages)
    
    return {
        "messages": [response],
        "tool_calls": response.tool_calls if hasattr(response, 'tool_calls') and response.tool_calls else []
    }

def tool_execution_node(state: AgentState) -> AgentState:
    """Execute selected tools"""
    tool_calls = state.get("tool_calls", [])
    messages = []
    
    for tool_call in tool_calls:
        tool_name = tool_call["name"]
        tool_args = tool_call["args"]
        
        # Find and execute the tool
        for tool in agent_tools:
            if tool.name == tool_name:
                try:
                    result = tool.invoke(tool_args)
                    # Create tool message
                    tool_message = ToolMessage(
                        content=str(result),
                        tool_call_id=tool_call["id"]
                    )
                    messages.append(tool_message)
                except Exception as e:
                    error_message = ToolMessage(
                        content=f"Error executing {tool_name}: {str(e)}",
                        tool_call_id=tool_call["id"]
                    )
                    messages.append(error_message)
                break
    
    return {"messages": messages}

def final_response_node(state: AgentState) -> AgentState:
    """Generate final response based on tool results"""
    messages = state["messages"]
    question = state["question"]
    
    # Create final prompt
    final_prompt = f"""
    Based on the tool results above, provide a comprehensive answer to the question: {question}
    
    Make sure to:
    - Synthesize information from multiple sources
    - Cite specific data points and sources
    - Provide insights beyond just raw data
    - Be conversational but informative
    """
    
    final_response = chat_model.invoke(messages + [HumanMessage(content=final_prompt)])
    
    return {
        "final_answer": final_response.content,
        "messages": [final_response]
    }

print("✅ Enhanced agent nodes defined!")


🤖 Building enhanced agent with tool selection...
✅ Enhanced agent nodes defined!


In [25]:
# Build the enhanced agent graph
print("🔗 Building agent workflow...")

from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode

# Create agent graph
agent_graph = StateGraph(AgentState)

# Add nodes
agent_graph.add_node("agent", agent_reasoning_node)
agent_graph.add_node("tools", ToolNode(agent_tools))
agent_graph.add_node("final_response", final_response_node)

# Add edges
agent_graph.add_edge(START, "agent")

# Conditional edge: if agent makes tool calls, go to tools; otherwise go to final response
def should_continue(state: AgentState) -> str:
    tool_calls = state.get("tool_calls", [])
    if tool_calls:
        return "tools"
    else:
        return "final_response"

agent_graph.add_conditional_edges("agent", should_continue)
agent_graph.add_edge("tools", "final_response")
agent_graph.add_edge("final_response", END)

# Compile the enhanced agent
enhanced_agent = agent_graph.compile()

print("✅ Enhanced agent with tool selection ready!")

# Generate unique project ID for this session
unique_id = uuid4().hex[:8]
project_name = f"Movie-Reviews-Enhanced-RAG-{unique_id}"

# Configure LangSmith if available
if os.getenv("LANGSMITH_API_KEY"):
    os.environ["LANGSMITH_PROJECT"] = project_name
    print(f"🎯 LangSmith project: {project_name}")

# Create a query function for the enhanced agent with tracing
def query_enhanced_agent_with_tracing(question: str, run_name: str = None) -> Dict[str, Any]:
    """Query the enhanced agent with LangSmith tracing"""
    
    # Generate run name if not provided
    if not run_name:
        run_name = f"movie_query_{int(time.time())}"
    
    # Add tags for better organization
    tags = ["movie-reviews", "rag-agent", "multi-tool"]
    
    try:
        # Execute with tracing metadata
        start_time = time.time()
        
        result = enhanced_agent.invoke(
            {
                "question": question,
                "messages": [],
                "tool_calls": [],
                "final_answer": ""
            },
            config={
                "tags": tags,
                "metadata": {
                    "query_type": "movie_analysis",
                    "session_id": unique_id,
                    "run_name": run_name
                }
            }
        )
        
        end_time = time.time()
        execution_time = end_time - start_time
        
        return {
            "question": question,
            "answer": result.get("final_answer", "No answer generated"),
            "tool_calls_made": len(result.get("tool_calls", [])),
            "execution_time": execution_time,
            "run_name": run_name,
            "success": True
        }
        
    except Exception as e:
        return {
            "question": question,
            "answer": f"Error: {str(e)}",
            "tool_calls_made": 0,
            "execution_time": 0,
            "run_name": run_name,
            "success": False
        }

print("🚀 Enhanced agent ready for complex movie analysis!")
print("✅ LangSmith tracing configured for evaluation!")


🔗 Building agent workflow...
✅ Enhanced agent with tool selection ready!
🎯 LangSmith project: Movie-Reviews-Enhanced-RAG-7de23b06
🚀 Enhanced agent ready for complex movie analysis!
✅ LangSmith tracing configured for evaluation!


## Step 6: Set Up Different Retrievers for Evaluation


In [26]:
# Import retriever components
from langchain_community.vectorstores import Qdrant
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ParentDocumentRetriever, EnsembleRetriever
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models

print("Setting up different retrievers for evaluation...")

# 1. Naive Retriever (Embedding-based)
def create_naive_retriever():
    vectorstore = Qdrant.from_documents(
        chunks,
        embedding_model,
        location=":memory:",
        collection_name="MovieReviews_Naive"
    )
    return vectorstore.as_retriever(search_kwargs={"k": 3})

naive_retriever = create_naive_retriever()
print("✅ 1. Naive retriever ready")

# 2. BM25 Retriever (Keyword-based)
def create_bm25_retriever():
    bm25 = BM25Retriever.from_documents(chunks)
    bm25.k = 3
    return bm25

bm25_retriever = create_bm25_retriever()
print("✅ 2. BM25 retriever ready")

# 3. Multi-Query Retriever
def create_multi_query_retriever():
    base_retriever = create_naive_retriever()
    return MultiQueryRetriever.from_llm(
        retriever=base_retriever, 
        llm=chat_model
    )

multi_query_retriever = create_multi_query_retriever()
print("✅ 3. Multi-query retriever ready")

# 4. Parent Document Retriever
def create_parent_document_retriever():
    # Create smaller chunks for parent document retrieval
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    
    # Create new QdrantClient and collection for parent docs
    client = QdrantClient(location=":memory:")
    client.create_collection(
        collection_name="movie_reviews_parent_docs",
        vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
    )
    
    parent_document_vectorstore = QdrantVectorStore(
        collection_name="movie_reviews_parent_docs", 
        embedding=embedding_model, 
        client=client
    )
    
    store = InMemoryStore()
    parent_retriever = ParentDocumentRetriever(
        vectorstore=parent_document_vectorstore,
        docstore=store,
        child_splitter=child_splitter,
    )
    
    parent_retriever.add_documents(chunks, ids=None)
    return parent_retriever

parent_document_retriever = create_parent_document_retriever()
print("✅ 4. Parent document retriever ready")

# 5. Contextual Compression Retriever (with Cohere reranking)
def create_compression_retriever():
    base_retriever = create_naive_retriever()
    compressor = CohereRerank(model="rerank-v3.5")
    return ContextualCompressionRetriever(
        base_compressor=compressor, 
        base_retriever=base_retriever
    )

compression_retriever = create_compression_retriever()
print("✅ 5. Contextual compression retriever ready")

# 6. Ensemble Retriever (combines multiple approaches)
def create_ensemble_retriever():
    # Use fresh instances to avoid conflicts
    naive = create_naive_retriever()
    bm25 = create_bm25_retriever()
    compression = create_compression_retriever()
    
    retrievers = [bm25, naive, compression]
    weights = [0.4, 0.4, 0.2]  # Slightly favor BM25 and naive
    
    return EnsembleRetriever(
        retrievers=retrievers, 
        weights=weights
    )

ensemble_retriever = create_ensemble_retriever()
print("✅ 6. Ensemble retriever ready")

# Store all retrievers for evaluation
retrievers_to_evaluate = [
    ("Naive", naive_retriever),
    ("BM25", bm25_retriever),
    ("Multi-Query", multi_query_retriever),
    ("Parent Document", parent_document_retriever),
    ("Contextual Compression", compression_retriever),
    ("Ensemble", ensemble_retriever)
]

print(f"\n✅ All {len(retrievers_to_evaluate)} retrievers ready for evaluation!")
for name, _ in retrievers_to_evaluate:
    print(f"  - {name}")


Setting up different retrievers for evaluation...
✅ 1. Naive retriever ready
✅ 2. BM25 retriever ready
✅ 3. Multi-query retriever ready
✅ 4. Parent document retriever ready
✅ 5. Contextual compression retriever ready
✅ 6. Ensemble retriever ready

✅ All 6 retrievers ready for evaluation!
  - Naive
  - BM25
  - Multi-Query
  - Parent Document
  - Contextual Compression
  - Ensemble


## Step 7: Create Golden Dataset using RAGAS


In [None]:
# # RAGAS setup for movie reviews
# from ragas.llms import LangchainLLMWrapper
# from ragas.embeddings import LangchainEmbeddingsWrapper
# from ragas.testset import TestsetGenerator

# print("🎯 Generating Golden Test Set for Movie Reviews using RAGAS...")
# print("=" * 70)

# # Initialize models for RAGAS
# generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0.7))
# generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# print("✅ RAGAS models initialized")

# # Prepare documents for test set generation
# print("📄 Preparing documents for test set generation...")

# # Use diverse subset of movie reviews for better question generation
# # Focus on diverse movies, critics, and sentiments
# testset_docs = chunks[:50]  # Use first 50 documents for cost efficiency

# print(f"   Selected {len(testset_docs)} review documents for question generation")

# # Create test set generator
# print("⚙️ Creating RAGAS test set generator...")
# generator = TestsetGenerator(
#     llm=generator_llm,
#     embedding_model=generator_embeddings
# )

# # Generate synthetic test set
# TESTSET_SIZE = 10  # Generate enough questions for robust evaluation
# print(f"🔬 Generating {TESTSET_SIZE} movie review questions (this may take a few minutes)...")

# try:
#     golden_dataset = generator.generate_with_langchain_docs(
#         documents=testset_docs,
#         testset_size=TESTSET_SIZE
#     )
    
#     print("✅ Synthetic test set generated successfully!")
    
#     # Convert to DataFrame and display
#     synthetic_df = golden_dataset.to_pandas()
#     print(f"\n📊 Generated {len(synthetic_df)} synthetic test cases")
    
#     # Show sample questions
#     print("\n📝 Generated Movie Review Questions:")
#     print("-" * 50)
    
#     # Find the correct question column
#     if 'question' in synthetic_df.columns:
#         question_col = 'question'
#     elif 'user_input' in synthetic_df.columns:
#         question_col = 'user_input'
#     else:
#         question_col = synthetic_df.columns[0]
#         print(f"Using column '{question_col}' as questions")
    
#     for i, row in synthetic_df.head(5).iterrows():
#         print(f"Q{i+1}: {row[question_col]}")
#         if 'reference' in row and pd.notna(row['reference']):
#             print(f"Expected: {row['reference'][:100]}...")
#         print("-" * 50)
    
#     # Store for evaluation
#     questions_for_evaluation = synthetic_df[question_col].tolist()
#     print(f"\n✅ Ready to evaluate {len(questions_for_evaluation)} questions with enhanced agent")
# finally:
#     print(f"\n✅ Golden test set ready with {len(questions_for_evaluation)} questions")
#     print("🎯 Ready for comprehensive retriever evaluation with enhanced agent!")


🎯 Generating Golden Test Set for Movie Reviews using RAGAS...
✅ RAGAS models initialized
📄 Preparing documents for test set generation...
   Selected 50 review documents for question generation
⚙️ Creating RAGAS test set generator...
🔬 Generating 10 movie review questions (this may take a few minutes)...


Applying HeadlinesExtractor:   0%|          | 0/50 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/50 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/50 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

✅ Synthetic test set generated successfully!

📊 Generated 12 synthetic test cases

📝 Generated Movie Review Questions:
--------------------------------------------------
Q1: Who is Patricia Puentes and what did she say about Parasite?
Expected: Patricia Puentes is a critic from CNET who provided a positive review of the film Parasite, stating ...
--------------------------------------------------
Q2: What insights does Catherine Springer from CathsFilmForum.com provide about the film Parasite?
Expected: Catherine Springer from CathsFilmForum.com describes Parasite as visually stunning, thematically res...
--------------------------------------------------
Q3: What are the sentiments expressed by Eric Webb in his reviews of the film?
Expected: Eric Webb, writing for the Austin American-Statesman, expressed a positive sentiment in his reviews ...
--------------------------------------------------
Q4: What did Mike Massie think about the disaster film 2012?
Expected: Mike Massie gave the 

In [29]:
# 🎯 IMPROVED RAGAS: Generating a general/comparative Golden Test Set
print("🎯 Generating Golden Test Set using RAGAS...")
print("=" * 70)

# -----------------------------
# Config knobs
# -----------------------------
MAX_TITLES            = 50       # widen unique title coverage
PER_TITLE_REVIEWS     = 20        # cap reviews per title to keep breadth
GEN_DOCS_LIMIT        = 50     # how many docs to feed into generator
TESTSET_SIZE          = 10       # number of questions to generate
RAND_SEED             = 42

# Guidance: general, movie-level questions (not single-review)
GENERATION_GUIDELINES = """
You are generating questions for evaluating a movie QA system.
The corpus consists of people's reviews of movies (subjective opinions from critics/audiences).

Generate questions that:
- Are GENERAL about movies or comparisons/similarities across movies, directors, genres, time periods, or sentiments.
- Do NOT ask about a single specific reviewer's wording, a single outlet, or a quote-level detail.
- Encourage retrieval across multiple documents (e.g., "Compare audience vs critic sentiment for X and Y", "Which genres show higher variance in sentiment?", "Do Nolan films receive more 'fresh' ratings than Villeneuve films?", etc.)
- Can be answered from aggregated patterns in reviews (scores, sentiments, themes), not from a single snippet.
- Ask about reccomendations based on movie x they like. i.e. "What movies are similar to The Dark Knight?"

Avoid:
- "What did [reviewer/outlet] say about <movie>?"
- "Quote the line where..."
- Any question that hinges on one review's phrasing.
"""

# -----------------------------
# Helper: prepare a larger, diverse document sample
# -----------------------------
MIN_TOKENS_PER_DOC = 120     # anything < 100 triggers RAGAS's error

def token_len(text: str) -> int:
    return len(text.split())

def prepare_documents_for_ragas() -> list:
    """
    Build Documents that are guaranteed to meet the min-token rule.
    Strategy:
      • Pick a broad set of review rows (as before)      → breadth
      • Concatenate rows from the same movie until >= N  → length
    """
    from langchain_core.documents import Document
    import random
    random.seed(RAND_SEED)

    # 1️⃣  Group rows by movie title
    by_title = {}
    for row in all_documents:
        title = row.get("metadata", {}).get("movie_title") or "UNKNOWN_TITLE"
        by_title.setdefault(title, []).append(row)

    # 2️⃣  Shuffle titles for randomness and pick a subset for breadth
    chosen_titles = random.sample(list(by_title), k=min(MAX_TITLES, len(by_title)))

    docs = []

    # 3️⃣  For each title, concatenate reviews until the merged block is long enough
    for title in chosen_titles:
        # Sort so we get a mix of sentiments & critics in the concat
        random.shuffle(by_title[title])

        buffer = []
        running_text = ""
        running_meta = {}

        for rev in by_title[title]:
            running_text += rev["content"].strip() + "\n\n"
            # Merge metadata (keep first non-None values)
            for k, v in rev.get("metadata", {}).items():
                running_meta.setdefault(k, v)

            if token_len(running_text) >= MIN_TOKENS_PER_DOC:
                # ✅ This block is long enough – push it as a Document
                docs.append(
                    Document(
                        page_content=running_text.strip(),
                        metadata=running_meta | {"merged_reviews": len(buffer) + 1}
                    )
                )
                # reset for next chunk of the same title
                buffer.clear()
                running_text = ""
                running_meta = {}

            else:
                buffer.append(rev)

        # If leftovers are still < MIN_TOKENS, append them to previous doc or skip
        if running_text:
            if docs and token_len(running_text) < MIN_TOKENS_PER_DOC:
                docs[-1].page_content += "\n\n" + running_text.strip()
                docs[-1].metadata["merged_reviews"] += len(buffer)
            else:
                docs.append(
                    Document(
                        page_content=running_text.strip(),
                        metadata=running_meta | {"merged_reviews": len(buffer)}
                    )
                )

        # Stop once we hit our global limit
        if len(docs) >= GEN_DOCS_LIMIT:
            break

    # 4️⃣  Clip to limit & prepend the generation-guidelines doc
    docs = [Document(GENERATION_GUIDELINES.strip(), metadata={"role": "generation_guidelines"})] + \
           docs[:GEN_DOCS_LIMIT]

    print(f"   ➜  {len(docs)-1} review docs ready (each ≥ {MIN_TOKENS_PER_DOC} tokens) "
          f"+ 1 guidelines doc")
    return docs


try:
    from ragas.llms import LangchainLLMWrapper
    from ragas.embeddings import LangchainEmbeddingsWrapper
    from ragas.testset import TestsetGenerator

    print("✅ RAGAS imports successful")

    # Prepare documents for synthetic generation
    print("📄 Preparing documents for test set generation...")
    rag_docs = prepare_documents_for_ragas()
    # Subtract 1 because the first doc is the guidelines doc
    print(f"   Selected {len(rag_docs)-1} review docs (+1 guidelines doc)")

    # Set up RAGAS generator models
    print("🤖 Setting up RAGAS generator models...")
    generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0.7))
    generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

    # Create test set generator
    print("⚙️ Creating RAGAS test set generator...")
    generator = TestsetGenerator(
        llm=generator_llm,
        embedding_model=generator_embeddings
    )

    # Generate synthetic test set
    print(f"🔬 Generating {TESTSET_SIZE} general/comparative questions "
          f"from {min(len(rag_docs), GEN_DOCS_LIMIT)} docs (this may take a few minutes)...")

    synthetic_dataset = generator.generate_with_langchain_docs(
        documents=rag_docs,        # includes the guidelines doc + diverse reviews
        testset_size=TESTSET_SIZE, # ✅ 10 questions
    )

    print("✅ Synthetic test set generated successfully!")

    # Convert to DataFrame and display
    synthetic_df = synthetic_dataset.to_pandas()
    print(f"\n📊 Generated {len(synthetic_df)} synthetic test cases")

    # Optional: light post-filter to nudge away from single-review phrasing
    def looks_too_review_specific(q: str) -> bool:
        ql = q.lower()
        triggers = [
            "what did", "what does", "according to this review", "quote", "in the following review",
            "the reviewer", "this critic", "as stated above"
        ]
        return any(t in ql for t in triggers)

    filtered_df = synthetic_df[~synthetic_df["user_input"].apply(looks_too_review_specific)]
    if len(filtered_df) < TESTSET_SIZE:
        print("⚠️ Some questions looked too review-specific; keeping the rest.")
    synthetic_df = filtered_df.head(TESTSET_SIZE)

    # Show sample questions
    print("\n📝 Generated Questions (general/comparative):")
    print("-" * 50)
    for i, row in synthetic_df.head(10).iterrows():
        print(f"Q{i+1}: {row['user_input']}")
        ref = row.get("reference", "")
        if isinstance(ref, str) and ref.strip():
            print(f"Expected Answer: {ref[:100]}...")
        print("-" * 50)

    # Store for evaluation
    golden_test_set = synthetic_dataset
    
    # Update questions_for_evaluation for the rest of the notebook
    questions_for_evaluation = synthetic_df["user_input"].tolist()

except Exception as e:
    print(f"❌ Error in RAGAS setup: {e}")
    print("💡 You can continue without RAGAS - the agent will use only embedded reviews")
    questions_for_evaluation = []

print(f"\n✅ Golden test set ready with {len(synthetic_df)} questions")
print("🎯 Ready for comprehensive RAGAS evaluation!")
print("\n🔍 Key Improvements in this version:")
print("  • General/comparative questions instead of reviewer-specific")
print("  • Better document preparation with minimum token requirements")
print("  • Grouped reviews by movie for better context")
print("  • Post-filtering to remove overly specific questions")
print("  • Configurable parameters for fine-tuning")


🎯 Generating Golden Test Set using RAGAS...
✅ RAGAS imports successful
📄 Preparing documents for test set generation...
   ➜  50 review docs ready (each ≥ 120 tokens) + 1 guidelines doc
   Selected 50 review docs (+1 guidelines doc)
🤖 Setting up RAGAS generator models...
⚙️ Creating RAGAS test set generator...
🔬 Generating 10 general/comparative questions from 50 docs (this may take a few minutes)...


Applying HeadlinesExtractor:   0%|          | 0/50 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/51 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/50 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/246 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

✅ Synthetic test set generated successfully!

📊 Generated 12 synthetic test cases
⚠️ Some questions looked too review-specific; keeping the rest.

📝 Generated Questions (general/comparative):
--------------------------------------------------
Q2: What was Matt Brunson's overall sentiment regarding Deadpool 2?
Expected Answer: Matt Brunson from Film Frenzy gave Deadpool 2 a score of 3/4 and expressed a positive sentiment, sta...
--------------------------------------------------
Q5: What positive sentiments do critics express about the nostalgia and emotional impact of Star Wars: The Force Awakens?
Expected Answer: Critics express a range of positive sentiments about Star Wars: The Force Awakens, highlighting its ...
--------------------------------------------------
Q6: What are the audience and critic sentiments towards Star Wars: The Force Awakens, and how does it compare to the original trilogy in terms of storytelling and nostalgia?
Expected Answer: The audience sentiment towards S

In [30]:
synthetic_df = golden_test_set.to_pandas()

In [31]:
synthetic_df

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What did Nicholas Oon think about the sequel t...,[Audience Score: 85.0%\nTomato Meter: 84.0%\nR...,Nicholas Oon from Maximum Hype (YouTube) rated...,single_hop_specifc_query_synthesizer
1,What was Matt Brunson's overall sentiment rega...,"[Movie: Deadpool 2\n Genre: Action, Adventure,...",Matt Brunson from Film Frenzy gave Deadpool 2 ...,single_hop_specifc_query_synthesizer
2,What did Christine Champagne from Out Magazene...,[Audience Score: 86.0%\nTomato Meter: 94.0%\nR...,Christine Champagne from Out Magazine describe...,single_hop_specifc_query_synthesizer
3,What did Keith H. Brown from Eye for Film say ...,"[Movie: American Splendor\n Genre: Biography, ...",Keith H. Brown from Eye for Film gave the movi...,single_hop_specifc_query_synthesizer
4,What positive sentiments do critics express ab...,[<1-hop>\n\nReviews: 445\n\nReviews:\n\n--- Re...,Critics express a range of positive sentiments...,multi_hop_abstract_query_synthesizer
5,What are the audience and critic sentiments to...,[<1-hop>\n\nReviews: 445\n\nReviews:\n\n--- Re...,The audience sentiment towards Star Wars: The ...,multi_hop_abstract_query_synthesizer
6,What is the audience score for Captain America...,[<1-hop>\n\nRelease Date: 2019-04-26\nAudience...,The audience score for Captain America: Civil ...,multi_hop_abstract_query_synthesizer
7,What was the release date of Avengers: Endgame...,[<1-hop>\n\nRelease Date: 2019-04-26\nAudience...,"Avengers: Endgame was released on April 26, 20...",multi_hop_abstract_query_synthesizer
8,What are the sentiments expressed by Eleanor R...,[<1-hop>\n\nAudience Score: 90.0%\nTomato Mete...,"In her review of 'Rain Man', Eleanor Ringel Ca...",multi_hop_specific_query_synthesizer
9,What are the critical sentiments expressed abo...,[<1-hop>\n\nMovie: Roma\n Genre: Drama\n Direc...,Critics have expressed a range of sentiments a...,multi_hop_specific_query_synthesizer


## Step 8: Enhanced Agent Evaluation with Retriever Swapping


In [32]:
# Function to swap retriever in the enhanced agent and evaluate
def evaluate_enhanced_agent_with_retriever(retriever_name: str, retriever, questions: List[str]):
    """
    Evaluate the enhanced agent with a specific retriever.
    This swaps out the base_retriever used in the search_movie_reviews tool.
    """
    print(f"\n🔍 Evaluating Enhanced Agent with {retriever_name} Retriever...")
    print("-" * 60)
    
    # Global reference to swap the retriever
    global base_retriever
    original_retriever = base_retriever
    
    try:
        # Swap in the new retriever
        base_retriever = retriever
        
        start_time = time.time()
        successful_queries = 0
        total_tool_calls = 0
        results = []
        
        for i, question in enumerate(questions):
            print(f"Processing question {i+1}/{len(questions)}: {question[:50]}...")
            
            try:
                # Query the enhanced agent with the swapped retriever
                result = query_enhanced_agent_with_tracing(
                    question, 
                    run_name=f"{retriever_name.lower().replace(' ', '_')}_q{i+1}"
                )
                
                if result['success']:
                    successful_queries += 1
                    total_tool_calls += result['tool_calls_made']
                    results.append({
                        'question': question,
                        'answer': result['answer'],
                        'tool_calls': result['tool_calls_made'],
                        'execution_time': result['execution_time'],
                        'success': True
                    })
                else:
                    print(f"   ❌ Failed: {result['answer']}")
                    results.append({
                        'question': question,
                        'answer': result['answer'],
                        'tool_calls': 0,
                        'execution_time': 0,
                        'success': False
                    })
                    
            except Exception as e:
                print(f"   ❌ Error: {str(e)}")
                results.append({
                    'question': question,
                    'answer': f"Error: {str(e)}",
                    'tool_calls': 0,
                    'execution_time': 0,
                    'success': False
                })
        
        end_time = time.time()
        total_time = end_time - start_time
        
        # Calculate metrics
        success_rate = successful_queries / len(questions) if questions else 0
        avg_tool_calls = total_tool_calls / len(questions) if questions else 0
        avg_execution_time = total_time / len(questions) if questions else 0
        
        evaluation_result = {
            'retriever_name': retriever_name,
            'success_rate': success_rate,
            'avg_tool_calls': avg_tool_calls,
            'total_time': total_time,
            'avg_execution_time': avg_execution_time,
            'successful_queries': successful_queries,
            'total_queries': len(questions),
            'results': results
        }
        
        print(f"✅ {retriever_name} Evaluation Complete:")
        print(f"   Success Rate: {success_rate:.2%}")
        print(f"   Avg Tool Calls: {avg_tool_calls:.1f}")
        print(f"   Total Time: {total_time:.2f}s")
        print(f"   Avg Time per Query: {avg_execution_time:.2f}s")
        
        return evaluation_result
        
    finally:
        # Always restore the original retriever
        base_retriever = original_retriever

# Function to estimate costs for enhanced agent evaluation
def estimate_enhanced_agent_cost(retriever_name: str, num_queries: int) -> float:
    """Estimate API costs for enhanced agent with different retrievers"""
    
    # Base costs per query (including agent reasoning + tool calls)
    base_costs = {
        'Naive': 0.008,  # OpenAI embeddings + LLM calls + tool execution
        'BM25': 0.006,   # No embeddings, but more tool calls due to lower accuracy
        'Multi-Query': 0.015,  # Multiple LLM calls + embeddings + retries
        'Parent Document': 0.010,  # Embeddings + LLM + larger context
        'Contextual Compression': 0.025,  # Cohere rerank + embeddings + LLM
        'Ensemble': 0.035,  # All of the above combined + coordination overhead
    }
    
    return base_costs.get(retriever_name, 0.012) * num_queries

print("✅ Enhanced agent evaluation functions ready!")
print("🔄 Ready to swap retrievers and test the complete enhanced pipeline!")


✅ Enhanced agent evaluation functions ready!
🔄 Ready to swap retrievers and test the complete enhanced pipeline!


## Step 9: Run Complete Enhanced Agent Evaluation


In [19]:
# Run comprehensive evaluation of enhanced agent with all retrievers
print("🚀 Starting Comprehensive Enhanced Agent Evaluation")
print("=" * 80)
print(f"📊 Testing {len(retrievers_to_evaluate)} retrievers on {len(questions_for_evaluation)} questions")
print(f"🤖 Each question will go through the complete enhanced agent pipeline")
print("=" * 80)

# Store all evaluation results
all_evaluation_results = []

# Evaluate each retriever with the enhanced agent
for retriever_name, retriever in retrievers_to_evaluate:
    try:
        # Run evaluation with this retriever
        result = evaluate_enhanced_agent_with_retriever(
            retriever_name, 
            retriever, 
            questions_for_evaluation
        )
        
        # Add cost estimation
        result['estimated_cost'] = estimate_enhanced_agent_cost(
            retriever_name, 
            len(questions_for_evaluation)
        )
        
        all_evaluation_results.append(result)
        
        # Add a brief pause between evaluations to avoid rate limits
        print(f"⏱️ Pausing briefly before next retriever...")
        time.sleep(2)
        
    except Exception as e:
        print(f"❌ Failed to evaluate {retriever_name}: {str(e)}")
        # Add placeholder result for failed evaluation
        all_evaluation_results.append({
            'retriever_name': retriever_name,
            'success_rate': 0.0,
            'avg_tool_calls': 0.0,
            'total_time': 0.0,
            'avg_execution_time': 0.0,
            'successful_queries': 0,
            'total_queries': len(questions_for_evaluation),
            'estimated_cost': 0.0,
            'results': []
        })

print("\\n✅ All Enhanced Agent Evaluations Complete!")
print(f"📈 Evaluated {len(all_evaluation_results)} retrievers with enhanced agent pipeline")

# Create results DataFrame for analysis
results_df = pd.DataFrame([
    {
        'Retriever': result['retriever_name'],
        'Success Rate': result['success_rate'],
        'Avg Tool Calls': result['avg_tool_calls'],
        'Total Time (s)': result['total_time'],
        'Avg Time/Query (s)': result['avg_execution_time'],
        'Estimated Cost ($)': result['estimated_cost']
    }
    for result in all_evaluation_results
])

# Display comprehensive results
print("\\n📊 Enhanced Agent Retriever Evaluation Results:")
print("=" * 80)
print(results_df.round(3).to_string(index=False))

# Find best performers
successful_results = results_df[results_df['Success Rate'] > 0.5]

if len(successful_results) > 0:
    fastest = successful_results.loc[successful_results['Total Time (s)'].idxmin()]
    cheapest = successful_results.loc[successful_results['Estimated Cost ($)'].idxmin()]
    most_reliable = successful_results.loc[successful_results['Success Rate'].idxmax()]
    most_efficient = successful_results.loc[successful_results['Avg Tool Calls'].idxmin()]
    
    print("\\n🏆 ENHANCED AGENT PERFORMANCE WINNERS:")
    print("=" * 50)
    print(f"⚡ Fastest: {fastest['Retriever']} ({fastest['Total Time (s)']:.2f}s total)")
    print(f"💰 Most Cost-Effective: {cheapest['Retriever']} (${cheapest['Estimated Cost ($)']:.3f})")
    print(f"🎯 Most Reliable: {most_reliable['Retriever']} ({most_reliable['Success Rate']:.1%} success)")
    print(f"🔧 Most Efficient: {most_efficient['Retriever']} ({most_efficient['Avg Tool Calls']:.1f} avg tools)")
    
    # Calculate combined score for overall best
    successful_results = successful_results.copy()
    successful_results['Combined Score'] = (
        0.3 * successful_results['Success Rate'] + 
        0.25 * (1 / (successful_results['Total Time (s)'] + 1)) + 
        0.25 * (1 / (successful_results['Estimated Cost ($)'] + 0.001)) +
        0.2 * (1 / (successful_results['Avg Tool Calls'] + 1))
    )
    
    best_overall = successful_results.loc[successful_results['Combined Score'].idxmax()]
    print(f"🌟 Best Overall: {best_overall['Retriever']} (Score: {best_overall['Combined Score']:.3f})")
    
else:
    print("\\n⚠️ No retrievers achieved >50% success rate. Check system configuration.")

print(f"\\n🎬 Enhanced Agent Movie Review Evaluation Complete!")
print(f"⏰ Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


🚀 Starting Comprehensive Enhanced Agent Evaluation
📊 Testing 6 retrievers on 12 questions
🤖 Each question will go through the complete enhanced agent pipeline

🔍 Evaluating Enhanced Agent with Naive Retriever...
------------------------------------------------------------
Processing question 1/12: What are the key highlights of the documentary 'Be...


  retrieved_docs = base_retriever.get_relevant_documents(query)


Processing question 2/12: What is the overall sentiment of the review for Bl...
Processing question 3/12: How does the choreography in 'City Hunter: Shinjuk...
Processing question 4/12: What are the main criticisms of the film City Hunt...
Processing question 5/12: What are the contrasting sentiments expressed in t...
Processing question 6/12: What are the critical perspectives on Alan J. Paku...
Processing question 7/12: What are the contrasting reviews of The DUFF in te...
Processing question 8/12: Who directed the movie Klute and what is notable a...
Processing question 9/12: How does John DeFore's review of 'More Than Honey'...
Processing question 10/12: In what ways does Richard Dutcher's film 'Falling'...
Processing question 11/12: In what ways does the film 'Miss and the Doctors' ...
Processing question 12/12: What are the critical reviews of the movie '10 Ter...
✅ Naive Evaluation Complete:
   Success Rate: 100.00%
   Avg Tool Calls: 1.4
   Total Time: 247.74s
   Avg Time per Q

## Step 11: LangSmith Advanced Evaluation Setup (QA Accuracy, Helpfulness, Relevance)

In [33]:
# Setup LangSmith evaluation dataset and configuration
print("🔬 Setting up LangSmith Advanced Evaluation...")

# Check if LangSmith is available
USE_LANGSMITH = bool(os.getenv("LANGSMITH_API_KEY"))

if USE_LANGSMITH:
    print("✅ LangSmith detected - setting up advanced evaluation")
    
    try:
        from langsmith import Client
        from langsmith.evaluation import LangChainStringEvaluator, evaluate
        from langchain.prompts import ChatPromptTemplate
        from langchain.schema import StrOutputParser
        from operator import itemgetter
        
        # Initialize LangSmith client
        langsmith_client = Client()
        
        # Create dataset name for this evaluation
        LANGSMITH_DATASET_NAME = f"movie-reviews-retriever-evalfinal-{unique_id}"
        print(f"📊 LangSmith dataset: {LANGSMITH_DATASET_NAME}")
        
        # Create dataset from our evaluation questions using RAGAS reference answers
        dataset_inputs = []
        dataset_outputs = []
        
        # Use the synthetic_df that was created by RAGAS (contains both questions and references)
        num_questions = min(10, len(synthetic_df))  # Use subset for LangSmith eval
        
        print(f"📝 Using RAGAS-generated reference answers from synthetic_df...")
        print(f"📊 Available columns in synthetic_df: {list(synthetic_df.columns)}")
        
        # Find the correct reference column
        if 'reference' in synthetic_df.columns:
            reference_col = 'reference'
        elif 'expected_output' in synthetic_df.columns:
            reference_col = 'expected_output'
        elif 'answer' in synthetic_df.columns:
            reference_col = 'answer'
        else:
            reference_col = synthetic_df.columns[-1]  # Use last column as fallback
            print(f"⚠️ Using column '{reference_col}' as reference answers")
        
        # Find the question column (already determined in RAGAS step)
        if 'question' in synthetic_df.columns:
            question_col = 'question'
        elif 'user_input' in synthetic_df.columns:
            question_col = 'user_input'
        else:
            question_col = synthetic_df.columns[0]
        
        for i in range(num_questions):
            row = synthetic_df.iloc[i]
            question = row[question_col]
            reference = row[reference_col]
            
            dataset_inputs.append({"question": question})
            dataset_outputs.append({"answer": reference})
        
        print(f"✅ Using {len(dataset_inputs)} RAGAS question-reference pairs for LangSmith evaluation")
        
        # Create the dataset in LangSmith
        try:
            dataset = langsmith_client.create_dataset(
                dataset_name=LANGSMITH_DATASET_NAME,
                description="Movie review questions for retriever evaluation with RAGAS-generated references"
            )
            
            # Add examples to dataset
            for inputs, outputs in zip(dataset_inputs, dataset_outputs):
                langsmith_client.create_example(
                    dataset_id=dataset.id,
                    inputs=inputs,
                    outputs=outputs
                )
            
            print(f"✅ Created LangSmith dataset with {len(dataset_inputs)} examples")
            
        except Exception as e:
            if "already exists" in str(e).lower():
                print(f"📋 Dataset {LANGSMITH_DATASET_NAME} already exists - using existing dataset")
            else:
                print(f"⚠️ Dataset creation issue: {e}")
                # Continue with evaluation anyway
        
        print("🎯 LangSmith evaluation setup complete!")
        
    except ImportError as e:
        print(f"❌ LangSmith libraries not available: {e}")
        print("💡 Install with: pip install langsmith")
        USE_LANGSMITH = False
    except Exception as e:
        print(f"⚠️ LangSmith setup issue: {e}")
        USE_LANGSMITH = False

else:
    print("⚠️ LangSmith not configured - skipping advanced evaluation")
    print("💡 Set LANGSMITH_API_KEY environment variable to enable advanced metrics")

print(f"🔧 LangSmith evaluation: {'ENABLED' if USE_LANGSMITH else 'DISABLED'}")


🔬 Setting up LangSmith Advanced Evaluation...
✅ LangSmith detected - setting up advanced evaluation
📊 LangSmith dataset: movie-reviews-retriever-evalfinal-7de23b06
📝 Using RAGAS-generated reference answers from synthetic_df...
📊 Available columns in synthetic_df: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']
✅ Using 10 RAGAS question-reference pairs for LangSmith evaluation
✅ Created LangSmith dataset with 10 examples
🎯 LangSmith evaluation setup complete!
🔧 LangSmith evaluation: ENABLED


## Step 12: LangSmith Advanced Evaluation Run Evaluation Using AGENTIC RAG

In [35]:

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity

# Run LangSmith Advanced Evaluation with ENHANCED AGENT (not simple RAG)
if USE_LANGSMITH:
    print("\n🔬 Running LangSmith evaluation with ENHANCED AGENT for ALL retrievers...")
    print("🤖 This tests the complete agentic pipeline with multi-tool selection")
    
    try:
        # Movie-specific evaluators (same as before)
        eval_llm = ChatOpenAI(model="gpt-4o-mini")
        qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm})
        
        # Movie review helpfulness evaluator
        movie_helpfulness_evaluator = LangChainStringEvaluator(
            "labeled_criteria",
            config={
                "criteria": {
                    "helpfulness": (
                        "Is this submission helpful for someone looking for movie information,"
                        " taking into account the correct reference answer about movies/reviews?"
                    )
                },
                "llm": eval_llm
            },
            prepare_data=lambda run, example: {
                "prediction": run.outputs["output"],
                "reference": example.outputs["answer"],
                "input": example.inputs["question"],
            }
        )
        
        # Movie review relevance evaluator
        movie_relevance_evaluator = LangChainStringEvaluator(
            "criteria",
            config={
                "criteria": {
                    "relevance": "Is this response relevant to the movie/review question? Does it provide useful movie information?",
                },
                "llm": eval_llm
            }
        )
        
        # Define all retrievers to evaluate (using the same ones from our main evaluation)
        all_retrievers_to_evaluate = [
            (naive_retriever, "Naive"),
            (bm25_retriever, "BM25"),
            (multi_query_retriever, "Multi-Query"),
            (parent_document_retriever, "Parent-Document"),
            (compression_retriever, "Contextual-Compression"),
            (ensemble_retriever, "Ensemble")
        ]
        
        print(f"📊 Evaluating {len(all_retrievers_to_evaluate)} retrievers with ENHANCED AGENT...")
        print("🔍 Evaluators: QA Accuracy, Movie Helpfulness, Movie Relevance")
        print("🤖 Testing: Complete agentic pipeline with tool selection and reasoning")
        
        # Evaluate each retriever with the ENHANCED AGENT
        for retriever, name in all_retrievers_to_evaluate:
            print(f"\n🔍 Evaluating {name} retriever with Enhanced Agent...")
            
            try:
                # Create a wrapper function for LangSmith that uses our enhanced agent
                def enhanced_agent_wrapper(inputs):
                    """
                    Wrapper function that LangSmith can call to evaluate our enhanced agent.
                    This swaps the retriever and runs the complete agentic pipeline.
                    """
                    question = inputs["question"]
                    
                    # Temporarily swap the retriever
                    global base_retriever
                    original_retriever = base_retriever
                    
                    try:
                        # Swap in the current retriever being evaluated
                        base_retriever = retriever
                        
                        # Use our enhanced agent function
                        result = query_enhanced_agent_with_tracing(
                            question, 
                            run_name=f"langsmith_{name.lower().replace(' ', '_').replace('-', '_')}"
                        )
                        
                        # Return the answer (LangSmith expects this format)
                        return result.get("answer", "No answer generated")
                        
                    except Exception as e:
                        return f"Enhanced agent error: {str(e)}"
                    
                    finally:
                        # Always restore original retriever
                        base_retriever = original_retriever
                
                # Run LangSmith evaluation with our enhanced agent wrapper
                experiment_results = evaluate(
                    enhanced_agent_wrapper,
                    data=LANGSMITH_DATASET_NAME,
                    evaluators=[
                        qa_evaluator,
                        movie_helpfulness_evaluator,
                        movie_relevance_evaluator
                    ],
                    metadata={
                        "retriever_type": name, 
                        "evaluation_run": "enhanced_agent_retrievers",
                        "evaluators": "qa_helpfulness_relevance",
                        "domain": "movie_reviews",
                        "evaluation_mode": "enhanced_agent_pipeline",
                        "agent_features": "multi_tool_selection_external_search_analytics"
                    },
                    experiment_prefix=f"enhanced_agent_{name.lower().replace(' ', '_').replace('-', '_')}"
                )
                
                print(f"✅ {name} enhanced agent evaluation completed successfully")
                
                # Add rate limiting delay between retrievers
                print(f"⏱️ Pausing briefly before next retriever...")
                time.sleep(3)  # 3 second delay between retrievers
                
            except Exception as e:
                print(f"❌ {name} enhanced agent evaluation failed: {e}")
                continue
        
        print("\n🎯 All enhanced agent retriever evaluations completed!")
        print("📊 Check LangSmith dashboard for detailed comparison results!")
        print("🔍 Each retriever tested with: Complete Enhanced Agent Pipeline")
        print("🤖 Includes: Multi-tool selection, external search, analytics, reasoning")
        print(f"🌐 LangSmith Project: {project_name}")
        print(f"📋 Dataset: {LANGSMITH_DATASET_NAME}")
        
    except Exception as e:
        print(f"❌ Enhanced Agent LangSmith evaluation failed: {e}")
        print("💡 Check your LangSmith API key and network connection")
        
else:
    print("\n⚠️ Skipping Enhanced Agent LangSmith evaluation (not configured)")
    print("💡 Configure LangSmith API key to enable:")
    print("   - Enhanced Agent QA Accuracy scoring")
    print("   - Enhanced Agent Movie helpfulness analysis") 
    print("   - Enhanced Agent Content relevance evaluation")
    print("   - Enhanced Agent Tool usage analytics")
    print("   - Detailed agentic pipeline performance comparisons")



🔬 Running LangSmith evaluation with ENHANCED AGENT for ALL retrievers...
🤖 This tests the complete agentic pipeline with multi-tool selection
📊 Evaluating 6 retrievers with ENHANCED AGENT...
🔍 Evaluators: QA Accuracy, Movie Helpfulness, Movie Relevance
🤖 Testing: Complete agentic pipeline with tool selection and reasoning

🔍 Evaluating Naive retriever with Enhanced Agent...
View the evaluation results for experiment: 'enhanced_agent_naive-b02acd5f' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=bf8b540c-a127-49c8-ad4e-f75c85ea5ac5




0it [00:00, ?it/s]

  retrieved_docs = base_retriever.get_relevant_documents(query)


KeyboardInterrupt: 

## Step 13: LangSmith Advanced Evaluation Run Evaluation Basic RAG

In [30]:
# Run LangSmith Advanced Evaluation for ALL Retrievers
if USE_LANGSMITH:
    print("\n🔬 Running LangSmith evaluation for ALL movie review retrievers...")
    
    try:
        # Create RAG chain for movie review evaluation
        MOVIE_RAG_PROMPT = """Given the provided movie review context and question, answer the question based only on the context.
If you cannot answer based on the context, say "I don't know".

Focus on:
- Movie titles, directors, genres, and release information
- Critic opinions, scores, and sentiments
- Review publication sources and dates
- Rotten Tomatoes ratings (Tomatometer and Audience Score)

Context: {context}
Question: {question}"""
        
        rag_prompt = ChatPromptTemplate.from_template(MOVIE_RAG_PROMPT)
        eval_llm = ChatOpenAI(model="gpt-4o-mini")
        
        # Movie-specific evaluators
        qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm})
        
        # Movie review helpfulness evaluator
        movie_helpfulness_evaluator = LangChainStringEvaluator(
            "labeled_criteria",
            config={
                "criteria": {
                    "helpfulness": (
                        "Is this submission helpful for someone looking for movie information,"
                        " taking into account the correct reference answer about movies/reviews?"
                    )
                },
                "llm": eval_llm
            },
            prepare_data=lambda run, example: {
                "prediction": run.outputs["output"],
                "reference": example.outputs["answer"],
                "input": example.inputs["question"],
            }
        )
        
        # Movie review relevance evaluator
        movie_relevance_evaluator = LangChainStringEvaluator(
            "criteria",
            config={
                "criteria": {
                    "relevance": "Is this response relevant to the movie/review question? Does it provide useful movie information?",
                },
                "llm": eval_llm
            }
        )
        
        # Define all retrievers to evaluate (using the same ones from our main evaluation)
        all_retrievers_to_evaluate = [
            (naive_retriever, "Naive"),
            (bm25_retriever, "BM25"),
            (multi_query_retriever, "Multi-Query"),
            (parent_document_retriever, "Parent-Document"),
            (compression_retriever, "Contextual-Compression"),
            (ensemble_retriever, "Ensemble")
        ]
        
        print(f"📊 Evaluating {len(all_retrievers_to_evaluate)} movie review retrievers with LangSmith...")
        print("🔍 Evaluators: QA Accuracy, Movie Helpfulness, Movie Relevance")
        print("📋 This evaluation tests retrievers directly (not through enhanced agent)")
        
        # Evaluate each retriever
        for retriever, name in all_retrievers_to_evaluate:
            print(f"\n🔍 Evaluating {name} retriever on movie reviews...")
            
            try:
                # Create RAG chain for this retriever
                rag_chain = (
                    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
                    | rag_prompt | eval_llm | StrOutputParser()
                )
                
                # Run evaluation for this retriever
                experiment_results = evaluate(
                    rag_chain.invoke,
                    data=LANGSMITH_DATASET_NAME,
                    evaluators=[
                        qa_evaluator,
                        movie_helpfulness_evaluator,
                        movie_relevance_evaluator
                    ],
                    metadata={
                        "retriever_type": name, 
                        "evaluation_run": "movie_review_retrievers",
                        "evaluators": "qa_helpfulness_relevance",
                        "domain": "movie_reviews",
                        "evaluation_mode": "direct_retriever"
                    },
                    experiment_prefix=f"movie_retriever_{name.lower().replace(' ', '_').replace('-', '_')}"
                )
                
                print(f"✅ {name} movie review evaluation completed successfully")
                
                # Add rate limiting delay between retrievers
                print(f"⏱️ Pausing briefly before next retriever...")
                time.sleep(3)  # 3 second delay between retrievers
                
            except Exception as e:
                print(f"❌ {name} evaluation failed: {e}")
                continue
        
        print("\n🎯 All movie review retriever evaluations completed!")
        print("📊 Check LangSmith dashboard for detailed comparison results!")
        print("🔍 Each retriever has been evaluated for: QA Accuracy, Movie Helpfulness, Movie Relevance")
        print(f"🌐 LangSmith Project: {project_name}")
        print(f"📋 Dataset: {LANGSMITH_DATASET_NAME}")
        
    except Exception as e:
        print(f"❌ LangSmith evaluation failed: {e}")
        print("💡 Check your LangSmith API key and network connection")
        
else:
    print("\n⚠️ Skipping LangSmith evaluation (not configured)")
    print("💡 Configure LangSmith API key to enable:")
    print("   - QA Accuracy scoring")
    print("   - Movie helpfulness analysis") 
    print("   - Content relevance evaluation")
    print("   - Detailed performance comparisons")



🔬 Running LangSmith evaluation for ALL movie review retrievers...
📊 Evaluating 6 movie review retrievers with LangSmith...
🔍 Evaluators: QA Accuracy, Movie Helpfulness, Movie Relevance
📋 This evaluation tests retrievers directly (not through enhanced agent)

🔍 Evaluating Naive retriever on movie reviews...
View the evaluation results for experiment: 'movie_retriever_naive-ac3fd236' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/2d89c42c-0b26-429c-82c5-e11a2ec88e53/compare?selectedSessions=1cd0d506-8e96-478b-9c7c-bf0cddb739de




0it [00:00, ?it/s]

✅ Naive movie review evaluation completed successfully
⏱️ Pausing briefly before next retriever...

🔍 Evaluating BM25 retriever on movie reviews...
View the evaluation results for experiment: 'movie_retriever_bm25-871d04be' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/2d89c42c-0b26-429c-82c5-e11a2ec88e53/compare?selectedSessions=97fa8ff8-96c6-4a43-ad8f-16ebbd467e59




0it [00:00, ?it/s]

✅ BM25 movie review evaluation completed successfully
⏱️ Pausing briefly before next retriever...

🔍 Evaluating Multi-Query retriever on movie reviews...
View the evaluation results for experiment: 'movie_retriever_multi_query-9ab2d4e4' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/2d89c42c-0b26-429c-82c5-e11a2ec88e53/compare?selectedSessions=36e9f2de-38aa-4b00-8419-1256d17ea5ad




0it [00:00, ?it/s]

✅ Multi-Query movie review evaluation completed successfully
⏱️ Pausing briefly before next retriever...

🔍 Evaluating Parent-Document retriever on movie reviews...
View the evaluation results for experiment: 'movie_retriever_parent_document-d0a54e61' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/2d89c42c-0b26-429c-82c5-e11a2ec88e53/compare?selectedSessions=5d6ad0b6-ddd2-49a9-94f0-4c0aa29a7eef




0it [00:00, ?it/s]

✅ Parent-Document movie review evaluation completed successfully
⏱️ Pausing briefly before next retriever...

🔍 Evaluating Contextual-Compression retriever on movie reviews...
View the evaluation results for experiment: 'movie_retriever_contextual_compression-e3873fe3' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/2d89c42c-0b26-429c-82c5-e11a2ec88e53/compare?selectedSessions=a869713e-662c-4330-afc9-677f74a0334a




0it [00:00, ?it/s]

✅ Contextual-Compression movie review evaluation completed successfully
⏱️ Pausing briefly before next retriever...

🔍 Evaluating Ensemble retriever on movie reviews...
View the evaluation results for experiment: 'movie_retriever_ensemble-1a5e3f73' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/2d89c42c-0b26-429c-82c5-e11a2ec88e53/compare?selectedSessions=4e314c41-7427-4d23-a1ca-718d68adf17f




0it [00:00, ?it/s]

✅ Ensemble movie review evaluation completed successfully
⏱️ Pausing briefly before next retriever...

🎯 All movie review retriever evaluations completed!
📊 Check LangSmith dashboard for detailed comparison results!
🔍 Each retriever has been evaluated for: QA Accuracy, Movie Helpfulness, Movie Relevance
🌐 LangSmith Project: Movie-Reviews-Enhanced-RAG-31693300
📋 Dataset: movie-reviews-retriever-eval1-31693300


## Evaluation Summary: Two Complementary Approaches

This notebook provides **two complementary evaluation approaches** for your movie review RAG system:

### 🤖 **Enhanced Agent Evaluation** (Steps 8-10)
- **What**: Tests retrievers within your complete enhanced agent pipeline
- **Includes**: Multi-tool selection, external search, statistical analysis, and LangSmith tracing
- **Metrics**: Success rate, tool usage, timing, cost estimation
- **Purpose**: Real-world performance evaluation of your production system
- **Output**: Production recommendations by use case (speed vs accuracy vs cost)

### 🔬 **LangSmith Advanced Evaluation** (Step 11)
- **What**: Tests retrievers directly with standardized LangSmith metrics
- **Includes**: QA Accuracy, Helpfulness, and Relevance scoring
- **Metrics**: Scientific evaluation scores for retriever comparison
- **Purpose**: Detailed retriever analysis with industry-standard metrics
- **Output**: Dashboard analytics and comparative scoring

### 💡 **Why Both Matter**:
- **Enhanced Agent**: Shows how retrievers perform in your actual use case with all tools
- **LangSmith**: Provides standardized metrics for scientific comparison
- **Together**: Complete picture of retriever performance for optimal selection

### 🎯 **Next Steps**:
1. **Run the notebook** to get both enhanced agent and LangSmith evaluations
2. **Use Enhanced Agent results** for production deployment decisions  
3. **Use LangSmith metrics** for detailed retriever analysis and optimization
4. **Compare RAGAS references** with actual retriever outputs in LangSmith dashboard
5. **Monitor both** in production for continuous improvement

### 🔧 **Key Fix Applied**:
- ✅ **RAGAS Reference Integration**: LangSmith now uses actual RAGAS-generated reference answers instead of generic ones
- ✅ **Proper Question-Answer Mapping**: Each evaluation question has its corresponding RAGAS reference
- ✅ **Better Evaluation Quality**: More accurate helpfulness and relevance scoring


In [None]:
# # 🐛 ISOLATED RAGAS DEBUG - Single Row Test (No LangSmith)

# print("🐛 ISOLATED RAGAS DEBUG - Testing with single example...")
# print("=" * 80)

# try:
#     from ragas import evaluate as ragas_evaluate, EvaluationDataset
#     from ragas.metrics import faithfulness
#     import os
    
#     # Disable RAGAS tracking
#     os.environ['RAGAS_DO_NOT_TRACK'] = 'true'
    
#     print("✅ RAGAS imports successful")
#     print("🎯 Testing with SINGLE example outside LangSmith context")
    
#     # 1. Get one example from the dataset
#     if 'synthetic_df' in locals() and len(synthetic_df) > 0:
#         test_row = synthetic_df.iloc[0]
#         test_question = test_row['user_input']
#         test_reference = test_row['reference']
        
#         print(f"\n📋 TEST DATA:")
#         print(f"   Question: {test_question}")
#         print(f"   Reference: {test_reference[:100]}...")
        
#         # 2. Use naive retriever to get contexts
#         print(f"\n🔍 RETRIEVAL:")
#         retrieved_docs = naive_retriever.invoke(test_question)
#         contexts = [doc.page_content for doc in retrieved_docs[:2]]
        
#         print(f"   Retrieved {len(retrieved_docs)} docs, using first 2")
#         for i, ctx in enumerate(contexts):
#             print(f"   Context[{i}]: {ctx[:100]}...")
        
#         # 3. Generate response using our RAG pipeline
#         print(f"\n🤖 RAG RESPONSE GENERATION:")
#         context_str = "\n\n".join(contexts)
#         rag_prompt = f"""Given the movie review context, answer the question based only on the context.

# Context: {context_str}
# Question: {test_question}"""
        
#         test_response = chat_model.invoke(rag_prompt).content
#         print(f"   Generated response: {test_response[:150]}...")
        
#         # 4. Create RAGAS evaluation data
#         print(f"\n📊 RAGAS DATA STRUCTURE:")
#         eval_data = [{
#             'user_input': test_question,
#             'response': test_response,
#             'retrieved_contexts': contexts,
#             'reference': test_reference
#         }]
        
#         print(f"   user_input: {type(eval_data[0]['user_input'])} - '{eval_data[0]['user_input'][:50]}...'")
#         print(f"   response: {type(eval_data[0]['response'])} - '{eval_data[0]['response'][:50]}...'")
#         print(f"   retrieved_contexts: {type(eval_data[0]['retrieved_contexts'])} - {len(eval_data[0]['retrieved_contexts'])} items")
#         print(f"   reference: {type(eval_data[0]['reference'])} - '{eval_data[0]['reference'][:50]}...'")
        
#         # 5. Create RAGAS dataset
#         print(f"\n🔬 CREATING RAGAS DATASET:")
#         ragas_dataset = EvaluationDataset.from_list(eval_data)
#         print(f"   Dataset created successfully!")
#         print(f"   Dataset type: {type(ragas_dataset)}")
#         print(f"   Dataset length: {len(ragas_dataset) if hasattr(ragas_dataset, '__len__') else 'Unknown'}")
        
#         # 6. Run RAGAS evaluation (THIS IS WHERE THE ERROR LIKELY OCCURS)
#         print(f"\n⚡ RUNNING RAGAS EVALUATION:")
#         print(f"   Metric: Faithfulness")
#         print(f"   LLM: {chat_model}")
#         print(f"   Embeddings: {embedding_model}")
        
#         # Try with minimal parameters first
#         try:
#             result = ragas_evaluate(
#                 dataset=ragas_dataset,
#                 metrics=[faithfulness],
#                 llm=chat_model,
#                 embeddings=embedding_model
#             )
            
#             print(f"✅ RAGAS evaluation completed successfully!")
#             print(f"   Result type: {type(result)}")
            
#             # Try to extract the score
#             try:
#                 result_df = result.to_pandas()
#                 print(f"   Result DataFrame shape: {result_df.shape}")
#                 print(f"   Result DataFrame columns: {list(result_df.columns)}")
                
#                 if 'faithfulness' in result_df.columns:
#                     score = result_df['faithfulness'].iloc[0]
#                     print(f"   Faithfulness score: {score}")
#                 else:
#                     print(f"   ❌ 'faithfulness' column not found")
                    
#             except Exception as score_error:
#                 print(f"   ❌ Error extracting score: {score_error}")
                
#         except Exception as ragas_error:
#             print(f"❌ RAGAS EVALUATION FAILED:")
#             print(f"   Error: {ragas_error}")
#             print(f"   Error type: {type(ragas_error)}")
            
#             # Print full traceback for debugging
#             import traceback
#             print(f"\n🔍 FULL TRACEBACK:")
#             traceback.print_exc()
        
#     else:
#         print("❌ No synthetic_df found - run RAGAS dataset generation first")
    
#     # Clean up
#     if 'RAGAS_DO_NOT_TRACK' in os.environ:
#         del os.environ['RAGAS_DO_NOT_TRACK']
        
# except ImportError as e:
#     print(f"❌ Import error: {e}")
# except Exception as e:
#     print(f"❌ Unexpected error: {e}")
#     import traceback
#     traceback.print_exc()

# print(f"\n🐛 ISOLATED DEBUG COMPLETE!")
# print(f"💡 This test shows exactly where the RAGAS error occurs")
# print(f"💡 If this fails, the issue is in RAGAS itself, not LangSmith integration")


🐛 ISOLATED RAGAS DEBUG - Testing with single example...
✅ RAGAS imports successful
🎯 Testing with SINGLE example outside LangSmith context

📋 TEST DATA:
   Question: What did Slant Magazine say about the movie Stay Cool and how does it reflect on the film's quality?
   Reference: Slant Magazine's critic Adam Keleman described Stay Cool as thoroughly a family affair but felt it w...

🔍 RETRIEVAL:
   Retrieved 5 docs, using first 2
   Context[0]: Movie: Stay Cool\nGenre: Comedy\nDirector: Michael Polish\nRating: PG-13\nRelease Date: nan\nCritic:...
   Context[1]: Movie: Stay Cool\nGenre: Comedy\nDirector: Michael Polish\nRating: PG-13\nRelease Date: nan\nCritic:...

🤖 RAG RESPONSE GENERATION:
   Generated response: Slant Magazine, through critic Adam Keleman, described "Stay Cool" as feeling rushed and merely an homage to films like "Pretty in Pink" or "Some Kind...

📊 RAGAS DATA STRUCTURE:
   user_input: <class 'str'> - 'What did Slant Magazine say about the movie Stay C...'
   response

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

✅ RAGAS evaluation completed successfully!
   Result type: <class 'ragas.dataset_schema.EvaluationResult'>
   Result DataFrame shape: (1, 5)
   Result DataFrame columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness']
   Faithfulness score: 0.7142857142857143

🐛 ISOLATED DEBUG COMPLETE!
💡 This test shows exactly where the RAGAS error occurs
💡 If this fails, the issue is in RAGAS itself, not LangSmith integration


In [None]:
# 🔧 RAGAS Metric Evaluation - WORKAROUND (Pre-compute scores outside LangSmith)

if USE_LANGSMITH:
    print("🔧 RAGAS WORKAROUND - Pre-computing scores outside LangSmith context...")
    print("=" * 80)
    
    try:
        from ragas import evaluate as ragas_evaluate, EvaluationDataset
        from ragas.metrics import (
            answer_relevancy,
            faithfulness,
            context_precision,
            context_recall
        )
        from langsmith.evaluation import evaluate
        from langsmith.schemas import Run, Example
        import os
        import pandas as pd
        
        # Disable RAGAS tracking
        os.environ['RAGAS_DO_NOT_TRACK'] = 'true'
        
        print("✅ RAGAS and LangSmith imports successful")
        print("🔧 WORKAROUND: Pre-computing RAGAS scores outside LangSmith context")
        print("💡 Issue: RAGAS tracing conflicts when running inside LangSmith evaluators")
        
        # Step 1: Pre-compute RAGAS scores for all questions with each retriever
        print(f"\n📊 Step 1: Pre-computing RAGAS scores for all dataset questions...")
        
        # Get all questions from the dataset
        if 'synthetic_df' in locals() and len(synthetic_df) > 0:
            questions = synthetic_df['user_input'].tolist()
            references = synthetic_df['reference'].tolist()
            print(f"   Found {len(questions)} questions in synthetic dataset")
        else:
            print("❌ No synthetic_df found - run RAGAS dataset generation first")
            questions = []
            references = []
        
        # Define retrievers to evaluate
        retrievers_for_ragas = [
            (naive_retriever, "Naive"),
            (bm25_retriever, "BM25"),
            (multi_query_retriever, "Multi-Query"),
            (parent_document_retriever, "Parent-Document"),
            (compression_retriever, "Contextual-Compression"),
            (ensemble_retriever, "Ensemble")
        ]
        
        # Store pre-computed scores
        precomputed_scores = {}
        
        # Pre-compute RAGAS scores for each retriever
        for retriever, retriever_name in retrievers_for_ragas:
            print(f"\n🔍 Pre-computing RAGAS scores for {retriever_name} retriever...")
            
            retriever_scores = {
                'faithfulness': [],
                'answer_relevancy': [],
                'context_precision': [],
                'context_recall': []
            }
            
            # Process each question
            for i, (question, reference) in enumerate(zip(questions, references)):
                print(f"   Processing question {i+1}/{len(questions)}...")
                
                try:
                    # Get contexts using current retriever
                    retrieved_docs = retriever.invoke(question)
                    contexts = [doc.page_content for doc in retrieved_docs[:2]]
                    
                    # Generate response using RAG
                    context_str = "\n\n".join(contexts)
                    rag_prompt = f"""Given the movie review context, answer the question based only on the context.

Context: {context_str}
Question: {question}"""
                    
                    response = chat_model.invoke(rag_prompt).content
                    
                    # Create RAGAS evaluation data
                    eval_data = [{
                        'user_input': question,
                        'response': response,
                        'retrieved_contexts': contexts,
                        'reference': reference
                    }]
                    
                    # Run RAGAS evaluation OUTSIDE LangSmith context
                    ragas_dataset = EvaluationDataset.from_list(eval_data)
                    
                    # Evaluate each metric separately to isolate any issues
                    metrics_to_eval = [
                        (faithfulness, 'faithfulness'),
                        (answer_relevancy, 'answer_relevancy'),
                        (context_precision, 'context_precision'),
                        (context_recall, 'context_recall')
                    ]
                    
                    for metric, metric_name in metrics_to_eval:
                        try:
                            result = ragas_evaluate(
                                dataset=ragas_dataset,
                                metrics=[metric],
                                llm=chat_model,
                                embeddings=embedding_model
                            )
                            
                            # Extract score
                            result_df = result.to_pandas()
                            if metric_name in result_df.columns:
                                score = float(result_df[metric_name].iloc[0])
                            else:
                                score = 0.0
                            
                            retriever_scores[metric_name].append(score)
                            
                        except Exception as metric_error:
                            print(f"     ⚠️ Error evaluating {metric_name}: {metric_error}")
                            retriever_scores[metric_name].append(0.0)
                    
                except Exception as question_error:
                    print(f"     ❌ Error processing question {i+1}: {question_error}")
                    # Add 0.0 for all metrics if question fails
                    for metric_name in retriever_scores:
                        retriever_scores[metric_name].append(0.0)
            
            # Store average scores for this retriever
            avg_scores = {}
            for metric_name, scores in retriever_scores.items():
                avg_scores[metric_name] = sum(scores) / len(scores) if scores else 0.0
            
            precomputed_scores[retriever_name] = avg_scores
            print(f"   ✅ {retriever_name} average scores: {avg_scores}")
        
        # Step 2: Create simple LangSmith evaluators that return pre-computed scores
        print(f"\n📊 Step 2: Creating LangSmith evaluators with pre-computed scores...")

        def make_metric_evaluator(metric_key: str, retriever_name: str):
            """Return a LangSmith evaluator that always looks up the
            already-computed `metric_key` for `retriever_name`."""
            def _evaluator(run: Run, example: Example) -> dict:
                return {
                    "key": metric_key,
                    "score": precomputed_scores[retriever_name][metric_key]
                }
            return _evaluator

        
        def create_precomputed_evaluator(metric_name):
            """Create evaluator that returns pre-computed RAGAS scores"""
            def precomputed_evaluator(run: Run, example: Example) -> dict:
                # Determine which retriever is being used (from metadata or experiment name)
                current_retriever_name = "Naive"  # Default fallback
                
                # Try to get retriever name from run metadata
                if hasattr(run, 'extra') and run.extra:
                    current_retriever_name = run.extra.get('retriever_type', 'Naive')
                
                # Get pre-computed score for this retriever and metric
                if current_retriever_name in precomputed_scores:
                    score = precomputed_scores[current_retriever_name].get(metric_name.lower().replace(' ', '_'), 0.0)
                else:
                    score = 0.0
                
                return {
                    "key": metric_name.lower().replace(' ', '_'),
                    "score": score
                }
            
            return precomputed_evaluator
        
        # Create pre-computed evaluators
        faithfulness_evaluator = create_precomputed_evaluator("Faithfulness")
        answer_relevancy_evaluator = create_precomputed_evaluator("Answer_Relevancy")
        context_precision_evaluator = create_precomputed_evaluator("Context_Precision")
        context_recall_evaluator = create_precomputed_evaluator("Context_Recall")
        
        print(f"✅ Pre-computed evaluators created")
        
        # Step 3: Run LangSmith evaluation with pre-computed scores
        print(f"\n📊 Step 3: Running LangSmith evaluation with pre-computed RAGAS scores...")
        
        # Global variable to track current retriever
        current_retriever_for_ragas = None
        
        # Evaluate each retriever with pre-computed RAGAS scores
        for retriever, name in retrievers_for_ragas:
            print(f"\n🔍 Running LangSmith evaluation for {name} retriever (with pre-computed RAGAS)...")
            
            try:
                # Set current retriever
                current_retriever_for_ragas = retriever
                
                # Create RAG wrapper
                def precomputed_rag_wrapper(inputs):
                    """Simple RAG wrapper for use with pre-computed scores"""
                    question = inputs["question"]
                    
                    try:
                        retrieved_docs = current_retriever_for_ragas.invoke(question)
                        contexts = [doc.page_content for doc in retrieved_docs[:2]]
                        context_str = "\n\n".join(contexts)
                        
                        rag_prompt = f"""Given the movie review context, answer the question based only on the context.

Context: {context_str}
Question: {question}"""
                        
                        response = chat_model.invoke(rag_prompt).content
                        return {"output": response}
                        
                    except Exception as e:
                        return {"output": f"RAG error: {str(e)}"}
                
                # Run LangSmith evaluation with pre-computed RAGAS metrics
                experiment_results = evaluate(
                    precomputed_rag_wrapper,
                    data=LANGSMITH_DATASET_NAME,
                    evaluators=[
                        faithfulness_evaluator,
                        answer_relevancy_evaluator,
                        context_precision_evaluator,
                        context_recall_evaluator
                    ],
                    metadata={
                        "retriever_type": name,
                        "evaluation_run": "ragas_metrics_precomputed",
                        "evaluators": "faithfulness_answer_relevancy_context_precision_context_recall",
                        "domain": "movie_reviews",
                        "evaluation_mode": "ragas_precomputed_scores",
                        "framework": "ragas_langsmith_workaround"
                    },
                    experiment_prefix=f"ragas_precomputed_{name.lower().replace(' ', '_').replace('-', '_')}"
                )
                
                print(f"✅ {name} LangSmith evaluation completed (with pre-computed RAGAS scores)")
                
                # Rate limiting delay
                print(f"⏱️ Pausing before next retriever...")
                time.sleep(3)
                
            except Exception as e:
                print(f"❌ {name} evaluation failed: {e}")
                continue
        
        # Display summary of pre-computed scores
        print(f"\n📊 SUMMARY: Pre-computed RAGAS Scores by Retriever")
        print("=" * 80)
        summary_df = pd.DataFrame(precomputed_scores).T
        print(summary_df.round(3))
        
        print(f"\n🎯 All evaluations completed!")
        print(f"✅ WORKAROUND: RAGAS tracing conflicts resolved by pre-computing scores")
        print(f"📊 Check LangSmith dashboard for detailed comparisons!")
        print(f"🔬 Each retriever evaluated with pre-computed RAGAS metrics")
        print(f"🌐 LangSmith Project: {project_name}")
        print(f"📋 Dataset: {LANGSMITH_DATASET_NAME}")
        
        # Clean up
        current_retriever_for_ragas = None
        if 'RAGAS_DO_NOT_TRACK' in os.environ:
            del os.environ['RAGAS_DO_NOT_TRACK']
        
    except ImportError as e:
        print(f"❌ RAGAS or LangSmith import error: {e}")
        
    except Exception as e:
        print(f"❌ Pre-computed RAGAS evaluation failed: {e}")
        import traceback
        traceback.print_exc()

else:
    print("⚠️ Skipping RAGAS evaluation (LangSmith not configured)")

print(f"\n🔧 RAGAS-LangSmith Integration Complete (WORKAROUND VERSION)!")
print("🔬 Pre-computed approach avoids tracing conflicts completely")
print("📊 Scientific retriever comparison available in LangSmith dashboard")


🔧 RAGAS WORKAROUND - Pre-computing scores outside LangSmith context...
✅ RAGAS and LangSmith imports successful
🔧 WORKAROUND: Pre-computing RAGAS scores outside LangSmith context
💡 Issue: RAGAS tracing conflicts when running inside LangSmith evaluators

📊 Step 1: Pre-computing RAGAS scores for all dataset questions...
   Found 12 questions in synthetic dataset

🔍 Pre-computing RAGAS scores for Naive retriever...
   Processing question 1/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 2/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 3/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 4/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 5/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 6/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 7/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 8/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 9/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 10/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 11/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 12/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Naive average scores: {'faithfulness': 0.9090909090909092, 'answer_relevancy': 0.6541499491567708, 'context_precision': 0.6249999999458333, 'context_recall': 0.5416666666666666}

🔍 Pre-computing RAGAS scores for BM25 retriever...
   Processing question 1/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 2/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 3/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 4/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 5/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 6/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 7/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 8/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 9/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 10/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 11/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 12/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ BM25 average scores: {'faithfulness': 0.876388888888889, 'answer_relevancy': 0.5621569131638228, 'context_precision': 0.4999999999625, 'context_recall': 0.4166666666666667}

🔍 Pre-computing RAGAS scores for Multi-Query retriever...
   Processing question 1/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 2/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 3/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 4/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 5/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 6/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 7/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 8/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 9/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 10/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 11/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 12/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Multi-Query average scores: {'faithfulness': 0.9041666666666667, 'answer_relevancy': 0.6479321683277005, 'context_precision': 0.5833333332833334, 'context_recall': 0.47222222222222227}

🔍 Pre-computing RAGAS scores for Parent-Document retriever...
   Processing question 1/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 2/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 3/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 4/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 5/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 6/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 7/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 8/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 9/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 10/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 11/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 12/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Parent-Document average scores: {'faithfulness': 0.8954156954156954, 'answer_relevancy': 0.7214588575192243, 'context_precision': 0.6666666666083333, 'context_recall': 0.548611111111111}

🔍 Pre-computing RAGAS scores for Contextual-Compression retriever...
   Processing question 1/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 2/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 3/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 4/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 5/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 6/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 7/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 8/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 9/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 10/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 11/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 12/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Contextual-Compression average scores: {'faithfulness': 0.8898148148148147, 'answer_relevancy': 0.7228061368183281, 'context_precision': 0.7499999999375001, 'context_recall': 0.576388888888889}

🔍 Pre-computing RAGAS scores for Ensemble retriever...
   Processing question 1/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 2/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 3/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 4/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 5/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 6/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 7/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 8/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 9/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 10/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 11/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   Processing question 12/12...


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Ensemble average scores: {'faithfulness': 0.7811507936507937, 'answer_relevancy': 0.7235935305423552, 'context_precision': 0.5833333332833334, 'context_recall': 0.5416666666666666}

📊 Step 2: Creating LangSmith evaluators with pre-computed scores...
✅ Pre-computed evaluators created

📊 Step 3: Running LangSmith evaluation with pre-computed RAGAS scores...

🔍 Running LangSmith evaluation for Naive retriever (with pre-computed RAGAS)...
View the evaluation results for experiment: 'ragas_precomputed_naive-f8b4f1e5' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=df1d9682-d5cc-4aca-af79-f2d71f14512c




0it [00:00, ?it/s]

✅ Naive LangSmith evaluation completed (with pre-computed RAGAS scores)
⏱️ Pausing before next retriever...

🔍 Running LangSmith evaluation for BM25 retriever (with pre-computed RAGAS)...
View the evaluation results for experiment: 'ragas_precomputed_bm25-c3c9c984' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=b132156c-9891-4098-bfda-bc1ed4f305b8




0it [00:00, ?it/s]

✅ BM25 LangSmith evaluation completed (with pre-computed RAGAS scores)
⏱️ Pausing before next retriever...

🔍 Running LangSmith evaluation for Multi-Query retriever (with pre-computed RAGAS)...
View the evaluation results for experiment: 'ragas_precomputed_multi_query-98e4f4c3' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=b9446cb8-e658-47a7-815b-5fe331a210ea




0it [00:00, ?it/s]

✅ Multi-Query LangSmith evaluation completed (with pre-computed RAGAS scores)
⏱️ Pausing before next retriever...

🔍 Running LangSmith evaluation for Parent-Document retriever (with pre-computed RAGAS)...
View the evaluation results for experiment: 'ragas_precomputed_parent_document-482d830d' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=d3bce4c0-15b3-4e38-b9da-77967a3fbd88




0it [00:00, ?it/s]

✅ Parent-Document LangSmith evaluation completed (with pre-computed RAGAS scores)
⏱️ Pausing before next retriever...

🔍 Running LangSmith evaluation for Contextual-Compression retriever (with pre-computed RAGAS)...
View the evaluation results for experiment: 'ragas_precomputed_contextual_compression-8e076536' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=0650d6a0-1afc-4095-a7bf-78226c205bda




0it [00:00, ?it/s]

✅ Contextual-Compression LangSmith evaluation completed (with pre-computed RAGAS scores)
⏱️ Pausing before next retriever...

🔍 Running LangSmith evaluation for Ensemble retriever (with pre-computed RAGAS)...
View the evaluation results for experiment: 'ragas_precomputed_ensemble-6cf337da' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/571b3b25-acd2-44c8-8012-6015f838d6c9/compare?selectedSessions=57988428-7430-4a47-ae8b-6bfb63cd33fa




0it [00:00, ?it/s]

✅ Ensemble LangSmith evaluation completed (with pre-computed RAGAS scores)
⏱️ Pausing before next retriever...

📊 SUMMARY: Pre-computed RAGAS Scores by Retriever
                        faithfulness  answer_relevancy  context_precision  \
Naive                          0.909             0.654              0.625   
BM25                           0.876             0.562              0.500   
Multi-Query                    0.904             0.648              0.583   
Parent-Document                0.895             0.721              0.667   
Contextual-Compression         0.890             0.723              0.750   
Ensemble                       0.781             0.724              0.583   

                        context_recall  
Naive                            0.542  
BM25                             0.417  
Multi-Query                      0.472  
Parent-Document                  0.549  
Contextual-Compression           0.576  
Ensemble                         0.542  

🎯 All eval

: 

In [34]:
# ===============================================================
# 🔧 RAGAS Metric Evaluation – WORKAROUND (Pre-compute scores
#    outside LangSmith, then upload the correct values 1-to-1)
# ===============================================================

if USE_LANGSMITH:
    print("🔧 RAGAS WORKAROUND – Pre-computing scores outside LangSmith context…")
    print("=" * 80)

    # -----------------------------------------------------------
    # Imports & setup
    # -----------------------------------------------------------
    import os, time, pandas as pd
    from ragas import evaluate as ragas_evaluate, EvaluationDataset
    from ragas.metrics import (
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    )
    from langsmith.evaluation import evaluate
    from langsmith.schemas import Run, Example

    # Disable RAGAS anonymous telemetry (optional)
    os.environ["RAGAS_DO_NOT_TRACK"] = "true"

    print("✅ RAGAS and LangSmith imports successful")

    # -----------------------------------------------------------
    # Step 1 ▸ Pre-compute RAGAS scores for every retriever
    # -----------------------------------------------------------
    if "synthetic_df" in locals() and len(synthetic_df) > 0:
        questions   = synthetic_df["user_input"].tolist()
        references  = synthetic_df["reference"].tolist()
        print(f"📊 Found {len(questions)} questions in synthetic dataset")
    else:
        raise RuntimeError("❌ No synthetic_df found – run RAGAS dataset generation first")

    retrievers_for_ragas = [
        (naive_retriever,          "Naive"),
        (bm25_retriever,           "BM25"),
        (multi_query_retriever,    "Multi-Query"),
        (parent_document_retriever,"Parent-Document"),
        (compression_retriever,    "Contextual-Compression"),
        (ensemble_retriever,       "Ensemble"),
    ]

    precomputed_scores: dict[str, dict[str, float]] = {}

    for retriever, retriever_name in retrievers_for_ragas:
        print(f"\n🔍 Pre-computing scores for «{retriever_name}» retriever …")

        metrics_buf = {
            "faithfulness":       [],
            "answer_relevancy":   [],
            "context_precision":  [],
            "context_recall":     [],
        }

        for i, (q, ref) in enumerate(zip(questions, references), start=1):
            print(f"   • Question {i}/{len(questions)}")

            try:
                # Retrieve context
                docs      = retriever.invoke(q)
                contexts  = [d.page_content for d in docs[:2]]
                ctx_str   = "\n\n".join(contexts)

                # Generate answer with your chat_model
                prompt = (
                    "Given the movie review context, answer the question "
                    "based only on the context.\n\n"
                    f"Context: {ctx_str}\nQuestion: {q}"
                )
                answer = chat_model.invoke(prompt).content

                # Build a temporary EvaluationDataset with one row
                data_row = {
                    "user_input":          q,
                    "response":            answer,
                    "retrieved_contexts":  contexts,
                    "reference":           ref,
                }
                dataset = EvaluationDataset.from_list([data_row])

                # Evaluate each metric separately
                for metric, key in [
                    (faithfulness,      "faithfulness"),
                    (answer_relevancy,  "answer_relevancy"),
                    (context_precision, "context_precision"),
                    (context_recall,    "context_recall"),
                ]:
                    try:
                        res_df = ragas_evaluate(
                            dataset=dataset,
                            metrics=[metric],
                            llm=chat_model,
                            embeddings=embedding_model,
                        ).to_pandas()
                        score = float(res_df[key].iloc[0])
                    except Exception as metric_err:
                        print(f"     ⚠️  {key} failed: {metric_err}")
                        score = 0.0

                    metrics_buf[key].append(score)

            except Exception as q_err:
                print(f"     ❌ Error on question {i}: {q_err}")
                # Push zeros so indices align
                for key in metrics_buf:
                    metrics_buf[key].append(0.0)

        # Average per-retriever
        precomputed_scores[retriever_name] = {
            k: (sum(v) / len(v) if v else 0.0) for k, v in metrics_buf.items()
        }
        print(f"   ✅ Avg scores for «{retriever_name}»: {precomputed_scores[retriever_name]}")

    # -----------------------------------------------------------
    # Step 2 ▸ Build evaluator factories that close over retriever
    # -----------------------------------------------------------
    def make_metric_evaluator(metric_key: str, retriever_name: str):
        """Return evaluator that always serves the matching pre-computed score."""
        def _eval(_: Run, __: Example) -> dict:
            return {
                "key":   metric_key,
                "score": precomputed_scores[retriever_name][metric_key],
            }
        return _eval

    # -----------------------------------------------------------
    # Step 3 ▸ Run LangSmith evaluation with correct evaluators
    # -----------------------------------------------------------
    for retriever, name in retrievers_for_ragas:
        print(f"\n🔍 Running LangSmith evaluation for «{name}» …")

        # 3-A  Small RAG wrapper that uses *this* retriever
        def rag_fn(inputs, _retriever=retriever):
            q      = inputs["question"]
            docs   = _retriever.invoke(q)
            ctx    = "\n\n".join(d.page_content for d in docs[:2])
            prompt = (
                "Given the movie review context, answer the question "
                "based only on the context.\n\n"
                f"Context: {ctx}\nQuestion: {q}"
            )
            return {"output": chat_model.invoke(prompt).content}

        # 3-B  Evaluators for just this retriever
        evaluator_list = [
            make_metric_evaluator("faithfulness",       name),
            make_metric_evaluator("answer_relevancy",   name),
            make_metric_evaluator("context_precision",  name),
            make_metric_evaluator("context_recall",     name),
        ]

        # 3-C  Send to LangSmith
        try:
            evaluate(
                rag_fn,
                data=LANGSMITH_DATASET_NAME,
                evaluators=evaluator_list,
                metadata={
                    "retriever_type": name,
                    "evaluation_run": "ragas_metrics_precomputed",
                    "domain": "movie_reviews",
                    "framework": "ragas_langsmith_workaround",
                },
                experiment_prefix=f"ragas_precomputed_{name.lower().replace(' ', '_')}",
            )
            print(f"✅ «{name}» evaluation complete")
            time.sleep(3)  # polite rate-limit
        except Exception as exc:
            print(f"❌ «{name}» evaluation failed: {exc}")

    # -----------------------------------------------------------
    # EXTRA ▸ Store all scores locally for later analysis
    # -----------------------------------------------------------
    scores_df = (
        pd.DataFrame(precomputed_scores)  # metrics as columns
          .T                              # rows = retrievers
          .reset_index()
          .rename(columns={"index": "retriever"})
    )
    print("\n📄 Local RAGAS score summary:")
    print(scores_df.round(3))

    # Persist to disk for future notebooks / reports
    scores_df.to_csv("precomputed_ragas_scores.csv", index=False)
    print("💾 Scores saved → precomputed_ragas_scores.csv")

    # Clean-up env var
    os.environ.pop("RAGAS_DO_NOT_TRACK", None)

else:
    print("⚠️  Skipping RAGAS evaluation (LangSmith not configured)")

print("\n🔧 RAGAS-LangSmith integration finished (WORKAROUND version)")


🔧 RAGAS WORKAROUND – Pre-computing scores outside LangSmith context…
✅ RAGAS and LangSmith imports successful
📊 Found 12 questions in synthetic dataset

🔍 Pre-computing scores for «Naive» retriever …
   • Question 1/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 2/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 3/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 4/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 5/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 6/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 7/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 8/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 9/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 10/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 11/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 12/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Avg scores for «Naive»: {'faithfulness': 0.7786172161172161, 'answer_relevancy': 0.7946670977353238, 'context_precision': 0.8333333332541667, 'context_recall': 0.8166666666666668}

🔍 Pre-computing scores for «BM25» retriever …
   • Question 1/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 2/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 3/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 4/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 5/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 6/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 7/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 8/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 9/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 10/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 11/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 12/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Avg scores for «BM25»: {'faithfulness': 0.8208573833573833, 'answer_relevancy': 0.484477993961654, 'context_precision': 0.7083333332624999, 'context_recall': 0.6944444444444445}

🔍 Pre-computing scores for «Multi-Query» retriever …
   • Question 1/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 2/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Exception raised in Job[0]: TimeoutError()


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 3/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 4/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 5/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 6/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 7/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 8/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 9/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 10/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 11/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 12/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Avg scores for «Multi-Query»: {'faithfulness': 0.8390151515151515, 'answer_relevancy': 0.7226773462384805, 'context_precision': nan, 'context_recall': 0.8333333333333334}

🔍 Pre-computing scores for «Parent-Document» retriever …
   • Question 1/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 2/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 3/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 4/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 5/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 6/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 7/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 8/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 9/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 10/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 11/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 12/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Avg scores for «Parent-Document»: {'faithfulness': 0.8305555555555556, 'answer_relevancy': 0.765725307251287, 'context_precision': 0.916666666575, 'context_recall': 0.8888888888888888}

🔍 Pre-computing scores for «Contextual-Compression» retriever …
   • Question 1/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 2/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 3/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 4/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 5/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 6/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 7/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 8/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 9/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 10/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 11/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 12/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Avg scores for «Contextual-Compression»: {'faithfulness': 0.8301282051282052, 'answer_relevancy': 0.7892890209138014, 'context_precision': 0.7083333332624999, 'context_recall': 0.8333333333333334}

🔍 Pre-computing scores for «Ensemble» retriever …
   • Question 1/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 2/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 3/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 4/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 5/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 6/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 7/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 8/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 9/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 10/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 11/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   • Question 12/12


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

   ✅ Avg scores for «Ensemble»: {'faithfulness': 0.8598484848484849, 'answer_relevancy': 0.7112428931916175, 'context_precision': 0.8333333332541667, 'context_recall': 0.7611111111111111}

🔍 Running LangSmith evaluation for «Naive» …
View the evaluation results for experiment: 'ragas_precomputed_naive-ba48e48b' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/555a2c84-0480-4b03-8f05-7b3a60c05722/compare?selectedSessions=5e5c5b32-dc55-4f49-a0dc-357d508f84b2




0it [00:00, ?it/s]

✅ «Naive» evaluation complete

🔍 Running LangSmith evaluation for «BM25» …
View the evaluation results for experiment: 'ragas_precomputed_bm25-51d75229' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/555a2c84-0480-4b03-8f05-7b3a60c05722/compare?selectedSessions=d228b687-425c-42cf-a7f8-2ecb2302c664




0it [00:00, ?it/s]

✅ «BM25» evaluation complete

🔍 Running LangSmith evaluation for «Multi-Query» …
View the evaluation results for experiment: 'ragas_precomputed_multi-query-91320fa5' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/555a2c84-0480-4b03-8f05-7b3a60c05722/compare?selectedSessions=b229517e-7c45-419a-b06a-1b8f0597cb87




0it [00:00, ?it/s]

✅ «Multi-Query» evaluation complete

🔍 Running LangSmith evaluation for «Parent-Document» …
View the evaluation results for experiment: 'ragas_precomputed_parent-document-32100124' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/555a2c84-0480-4b03-8f05-7b3a60c05722/compare?selectedSessions=8ab986db-d4da-49e5-8879-af50928dcbf0




0it [00:00, ?it/s]

✅ «Parent-Document» evaluation complete

🔍 Running LangSmith evaluation for «Contextual-Compression» …
View the evaluation results for experiment: 'ragas_precomputed_contextual-compression-3acfb502' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/555a2c84-0480-4b03-8f05-7b3a60c05722/compare?selectedSessions=999c5006-6287-41b1-965a-b65fd39825bc




0it [00:00, ?it/s]

✅ «Contextual-Compression» evaluation complete

🔍 Running LangSmith evaluation for «Ensemble» …
View the evaluation results for experiment: 'ragas_precomputed_ensemble-dd89dd72' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/555a2c84-0480-4b03-8f05-7b3a60c05722/compare?selectedSessions=20c919a6-ad40-43e8-a3ae-8be2c9d32c05




0it [00:00, ?it/s]

Error running target function: headers: {'access-control-expose-headers': 'X-Debug-Trace-ID', 'cache-control': 'no-cache, no-store, no-transform, must-revalidate, private, max-age=0', 'content-type': 'application/json', 'expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'pragma': 'no-cache', 'vary': 'Origin', 'x-accel-expires': '0', 'x-debug-trace-id': '26cf5bfe2d7ff00cf0f5c06c4bde8127', 'date': 'Tue, 05 Aug 2025 16:38:18 GMT', 'content-length': '372', 'x-envoy-upstream-service-time': '2', 'server': 'envoy', 'via': '1.1 google', 'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'}, status_code: 429, body: {'id': '9c96dffd-cc3d-4c33-b4fa-3e4026078118', 'message': "You are using a Trial key, which is limited to 10 API calls / minute. You can continue to use the Trial key for free or upgrade to a Production key with higher rate limits at 'https://dashboard.cohere.com/api-keys'. Contact us on 'https://discord.gg/XW44jPfYJu' or email us at support@cohere.com with any questions"}
Traceback 

✅ «Ensemble» evaluation complete

📄 Local RAGAS score summary:
                retriever  faithfulness  answer_relevancy  context_precision  \
0                   Naive         0.779             0.795              0.833   
1                    BM25         0.821             0.484              0.708   
2             Multi-Query         0.839             0.723                NaN   
3         Parent-Document         0.831             0.766              0.917   
4  Contextual-Compression         0.830             0.789              0.708   
5                Ensemble         0.860             0.711              0.833   

   context_recall  
0           0.817  
1           0.694  
2           0.833  
3           0.889  
4           0.833  
5           0.761  
💾 Scores saved → precomputed_ragas_scores.csv

🔧 RAGAS-LangSmith integration finished (WORKAROUND version)
