# React Agent Evaluation with DeepEval

This notebook demonstrates how to evaluate a React agent using the DeepEval framework. We'll use the React agent from the langchain-agent example and evaluate its performance using various metrics including:

- Answer Relevancy
- Faithfulness 
- Contextual Relevancy
- Citation accuracy

The evaluation will be performed on a set of questions loaded from a CSV file, and results will be saved back to the file for analysis.

In [1]:
# Install required packages
%pip install deepeval langchain langchain-community langchain-openai langgraph azure-search-documents pandas openpyxl

Collecting langgraph
  Downloading langgraph-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting langgraph-checkpoint<3.0.0,>=2.1.0 (from langgraph)
  Downloading langgraph_checkpoint-2.1.1-py3-none-any.whl.metadata (4.2 kB)
Collecting langgraph-prebuilt<0.7.0,>=0.6.0 (from langgraph)
  Downloading langgraph_prebuilt-0.6.3-py3-none-any.whl.metadata (4.5 kB)
Collecting langgraph-sdk<0.3.0,>=0.2.0 (from langgraph)
  Downloading langgraph_sdk-0.2.0-py3-none-any.whl.metadata (1.5 kB)
Collecting xxhash>=3.5.0 (from langgraph)
  Using cached xxhash-3.5.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting ormsgpack>=1.10.0 (from langgraph-checkpoint<3.0.0,>=2.1.0->langgraph)
  Downloading ormsgpack-1.10.0-cp312-cp312-win_amd64.whl.metadata (44 kB)
Downloading langgraph-0.6.3-py3-none-any.whl (152 kB)
Downloading langgraph_checkpoint-2.1.1-py3-none-any.whl (43 kB)
Downloading langgraph_prebuilt-0.6.3-py3-none-any.whl (28 kB)
Downloading langgraph_sdk-0.2.0-py3-none-any.whl (50 kB)
Downloadi

In [2]:
# Import required libraries
import os
import pandas as pd
import json
from typing import List, Dict, Any
from dotenv import load_dotenv

# DeepEval imports
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    GEval
)
from deepeval.test_case import LLMTestCase

# LangChain imports
from langchain_openai import AzureChatOpenAI
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents._generated.models import QueryType

# Load environment variables
load_dotenv()

True

## Setup Azure OpenAI and Search Clients

In [None]:
# Initialize Azure OpenAI client
llm = AzureChatOpenAI(
    openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    openai_api_version=os.getenv("OPENAI_API_VERSION"),
)

# Initialize Azure Search client
search_client = SearchIndexClient(
    endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"), 
    credential=AzureKeyCredential(key=os.getenv("AZURE_AI_SEARCH_API_KEY"))
)
index_details = search_client.get_index(os.getenv("AZURE_AI_SEARCH_INDEX_NAME"))

print("✅ Azure clients initialized successfully")

## Define Search Tool and React Agent

In [None]:
@tool
def search_index(query: str) -> List[Dict[str, Any]]:
    """
    This tool searches the Azure Search index for the given query.

    Args:
        query (str): The query to search for.

    Returns:
        List[Dict]: The search results with filename, content, score, and metadata.
    """
    client = search_client.get_search_client(index_name=os.getenv("AZURE_AI_SEARCH_INDEX_NAME"))
    search_fields = [field.name for field in index_details.fields]
    search_results = client.search(
        search_text=query, 
        query_type=QueryType.SIMPLE,
        top=10,
        select=search_fields,
    )
    documents = [{
        "filename": i.get('title', ''),
        "content": i.get('chunk', ''),
        "score": i.get('@search.score', 0),
        "metadata": i.get('metadata', {})
    } for i in search_results]
    return documents

# Create React agent
memory = MemorySaver()
tools = [search_index]
prompt = """You are a helpful assistant specializing in React and web development. 
You should use the search_index tool to search for relevant information in the knowledge base 
to answer user questions accurately. Always provide citations from the search results when available."""

agent_executor = create_react_agent(llm, tools, checkpointer=memory, version="v2", prompt=prompt)

print("✅ React agent created successfully")

## Helper Functions

In [None]:
def query_agent(question: str, thread_id: str = "eval_thread") -> tuple[str, List[str]]:
    """
    Query the React agent and extract answer and citations.
    
    Args:
        question: The question to ask the agent
        thread_id: Thread ID for conversation memory
        
    Returns:
        Tuple of (answer, citations)
    """
    config = {"configurable": {"thread_id": thread_id}}
    
    # Get response from agent
    response_chunks = []
    for chunk in agent_executor.stream({"messages": question}, stream_mode="messages", config=config):
        if hasattr(chunk[0], 'content') and chunk[0].content:
            response_chunks.append(chunk[0].content)
    
    answer = ' '.join(response_chunks).strip()
    
    # Extract citations (simplified - looking for common citation patterns)
    citations = []
    if 'React' in answer or 'handbook' in answer.lower():
        citations.append('React-handbook.pdf')
    
    return answer, citations

def extract_context_from_search(question: str) -> List[str]:
    """
    Extract context by directly calling the search tool.
    
    Args:
        question: The question to search for
        
    Returns:
        List of context strings
    """
    search_results = search_index(question)
    contexts = [doc['content'] for doc in search_results if doc['content']]
    return contexts[:5]  # Limit to top 5 results

print("✅ Helper functions defined")

## Initialize DeepEval Metrics

In [None]:
# Initialize evaluation metrics
answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)

faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)

contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)

# Custom citation accuracy metric using G-Eval
citation_accuracy_metric = GEval(
    name="Citation Accuracy",
    criteria="Determine if the actual answer includes proper citations that match the expected citations. Check if the sources referenced are accurate and relevant.",
    evaluation_steps=[
        "Check if the actual answer contains citations",
        "Compare the citations in actual answer with expected citations", 
        "Evaluate if the citations are relevant to the content",
        "Assign a score from 0 to 1 based on citation accuracy"
    ],
    evaluation_params=[
        "actual_answer",
        "expected_citation"
    ],
    threshold=0.7,
    model="gpt-4"
)

print("✅ DeepEval metrics initialized")

## Load Evaluation Questions

In [None]:
# Load evaluation questions from CSV
df = pd.read_csv('evaluation_questions.csv')
print(f"Loaded {len(df)} evaluation questions")
print("\nSample questions:")
for i, row in df.head(3).iterrows():
    print(f"Q{i+1}: {row['question']}")
    print(f"Expected: {row['expected_answer'][:100]}...\n")

## Run Evaluation

In [None]:
# Run evaluation on all questions
results = []
test_cases = []

print("Starting evaluation...\n")

for idx, row in df.iterrows():
    question = row['question']
    expected_answer = row['expected_answer']
    expected_citation = row['expected_citation']
    
    print(f"Evaluating question {idx + 1}/{len(df)}: {question[:50]}...")
    
    try:
        # Get agent response
        actual_answer, actual_citations = query_agent(question, thread_id=f"eval_{idx}")
        
        # Get context for evaluation
        retrieval_context = extract_context_from_search(question)
        
        # Create test case
        test_case = LLMTestCase(
            input=question,
            actual_output=actual_answer,
            expected_output=expected_answer,
            retrieval_context=retrieval_context
        )
        
        test_cases.append(test_case)
        
        # Store results
        result = {
            'question': question,
            'expected_answer': expected_answer,
            'expected_citation': expected_citation,
            'actual_answer': actual_answer,
            'actual_citation': '; '.join(actual_citations),
            'no_of_citation_correct': len([c for c in actual_citations if expected_citation in c])
        }
        
        results.append(result)
        
        print(f"✅ Question {idx + 1} completed")
        
    except Exception as e:
        print(f"❌ Error processing question {idx + 1}: {str(e)}")
        # Add empty result to maintain alignment
        result = {
            'question': question,
            'expected_answer': expected_answer,
            'expected_citation': expected_citation,
            'actual_answer': f"Error: {str(e)}",
            'actual_citation': '',
            'no_of_citation_correct': 0
        }
        results.append(result)

print(f"\n✅ Completed querying agent for {len(results)} questions")

## Evaluate with DeepEval Metrics

In [None]:
print("Running DeepEval metrics evaluation...\n")

# Evaluate each test case with metrics
for i, (test_case, result) in enumerate(zip(test_cases, results)):
    if "Error:" in result['actual_answer']:
        # Skip evaluation for error cases
        result.update({
            'answer_relevancy_score': 0.0,
            'faithfulness_score': 0.0,
            'contextual_relevancy_score': 0.0,
            'citation_accuracy_score': 0.0,
            'correctness': 'Failed',
            'answer_score': 0.0
        })
        continue
    
    print(f"Evaluating metrics for question {i + 1}...")
    
    try:
        # Evaluate individual metrics
        metrics_to_evaluate = [answer_relevancy_metric, faithfulness_metric, contextual_relevancy_metric]
        
        # Run evaluation
        evaluation_results = evaluate(
            test_cases=[test_case],
            metrics=metrics_to_evaluate,
            print_results=False
        )
        
        # Extract scores
        eval_result = evaluation_results[0]
        
        answer_relevancy_score = next((m.score for m in eval_result.metrics_metadata if m.metric == "Answer Relevancy"), 0.0)
        faithfulness_score = next((m.score for m in eval_result.metrics_metadata if m.metric == "Faithfulness"), 0.0)
        contextual_relevancy_score = next((m.score for m in eval_result.metrics_metadata if m.metric == "Contextual Relevancy"), 0.0)
        
        # Calculate overall answer score (average of metrics)
        answer_score = (answer_relevancy_score + faithfulness_score + contextual_relevancy_score) / 3
        
        # Determine correctness based on thresholds
        correctness = "Pass" if answer_score >= 0.7 else "Fail"
        
        # Simple citation accuracy check
        citation_accuracy_score = 1.0 if result['expected_citation'] in result['actual_citation'] else 0.0
        
        # Update result with scores
        result.update({
            'answer_relevancy_score': round(answer_relevancy_score, 3),
            'faithfulness_score': round(faithfulness_score, 3),
            'contextual_relevancy_score': round(contextual_relevancy_score, 3),
            'citation_accuracy_score': round(citation_accuracy_score, 3),
            'correctness': correctness,
            'answer_score': round(answer_score, 3)
        })
        
        print(f"✅ Question {i + 1} evaluated - Score: {answer_score:.3f}")
        
    except Exception as e:
        print(f"❌ Error evaluating question {i + 1}: {str(e)}")
        result.update({
            'answer_relevancy_score': 0.0,
            'faithfulness_score': 0.0,
            'contextual_relevancy_score': 0.0,
            'citation_accuracy_score': 0.0,
            'correctness': 'Failed',
            'answer_score': 0.0
        })

print(f"\n✅ Completed evaluation with DeepEval metrics")

## Save Results and Analysis

In [None]:
# Create results DataFrame
results_df = pd.DataFrame(results)

# Reorder columns to match the requested format
column_order = [
    'question', 'expected_answer', 'expected_citation', 
    'actual_answer', 'actual_citation', 'correctness', 
    'answer_score', 'no_of_citation_correct',
    'answer_relevancy_score', 'faithfulness_score', 
    'contextual_relevancy_score', 'citation_accuracy_score'
]

# Add any missing columns
for col in column_order:
    if col not in results_df.columns:
        results_df[col] = ''

results_df = results_df[column_order]

# Save to CSV
output_file = 'react_agent_evaluation_results.csv'
results_df.to_csv(output_file, index=False)

# Also save to Excel for better formatting
excel_file = 'react_agent_evaluation_results.xlsx'
results_df.to_excel(excel_file, index=False, sheet_name='Evaluation Results')

print(f"✅ Results saved to {output_file} and {excel_file}")
print(f"\nEvaluation Summary:")
print(f"Total questions: {len(results_df)}")
print(f"Passed: {len(results_df[results_df['correctness'] == 'Pass'])}")
print(f"Failed: {len(results_df[results_df['correctness'] == 'Fail'])}")
print(f"Errors: {len(results_df[results_df['correctness'] == 'Failed'])}")
print(f"Average Answer Score: {results_df['answer_score'].mean():.3f}")
print(f"Average Citation Accuracy: {results_df['citation_accuracy_score'].mean():.3f}")

## Detailed Analysis

In [None]:
# Display detailed results
print("\n📊 Detailed Evaluation Results:")
print("=" * 80)

for idx, row in results_df.iterrows():
    print(f"\nQuestion {idx + 1}: {row['question']}")
    print(f"Expected Answer: {row['expected_answer'][:100]}...")
    print(f"Actual Answer: {row['actual_answer'][:100]}...")
    print(f"Expected Citation: {row['expected_citation']}")
    print(f"Actual Citation: {row['actual_citation']}")
    print(f"Correctness: {row['correctness']}")
    print(f"Answer Score: {row['answer_score']}")
    print(f"Citation Accuracy: {row['citation_accuracy_score']}")
    print("-" * 40)

# Show metric breakdown
print("\n📈 Metric Breakdown:")
print(f"Answer Relevancy: {results_df['answer_relevancy_score'].mean():.3f} ± {results_df['answer_relevancy_score'].std():.3f}")
print(f"Faithfulness: {results_df['faithfulness_score'].mean():.3f} ± {results_df['faithfulness_score'].std():.3f}")
print(f"Contextual Relevancy: {results_df['contextual_relevancy_score'].mean():.3f} ± {results_df['contextual_relevancy_score'].std():.3f}")
print(f"Citation Accuracy: {results_df['citation_accuracy_score'].mean():.3f} ± {results_df['citation_accuracy_score'].std():.3f}")

## Visualization (Optional)

In [None]:
# Optional: Create visualizations if matplotlib is available
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Set up the plotting style
    plt.style.use('default')
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('React Agent Evaluation Results', fontsize=16)
    
    # Metric scores distribution
    metrics = ['answer_relevancy_score', 'faithfulness_score', 'contextual_relevancy_score', 'citation_accuracy_score']
    metric_names = ['Answer Relevancy', 'Faithfulness', 'Contextual Relevancy', 'Citation Accuracy']
    
    for i, (metric, name) in enumerate(zip(metrics, metric_names)):
        ax = axes[i//2, i%2]
        ax.hist(results_df[metric], bins=10, alpha=0.7, edgecolor='black')
        ax.set_title(f'{name} Distribution')
        ax.set_xlabel('Score')
        ax.set_ylabel('Frequency')
        ax.axvline(results_df[metric].mean(), color='red', linestyle='--', label=f'Mean: {results_df[metric].mean():.3f}')
        ax.legend()
    
    plt.tight_layout()
    plt.savefig('evaluation_results_visualization.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Correctness pie chart
    plt.figure(figsize=(8, 6))
    correctness_counts = results_df['correctness'].value_counts()
    plt.pie(correctness_counts.values, labels=correctness_counts.index, autopct='%1.1f%%', startangle=90)
    plt.title('Evaluation Results Distribution')
    plt.savefig('correctness_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✅ Visualizations saved as PNG files")
    
except ImportError:
    print("📊 Matplotlib not available. Install it for visualizations: pip install matplotlib seaborn")

## Conclusion

This notebook demonstrates a comprehensive evaluation approach for LLM agents using DeepEval. The evaluation covers:

1. **Answer Quality**: Using DeepEval's built-in metrics (Answer Relevancy, Faithfulness, Contextual Relevancy)
2. **Citation Accuracy**: Custom evaluation of whether the agent provides proper citations
3. **Systematic Testing**: Batch evaluation of multiple questions from CSV/Excel files
4. **Results Export**: Saving detailed results back to Excel for further analysis

### Key Features:
- **Automated Evaluation**: No manual scoring required
- **Multiple Metrics**: Comprehensive assessment across different dimensions
- **Reproducible**: Can be run multiple times with consistent results
- **Extensible**: Easy to add new questions or modify evaluation criteria

### Next Steps:
1. Analyze the results to identify areas for improvement
2. Tune the agent's prompt or search parameters based on evaluation findings
3. Add more diverse questions to the evaluation set
4. Implement automated evaluation as part of CI/CD pipeline