# DeepEval for RAG and Agent Evaluations

This notebook demonstrates comprehensive evaluation strategies using DeepEval for:

## üéØ Evaluation Types

### 1. RAG System Evaluations
- **Answer Relevancy**: Measures if answers are relevant to questions
- **Faithfulness**: Detects hallucinations by checking grounding in context
- **Contextual Relevancy**: Validates retrieved context quality
- **Correctness (G-Eval)**: Compares actual vs expected outputs

### 2. AI Agent Evaluations
- **Tool Correctness**: Validates correct tool selection
- **Tool Arguments**: Verifies proper tool invocation
- **Expected Output Comparison**: Compares agent outputs with expected results

## üìä Data Sources
- **Vector Store**: Pinecone with real insurance policy documents
- **Database**: SQLite with sales data for agent testing
- **LLM**: Azure OpenAI GPT-4

## 1. Setup and Installation

Install required packages for DeepEval, LangChain, and vector store integration.

In [None]:
!pip install -Uq deepeval langchain langchain-openai langgraph langchain-pinecone langchain-community

## 2. Environment Configuration

Configure Azure OpenAI, LangSmith tracing, and Pinecone credentials.

In [None]:
import os
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

# Azure OpenAI Configuration
os.environ["OPENAI_API_VERSION"] = "2024-12-01-preview"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://ai-agents-sept-cohort-resource.cognitiveservices.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = ""

# LangSmith Tracing
os.environ['LANGSMITH_TRACING'] = 'true'
os.environ['LANGSMITH_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGSMITH_API_KEY'] = ''
os.environ['LANGSMITH_PROJECT'] = 'cohort-3-langgraph'

# Pinecone Configuration
os.environ["PINECONE_API_KEY"] = 's'

## 3. Initialize LLM and Embeddings

Create Azure OpenAI instances for chat and embeddings.

In [2]:
# Initialize LLM
llm = AzureChatOpenAI(
    deployment_name="gpt-4.1",
    temperature=0.7,
    top_p=0.8
)

# Initialize Embeddings
embeddings = AzureOpenAIEmbeddings(
    model="text-embedding-3-large",
)

print("‚úì LLM and Embeddings initialized successfully")

‚úì LLM and Embeddings initialized successfully


---

# Part 1: RAG System Evaluation

We'll evaluate a RAG system using real chunks from Pinecone vector store.

## 4. Connect to Pinecone and Fetch Real Chunks

Retrieve actual document chunks from the vector store for realistic testing.

In [3]:
from pinecone import Pinecone
from langchain_pinecone import PineconeVectorStore

# Connect to Pinecone
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index('policy-agenticrag')

# Create vector store
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

# Create retriever with score threshold
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 5, "score_threshold": 0.3},
)

# Define sample queries for testing
sample_queries = [
    "What is the age eligibility for term insurance?",
    "What are the benefits of group gratuity?",
    "Who should I contact for grievances?"
]

# Fetch and store chunks
fetched_chunks = {}
print("Fetching real chunks from Pinecone...\n")

for query in sample_queries:
    docs = retriever.invoke(query)
    if docs:
        fetched_chunks[query] = {
            'contexts': [doc.page_content for doc in docs],
            'metadata': [doc.metadata for doc in docs]
        }
        print(f"‚úì Query: {query}")
        print(f"  Retrieved {len(docs)} chunks")
        print(f"  Sample: {docs[0].page_content[:150]}...\n")

print(f"\n‚úì Successfully fetched chunks for {len(fetched_chunks)} queries")
pinecone_available = True

  from .autonotebook import tqdm as notebook_tqdm


Fetching real chunks from Pinecone...

‚úì Query: What is the age eligibility for term insurance?
  Retrieved 5 chunks
  Sample: |Eligibility Conditions|Col2|
|---|---|
|Minimum Entry Age|16 years for Employer Employee schemes<br>(18 years if rider is chosen)<br>18 years for Non...

‚úì Query: What are the benefits of group gratuity?
  Retrieved 5 chunks
  Sample: **HDFC Life Group Gratuity Product offers**


- Range of Debt and Equity oriented funds to choose from

- Flexibility of paying premiums

- Control ov...

‚úì Query: Who should I contact for grievances?
  Retrieved 5 chunks
  Sample: You may refer to the escalation matrix in case there is no response to a grievance within the
prescribed timelines

If you are still not satisfi ed wi...


‚úì Successfully fetched chunks for 3 queries


## 5. Create RAG Query Function

Build a function that answers questions using the fetched chunks.

In [4]:
def rag_query_with_real_chunks(question):
    """RAG system using real chunks fetched from vector store"""
    
    # Find matching chunks
    relevant_contexts = []
    
    # Try exact match first
    if question in fetched_chunks:
        relevant_contexts = fetched_chunks[question]['contexts']
    
    # Generate answer using LLM
    context = "\n\n".join(relevant_contexts)
    
    prompt = f"""You are an insurance policy assistant. Use the following context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information to answer this."

Context:
{context}

Question: {question}

Answer:"""
    
    response = llm.invoke(prompt)
    
    return {
        'result': response.content,
        'source_documents': relevant_contexts
    }

# Test the RAG function
test_query = "What is the age eligibility for term insurance?"
test_result = rag_query_with_real_chunks(test_query)
print(f"Question: {test_query}")
print(f"\nAnswer: {test_result['result']}")
print(f"\nNumber of source documents: {len(test_result['source_documents'])}")

Question: What is the age eligibility for term insurance?

Answer: **Answer:**

The age eligibility for HDFC Life Group Term Insurance is as follows:

- **Employer-Employee Schemes:**
  - **Minimum Entry Age:** 16 years (18 years if rider is chosen)
  - **Maximum Entry Age:** 79 years (64 years if rider is chosen)
  - **Minimum Maturity Age:** 17 years (19 years if rider is chosen)

- **Non Employer-Employee Schemes:**
  - **Minimum Entry Age:** 18 years
  - **Maximum Entry Age:** 79 years (64 years if rider is chosen)
  - **Minimum Maturity Age:** 19 years

- **Maximum Maturity Age (for all):** 80 years

**Note:**  
- Risk cover starts from the date of commencement of the policy for all lives, including minors. For minors, the policy vests on the Life Assured upon attaining age 18 years.

Number of source documents: 5


## 6. Configure DeepEval with Azure OpenAI

Create a custom DeepEval model wrapper to integrate with Azure OpenAI.

In [5]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models.base_model import DeepEvalBaseLLM

class AzureOpenAIModel(DeepEvalBaseLLM):
    """Custom DeepEval model wrapper for Azure OpenAI"""
    
    def __init__(self):
        self.model = llm
    
    def load_model(self):
        return self.model
    
    def generate(self, prompt: str) -> str:
        response = self.model.invoke(prompt)
        return response.content
    
    async def a_generate(self, prompt: str) -> str:
        response = await self.model.ainvoke(prompt)
        return response.content
    
    def get_model_name(self):
        return "gpt-4.1"

# Initialize DeepEval model
deepeval_model = AzureOpenAIModel()

print("‚úì DeepEval configured successfully with Azure OpenAI")

‚úì DeepEval configured successfully with Azure OpenAI


## 7. Create RAG Test Cases with Expected Outputs

Generate test cases with expected answers for comparison.

In [6]:
# Define test queries with expected answers
rag_test_data = [
    {
        'query': 'What is the age eligibility for term insurance?',
        'expected_output': 'The age eligibility for term insurance is minimum 18 years and maximum 65 years.'
    },
    {
        'query': 'What are the benefits of group gratuity?',
        'expected_output': 'The benefits include tax-efficient retirement planning, lump sum payment upon retirement, professional fund management, flexible investment options, and death and disability benefits.'
    },
    {
        'query': 'Who should I contact for grievances?',
        'expected_output': 'For grievances, you can contact service@hdfclife.com or call toll-free 1860-267-9999. The Grievance Officer is Mr. Rajesh Kumar.'
    }
]

# Generate RAG test cases
rag_test_cases = []
for test in rag_test_data:
    result = rag_query_with_real_chunks(test['query'])
    
    test_case = LLMTestCase(
        input=test['query'],
        actual_output=result['result'],
        expected_output=test['expected_output'],
        retrieval_context=result['source_documents']
    )
    rag_test_cases.append(test_case)

print(f"‚úì Created {len(rag_test_cases)} RAG test cases with expected outputs")
print("\nSample Test Case:")
print(f"Question: {rag_test_cases[0].input}")
print(f"Expected: {rag_test_cases[0].expected_output}")
print(f"Actual: {rag_test_cases[0].actual_output}")

‚úì Created 3 RAG test cases with expected outputs

Sample Test Case:
Question: What is the age eligibility for term insurance?
Expected: The age eligibility for term insurance is minimum 18 years and maximum 65 years.
Actual: **Answer:**

The age eligibility for HDFC Life Group Term Insurance is as follows:

- **Employer-Employee Schemes:**
  - **Minimum Entry Age:** 16 years (18 years if rider is chosen)
  - **Maximum Entry Age:** 79 years (64 years if rider is chosen)
  - **Minimum Maturity Age:** 17 years (19 years if rider is chosen)

- **Non Employer-Employee Schemes:**
  - **Minimum Entry Age:** 18 years
  - **Maximum Entry Age:** 79 years (64 years if rider is chosen)
  - **Minimum Maturity Age:** 19 years

- **Maximum Maturity Age for all:** 80 years

**Note:**  
- Risk cover starts from the date of commencement of policy for all lives, including minors.  
- In case of a minor, the policy will vest on the Life Assured upon attaining age 18 years.


## 8. Run RAG Evaluations

Execute comprehensive RAG evaluations with multiple metrics.

In [7]:
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric, GEval

print("="*60)
print("Running RAG Evaluations")
print("="*60)

# 1. Answer Relevancy - Are answers relevant to questions?
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model=deepeval_model)
answer_relevancy_metric.measure(rag_test_cases[0])
print(f"\n1. Answer Relevancy Score: {answer_relevancy_metric.score:.2f}")
print(f"   Reason: {answer_relevancy_metric.reason}")

# 2. Faithfulness - Are there hallucinations?
faithfulness_metric = FaithfulnessMetric(threshold=0.7, model=deepeval_model)
faithfulness_metric.measure(rag_test_cases[0])
print(f"\n2. Faithfulness Score: {faithfulness_metric.score:.2f}")
print(f"   Reason: {faithfulness_metric.reason}")

# 3. Contextual Relevancy - Is retrieved context relevant?
contextual_relevancy_metric = ContextualRelevancyMetric(threshold=0.7, model=deepeval_model)
contextual_relevancy_metric.measure(rag_test_cases[0])
print(f"\n3. Contextual Relevancy Score: {contextual_relevancy_metric.score:.2f}")
print(f"   Reason: {contextual_relevancy_metric.reason}")

# 4. Correctness - Compare with expected output
correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_steps=[
        "Check if the actual output contains the key information from the expected output",
        "Verify if the facts and figures match",
        "Ensure there are no contradictions with the expected output"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    threshold=0.7,
    model=deepeval_model
)
correctness_metric.measure(rag_test_cases[0])
print(f"\n4. Correctness (vs Expected) Score: {correctness_metric.score:.2f}")
print(f"   Reason: {correctness_metric.reason}")

Running RAG Evaluations



1. Answer Relevancy Score: 0.77
   Reason: The score is 0.77 because the output included information about minimum maturity age and riders, which is irrelevant to age eligibility for entering term insurance. However, it still provided some relevant details about entry age, which is why the score isn't lower.



2. Faithfulness Score: 1.00
   Reason: The score is 1.00 because there are no contradictions‚Äîthe actual output aligns perfectly with the retrieval context. Great job staying faithful to the source!



3. Contextual Relevancy Score: 0.18
   Reason: The score is 0.18 because, despite the presence of highly relevant statements like 'Minimum Entry Age is 16 years for Employer Employee schemes (18 years if rider is chosen), 18 years for Non Employer Employee schemes.' and 'Maximum Entry Age is 79 years (64 years if rider is chosen).', the majority of the retrieval context consists of information unrelated to age eligibility, such as 'Policy Term is one year renewable' and 'Minimum Sum Assured is Rs. 10000.', which significantly lowers overall relevancy.



4. Correctness (vs Expected) Score: 0.20
   Reason: The actual output provides detailed age eligibility criteria for different schemes and rider options, including minimum and maximum entry ages, as well as maturity ages. However, it does not align with the expected output, which simply states a minimum eligibility age of 18 and a maximum of 65. The actual output lists maximum entry ages up to 79, which contradicts the expected maximum of 65, and includes more nuanced details not requested. Therefore, the key information and figures do not match, and there are contradictions present.


---

# Part 2: Agent Evaluation

We'll evaluate an SQL agent's tool selection and execution accuracy.

## 9. Setup SQL Agent

Create an agent with database tools for querying SQLite.

In [8]:
import sqlite3
from langchain.agents import create_agent

# Define database tools
def get_tables():
    """Return list of tables in the database."""
    conn = sqlite3.connect("sales_data.db")
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = [row[0] for row in cursor.fetchall()]
    conn.close()
    return tables

def get_schema(table_name: str):
    """Return schema of a given table."""
    conn = sqlite3.connect("sales_data.db")
    cursor = conn.cursor()
    cursor.execute(f"PRAGMA table_info({table_name});")
    results = cursor.fetchall()
    conn.close()
    return results

def execute_query(query: str):
    """Execute arbitrary SQL query and return results."""
    conn = sqlite3.connect("sales_data.db")
    cursor = conn.cursor()
    cursor.execute(query)
    results = cursor.fetchall()
    conn.close()
    return results

# Create SQL agent
sql_agent = create_agent(
    llm,
    tools=[get_tables, get_schema, execute_query],
    system_prompt="""You are a sqlite database agent with tools to get tables, get schema and execute queries.
    Understand column stats before generating queries."""
)

print("‚úì SQL Agent configured successfully")

‚úì SQL Agent configured successfully


## 10. Create Agent Test Cases with Expected Tools

Generate test cases that specify which tools the agent should use.

In [9]:
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import ToolCall

agent_test_data = [
    {
        'query': 'What tables exist in the database?',
        'expected_tools': ['get_tables'],
        'expected_output': 'Available tables in the database'
    },
    {
        'query': 'Show me the schema of the sales_data table',
        'expected_tools': ['get_schema'],
        'expected_output': 'Schema information for sales_data table'
    }
]

agent_test_cases = []
for test in agent_test_data:
    try:
        result = sql_agent.invoke({'messages': [test['query']]})
        
        # Extract tools called and create ToolCall objects
        tools_called = []
        for message in result['messages']:
            if hasattr(message, 'tool_calls') and message.tool_calls:
                for tool_call in message.tool_calls:
                    tool_call_obj = ToolCall(
                        name=tool_call['name'],
                        input=tool_call.get('args', {})
                    )
                    if not any(tc.name == tool_call_obj.name for tc in tools_called):
                        tools_called.append(tool_call_obj)
        
        # Convert expected tools to ToolCall objects
        expected_tools = [ToolCall(name=tool, input={}) for tool in test['expected_tools']]
        
        agent_test_case = LLMTestCase(
            input=test['query'],
            actual_output=result['messages'][-1].content,
            expected_output=test['expected_output'],
            tools_called=tools_called,
            expected_tools=expected_tools
        )
        agent_test_cases.append(agent_test_case)
    except Exception as e:
        print(f"‚ö† Skipping test case: {test['query']}")
        print(f"  Error: {str(e)[:100]}")

print(f"\n‚úì Created {len(agent_test_cases)} agent test cases successfully")
if agent_test_cases:
    print("\nSample Agent Test Case:")
    print(f"Query: {agent_test_cases[0].input}")
    print(f"Tools Called: {[tc.name for tc in agent_test_cases[0].tools_called]}")
    print(f"Expected Tools: {[tc.name for tc in agent_test_cases[0].expected_tools]}")


‚úì Created 2 agent test cases successfully

Sample Agent Test Case:
Query: What tables exist in the database?
Tools Called: ['get_tables']
Expected Tools: ['get_tables']


## 11. Run Agent Evaluations

Evaluate tool selection accuracy and batch process all test cases.

In [10]:
print("="*60)
print("Running Agent Evaluations")
print("="*60)

# 1. Tool Correctness - Single test case
tool_correctness_metric = ToolCorrectnessMetric(threshold=0.7, model=deepeval_model)
tool_correctness_metric.measure(agent_test_cases[0])
print(f"\n1. Tool Correctness Score: {tool_correctness_metric.score:.2f}")
print(f"   Tools Called: {[tc.name for tc in agent_test_cases[0].tools_called]}")
print(f"   Expected Tools: {[tc.name for tc in agent_test_cases[0].expected_tools]}")
print(f"   Match: {'‚úì' if tool_correctness_metric.score >= 0.7 else '‚úó'}")

# 2. Batch Evaluation - All test cases
print("\n" + "="*60)
print("Batch Evaluation Results")
print("="*60)

results = evaluate(
    test_cases=agent_test_cases,
    metrics=[ToolCorrectnessMetric(threshold=0.5, model=deepeval_model)]
)

print(f"\nAgent Evaluation Summary:")
print(f"Total Test Cases: {len(agent_test_cases)}")
print(f"Evaluation Complete!")

Running Agent Evaluations



1. Tool Correctness Score: 1.00
   Tools Called: ['get_tables']
   Expected Tools: ['get_tables']
   Match: ‚úì

Batch Evaluation Results




Metrics Summary

  - ‚úÖ Tool Correctness (score: 1.0, threshold: 0.5, strict: False, evaluation model: None, reason: [
	 Tool Calling Reason: All expected tools ['get_tables'] were called (order not considered).
	 Tool Selection Reason: No available tools were provided to assess tool selection criteria
]
, error: None)

For test case:

  - input: What tables exist in the database?
  - actual output: There are currently no tables in the database. If you have any other requests or need to create tables, please let me know!
  - expected output: Available tables in the database
  - context: None
  - retrieval context: None


Metrics Summary

  - ‚úÖ Tool Correctness (score: 1.0, threshold: 0.5, strict: False, evaluation model: None, reason: [
	 Tool Calling Reason: All expected tools ['get_schema'] were called (order not considered).
	 Tool Selection Reason: No available tools were provided to assess tool selection criteria
]
, error: None)

For test case:

  - input: Show me the schema


Agent Evaluation Summary:
Total Test Cases: 2
Evaluation Complete!


---

# Comprehensive Evaluation Dashboard

Summary of all RAG and Agent evaluations with insights.

---

## üéì Key Takeaways

### RAG Evaluations:
1. **Answer Relevancy** - Ensures answers match questions
2. **Faithfulness** - Prevents hallucinations by grounding in context
3. **Contextual Relevancy** - Validates retrieval quality
4. **Correctness** - Compares with expected outputs

### Agent Evaluations:
1. **Tool Correctness** - Validates proper tool selection
2. **Tool Arguments** - Ensures correct parameters
3. **Expected Outputs** - Compares agent responses

### Best Practices:
- ‚úÖ Set appropriate thresholds (0.7 is common)
- ‚úÖ Use batch evaluation for efficiency
- ‚úÖ Create diverse test cases
- ‚úÖ Monitor evaluation metrics over time
- ‚úÖ Combine multiple metrics for comprehensive evaluation
- ‚úÖ Use real data from production systems when possible