# RAG Evaluation Demo
Demonstrates RAG evaluation pipeline using first test case from goldens:
1. Load golden (Q&A pair)
2. Retrieve context (AWS Knowledge Base)
3. Generate RAG response
4. RAG Triad metrics (Faithfulness, Contextual Relevancy, Answer Relevancy)
5. Evaluate using GEval
6. Generate report

## Load and Analyze Test Case

Loads first golden from synthetic_data/goldens.json to demonstrate:
- Input query for RAG evaluation
- Expected output (ground truth)
- Context length and source

In [30]:
import json
from pathlib import Path
import pandas as pd  # for better data visualization

def load_test_cases(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

test_cases = load_test_cases('./synthetic_data/goldens.json')

first_case = test_cases[0]
print("First Test Case Analysis:")
print("\nInput Query:")
print(first_case['input'])
print("\nExpected Output:")
print(first_case['expected_output'])
print("\nContext Documents:", len(first_case['context']))
print("\nSource File:", first_case['source_file'])

First Test Case Analysis:

Input Query:
What aspects of Lisp contributed to its prominence over PL/I in AI programming?

Expected Output:
Lisp's prominence over PL/I in AI programming can be attributed to several key aspects that aligned well with the needs of AI development:

1. **Symbolic Computation**: Lisp was uniquely suited for symbolic computation, a cornerstone of AI programming. It allowed for easy manipulation of symbols and lists, which was essential for representing knowledge and reasoning in AI systems.

2. **Flexibility & Extensibility**: The language's inherent flexibility allowed programmers to define their own syntax and new language features easily, enabling rapid experimentation and adaptation to complex AI challenges.

3. **Dynamic Typing and Garbage Collection**: Lisp's dynamic typing and automatic memory management via garbage collection allowed AI developers to focus more on problem-solving rather than low-level memory management, which was often cumbersome in la

## Context Retrieval
Uses AWS Bedrock Agent to fetch relevant passages:
- Configurable results via `RAGConfig.NUMBER_OF_RESULTS`
- Retrieves context for input query
- Returns formatted passages from Knowledge Base

In [31]:
import boto3
import os
from dotenv import load_dotenv
from config import RAGConfig

# Initialize Bedrock client for retrieval
load_dotenv()
bedrock_agent = boto3.client('bedrock-agent-runtime', region_name=os.getenv('AWS_BEDROCK_REGION'))

def retrieve_context(knowledge_base_id: str, prompt: str):
    """
    Retrieve relevant passages using Bedrock Agent Runtime
    """
    try:
        # Use the same retrieval configuration as in bedrock_rag.py
        retrieve_response = bedrock_agent.retrieve(
            knowledgeBaseId=knowledge_base_id,
            retrievalQuery={
                'text': prompt
            },
            retrievalConfiguration={
                'vectorSearchConfiguration': {
                    'numberOfResults': RAGConfig.NUMBER_OF_RESULTS
                }
            }
        )
        
        # Extract passages from response
        passages = []
        for result in retrieve_response.get('retrievalResults', []):
            text = result.get('content', {}).get('text', '')
            if text:
                passages.append(text)
                
        return passages
    except Exception as e:
        print(f"Error in retrieval: {str(e)}")
        return []

# Get the input query from first test case
input_query = first_case['input']
knowledge_base_id = os.getenv('KNOWLEDGE_BASE_ID')

# Retrieve context
retrieved_contexts = retrieve_context(knowledge_base_id, input_query)

print(f"Input Query: {input_query}")
print(f"\nNumber of contexts retrieved: {len(retrieved_contexts)}")
print("\nRetrieved Contexts:")
for i, context in enumerate(retrieved_contexts, 1):
    print(f"\nContext {i}:")
    print(context[:200] + "..." if len(context) > 200 else context)  # Show first 200 chars for readability

Input Query: What aspects of Lisp contributed to its prominence over PL/I in AI programming?

Number of contexts retrieved: 3

Retrieved Contexts:

Context 1:
What these programs really showed was that there's a subset of natural language that's a formal language. But a very proper subset. It was clear that there was an unbridgeable gap between what they co...

Context 2:
I started writing essays again, and wrote a bunch of new ones over the next few months. I even wrote a couple that weren't about startups. Then in March 2015 I started working on Lisp again.  The dist...

Context 3:
There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI....


## Response Generation
Generates answer using AWS Bedrock:
- Uses retrieved context and input query
- Follows RAGConfig.PROMPT_TEMPLATE format
- Parameters from ModelConfig (max_gen_len, temperature, top_p)
- Handles multiple response formats

In [32]:
import boto3
import json
from config import ModelConfig, RAGConfig

# Initialize Bedrock runtime client
bedrock_runtime = boto3.client('bedrock-runtime', region_name=os.getenv('AWS_BEDROCK_REGION'))

def generate_response(model_id: str, context: list, prompt: str):
    """
    Generate response using Bedrock's LLaMA3 model
    """
    try:
        # Create enhanced prompt using the same template
        context_text = "\n".join(context)
        enhanced_prompt = RAGConfig.PROMPT_TEMPLATE.format(
            context=context_text, 
            question=prompt
        )
        
        # Print enhanced prompt for debugging
        print("\n=== Enhanced Prompt ===")
        print(enhanced_prompt)
        print("=====================")
        
        # Invoke LLaMA3 model with same configuration
        llm_response = bedrock_runtime.invoke_model(
            modelId=model_id,
            contentType="application/json",
            accept="application/json",
            body=json.dumps({
                "prompt": enhanced_prompt,
                "max_gen_len": ModelConfig.MAX_GEN_LEN,
                "temperature": ModelConfig.TEMPERATURE,
                "top_p": ModelConfig.TOP_P
            })
        )
        
        response_body = json.loads(llm_response["body"].read())
        
        # Handle response formats
        if "generation" in response_body:
            return response_body["generation"]
        elif "outputs" in response_body and isinstance(response_body["outputs"], list):
            return response_body["outputs"][0].get("text", "No response generated")
        else:
            print(f"\nDebug - Received response format: {response_body.keys()}")
            return "Unexpected response format."
            
    except Exception as e:
        print(f"\nError in generate_response: {str(e)}")
        return None

# Generate response for first test case
model_id = os.getenv('AWS_BEDROCK_MODEL_ID')
input_query = first_case['input']

response = generate_response(
    model_id=model_id,
    context=retrieved_contexts,  # From Cell 2
    prompt=input_query
)

print("\n=== Generated Response ===")
print(response)


=== Enhanced Prompt ===
Based on the following context, please answer the question.

Context:
What these programs really showed was that there's a subset of natural language that's a formal language. But a very proper subset. It was clear that there was an unbridgeable gap between what they could do and actually understanding natural language. It was not, in fact, simply a matter of teaching SHRDLU more words. That whole way of doing AI, with explicit data structures representing concepts, was not going to work. Its brokenness did, as so often happens, generate a lot of opportunities to write papers about various band-aids that could be applied to it, but it was never going to get us Mike.  So I looked around to see what I could salvage from the wreckage of my plans, and there was Lisp. I knew from experience that Lisp was interesting for its own sake and not just for its association with AI, even though that was the main reason people cared about it at the time. So I decided to focus

## RAG Traid

The RAG Triad consists of three key metrics:
1. Faithfulness – Is the answer grounded in the retrieved context?
2. Contextual Relevancy – Is the retrieved context relevant to the query?
3. Answer Relevancy – Is the answer relevant to the query?

In [34]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input=first_case['input'],
    actual_output=response,
    expected_output=first_case['expected_output'],
    context=first_case['context'],
    retrieval_context=retrieved_contexts
)

evaluate(test_cases=[test_case], metrics=[answer_relevancy_metric, faithfulness_metric, contextual_relevancy_metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:12, 12.37s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 0.875, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.88 because while the response provided valuable insights into Lisp's prominence in AI programming, the mention of SHRDLU's intelligence level strayed from directly comparing Lisp to PL/I, causing a slight decrease in relevance., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because there are no contradictions present, indicating perfect alignment between the actual output and the information in the retrieval context., error: None)
  - ❌ Contextual Relevancy (score: 0.55, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.55 because while the retrieval context includes some relevant statements like 'The distinctive thing about Lisp is that its core is a language defined by writing an interpreter in itself' and 'But its origins a




EvaluationResult(test_results=[TestResult(name='test_case_0', success=False, metrics_data=[MetricData(name='Answer Relevancy', threshold=0.7, success=True, score=0.875, reason="The score is 0.88 because while the response provided valuable insights into Lisp's prominence in AI programming, the mention of SHRDLU's intelligence level strayed from directly comparing Lisp to PL/I, causing a slight decrease in relevance.", strict_mode=False, evaluation_model='gpt-4o-mini', error=None, evaluation_cost=0.000405, verbose_logs='Statements:\n[\n    "Lisp originated as a formal model of computation.",\n    "Lisp is regarded as the language of AI.",\n    "Lisp has power and elegance in programming.",\n    "Lisp is prominent over PL/I in AI programming.",\n    "Lisp can expand one\'s concept of a program.",\n    "Lisp is suitable for reverse-engineering SHRDLU.",\n    "The author believed SHRDLU was climbing the lower slopes of intelligence.",\n    "Lisp\'s popularity in AI programming is due to va

## Evaluation Setup
Import required packages for custom evaluator:
- LangChain AWS integration
- DeepEval base model

In [35]:
# Import required packages
# from langchain_community.chat_models import BedrockChat
from langchain_aws import ChatBedrock
from deepeval.models.base_model import DeepEvalBaseLLM
import boto3

## Custom Evaluator
DeepEval compatible AWS Bedrock wrapper:
- Implements required LLM interface
- Supports sync/async generation
- Uses configured model name from env

In [36]:
class AWSBedrock(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return os.getenv('AWS_EVALUATOR_MODEL_NAME', "Custom Azure OpenAI Model")

## Initialize Evaluator
Sets up AWS Bedrock for evaluation:
- Creates ChatBedrock instance with env config
- Wraps model in DeepEval-compatible interface
- Configures generation parameters

In [37]:
from langchain_aws import ChatBedrock
from deepeval.models.base_model import DeepEvalBaseLLM

# Initialize the custom model
custom_model = ChatBedrock(
    region_name=os.getenv('AWS_EVALUATOR_REGION'),
    model_id=os.getenv('AWS_EVALUATOR_MODEL_ID'),
    model_kwargs={
        "temperature": 0.7,
        "top_p": 0.8,
        "max_gen_len": 2048
    }
)

# Create our custom LLM class instance
aws_bedrock = AWSBedrock(model=custom_model)

## GEval Configuration
Sets up evaluation metric:
- Assesses: accuracy, completeness, clarity
- Uses: input, output, expected, context
- Evaluates via custom AWS Bedrock model

In [38]:
# Import GEval components
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define accuracy metric
accuracy_metric = GEval(
    name="response_accuracy_metric",
    criteria="""Evaluate the response for:
1. Factual Accuracy: Information provided matches the source material
2. Completeness: All key points from the question are addressed
3. Clarity: Response is well-structured and easy to understand""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.CONTEXT
    ],
    model=aws_bedrock  # Using our custom AWS Bedrock model
)

print("GEval metric configured successfully")

GEval metric configured successfully


## Test Case Creation
Prepares evaluation data:
- Input: from golden
- Actual: RAG generated response
- Expected: golden's ground truth
- Context: golden's context passages

In [39]:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Get first test case from goldens.json
first_case = test_cases[0]

# Create LLMTestCase:
# - input from JSON
# - actual_output from RAG response only
# - expected_output from JSON
# - context from JSON only
test_case = LLMTestCase(
    input=first_case['input'],
    actual_output=response,  # From RAG's response
    expected_output=first_case['expected_output'],
    context=first_case['context']  
)

print("=== Test Case Created ===")
print(f"Input Query: {test_case.input[:100]}...")
print(f"\nActual Output Length: {len(test_case.actual_output) if test_case.actual_output else 0} chars")
print(f"Expected Output Length: {len(test_case.expected_output)} chars")
print(f"Context Passages: {len(test_case.context)}")

=== Test Case Created ===
Input Query: What aspects of Lisp contributed to its prominence over PL/I in AI programming?...

Actual Output Length: 496 chars
Expected Output Length: 1475 chars
Context Passages: 3


## Run Evaluation
Executes GEval assessment:
- Evaluates test case against metric
- Shows score, threshold, success/fail
- Provides detailed analysis rationale

In [40]:
from deepeval import evaluate

result = evaluate(
    test_cases=[test_case],
    metrics=[accuracy_metric]  # Using the custom defined 'accuracy_metric'
)

# Extract and print the detailed results
metric_result = result.test_results[0].metrics_data[0]  # Get first test case's metric

print("=== Evaluation Results ===")
print(f"Metric: {metric_result.name}")
print(f"Score: {metric_result.score:.4f}")
print(f"Threshold: {metric_result.threshold}")
print(f"Success: {metric_result.success}")
print(f"Model Used: {metric_result.evaluation_model}")
print("\nDetailed Analysis:")
print(metric_result.reason)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:03,  3.67s/test case]



Metrics Summary

  - ✅ response_accuracy_metric (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: "LLaMA3-70B Evaluator"  # Custom display name, reason: The Actual Output partially matches the Expected Output, covering some key points such as Lisp's symbolic computation and flexibility, but lacks clarity and completeness, failing to address all aspects contributing to Lisp's prominence over PL/I in AI programming., error: None)

For test case:

  - input: What aspects of Lisp contributed to its prominence over PL/I in AI programming?
  - actual output:  Lisp's origins as a formal model of computation, its power and elegance, and its ability to be used to program computers, as well as its being regarded as the language of AI, contributed to its prominence over PL/I in AI programming. Additionally, Lisp's ability to expand one's concept of a program and its suitability for reverse-engineering SHRDLU, a program that the author believed was already climbing the lower 




=== Evaluation Results ===
Metric: response_accuracy_metric (GEval)
Score: 0.8000
Threshold: 0.5
Success: True
Model Used: "LLaMA3-70B Evaluator"  # Custom display name

Detailed Analysis:
The Actual Output partially matches the Expected Output, covering some key points such as Lisp's symbolic computation and flexibility, but lacks clarity and completeness, failing to address all aspects contributing to Lisp's prominence over PL/I in AI programming.


## Report Generation
Creates Excel report with two sheets:
- Summary: timestamp, score, pass/fail
- Details: full test case results
- Auto-formatted for readability

In [41]:
import pandas as pd
from datetime import datetime
import os

def generate_excel_reports(test_case, metric_result, report_dir='reports'):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    excel_file = os.path.join(report_dir, f'rag_evaluation_report_{timestamp}.xlsx')
    
    # Create Summary DataFrame
    summary_data = {
        'Metric': ['Timestamp', 'Total Test Cases', 'Overall Score', 'Pass/Fail', 'Model Used'],
        'Value': [
            datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
            1,  # Will be modified for production
            f"{metric_result.score:.4f}",
            'PASS' if metric_result.success else 'FAIL',
            metric_result.evaluation_model
        ]
    }
    summary_df = pd.DataFrame(summary_data)
    
    # Create Detailed Results DataFrame - Now with each test case as a row
    detailed_data = {
        'Test ID': [1],  # Will increment for multiple test cases
        'Input Query': [test_case.input],
        'Expected Output': [test_case.expected_output],
        'Actual Output': [test_case.actual_output],
        'Score': [f"{metric_result.score:.4f}"],
        'Threshold': [metric_result.threshold],
        'Success': [metric_result.success],
        'Analysis': [metric_result.reason]
    }
    detailed_df = pd.DataFrame(detailed_data)
    
    # Create Excel writer object
    with pd.ExcelWriter(excel_file, engine='openpyxl') as writer:
        # Write each dataframe to a different worksheet
        summary_df.to_excel(writer, sheet_name='Summary', index=False)
        detailed_df.to_excel(writer, sheet_name='Detailed Results', index=False)
        
        # Auto-adjust columns' width
        for sheet_name in writer.sheets:
            worksheet = writer.sheets[sheet_name]
            for idx, col in enumerate(worksheet.columns, 1):
                max_length = 0
                column = worksheet.column_dimensions[chr(64 + idx)]
                for cell in col:
                    try:
                        if len(str(cell.value)) > max_length:
                            max_length = len(str(cell.value))
                    except:
                        pass
                adjusted_width = min(max_length + 2, 100)  # Cap width at 100
                column.width = adjusted_width

    print(f"Excel report generated successfully: {excel_file}")

# Generate the Excel report
generate_excel_reports(test_case, metric_result)

Excel report generated successfully: reports\rag_evaluation_report_20250530_152953.xlsx
