# Amazon Bedrock Knowledge Base Evaluation Guide

## Introduction

Amazon Bedrock Knowledge Base Evaluation provides a comprehensive solution for assessing RAG (Retrieval-Augmented Generation) applications. This guide demonstrates how to evaluate both retrieval and generation components of your RAG system using Amazon Bedrock APIs.

Through this guide, we'll explore:
- Setting up evaluation configurations
- Creating retrieval only evaluation jobs
- Creating retrieval with generation evaluation jobs
- Monitoring evaluation progress

## Prerequisites

Before we begin, make sure you have:
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region
- An S3 bucket with CORS enabled for storing evaluation data
- A created and synced Amazon Bedrock Knowledge Base
- An IAM role with necessary permissions for S3 and Bedrock
- To complete these prerequisites, check the how to steps avaialble [here](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-evaluation-prereq.html)

> **Important**: Make sure that your knowledge base is synced and ready before starting any evaluation job.

## Dataset Format

The evaluation data must follow specific JSONL formats based on the type of evaluation:

### Retrieval-only Evaluation Format
```json
{
    "conversationTurns": [{
        "referenceContexts": [{
            "content": [{
                "text": "Reference context for evaluation"
            }]
        }],
        "prompt": {
            "content": [{
                "text": "Your prompt here"
            }]
        }
    }]
}
```

### Retrieval and Generation Evaluation Format
```json
{
    "conversationTurns": [{
        "referenceResponses": [{
            "content": [{
                "text": "Reference response for evaluation"
            }]
        }],
        "prompt": {
            "content": [{
                "text": "Your prompt here"
            }]
        }
    }]
}
```

## Dataset Requirements

### Job Requirements
- Maximum 1000 prompts per evaluation job
- Each line in the JSONL file must be a complete prompt

### File Requirements
- File must use JSONL format with `.jsonl` extension
- Each line must be a valid JSON object
- File must be stored in an S3 bucket with CORS enabled

### Data Structure Requirements
For Retrieval-only Evaluation:
- Must include `referenceContexts` as shown in the format above
- Each prompt must follow the specified JSON structure

For Retrieval and Generation Evaluation:
- Optional `referenceResponses` as shown in the format above
- Must follow the specified JSON structure

> **Note**: When preparing your dataset, consider your evaluation objectives and make sure that your prompts and reference data align with your assessment goals. 

## Implementation

First, let's set up our configuration parameters:

In [1]:
import boto3
import time
from datetime import datetime

# Generate unique name for the job
job_name = f"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "mistral.mistral-large-2402-v1:0"
generator_model = "anthropic.claude-3-sonnet-20240229-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"

# Specify S3 locations
input_data = "s3://<YOUR_BUCKET>/evaluation_data/input.jsonl"
output_path = "s3://<YOUR_BUCKET>/evaluation_output/"

# Configure retrieval settings
num_results = 5
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

### Creating a Retrieval-only Evaluation Job

This configuration focuses on assessing the quality of retrieved contexts. Available metrics for retrieval evaluation:
- `Builtin.ContextRelevance`: Assesses how relevant the retrieved contexts are to the query
- `Builtin.ContextCoverage`: Measures how well the retrieved contexts cover the information needed

In [2]:
retrieval_job = bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval performance",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveConfig": {
                    "knowledgeBaseId": knowledge_base_id,
                    "knowledgeBaseRetrievalConfiguration": {
                        "vectorSearchConfiguration": {
                            "numberOfResults": num_results,
                            "overrideSearchType": search_type
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.ContextRelevance",
                    "Builtin.ContextCoverage"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

### Creating a Retrieval and Generation Evaluation Job

This configuration evaluates both retrieval and response generation. Available metrics for this evaluation:
- `Builtin.Correctness`: Evaluates factual accuracy of generated responses
- `Builtin.Completeness`: Assesses if all relevant information is included
- `Builtin.Helpfulness`: Measures how useful the response is
- `Builtin.LogicalCoherence`: Evaluates response structure and flow
- `Builtin.Faithfulness`: Checks for hallucinations or made-up information
- `Builtin.Harmfulness`: Detects harmful content
- `Builtin.Stereotyping`: Identifies biased or stereotypical responses
- `Builtin.Refusal`: Evaluates appropriate refusal of problematic requests

In [3]:
time.sleep(1)
job_name_rg = f"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
retrieve_generate_job = bedrock_client.create_evaluation_job(
    jobName=job_name_rg,
    jobDescription="Evaluate retrieval and generation",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results,
                                "overrideSearchType": search_type
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "Builtin.LogicalCoherence",
                    "Builtin.Faithfulness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

### Monitoring Job Progress

Track the status of your evaluation job:

In [None]:
# Get job ARN based on job type
evaluation_job_arn = retrieval_job['jobArn']  # or retrieve_generate_job['jobArn']

# Check job status
response = bedrock_client.get_evaluation_job(
    jobIdentifier=evaluation_job_arn 
)
print(f"Job Status: {response['status']}")

## Conclusion

In this guide, we've walked through the process of implementing Knowledge Base Evaluation using Amazon Bedrock. The feature enables organizations to:
- Assess AI model outputs across various tasks and contexts
- Evaluate multiple dimensions of AI performance simultaneously
- Systematically assess both retrieval and generation quality in RAG systems
- Scale evaluations across thousands of responses while maintaining quality standards

Remember to follow the best practices outlined above to ensure effective evaluation of your RAG applications.