# RAG Evaluation with Custom Metrics on Amazon Bedrock

## Introduction

Amazon Bedrock Evaluations now supports Custom Metrics for RAG (Retrieval-Augmented Generation) systems, enabling you to define specialized evaluation criteria tailored to your specific needs. This notebook demonstrates how to create and implement custom metrics for your RAG evaluation jobs, allowing you to measure unique aspects of your RAG system's performance beyond the built-in metrics.

Through this guide, we'll explore:
- Creating custom metrics for RAG evaluation with full configuration control
- Implementing retrieve-and-generate evaluation jobs with custom metrics
- Defining numerical and categorical scoring systems for your custom metrics
- Analyzing evaluation results with your specialized metrics alongside built-in metrics
- Monitoring evaluation progress and interpreting custom metric results

## Prerequisites

Before we begin, make sure you have:
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region
- An S3 bucket for storing evaluation data and results
- An IAM role with necessary permissions for S3 and Bedrock
- A dataset formatted according to the RAG evaluation requirements

> **Important**: The evaluation process requires access to Amazon Bedrock evaluator models. Make sure these are enabled in your account.

## Custom Metrics for RAG Evaluation

Custom metrics allow you to evaluate specific dimensions of your RAG system's performance beyond the default metrics. For example, you might want to evaluate:
- Information comprehensiveness
- Knowledge integration fidelity
- Information relevance
- Brand voice consistency
- Domain-specific accuracy criteria

Let's implement these custom evaluations using the Amazon Bedrock SDK.

## Implementation

First, let's set up our configuration parameters:

In [5]:
#Upgrade Boto3
!pip install --upgrade boto3

Collecting boto3
  Downloading boto3-1.37.36-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.38.0,>=1.37.36 (from boto3)
  Downloading botocore-1.37.36-py3-none-any.whl.metadata (5.7 kB)
Downloading boto3-1.37.36-py3-none-any.whl (139 kB)
Downloading botocore-1.37.36-py3-none-any.whl (13.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: botocore, boto3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.36.23
    Uninstalling botocore-1.36.23:
      Successfully uninstalled botocore-1.36.23
  Attempting uninstall: boto3
    Found existing installation: boto3 1.36.23
    Uninstalling boto3-1.36.23:
      Successfully uninstalled boto3-1.36.23
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-mu

In [1]:
# Verify boto3 installed successfully
import boto3
print(boto3.__version__)

1.37.36


To use the Python SDK for creating an RAG evaluation job with your own inference responses, use the following steps. First, set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results.

In [None]:
import boto3
import time
from datetime import datetime

# Generate unique name for the job
job_name = f"rag-evaluation-custom-metrics-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
generator_model = "amazon.nova-lite-v1:0"
custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_BUCKET_NAME>"

# Specify S3 locations
input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
output_path = f"s3://{BUCKET_NAME}/evaluation_output/"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

In [5]:
import boto3
import time
from datetime import datetime

# Generate unique name for the job
job_name = f"rag-evaluation-custom-metrics-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure knowledge base and model settings
knowledge_base_id = "STCXFRIFPT"
evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
generator_model = "amazon.nova-lite-v1:0"
custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
role_arn = "arn:aws:iam::968116482887:role/AmazonBedrock"
BUCKET_NAME = "wale-eval-bucket-us-east-1"

# Specify S3 locations
input_data = f"s3://{BUCKET_NAME}/evaluation_data/rag_dataset_prompt_with_gt.jsonl"
output_path = f"s3://{BUCKET_NAME}/evaluation_output/"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

## Creating a Retrieval and Generation Evaluation Job with Custom Metrics

For this evaluation job, we'll use three key built-in metrics:
- `Builtin.Correctness`: Evaluates factual accuracy of generated responses
- `Builtin.Completeness`: Assesses if all relevant information is included  
- `Builtin.Helpfulness`: Measures how useful the response is

Additionally, we'll implement our custom metric:
- `information_comprehensiveness`: Evaluates how thoroughly the response utilizes retrieved information

In [6]:
# Define our custom information_comprehensiveness metric
information_comprehensiveness_metric = {
    "customMetricDefinition": {
        "name": "information_comprehensiveness",
        "instructions": """
        Your role is to evaluate how comprehensively the response addresses the query using the retrieved information. 
        Assess whether the response provides a thorough treatment of the subject by effectively utilizing the available retrieved passages.

Carefully evaluate the comprehensiveness of the RAG response for the given query against all specified criteria. 
Assign a single overall score that best represents the comprehensiveness, and provide a brief explanation justifying your rating, referencing specific strengths and weaknesses observed.

When evaluating response comprehensiveness, consider the following rubrics:
- Coverage: Does the response utilize the key relevant information from the retrieved passages?
- Depth: Does the response provide sufficient detail on important aspects from the retrieved information?
- Context utilization: How effectively does the response leverage the available retrieved passages?
- Information synthesis: Does the response combine retrieved information to create a thorough treatment?

Evaluate using the following:

Query: {{prompt}}

Retrieved passages: {{context}}

Response to evaluate: {{prediction}}
""",
        "ratingScale": [
            {
                "definition": "Very comprehensive",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "Moderately comprehensive",
                "value": {
                    "floatValue": 2
                }
            },
            {
                "definition": "Minimally comprehensive",
                "value": {
                    "floatValue": 1
                }
            },
            {
                "definition": "Not at all comprehensive",
                "value": {
                    "floatValue": 0
                }
            }
        ]
    }
}

In [7]:
# Create the evaluation job
retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

retrieve_generate_job = bedrock_client.create_evaluation_job(
    jobName=retrieve_generate_job_name,
    jobDescription="Evaluate retrieval and generation with custom metric",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "General",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "information_comprehensiveness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            },
            "customMetricConfig": {
                "customMetrics": [
                    information_comprehensiveness_metric
                ],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": custom_metrics_evaluator_model
                    }]
                }
            }
        }
    }
)

print(f"Created evaluation job: {retrieve_generate_job_name}")
print(f"Job ID: {retrieve_generate_job['jobArn']}")

Created evaluation job: rag-evaluation-generate-2025-04-18-18-26-22
Job ID: arn:aws:bedrock:us-east-1:968116482887:evaluation-job/sr0ocq5n2a6l


### Monitoring Job Progress
Track the status of your evaluation job:

In [9]:
# Get job ARN based on job type
evaluation_job_arn = retrieve_generate_job['jobArn']  # or retrieve_generate_job['jobArn']

# Check job status
response = bedrock_client.get_evaluation_job(
    jobIdentifier=evaluation_job_arn 
)
print(f"Job Status: {response['status']}")

Job Status: Completed


## Conclusion
This guide demonstrated how to implement Custom Metrics for RAG Evaluation on Amazon Bedrock. This feature allows organizations to:
- Create tailored evaluation criteria beyond standard metrics
- Define specialized scoring systems for unique business requirements
- Combine custom and built-in metrics for comprehensive RAG assessment
  
With these capabilities, you can systematically evaluate and optimize your RAG applications according to the dimensions that matter most for your specific use cases.