# RAG Evaluation with Bring Your Own Inference Responses (BYOI) on Amazon Bedrock

## Introduction

Amazon Bedrock RAG Evaluation capabilities now support "Bring Your Own Inference Responses" (BYOI), enabling you to assess any Retrieval-Augmented Generation system regardless of where it's deployed. This notebook demonstrates how to evaluate the quality of RAG systems using specialized metrics including the newly available citation metrics - Citation Precision and Citation Coverage - providing deep insights into how effectively your system uses retrieved information.

Through this guide, we'll explore:
- Setting up RAG evaluation configurations with BYOI
- The creation of retrieve-and-generate evaluation jobs
- Analyzing citation quality with the new precision and coverage metrics
- Monitoring evaluation progress 

## Prerequisites

Before we begin, make sure you have:
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region
- An S3 bucket for storing evaluation data and results
- An IAM role with necessary permissions for S3 and Bedrock
- RAG system outputs in the required BYOI format

> **Important**: The evaluation process requires access to Amazon Bedrock evaluator models. Make sure these are enabled in your account.

## Dataset Format for RAG BYOI

### Retrieve-and-Generate Evaluation Format
```json
{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": "Your prompt here"
          }
        ]
      },
      "referenceResponses": [
        {
          "content": [
            {
              "text": "Expected ground truth answer"
            }
          ]
        }
      ],
      "output": {
        "text": "Generated response text",
        "knowledgeBaseIdentifier": "third-party-RAG",
        "retrievedPassages": {
          "retrievalResults": [
            {
              "name": "Optional passage name",
              "content": {
                "text": "Retrieved passage content"
              },
              "metadata": {
                "source": "Optional metadata"
              }
            }
          ]
        },
        "citations": [
          {
            "generatedResponsePart": {
              "textResponsePart": {
                "span": {
                  "start": 0,
                  "end": 50
                },
                "text": "Part of the response that uses cited material"
              }
            },
            "retrievedReferences": [
              {
                "name": "Optional passage name",
                "content": {
                  "text": "Source passage for the citation"
                },
                "metadata": {
                  "source": "Optional metadata"
                }
              }
            ]
          }
        ]
      }
    }
  ]
}
```
## Implementation

First, let's set up our configuration parameters:

In [None]:
#Upgrade Boto3
!pip install --upgrade boto3

In [None]:
# Verify boto3 installed successfully
import boto3
from datetime import datetime
print(boto3.__version__)

To use the Python SDK for creating an RAG evaluation job with your own inference responses, use the following steps. First, set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results.

In [None]:
# Configure knowledge base and model settings
evaluator_model = "<YOUR_EVALUATOR_MODEL>"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
PREFIX = "<YOUR_BUCKET_PREFIX>"
RAG_dataset_custom_name = "<YOUR_RAG_BYOI_DATASET_NAME>" # without the ".jsonl file extension

# Specify S3 locations
input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl"
output_path = f"s3://{BUCKET_NAME}/{PREFIX}/"

# Create Bedrock client
bedrock_client = boto3.client('bedrock')

## Configuring a Retrieve and Generate RAG Evaluation Job with BYOI

The code below creates an evaluation job that analyzes both retrieval and generation quality from your RAG system. The most significant aspect is the `precomputedRagSourceConfig` parameter, which enables the Bring Your Own Inference capability. This configuration tells Bedrock to evaluate pre-generated responses rather than generating new ones.

Note how we're configuring a rich set of evaluation metrics, including the new citation metrics:

- **CitationPrecision**: Measures how accurately your RAG system cites sources by evaluating whether cited passages actually contain the information used in the response
- **CitationCoverage**: Evaluates how well the response's content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited

The `ragSourceIdentifier` parameter must match the identifier in your dataset (in this example, "third-party-RAG"), creating the link between your evaluation configuration and the responses you've provided. The job will analyze your RAG system's performance across multiple dimensions, providing comprehensive insights into both information retrieval accuracy and generation quality.

In [None]:
retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

retrieve_generate_job = bedrock_client.create_evaluation_job(
    jobName=retrieve_generate_job_name,
    jobDescription="Evaluate retrieval and generation",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [
            {
                "precomputedRagSourceConfig": {
                    "retrieveAndGenerateSourceConfig": {
                        "ragSourceIdentifier": "third-party-RAG"  # Replace with your identifier
                    }
                }
            }
        ]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "QuestionAndAnswer",  
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "Builtin.LogicalCoherence",
                    "Builtin.Faithfulness",
                    "Builtin.CitationPrecision",
                    "Builtin.CitationCoverage"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

## Monitoring Your RAG Evaluation Jobs

After submitting your evaluation jobs, you'll want to monitor their progress. The code below demonstrates how to check the status of both job types:

You can run this code periodically to track your job's progress through its lifecycle. Typical status values include "IN_PROGRESS", "COMPLETED", or "FAILED". Once a job reaches "COMPLETED" status, you can proceed to retrieve and analyze the evaluation results from the S3 output location you specified when creating the job.

In [None]:
# Check status of retrieve-and-generate job
retrieve_generate_job_arn = retrieve_generate_job['jobArn']
retrieve_generate_status = bedrock_client.get_evaluation_job(jobIdentifier=retrieve_generate_job_arn)
print(f"Retrieve-and-Generate Job Status: {retrieve_generate_status['status']}")

## Conclusion

In this guide, we've explored how to leverage Amazon Bedrock RAG Evaluation capabilities with Bring Your Own Inference Responses to assess any RAG system's performance. Key advantages of this approach include:

- **Platform independence**: Evaluate RAG systems deployed anywhere - on Amazon Bedrock, other cloud providers, or on-premises
- **Comprehensive assessment**: Analyze both retrieve and generate quality with specialized metrics
- **Citation quality insights**: Leverage the new citation metrics to ensure responses are properly grounded in source information
- **Systematic benchmarking**: Compare different RAG implementations to make data-driven optimization decisions

By implementing regular evaluation workflows using these capabilities, you can continuously improve your RAG systems to deliver more accurate, relevant, and well-attributed responses. Whether you're fine-tuning retrieval strategies, optimizing prompt engineering, or exploring different foundation models for generation, these evaluation tools provide the quantitative insights needed to guide your development process.