# LLM as a Judge Model Evaluation with Custom Metrics on Amazon Bedrock

## Introduction

Amazon Bedrock provides robust capabilities for evaluating foundation models using custom metrics, allowing you to assess model performance based on criteria specific to your use case. This notebook demonstrates how to implement custom evaluation metrics for foundation models on Amazon Bedrock, enabling you to measure unique aspects of model performance beyond standard metrics.

Through this guide, we'll explore:
- Creating custom metrics for foundation model evaluation
- Implementing model evaluation jobs with your specialized metrics
- Defining numerical and categorical scoring systems tailored to your requirements
- Analyzing evaluation results with your custom metrics alongside built-in metrics
- Monitoring evaluation progress and interpreting results

## Prerequisites
Before we begin, make sure you have:

### AWS Account and Model Access
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region
- [Selected evaluator and generator models are enabled](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) in Amazon Bedrock (verify on the [Model access page](https://console.aws.amazon.com/bedrock/home#/modelaccess) of the Amazon Bedrock console)
- Confirmed [AWS Regions where the models are available](https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html) and their [quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html)

### IAM and S3 Configuration
- [An IAM role with necessary permissions](https://docs.aws.amazon.com/bedrock/latest/userguide/judge-service-roles.html) for S3 and Bedrock
- Configured [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) with appropriate permissions for accessing and writing output data
- [Enabled CORS on your S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enabling-cors-examples.html)

### Additional Requirements
- A dataset formatted according to the [model evaluation requirements](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets-judge.html)


> **Important**: The evaluation process requires access to Amazon Bedrock evaluator models. Ensure these are enabled in your account.

## Custom Metrics for Model Evaluation

Custom metrics allow you to evaluate specific dimensions of model performance that standard metrics might not capture. For example, you might want to evaluate:
- Response creativity and originality
- Domain-specific accuracy
- Style and tone consistency
- Task-specific requirements
- Business-aligned performance indicators

Let's implement these custom evaluations using the Amazon Bedrock SDK.

## Implementation

First, let's set up our configuration parameters:

In [None]:
#Upgrade Boto3
!pip install --upgrade boto3

In [None]:
# Verify boto3 installed successfully
import boto3
print(boto3.__version__)

In [None]:
import boto3
import time
from datetime import datetime

# Configure knowledge base and model settings
evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
generator_model = "amazon.nova-lite-v1:0"
custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_BUCKET_NAME>"

# Specify S3 locations
input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
output_path = f"s3://{BUCKET_NAME}/evaluation_output/"

# Create Bedrock client
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

## Creating a Model Evaluation Job with Custom Metrics

For this evaluation job, we'll use several key built-in metrics:
- `Builtin.Correctness`: Evaluates factual accuracy of model responses
- `Builtin.Completeness`: Assesses if all relevant information is included
- `Builtin.Coherence`: Measures how logical and well-structured the response is
- `Builtin.Relevance`: Assesses if the response directly addresses the input prompt
- `Builtin.FollowingInstructions`: Evaluates how well the model follows given instructions

Additionally, we'll implement our custom metric:
- `comprehensiveness`: Evaluates how thorough and complete the model's response is

In [None]:
comprehensiveness_metric ={
    "customMetricDefinition": {
        "name": "comprehensiveness",
        "instructions": """Your role is to judge the comprehensiveness of an answer based on the question and the prediction. Assess the quality, accuracy, and helpfulness of language model response, and use these to judge how comprehensive the response is. Award higher scores to responses that are detailed and thoughtful.

Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt) against all specified criteria. Assign a single overall score that best represents the comprehensivenss, and provide a brief explanation justifying your rating, referencing specific strengths and weaknesses observed.

When evaluating the response quality, consider the following rubrics:
- Accuracy: Factual correctness of information provided
- Completeness: Coverage of important aspects of the query
- Clarity: Clear organization and presentation of information
- Helpfulness: Practical utility of the response to the user

Evaluate the following:

Query:
{{prompt}}

Response to evaluate:
{{prediction}}""",
        "ratingScale": [
            {
                "definition": "Very comprehensive",
                "value": {
                    "floatValue": 10
                }
            },
            {
                "definition": "Mildly comprehensive",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "Not at all comprehensive",
                "value": {
                    "floatValue": 1
                }
            }
        ]
    }
}

In [None]:
# Create the model evaluation job
model_eval_job_name = f"model-evaluation-custom-metrics{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

model_eval_job = bedrock_client.create_evaluation_job(
    jobName=model_eval_job_name,
    jobDescription="Evaluate model performance with custom comprehensiveness metric",
    roleArn=role_arn,
    applicationType="ModelEvaluation",
    inferenceConfig={
        "models": [{
            "bedrockModel": {
                "modelIdentifier": generator_model
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "General",
                "dataset": {
                    "name": "ModelEvalDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Coherence",
                    "Builtin.Relevance",
                    "Builtin.FollowingInstructions",
                    "comprehensiveness"
                ]
            }],
            "customMetricConfig": {
                "customMetrics": [
                    comprehensiveness_metric
                ],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": custom_metrics_evaluator_model
                    }]
                }
            },
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

print(f"Created model evaluation job: {model_eval_job_name}")
print(f"Job ID: {model_eval_job['jobArn']}")

### Monitoring Job Progress
Track the status of your evaluation job:

In [None]:
# Get job ARN based on job type
evaluation_job_arn = model_eval_job['jobArn']  # or retrieve_generate_job['jobArn']

# Check job status
response = bedrock_client.get_evaluation_job(
    jobIdentifier=evaluation_job_arn 
)
print(f"Job Status: {response['status']}")

## Conclusion

This guide demonstrated how to implement Custom Metrics for Model Evaluation on Amazon Bedrock. This powerful feature allows organizations to:

- Create custom evaluation metrics beyond standard benchmarks
- Define specialized scoring systems aligned with specific business requirements
- Combine custom and built-in metrics for comprehensive model assessment
- Evaluate models based on domain-specific criteria that matter to your use case
- Generate consistent, comparable results to track model improvements over time

With these capabilities, you can systematically evaluate and optimize your foundation models according to the dimensions that matter most for your specific applications, ensuring they meet your quality standards and business objectives.