# LLM-as-a-Judge with Bring Your Own Inference Responses on Amazon Bedrock

## Introduction

Amazon Bedrock Model Evaluation capabilities now support "Bring Your Own Inference responses" (BYOI), allowing you to evaluate any model's outputs regardless of where they're hosted. This notebook demonstrates how to use LLM-as-a-Judge (LLMaJ) to evaluate model responses from any source - whether from other foundation model providers or your own deployed solutions.

Through this guide, we'll explore:
- Setting up evaluation configurations with BYOI
- Creating and configuring LLM-as-a-Judge evaluation jobs
- Monitoring evaluation progress and interpreting results
- Analyzing model performance across various dimensions

## Prerequisites

Before we begin, make sure you have:
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region
- An S3 bucket for storing evaluation data and results
- An IAM role with necessary permissions for S3 and Bedrock
- Model responses in the required BYOI format

> **Important**: The evaluation process requires access to Amazon Bedrock evaluator models. Make sure these are enabled in your account.

## Dataset Format for BYOI

The evaluation data must follow specific JSON format:

```json
{
    "prompt": "What is the discount amount for a product with original price $80 and a 25% discount?",
    "referenceResponse": "The discount amount is $20.",
    "category": "Discount Calculation",
    "modelResponses": [
        {
            "response": "To calculate the discount amount: $80 × 25% = $80 × 0.25 = $20. The discount amount is $20.",
            "modelIdentifier": "third-party-model"
        }
    ]
}
```

## Dataset Requirements

### Job Requirements
- Each evaluation job can evaluate one model at a time
- Maximum 1000 prompts per evaluation job

### Data Structure Requirements
- Must include `prompt` and `modelResponses` fields
- `modelIdentifier` in the response must match your source configuration
- `referenceResponse` is optional but recommended for most metrics
- `category` is optional for classification of results

> **Note**: When preparing your dataset, ensure your model identifier is consistent across all entries and matches the identifier you'll use when configuring the evaluation job.

## Implementation

Let's set up our configuration parameters to get started:

In [None]:
#Upgrade Boto3
!pip install --upgrade boto3

In [None]:
# Verify boto3 installed successfully
import boto3
print(boto3.__version__)

In [None]:
import boto3
import json
import random
from datetime import datetime
import botocore
import time


# AWS Configuration
REGION = "us-east-1"
ROLE_ARN = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
PREFIX = "<YOUR_BUCKET_PREFIX>"
dataset_custom_name = "<YOUR_BYOI_DATASET_NAME"

# Initialize AWS clients
bedrock_client = boto3.client('bedrock', region_name=REGION)
s3_client = boto3.client('s3', region_name=REGION)

## OPTIONAL - Generating Synthetic Data for BYOI Demonstration

To demonstrate the BYOI evaluation capability without requiring actual third-party model integration, we'll generate a synthetic dataset of shopping math problems. The code below uses Amazon Bedrock's Nova Lite model to simulate responses from a third-party model. This approach creates a realistic evaluation scenario while keeping the demonstration self-contained. Each problem involves discount calculations with a reference answer and a model-generated response, structured according to the required BYOI format.

In [None]:
def generate_shopping_problems(num_problems=30):
    client = boto3.client('bedrock-runtime', region_name='us-east-1')
    items = ["apples", "oranges", "bananas", "books", "pencils", "notebooks"]
    problems = []
    
    for i in range(num_problems):
        # Generate problem data
        item = random.choice(items)
        quantity = random.randint(3, 20)
        price_per_item = round(random.uniform(1.5, 15.0), 2)
        discount_percent = random.choice([10, 15, 20, 25, 30])
        
        # Calculate answer
        total = quantity * price_per_item
        discount = total * (discount_percent / 100)
        final = round(total - discount, 2)
        
        # Create prompt
        prompt = f"If {item} cost \${price_per_item} each and you buy {quantity} of them with a {discount_percent}% discount, how much will you pay in total?"
        
        # Get Nova response
        try:
            body = json.dumps({
                "schemaVersion": "messages-v1",
                "messages": [{"role": "user", "content": [{"text": prompt}]}],
                "inferenceConfig": {"maxTokens": 300, "temperature": 0.3, "topP": 0.1, "topK": 20}
            })
            
            response = client.invoke_model(
                body=body,
                modelId="us.amazon.nova-lite-v1:0",
                accept='application/json',
                contentType='application/json'
            )
            
            # Correct way to handle the response body
            response_body = json.loads(response.get('body').read())
            model_response = response_body["output"]["message"]["content"][0]["text"]
        except Exception as e:
            model_response = f"Error calculating discount price: {str(e)[:50]}..."
        
        # Create problem object
        problems.append({
            "prompt": prompt,
            "referenceResponse": f"The total price will be \${final}. Original price: \${total} minus {discount_percent}% discount (\${discount})",
            "category": "Shopping Math",
            "modelResponses": [{"response": model_response, "modelIdentifier": "third-party-model"}]
        })
        
        if i < num_problems - 1:
            time.sleep(0.5)
    
    return problems

# Generate problems and save to JSONL
dataset_custom_name = "dummy-data-BYOI"
problems = generate_shopping_problems()
with open(f"{dataset_custom_name}.jsonl", 'w') as f:
    for problem in problems:
        f.write(json.dumps(problem) + '\n')

In [None]:
def upload_to_s3(local_file: str, bucket: str, s3_key: str) -> bool:
    """
    Upload a file to S3 with error handling.
    
    Returns:
        bool: Success status
    """
    try:
        s3_client.upload_file(local_file, bucket, s3_key)
        print(f"✓ Successfully uploaded to s3://{bucket}/{s3_key}")
        return True
    except Exception as e:
        print(f"✗ Error uploading to S3: {str(e)}")
        return False

# Upload dataset
s3_key = f"{PREFIX}/{dataset_custom_name}.jsonl"
upload_success = upload_to_s3(f"{dataset_custom_name}.jsonl", BUCKET_NAME, s3_key)

if not upload_success:
    raise Exception("Failed to upload dataset to S3")

## Creating an LLM-as-a-Judge Evaluation Job

Now that we have our dataset prepared and uploaded to S3, we need to create the evaluation job that will assess our model responses. The function below handles the creation of an LLM-as-a-judge evaluation job through the Bedrock API. This function configures all aspects of the evaluation, including selecting which metrics to evaluate, specifying the evaluator model that will act as judge, and most importantly, setting up the `precomputedInferenceSource` parameter that enables the Bring Your Own Inference capability. You can customize this function to select specific metrics relevant to your use case.

In [None]:
def create_llm_judge_evaluation(
    client,
    job_name: str,
    role_arn: str,
    input_s3_uri: str,
    output_s3_uri: str,
    evaluator_model_id: str,
    dataset_name: str = None,
    task_type: str = "General" # must be General for LLMaaJ
):    
    # All available LLM-as-judge metrics
    llm_judge_metrics = [
        "Builtin.Correctness",
        "Builtin.Completeness", 
        "Builtin.Faithfulness",
        "Builtin.Helpfulness",
        "Builtin.Coherence",
        "Builtin.Relevance",
        "Builtin.FollowingInstructions",
        "Builtin.ProfessionalStyleAndTone",
        "Builtin.Harmfulness",
        "Builtin.Stereotyping",
        "Builtin.Refusal"
    ]

    # Configure dataset
    dataset_config = {
        "name": dataset_name or "CustomDataset",
        "datasetLocation": {
            "s3Uri": input_s3_uri
        }
    }

    try:
        response = client.create_evaluation_job(
            jobName=job_name,
            roleArn=role_arn,
            applicationType="ModelEvaluation",
            evaluationConfig={
                "automated": {
                    "datasetMetricConfigs": [
                        {
                            "taskType": task_type,
                            "dataset": dataset_config,
                            "metricNames": llm_judge_metrics
                        }
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [
                            {
                                "modelIdentifier": evaluator_model_id
                            }
                        ]
                    }
                }
            },
            inferenceConfig={
                "models": [
                    {
                        "precomputedInferenceSource": {
                            "inferenceSourceIdentifier": "third-party-model"
                        }
                    }
                ]
            },
            outputDataConfig={
                "s3Uri": output_s3_uri
            }
        )
        return response
        
    except Exception as e:
        print(f"Error creating evaluation job: {str(e)}")
        raise

## Executing the Evaluation Job

The code below launches your evaluation workflow, selecting an appropriate evaluator model from Amazon Bedrock and configuring the job with your dataset of third-party model responses. When successful, the job ARN is returned, allowing you to track progress and access results.

In [None]:
# Job Configuration
evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
job_name = f"llmaaj-third-party-model-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# S3 Paths
input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
output_path = f"s3://{BUCKET_NAME}/{PREFIX}"

# Create evaluation job
try:
    llm_as_judge_response = create_llm_judge_evaluation(
        client=bedrock_client,
        job_name=job_name,
        role_arn=ROLE_ARN,
        input_s3_uri=input_data,
        output_s3_uri=output_path,
        evaluator_model_id=evaluator_model,
        task_type="General"
    )
    print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
except Exception as e:
    print(f"✗ Failed to create evaluation job: {str(e)}")
    raise

In [None]:
# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response['jobArn']

# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
print(f"Job Status: {check_status['status']}")

## Conclusion

In this guide, we've demonstrated how to leverage Amazon Bedrock Evaluation LLM-as-a-Judge with Bring Your Own Inference capabilities to evaluate any model's outputs, regardless of source. Key benefits include:

- Platform-agnostic evaluation that works with any model or AI system
- Comprehensive assessment across multiple quality dimensions simultaneously
- Consistent benchmarking framework for comparing different AI implementations
- Scalable approach for evaluating hundreds or thousands of model responses

By implementing regular evaluation workflows with BYOI, you can make data-driven decisions about model selection, fine-tuning, and deployment across your entire AI portfolio, whether running on Amazon Bedrock or elsewhere.