## Amazon Bedrock Model-as-a-Judge Evaluation Guide

> Original source: [aws-samples-notebook](https://github.com/aws-samples/amazon-bedrock-samples/blob/main/evaluation-observe/bedrock-llm-as-judge-evaluation/model-as-a-judge.ipynb)

### See Model Evaluation support by Region 

Model evaluation by Region [docs](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-support.html)

### Introduction

This notebook demonstrates how to use Amazon Bedrock's Model-as-a-Judge feature for systematic model evaluation. The Model-as-a-Judge approach uses a foundation model to score another model's responses and provide explanations for the scores. The guide covers creating evaluation datasets, running evaluations, and comparing different foundation models.

### Contents

1. [Setup and Configuration](#setup)
2. [Dataset Generation](#dataset)
3. [S3 Integration](#s3)
4. [Single Model Evaluation](#single)
5. [Model Selection and Comparison](#comparison)
6. [Monitoring and Results](#monitoring)

### Prerequisites

- An AWS account with Bedrock access
- Appropriate IAM roles and permissions
- Access to supported evaluator models (Claude 3 Haiku, Claude 3.5 Sonnet, Mistral Large, or Meta Llama 3.1)
- An S3 bucket for storing evaluation data

In [None]:
import boto3
import json
import random
from botocore.exceptions import ClientError
from datetime import datetime
from typing import List, Dict, Any, Optional

### Environment Setup <a name="setup"></a>
> Cambien a sus datos (e.g. bucket name)

In [None]:
# AWS Configuration
REGION = "us-west-2"
BUCKET_NAME = "genai-carlos-contreras-bucket-data-quarks-labs-oregon-01"
PREFIX = "labs/model-evaluation"
dataset_custom_name = "fruit-discounts-data"

# Initialize AWS clients
bedrock_client = boto3.client('bedrock', region_name=REGION)
s3_client = boto3.client('s3', region_name=REGION)

#### Create IAM Role for this lab

In [None]:
# Define the IAM role name
import string
random_string = ''.join(random.choices(string.ascii_letters + string.digits, k=6))
role_name = 'AdminOperBedrockFullAccess-GenAi-' + random_string

In [None]:
# Create IAM and S3 clients
iam = boto3.client('iam')
s3 = boto3.client('s3')

# Check if the role already exists
try:
    existing_role = iam.get_role(RoleName=role_name)
    print(f"IAM role '{role_name}' already exists.")
    ROLE_ARN = existing_role['Role']['Arn']
except ClientError as e:
    if e.response['Error']['Code'] == 'NoSuchEntity':
        # Role doesn't exist, create it
        trust_policy = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "bedrock.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }

        # Create the IAM role
        response = iam.create_role(
            RoleName=role_name,
            AssumeRolePolicyDocument=json.dumps(trust_policy),
            Description='IAM role for Bedrock with full access and S3 read/write access'
        )

        # Attach the AmazonBedrockFullAccess managed policy.
        # IMPORTANT: This grants full access to Bedrock, so only for demo purposes.
        iam.attach_role_policy(
            RoleName=role_name,
            PolicyArn='arn:aws:iam::aws:policy/AmazonBedrockFullAccess'
        )

        # Define the S3 access policy
        s3_policy = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "s3:GetObject",
                        "s3:PutObject",
                        "s3:ListBucket"
                    ],
                    "Resource": [
                        f"arn:aws:s3:::{BUCKET_NAME}",
                        f"arn:aws:s3:::{BUCKET_NAME}/*"
                    ]
                }
            ]
        }

        # Define the policy name
        policy_name = f'{role_name}S3Policy'

        # Check if the policy already exists
        try:
            existing_policy = iam.get_policy(PolicyArn=f'arn:aws:iam::{boto3.client("sts").get_caller_identity()["Account"]}:policy/{policy_name}')
            print(f"Policy '{policy_name}' already exists.")
            policy_arn = existing_policy['Policy']['Arn']
        except iam.exceptions.NoSuchEntityException:
            # Policy doesn't exist, create it
            s3_policy_response = iam.create_policy(
                PolicyName=policy_name,
                PolicyDocument=json.dumps(s3_policy)
            )
            policy_arn = s3_policy_response['Policy']['Arn']
            print(f"Policy '{policy_name}' created successfully.")

        # Attach the S3 access policy to the role
        iam.attach_role_policy(
            RoleName=role_name,
            PolicyArn=policy_arn
        )

        ROLE_ARN = response['Role']['Arn']
        print(f"IAM role '{role_name}' created successfully with AmazonBedrockFullAccess and S3 access policies attached.")
    else:
        # If there's an error other than NoSuchEntity, re-raise it
        raise

print(f"Role ARN: {ROLE_ARN}")

## Dataset Generation <a name="dataset"></a>

We'll create a simple dataset of mathematical reasoning problems. These problems test:
- Basic arithmetic
- Logical reasoning
- Natural language understanding

The dataset follows the required JSONL format for Bedrock evaluation jobs.

In [None]:
import random
import json

def generate_shopping_problems(num_problems=50):
    """Generate shopping-related math problems with random values."""
    problems = []
    items = ["apples", "oranges", "bananas", "books", "pencils", "notebooks"]
    
    for _ in range(num_problems):
        # Generate random values
        item = random.choice(items)
        quantity = random.randint(3, 20)
        price_per_item = round(random.uniform(1.5, 15.0), 2)
        discount_percent = random.choice([10, 15, 20, 25, 30])
        
        # Calculate the answer
        total_price = quantity * price_per_item
        discount_amount = total_price * (discount_percent / 100)
        final_price = round(total_price - discount_amount, 2)
        
        # Create the problem
        problem = {
            "prompt": f"If {item} cost ${price_per_item} each and you buy {quantity} of them with a {discount_percent}% discount, how much will you pay in total?",
            "category": "Shopping Math",
            "referenceResponse": f"The total price will be ${final_price}. Original price: ${total_price} minus {discount_percent}% discount (${discount_amount})"
        }
        
        problems.append(problem)
    
    return problems


def save_to_jsonl(problems, output_file):
    """Save the problems to a JSONL file."""
    with open(output_file, 'w') as f:
        for problem in problems:
            f.write(json.dumps(problem) + '\n')

SAMPLE_SIZE = 30
problems = generate_shopping_problems(SAMPLE_SIZE)
save_to_jsonl(problems, f"evaluation/{dataset_custom_name}.jsonl")

## S3 Integration <a name="s3"></a>

After generating our sample dataset, we need to upload it to S3 for use in the evaluation job. 
We'll use the boto3 S3 client to upload our JSONL file.

> **Note**: Make sure your IAM role has appropriate S3 permissions (s3:PutObject) for the target bucket.

In [None]:
def upload_to_s3(local_file: str, bucket: str, s3_key: str) -> bool:
    """
    Upload a file to S3 with error handling.
    
    Returns:
        bool: Success status
    """
    try:
        s3_client.upload_file(local_file, bucket, s3_key)
        print(f"✓ Successfully uploaded to s3://{bucket}/{s3_key}")
        return True
    except Exception as e:
        print(f"✗ Error uploading to S3: {str(e)}")
        return False

# Upload dataset
s3_key = f"{PREFIX}/{dataset_custom_name}.jsonl"
upload_success = upload_to_s3(f"evaluation/{dataset_custom_name}.jsonl", BUCKET_NAME, s3_key)

if not upload_success:
    raise Exception("Failed to upload dataset to S3")

## Evaluation Job Configuration

Configure the LLM-as-Judge evaluation with comprehensive metrics for assessing model performance:

| Metric Category | Description |
|----------------|-------------|
| Quality | Correctness, Completeness, Faithfulness |
| User Experience | Helpfulness, Coherence, Relevance |
| Instructions | Following Instructions, Professional Style |
| Safety | Harmfulness, Stereotyping, Refusal |

In [None]:
def create_llm_judge_evaluation(
    client,
    job_name: str,
    role_arn: str,
    input_s3_uri: str,
    output_s3_uri: str,
    evaluator_model_id: str,
    generator_model_id: str,
    dataset_name: str = None,
    task_type: str = "General" # must be General for LLMaaJ
):    
    # All available LLM-as-judge metrics
    llm_judge_metrics = [
        "Builtin.Correctness",
        "Builtin.Completeness", 
        "Builtin.Faithfulness",
        "Builtin.Helpfulness",
        "Builtin.Coherence",
        "Builtin.Relevance",
        "Builtin.FollowingInstructions",
        "Builtin.ProfessionalStyleAndTone",
        "Builtin.Harmfulness",
        "Builtin.Stereotyping",
        "Builtin.Refusal"
    ]

    # Configure dataset
    dataset_config = {
        "name": dataset_name or "CustomDataset",
        "datasetLocation": {
            "s3Uri": input_s3_uri
        }
    }

    try:
        response = client.create_evaluation_job(
            jobName=job_name,
            roleArn=role_arn,
            applicationType="ModelEvaluation",
            evaluationConfig={
                "automated": {
                    "datasetMetricConfigs": [
                        {
                            "taskType": task_type,
                            "dataset": dataset_config,
                            "metricNames": llm_judge_metrics
                        }
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [
                            {
                                "modelIdentifier": evaluator_model_id
                            }
                        ]
                    }
                }
            },
            inferenceConfig={
                "models": [
                    {
                        "bedrockModel": {
                            "modelIdentifier": generator_model_id
                        }
                    }
                ]
            },
            outputDataConfig={
                "s3Uri": output_s3_uri
            }
        )
        return response
        
    except Exception as e:
        print(f"Error creating evaluation job: {str(e)}")
        raise

## Single Model Evaluation <a name="single"></a>

First, let's run a single evaluation job using Claude 3 Haiku as both generator and evaluator.

### Note:
⚠️Confirm the MODEL is supported [here](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-support.html), for Model Evaluation

In [None]:
# Job Configuration
evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
generator_model = "anthropic.claude-3-haiku-20240307-v1:0"

# Job Name
job_name = f"llmaaj-{generator_model.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# S3 Paths
input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
output_path = f"s3://{BUCKET_NAME}/{PREFIX}"

In [None]:
# Create evaluation job
try:
    llm_as_judge_response = create_llm_judge_evaluation(
        client=bedrock_client,
        job_name=job_name,
        role_arn=ROLE_ARN,
        input_s3_uri=input_data,
        output_s3_uri=output_path,
        evaluator_model_id=evaluator_model,
        generator_model_id=generator_model,
        task_type="General"
    )
    print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
except Exception as e:
    print(f"✗ Failed to create evaluation job: {str(e)}")
    raise

### Monitoring Job Progress
Track the status of your evaluation job:

In [None]:
# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response['jobArn']

# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
print(f"Job Status: {check_status['status']}")

## Model Selection and Comparison <a name="comparison"></a>

Now, let's evaluate multiple generator models to find the optimal model for our use case. We'll compare different foundation models while using a consistent evaluator.

### IMPORTANTE 🚨
> Confirma permisos de cuenta y región sobre el modelo a usar en Evaluación:  (Bedrock -> Model Access)
- Por ejemplo, permisos sobre DeepSeek

In [None]:
GENERATOR_MODELS = [
    "anthropic.claude-3-haiku-20240307-v1:0",
    "anthropic.claude-3-5-haiku-20241022-v1:0"
    ""
]

# Consistent Evaluator
EVALUATOR_MODEL = "anthropic.claude-3-5-sonnet-20240620-v1:0"

def run_model_comparison(
    generator_models: List[str],
    evaluator_model: str
) -> List[Dict[str, Any]]:
    evaluation_jobs = []
    
    for generator_model in generator_models:
        job_string = generator_model.replace('.', '-').replace(':', '-')
        job_name = f"llmaaj-{job_string}-{datetime.now().strftime('%Y%m%d-%H%M')}"
        
        try:
            response = create_llm_judge_evaluation(
                client=bedrock_client,
                job_name=job_name,
                role_arn=ROLE_ARN,
                input_s3_uri=input_data,
                output_s3_uri=f"{output_path}/{job_name}/",
                evaluator_model_id=evaluator_model,
                generator_model_id=generator_model,
                task_type="General"
            )
            
            job_info = {
                "job_name": job_name,
                "job_arn": response["jobArn"],
                "generator_model": generator_model,
                "evaluator_model": evaluator_model,
                "status": "CREATED"
            }
            evaluation_jobs.append(job_info)
            
            print(f"✓ Created job: {job_name}")
            print(f"  Generator: {generator_model}")
            print(f"  Evaluator: {evaluator_model}")
            print("-" * 80)
            
        except Exception as e:
            print(f"✗ Error with {generator_model}: {str(e)}")
            continue
            
    return evaluation_jobs

# Run model comparison
evaluation_jobs = run_model_comparison(GENERATOR_MODELS, EVALUATOR_MODEL)

## Monitoring and Results <a name="monitoring"></a>

Track the progress of all evaluation jobs and display their current status.

> Note: at the moment, Jan 2025, this process takes around 10 min

In [None]:
# function to check job status
def check_jobs_status(jobs, client):
    """Check and update status for all evaluation jobs"""
    for job in jobs:
        try:
            response = client.get_evaluation_job(
                jobIdentifier=job["job_arn"]
            )
            job["status"] = response["status"]
        except Exception as e:
            job["status"] = f"ERROR: {str(e)}"
    
    return jobs

# Check initial status
updated_jobs = check_jobs_status(evaluation_jobs, bedrock_client)

# Display status summary
for job in updated_jobs:
    print(f"Job: {job['job_name']}")
    print(f"Status: {job['status']}")
    print(f"Generator: {job['generator_model']}")
    print(f"Evaluator: {job['evaluator_model']}")
    print("-" * 80)

## Clean up
Recuerda borrar las políticas IAM creadas por este notebook. 
- Tip: Puedes usar el notebook ```Clean_up_Resources_eg_iam_policies.ipynb``` para esto