# Amazon Bedrock LLM-as-a-Judge Evaluation 

## Introduction

This notebook demonstrates how to use Amazon Bedrock's Model-as-a-Judge feature for systematic model evaluation. The Model-as-a-Judge approach uses a foundation model to score another model's responses and provide explanations for the scores. The guide covers creating evaluation datasets, running evaluations, and comparing different foundation models.

Please refer to [official documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html) for more details including supported evaluator and generator models.



### The Role of Evaluator and Generator Models in Amazon Bedrock LLM-as-Judge Evaluation
In Amazon Bedrock's LLM-as-Judge evaluation framework, two distinct model roles work together to enable robust assessment of language model outputs:

**Generator Models**

Generator models are the models being evaluated. They:

* Produce responses to the input prompts in your evaluation dataset
* Represent the candidates whose performance you want to assess
* Can be different foundation models (e.g., Llama, Qwen, DeepSeek, Nova, etc.,) or the same model with different fine-tuning or prompt engineering approaches
* Are the subjects of comparison in A/B testing scenarios
* Generate outputs that will be scored by the evaluator model

**Evaluator Models (LLM-as-Judge)**

Evaluator models serve as automated judges that:

* Assess the quality of responses from generator models
* Apply scoring criteria defined in your evaluation job configuration
* Provide numerical ratings and explanatory feedback for each response
* Act as impartial judges to compare multiple model responses objectively
* Replace or supplement human evaluation, offering scalable assessment
* Should ideally be powerful models with strong reasoning capabilities 

**How They Interact in the Evaluation Process**
* You define evaluation prompts and metrics in your evaluation dataset
* Generator models produce responses to these prompts
* The evaluator model reviews each response according to specified criteria
* The evaluator provides scores and justifications for each assessment
* Bedrock aggregates these results into comprehensive evaluation reports

This automated approach enables systematic comparison of model outputs across various dimensions like accuracy, helpfulness, relevance, and safety, while reducing the need for extensive human evaluation.

The separation of generator and evaluator roles allows for fair, consistent assessment across different model types and configurations, helping you identify the best-performing models for your specific use cases.

### Use case: 
In this lab, you will **evaluate LLM performance on mathematical computations in the context of shopping**. You will use Mistral Large as the evaluator model, and Meta Llama 3.1 8B and Mistral 7B Instruct as the generator models.

## Prerequisites

1. An AWS account with Bedrock access
2. Appropriate IAM roles and permissions
3. An S3 bucket for storing evaluation data

Let's begin with updating boto3 to latest version and install other dependencies.

In [None]:
import boto3
import json

In [None]:
bedrock_client = boto3.client('bedrock')
found_models = [m['modelId'] for m in bedrock_client.list_foundation_models(byOutputModality='TEXT')['modelSummaries']]
eval_model = [fm for fm in found_models if "mistral.mistral-large-2402" in fm][0]
print(f"Evaluator Model: {eval_model}")
gen_models = [ "meta.llama3-1-8b-instruct-v1:0", "mistral.mistral-7b-instruct-v0:2"]
print(f"Generator Models: {gen_models}")


### Choose a S3 Bucket for Model Evaluation jobs
----

Bedrock model evaluation jobs require an Amazon S3 bucket in your current AWS region to store input datasets and model evaluation results.


**If you're running this notebook in the JupyterLab environment in Amazon SageMaker AI Studio, you can use your the default bucket from the Amazon SageMaker session to store the datasets and evaluation results. To do so, run the code below as-is.** 

For users running this notebook outside SageMaker AI Studio (for example on a local machine or EC2 instance), you'll need to either create a new S3 bucket or specify an existing one in your region. Please follow the instructions within the cell before you execute it.

In [None]:
#comment out the following 2 lines if not running on SageMaker AI Studio notebooks
import sagemaker
sess = sagemaker.Session()
#If you want to use a custom s3 bucket or running this notebook outside of SageMaker AI Studio, please mention the bucket name as follows
#bucket = ""
bucket=None

if bucket is None and sess is not None: 
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=bucket) #comment out if Sagemaker is not used
 

print(f"Model Evaluation bucket: {bucket}")

Next, you need to allow the IAM service role to access to the S3 bucket you specified. For that, you define the IAM policy and assign it to the role.

In [None]:
#Restore role_arn and role_name created in lab2a to run eval jobs
%store -r role_arn
%store -r role_name

## Run an Amazon Bedrock LLM-as-a-Judge Evaluation job

### Generate the dataset
You'll create a simple dataset of mathematical reasoning problems. These problems test:

1. Basic arithmetic
2. Logical reasoning
3. Natural language understanding

The dataset follows the required JSONL format for Bedrock evaluation jobs.

In [None]:
import random
import json

def generate_shopping_problems(num_problems=50):
    """Generate shopping-related math problems with random values."""
    problems = []
    items = ["apples", "oranges", "bananas", "books", "pencils", "notebooks"]
    
    for _ in range(num_problems):
        # Generate random values
        item = random.choice(items)
        quantity = random.randint(3, 20)
        price_per_item = round(random.uniform(1.5, 15.0), 2)
        discount_percent = random.choice([10, 15, 20, 25, 30])
        
        # Calculate the answer
        total_price = quantity * price_per_item
        discount_amount = total_price * (discount_percent / 100)
        final_price = round(total_price - discount_amount, 2)
        
        # Create the problem
        problem = {
            "prompt": f"If {item} cost \${price_per_item} each and you buy {quantity} of them with a {discount_percent}% discount, how much will you pay in total?",
            "category": "Shopping Math",
            "referenceResponse": f"The total price will be \${final_price}. Original price: \${total_price} minus {discount_percent}% discount (\${discount_amount})"
        }
        
        problems.append(problem)
    
    return problems

def save_to_jsonl(problems, output_file):
    """Save the problems to a JSONL file."""
    with open(output_file, 'w') as f:
        for problem in problems:
            f.write(json.dumps(problem) + '\n')

SAMPLE_SIZE = 30
dataset_custom_name = "eval_dataset"
problems = generate_shopping_problems(SAMPLE_SIZE)
save_to_jsonl(problems, f"{dataset_custom_name}.jsonl")

After generating the sample dataset, you need to upload it to S3 for use in the evaluation job. You'll use the boto3 S3 client to upload our JSONL file.

In [None]:
def upload_to_s3(local_file: str, bucket: str, s3_key: str) -> bool:
    """
    Upload a file to S3 with error handling.
    
    Returns:
        bool: Success status
    """
    s3_client = boto3.client('s3')
    try:
        s3_client.upload_file(local_file, bucket, s3_key)
        print(f"‚úì Successfully uploaded to s3://{bucket}/{s3_key}")
        return True
    except Exception as e:
        print(f"‚úó Error uploading to S3: {str(e)}")
        return False

# Upload dataset
PREFIX = "bedrock_model_eval"
s3_key = f"{PREFIX}/{dataset_custom_name}.jsonl"
upload_success = upload_to_s3(f"{dataset_custom_name}.jsonl", bucket, s3_key)

if not upload_success:
    raise Exception("Failed to upload dataset to S3")

### Configure the evaluation jobs

You are now ready to configure the LLM-as-Judge evaluation jobs. With Amazon Bedrock LLM-as-a-Judge evaluation you can use comprehensive metrics to assess model performance:

| Metric    | Description |
| -------- | ------- |
| Quality  | Correctness, Completeness, Faithfulness    |
| User Experience | Helpfulness, Coherence, Relevance     |
| Instructions    | Following Instructions, Professional Style    |
| Safety    | Harmfulness, Stereotyping, Refusal    |

The following code configures the jobs using the boto3 SDK

In [None]:
def create_llm_judge_evaluation(
    client,
    job_name: str,
    role_arn: str,
    input_s3_uri: str,
    output_s3_uri: str,
    evaluator_model_id: str,
    generator_model_id: str,
    dataset_name: str = None,
    task_type: str = "General" # must be General for LLMaaJ
):    
    # All available LLM-as-judge metrics
    llm_judge_metrics = [
        "Builtin.Correctness",
        "Builtin.Completeness", 
        "Builtin.Faithfulness",
        "Builtin.Helpfulness",
        "Builtin.Coherence",
        "Builtin.Relevance",
        "Builtin.FollowingInstructions",
        "Builtin.ProfessionalStyleAndTone",
        "Builtin.Harmfulness",
        "Builtin.Stereotyping",
        "Builtin.Refusal"
    ]

    # Configure dataset
    dataset_config = {
        "name": dataset_name or "CustomDataset",
        "datasetLocation": {
            "s3Uri": input_s3_uri
        }
    }

    try:
        response = client.create_evaluation_job(
            jobName=job_name,
            roleArn=role_arn,
            applicationType="ModelEvaluation",
            evaluationConfig={
                "automated": {
                    "datasetMetricConfigs": [
                        {
                            "taskType": task_type,
                            "dataset": dataset_config,
                            "metricNames": llm_judge_metrics
                        }
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [
                            {
                                "modelIdentifier": evaluator_model_id
                            }
                        ]
                    }
                }
            },
            inferenceConfig={
                "models": [
                    {
                        "bedrockModel": {
                            "modelIdentifier": generator_model_id
                        }
                    }
                ]
            },
            outputDataConfig={
                "s3Uri": output_s3_uri
            }
        )
        return response
        
    except Exception as e:
        print(f"Error creating evaluation job: {str(e)}")
        raise

### Run evaluation jobs for the 2 generator models

Next, trigger the evaluation jobs.

In [None]:
output_path = f"{bucket}/model_eval_output"
task_type="General"

In [None]:
from typing import List, Dict, Any
from datetime import datetime
import time

def run_model_comparison(
    generator_models: List[str],
    evaluator_model: str
) -> List[Dict[str, Any]]:
    evaluation_jobs = []
    
    for generator_model in generator_models:
        job_name = f"{generator_model.split('.')[1].split(':')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
        
        try:
            response = create_llm_judge_evaluation(
                client=bedrock_client,
                job_name=job_name,
                role_arn=role_arn,
                input_s3_uri=f"s3://{bucket}/{PREFIX}/{dataset_custom_name}.jsonl",
                output_s3_uri=f"s3://{output_path}/{job_name}/",
                evaluator_model_id=evaluator_model,
                generator_model_id=generator_model,
                task_type=task_type
            )
            
            job_info = {
                "job_name": job_name,
                "job_arn": response["jobArn"],
                "generator_model": generator_model,
                "evaluator_model": evaluator_model,
                "status": "CREATED"
            }
            evaluation_jobs.append(job_info)
            
            print(f"‚úì Created job: {job_name}")
            print(f"  Generator: {generator_model}")
            print(f"  Evaluator: {evaluator_model}")
            print("-" * 80)
            time.sleep(1)
            
        except Exception as e:
            print(f"‚úó Error with {generator_model}: {str(e)}")
            continue
            
    return evaluation_jobs

# Run model comparison
evaluation_jobs = run_model_comparison(gen_models, eval_model)

### Monitoring and Results
The jobs will take several minutes to complete. You can monitor the progress of the evaluation jobs and display their current status before you proceed.

In [None]:
import time
import datetime 
# function to check job status
def check_jobs_status(jobs, client):
    """Check and update status for all evaluation jobs"""
    for job in jobs:
        try:
            response = client.get_evaluation_job(
                jobIdentifier=job["job_arn"]
            )
            job["status"] = response["status"]
        except Exception as e:
            job["status"] = f"ERROR: {str(e)}"
    
    return jobs
    


In [None]:
from IPython.display import clear_output
import time
import datetime

def check_status(evaluation_jobs, loop=True):
    max_time = time.time() + 2*60*60 
    
    while True:
        now = datetime.datetime.now()
        current_time = now.strftime("%H:%M:%S")
        updated_jobs = check_jobs_status(evaluation_jobs, bedrock_client)
        
        job1_status, job2_status = updated_jobs[0]["status"], updated_jobs[1]["status"]
        
        if loop:
            clear_output(wait=True)
        
        print(f"{current_time} : Model evaluation job1 is {job1_status} and job2 is {job2_status}.")
        
        if not loop or (job1_status == "Completed" or job1_status == "Failed") and (job2_status == "Completed" or job2_status == "Failed") or time.time() >= max_time:
            break
        
        time.sleep(60)
    
    return job1_status, job2_status

In [None]:
status1, status2 = check_status(evaluation_jobs, loop=False)

In [None]:
from IPython.display import Markdown, display
region = boto3.session.Session().region_name

display(Markdown(f"You can also review the status of the jobs in the [Amazon Bedrock Console](https://{region}.console.aws.amazon.com/bedrock/home?region={region}#/eval/evaluation)"))


<div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 15px; border-radius: 5px;">

<strong>The evaluation jobs you just submitted may take several minutes to complete.</strong><br><br>


Instead of waiting for the submitted evaluation job(s) to complete, let's proceed with monitoring and analyzing results from previously completed jobs. This approach allows us to:

‚è±Ô∏è Make productive use of our workshop time.

üß† Understand the evaluation framework and metrics.

üìà Compare existing model performance results.

In the following cells, we'll:

üîÑ Check the status of our submitted job(s).

üì• Retrieve and analyze results from completed evaluation jobs.

‚öñÔ∏è Compare performance across different models.

üìä Visualize key metrics and insights.
</div>

Next, you retrieve the most recent jobs run for the generator, evaluator and task type combinations.

In [None]:
from datetime import datetime, timedelta, timezone

bedrock = boto3.client('bedrock', region_name=region)

def get_completed_llm_judge_jobs(hours_ago=1):
    all_jobs = []
    next_token = None
    
    # Get all jobs with pagination
    while True:
        params = {
            'sortBy': 'CreationTime',
            'sortOrder': 'Descending',
            'statusEquals': 'Completed',
            'applicationTypeEquals': 'ModelEvaluation',
            'maxResults': 1000
        }
        
        if next_token:
            params['nextToken'] = next_token
            
        response = bedrock.list_evaluation_jobs(**params)
        all_jobs.extend(response['jobSummaries'])
        
        next_token = response.get('nextToken')
        if not next_token:
            break

    # Filter jobs for LLM-as-judge evaluation
    jobs = [
        job for job in all_jobs 
        if 'evaluatorModelIdentifiers' in job
        and any(job.get('modelIdentifiers', []) == [model] for model in gen_models)
        and job.get('evaluatorModelIdentifiers', []) == [eval_model]
    ]

    # Group jobs by unique combination of generator model and evaluator model
    job_groups = {}
    
    for job in jobs:
        generator_model = job['modelIdentifiers'][0]
        evaluator_model = job['evaluatorModelIdentifiers'][0]
        key = (generator_model, evaluator_model)
        
        # Keep only the most recent job for each unique combination
        if key not in job_groups or job['creationTime'] > job_groups[key]['creationTime']:
            job_groups[key] = job
    
    return list(job_groups.values())

In [None]:
evaluation_jobs = get_completed_llm_judge_jobs()[:2]
evaluation_jobs

## Review evaluation results

Next, retrieve the S3 output locations for each evaluation job.

In [None]:
import boto3

s3 = boto3.client('s3')
outputs_jsonl = []

for job in evaluation_jobs:
    job_details = bedrock.get_evaluation_job(jobIdentifier=job['jobArn'])
    s3_uri = job_details['outputDataConfig']['s3Uri']
    
    # Parse S3 URI
    bucket = s3_uri.split('/')[2]
    prefix = '/'.join(s3_uri.split('/')[3:])
    
    # List objects
    response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    
    jsonl_files = [f"{obj['Key']}" for obj in response.get('Contents', []) if obj['Key'].endswith('.jsonl')]
    outputs_jsonl.extend(jsonl_files)
outputs_jsonl

and the metrics calculated during the evaluation

In [None]:
# Function to retrieve metrics from the output
import json
s3_res = boto3.resource('s3')

def retrieve_metrics(bucket, output_jsonl):
    content_object = s3_res.Object(bucket, output_jsonl)
    jsonl_content = content_object.get()['Body'].read().decode('utf-8')
    output_content = [json.loads(jline) for jline in jsonl_content.splitlines()]
    return output_content
    
eval_jobs_metrics_jsonl = [retrieve_metrics(bucket, output_jsonl) for output_jsonl in outputs_jsonl]

### Plot Metrics 


You can now visualize and compare model performance through detailed metric analysis. It processes evaluation results across 11 key metrics and generates plots. The visualization helps identify which model excels in specific areas like accuracy and coherence, making it easier to make data-driven decisions about model selection.

In [None]:
metric_names = [
        "Builtin.Correctness",
        "Builtin.Completeness", 
        "Builtin.Faithfulness",
        "Builtin.Helpfulness",
        "Builtin.Coherence",
        "Builtin.Relevance",
        "Builtin.FollowingInstructions",
        "Builtin.ProfessionalStyleAndTone",
        "Builtin.Harmfulness",
        "Builtin.Stereotyping",
        "Builtin.Refusal"
    ]

In [None]:

# Function to filter and load the metrics in pandas DataFrame
import pandas as pd

def pd_metrics(model1, model2, metric, job1_metrics, job2_metrics):
    met1 = []
    met2 = []
    met_index = [job1_metrics[0]['automatedEvaluationResult']['scores'].index(i) for i in job1_metrics[0]['automatedEvaluationResult']['scores'] if i["metricName"]==metric]
    for i, (x, y) in enumerate(zip(job1_metrics, job2_metrics)):
        met1.append(x['automatedEvaluationResult']['scores'][met_index[0]]['result'])
        met2.append(y['automatedEvaluationResult']['scores'][met_index[0]]['result'])
    met = pd.DataFrame({model1.split(':')[0]: met1, model2.split(':')[0]: met2})
    return met

In [None]:

stats_list = []
for metric in metric_names:
    met_pd = pd_metrics(gen_models[0], gen_models[1], metric, eval_jobs_metrics_jsonl[0], eval_jobs_metrics_jsonl[1])
    stats_list.append(met_pd)


In [None]:
# Function to line plot for model comparison per metric
import seaborn as sns
import matplotlib.pyplot as plt
metrics = [m.split('.')[1] for m in metric_names]
def plot_line_metrics(metrics, stats_list):
    for metric, df in zip(metrics, stats_list):
        print("\n \n \n")
        ltb = ["Refusal", "Sterotyping", "Harmfulness"]
        if metric in ltb:
            sub = "    Lower the better"
        else:
            sub = "    Higher the better"
        plt.figure(figsize=(12, 6))
        sns.set_style("whitegrid")
        sns.lineplot(data=df, markers=True, palette="flare")
        plt.legend(title='Model')
        plt.xlabel('Inference test')
        plt.ylabel(metric)
        plt.title(metric)
        plt.figtext(0.5, 0.01, sub, horizontalalignment='center', verticalalignment='bottom', fontsize=10, fontstyle='italic', color='purple')
        plt.show();

In [None]:
plot_line_metrics(metrics, stats_list)