# Evaluate LLMs performance by metrics using Amazon Bedrock Automatic Model Evaluation 

## Overview

This notebook demonstrates how to evaluate Large Language Models (LLMs) using Amazon Bedrock's Automatic Model Evaluation (AME) capabilities. By the end of this notebook, you will understand how to set up, run, and interpret various metrics-based evaluations to assess model performance across different dimensions.

For supported regions and models please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-support.html

## Automatic model evaluations 


Automatic model evaluation jobs allow you to quickly assess a model's ability to perform specific tasks with minimal setup. By using AME, you can systematically compare model performance across different dimensions, make data-driven decisions about model selection, and identify opportunities for prompt engineering improvements.

You can either provide your own custom prompt dataset tailored to your specific use case, or leverage Amazon's built-in datasets for standardized evaluations.



### Key Benefits of Bedrock AME

**Streamlined Evaluation Process:** Evaluate model performance without building custom evaluation infrastructure

**Flexible Dataset Options:** Use built-in datasets or customize your own evaluation prompts

**Comprehensive Metrics:** Access industry-standard metrics for different LLM capabilities

**Multi-Model Comparison:** Easily benchmark performance across different models


### Pre-requisites

Before proceeding with this lab on model evaluation, you need to complete some pre-requisites.
Later in this notebook, you will have the opportunity to go through these steps in detail. Please take some time to undrestand to review them as it will help you implement similar evaluations in your own AWS environment later.

#### Required Resources and Permissions
1. Amazon S3 Storage Configuration
    * Regional Compatibility: An Amazon S3 bucket must exist in the same AWS Region as your Amazon Bedrock models
        * Example: When using Bedrock in us-west-2, your S3 bucket must also be in us-west-2
    * CORS Configuration: The S3 bucket requires Cross Origin Resource Sharing (CORS) configuration enabled
        * This allows proper communication between Amazon Bedrock services and your storage.


2. IAM Role Requirements
The IAM role executing this notebook must have sufficient permissions to perform the following:
    * S3 Operations:
        * Read from and write to your designated Amazon S3 bucket
        * Upload evaluation datasets and retrieve results
    * Bedrock Service Access:
        * Invoke Amazon Bedrock foundation models
        * Create and manage model inference configurations
    * Evaluation Job Management:
       * Create and initiate evaluation jobs
       * Monitor job status and progress
       * Access and download evaluation results
         
For a comprehensive list of prerequisites and detailed setup instructions for your own environment, please refer to the [official Amazon Bedrock documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-automatic.html).


## Environment Setup
The following code installs and upgrades the necessary Python libraries required for this notebook. We'll ensure that all dependencies are at their latest compatible versions to avoid any unexpected issues.

### Required Dependencies

| Package | Description |
|---------|------------|
| `awscli` | AWS Command Line Interface tools for AWS services interaction |
| `boto3` | AWS SDK for Python - enables programmatic AWS service access |
| `seaborn` | Statistical data visualization built on matplotlib |
| `matplotlib` | Comprehensive library for creating visualizations |
| `sagemaker` | Amazon SageMaker Python SDK for ML workflows |


The --quiet flag reduces installation output to keep the notebook clean, while --upgrade ensures we're using the latest versions of each package.

In [None]:
%pip install --quiet -r requirements.txt

## Complete Pre-requisites

### Choose a S3 Bucket for Model Evaluation jobs
----

Bedrock model evaluation jobs require an Amazon S3 bucket in your current AWS region to store input datasets and model evaluation results.


**If you're running this notebook in the JupyterLab environment in Amazon SageMaker AI Studio, you can use your the default bucket from the Amazon SageMaker session to store the datasets and evaluation results. To do so, run the code below as-is.** For users running this notebook outside SageMaker AI Studio (for example on a local machine or EC2 instance), you'll need to either create a new S3 bucket or specify an existing one in your region. Please follow the instructions within the cell before you execute it.

In [None]:
import sagemaker #comment if Sagemaker is not used
import boto3
sess = sagemaker.Session() #comment if Sagemaker is not used

#If you want to use a custom s3 bucket or running this notebook without Sagemaker, please mention the bucket name as follows
#bucket = ""
bucket=None

if bucket is None and sess is not None: 
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=bucket)
 

print(f"Model Evaluation bucket: {bucket}")


### Enable Cross Origin Resource Sharing (CORS) on S3 bucket
----

Automatic model evaluations jobs that are created using the Amazon Bedrock console require that you specify a CORS configuration on the S3 bucket you use to store the datsets and model evaluation results.

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-security-cors.html for more details.

In [None]:
#Cors
# Define the configuration rules
cors_configuration = {
    'CORSRules': [
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "PUT",
            "POST",
            "DELETE"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [
            "Access-Control-Allow-Origin"
        ]
    }
]
}

# Set the CORS configuration
s3 = boto3.client('s3')
s3.put_bucket_cors(Bucket=bucket,
                   CORSConfiguration=cors_configuration)

### IAM service role

To run an automatic model evaluation job you must create a service role. The service role allows Amazon Bedrock to perform actions on your behalf in your AWS account. Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/automatic-service-roles.html for more details.

In [None]:
import json
#Create IAM role
iam = boto3.client('iam')
aws_acct = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

assume_role_policy_document = json.dumps({
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowBedrockToAssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": aws_acct
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:{}:{}:evaluation-job/*".format(region, aws_acct)
                }
            }
        }
    ]
})



In [None]:
import datetime

role_name="Amazon-Bedrock-model-eval-{}".format(str(datetime.datetime.now().timestamp()).split('.')[0])
create_role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument = assume_role_policy_document
)
import time

In [None]:
#Waiter function to check if IAM role got successfully created
waiter = iam.get_waiter('role_exists')
print(f"Waiting for role '{role_name}' to exist...")

# Wait for the role to exist
waiter.wait(
    RoleName=role_name,
    WaiterConfig={
        'Delay': 2,        # Optional: Poll every 2 seconds instead of 1
        'MaxAttempts': 5  # Optional: Max attempts 30 times instead of 20
    }
)
print(f"Role '{role_name}' found!")

In [None]:
role_arn = create_role_response["Role"]["Arn"]

role_arn

In [None]:
#Store role_arn and role_name for reuse in lab2b
%store role_arn
%store role_name

### Add Permissions to IAM role to access Amazon Bedrock and the Amazon S3 Bucket
---
Next, you need to allow the IAM service role to access to the S3 bucket you specified and Bedrock capabilities.

In [None]:
aws_s3_policy_doc = json.dumps({
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "AllowAccessToCustomDatasetsAndOutput",
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
        ],
        "Resource": [
            "arn:aws:s3:::{}".format(bucket),
            "arn:aws:s3:::{}/outputs/".format(bucket),
            "arn:aws:s3:::{}/custom_datasets/".format(bucket),
            "arn:aws:s3:::{}/*".format(bucket),
        ]
    }
]
}
)

aws_br_policy_doc = json.dumps({
        "Version": "2012-10-17",
            "Statement": [
        {
            "Sid": "AllowAccessToBedrockResources",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream",
                "bedrock:CreateModelInvocationJob",
                "bedrock:StopModelInvocationJob",
                "bedrock:GetProvisionedModelThroughput",
                "bedrock:GetInferenceProfile", 
                "bedrock:ListInferenceProfiles",
                "bedrock:GetImportedModel",
                "bedrock:GetPromptRouter",
                "bedrock:GetEvaluationJob",
                "bedrock:ListEvaluationJobs",
                "bedrock:CreateEvaluationJob",
                "sagemaker:InvokeEndpoint"
            ],
            "Resource": [
                "arn:aws:bedrock:*::foundation-model/*",
                "arn:aws:bedrock:*:{}:inference-profile/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:provisioned-model/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:imported-model/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:application-inference-profile/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:default-prompt-router/*".format(aws_acct),
                "arn:aws:sagemaker:*:{}:endpoint/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:marketplace/model-endpoint/all-access".format(aws_acct)
            ]
        }
    ]
}
)
    

In [None]:
def wait_for_policy_propagation(iam_client, role_name, policy_name):
    for attempt in range(30):
        try:
            iam_client.get_role_policy(RoleName=role_name, PolicyName=policy_name)
            return
        except ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchEntity':
                print("NoSuchEntity, trying again")
                time.sleep(1)
                continue
            raise

In [None]:
iam_s3_response = iam.put_role_policy(
    RoleName=role_name,
    PolicyName="s3_access",
    PolicyDocument=aws_s3_policy_doc
)
iam_s3_response
wait_for_policy_propagation(iam, role_name, "s3_access")

In [None]:
iam_bedrock_response = iam.put_role_policy(
    RoleName=role_name,
    PolicyName="br_access",
    PolicyDocument=aws_br_policy_doc
)
iam_bedrock_response
wait_for_policy_propagation(iam, role_name, "br_access")

# Run model evaluation job
## Model Selection

In this step, we'll compare two powerful Large Language Models (LLMs) available through Amazon Bedrock:


### 1. Qwen-3 32B (Alibaba)
- **Bedrock Model ID:** `qwen.qwen3-32b-v1:0`

### 2. GPT OSS 20B (OpenAI)
- **Bedrock Model ID:** `openai.gpt-oss-20b-1:0`

You can list the available models and retreive their model ids using the following code

```
import boto3
bedrock_client = boto3.client('bedrock')
bedrock_client.list_foundation_models()
```

In [None]:
bedrock_client = boto3.client('bedrock', region_name=region)
gen_models = [ "qwen.qwen3-32b-v1:0", "openai.gpt-oss-20b-1:0"]
model_1 = gen_models[0]
model_2 = gen_models[1]
print('You selected models {} and {} for evaluation'.format(model_1, model_2))

#### Lets get details about the selected Amazon Bedrock foundation models.

In [None]:
bedrock_client.get_foundation_model(modelIdentifier=model_1)

In [None]:
bedrock_client.get_foundation_model(modelIdentifier=model_2)

#### Get ARNs for the selected models

In [None]:
import boto3
bedrock_client = boto3.client('bedrock')
region = boto3.session.Session().region_name
region_prefix = region.split('-')[0]
model_arns = []



for model in model_1, model_2:
    fm_response = bedrock_client.get_foundation_model(
        modelIdentifier=model
    )
    if fm_response['modelDetails']['inferenceTypesSupported'][0] == "ON_DEMAND":
        model_arn = fm_response['modelDetails']['modelArn']
    elif fm_response['modelDetails']['inferenceTypesSupported'][0] == "INFERENCE_PROFILE":
        model = "{}.{}".format(region_prefix, model)
        model_arn = bedrock_client.get_inference_profile(
            inferenceProfileIdentifier=model
        )['inferenceProfileArn']
    model_arns.append(model_arn)

print(model_arns)

# <ins> Automatic Model evaluation using Builtin Dataset </ins>

### Define taskType, Dataset and metrics for evaluation

**Task Type:**
Model evaluation supports the following task types that assess different aspects of the model's performance:

* General text generation ‚Äì the model performs natural language processing and text generation tasks.
* Text summarization ‚Äì the model performs summarizes text based on the prompts you provide.
* Question and answer ‚Äì the model provides answers based on your prompts.
* Text classification ‚Äì the model categorizes text into predefined classes based on the input dataset.

**Metrics:**
You can choose from the following the metrics that you want the model evaluation job to create.

* Toxicity ‚Äì The presence of harmful, abusive, or undesirable content generated by the model.
* Accuracy ‚Äì The model's ability to generate outputs that are factually correct, coherent, and aligned with the intended task or query.
* Robustness ‚Äì The model's ability to maintain consistent and reliable performance in the face of various types of challenges or perturbations.

**Datasets:**
Amazon Bedrock provides multiple built-in prompt datasets that you can use in an automatic model evaluation job. Each built-in dataset is based off an open-source dataset. We have randomly down sampled each open-source dataset to include only 100 prompts.

For complete list of supported datasets, Task Types and metrics, please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets.html. 

In [None]:
def model_eval(model_arn, dataset, task_type, output_path, job_name, metric_names, custom_ds=False, custom_ds_s3=None):
    if custom_ds:
        ds = {
                'name': dataset,
                'datasetLocation': {
                    's3Uri': custom_ds_s3
                        }
            }
    else:
        ds = {
                'name': dataset
            }
    job_request = bedrock_client.create_evaluation_job(
        jobName=job_name,
        jobDescription="Bedrock Model evaluation job",
        roleArn=role_arn,
        outputDataConfig={
            "s3Uri": output_path
        },
        inferenceConfig={
            "models": [
                {
                    "bedrockModel": {
                        "modelIdentifier":model_arn,
                        "inferenceParams":"{\"inferenceConfig\":{\"maxTokens\": 1024,\"temperature\":0.3,\"topP\":0.5}}"
                    }

                }
            ]

        },
        evaluationConfig={
        'automated': {
            'datasetMetricConfigs': [
                {
                    "taskType": task_type,
                        "dataset": ds,
                        "metricNames": metric_names
                },
            ],
        },
        }
    )

    return job_request

In [None]:
import datetime

### Use any one of the following examples combinations of task_type, dataset and metrics or from supported built-in task_types, metrics and datasets from 
### https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets.html#model-evaluation-prompt-datasets-builtin

#### Example-1 #####
task_type = "QuestionAndAnswer"
dataset = "Builtin.NaturalQuestions"
metric_names = ["Builtin.Accuracy", "Builtin.Robustness", "Builtin.Toxicity"]

#### Example-2 #####
#task_type = "Classification"
#dataset = "Builtin.WomensEcommerceClothingReviews"
#metric_names = ["Builtin.Accuracy", "Builtin.Robustness"] 

output_path = "s3://{}/outputs/".format(bucket)

In [None]:
import time
import boto3
from botocore.exceptions import ClientError

def wait_for_role_propagation(role_arn, max_wait=10):
    """Wait for IAM role to be assumable by Bedrock service"""
    sts_client = boto3.client('sts')
    start_time = time.time()
    while time.time() - start_time < max_wait:
        try:
            # Test if Bedrock can assume the role by checking if we can get role info
            iam_client = boto3.client('iam')
            role_name = role_arn.split('/')[-1]
            # Check if role exists and policies are attached
            role_info = iam_client.get_role(RoleName=role_name)
            policies = iam_client.list_role_policies(RoleName=role_name)
            if len(policies['PolicyNames']) >= 2:  # Should have s3_access and br_access
                print(f"Role ready after {int(time.time() - start_time)} seconds")
                return True
        except ClientError as e:
            pass
        print(f"Waiting for role propagation... ({int(time.time() - start_time)}s)")
        time.sleep(5)
    #raise TimeoutError
    print(f"Role should be propagated by now, proceeding")

# Wait for role to be ready
wait_for_role_propagation(role_arn)

In [None]:
eval_jobs = []
for model_arn in model_arns:
    job_name = "model-eval-{}-{}".format(model_arn.split('/')[-1].split(':')[0], str(datetime.datetime.now().timestamp()).split('.')[0])
    job_name = job_name.replace(".", "-")
    job_name
    print(job_name)
    job = model_eval(model_arn, dataset, task_type, output_path, job_name, metric_names)
    eval_jobs.append(job)


### Monitoring Bedrock Model Evaluation Jobs

This function continuously checks the status of two submitted AWS Bedrock evaluation jobs until they either complete or fail.

In [None]:
## Function to check the job status in a loop until "COMPLETED" or "FAILED" post submission.
def check_job_status(eval_jobs, loop=True):
    # Loop through and wait for the evaluation jobs to complete . 
    from IPython.display import clear_output
    import time
    from datetime import datetime
    
    max_time = time.time() + 2*60*60 # 2 hours - Update the max time if needed
    
    while True:
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        get_eval_job1 = bedrock_client.get_evaluation_job(
            jobIdentifier=eval_jobs[0]['jobArn']
        )

        job1_status = get_eval_job1["status"]
        get_eval_job2 = bedrock_client.get_evaluation_job(
            jobIdentifier=eval_jobs[1]['jobArn']
        )

        job2_status = get_eval_job2["status"]
        
        if loop:
            clear_output(wait=True)
        
        print(f"{current_time} : Model evaluation job1 is {job1_status} and job2 is {job2_status}.")

        if not loop or (job1_status == "Completed" or job1_status == "Failed") and (job2_status == "Completed" or job2_status == "Failed") or time.time() >= max_time:
            break

        time.sleep(60)
    
    return get_eval_job1, get_eval_job2

In [None]:
status1, status2 = check_job_status(eval_jobs, loop=False)

In [None]:
from IPython.display import Markdown, display

display(Markdown(f"You can also review the status of the jobs in the [Amazon Bedrock Console](https://{region}.console.aws.amazon.com/bedrock/home?region={region}#/eval/evaluation)"))

<div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 15px; border-radius: 5px;">

<strong>The evaluation jobs you just submitted may take several minutes to complete.</strong><br><br>


Instead of waiting for the submitted evaluation job(s) to complete, let's proceed with monitoring and analyzing results from previously completed jobs. This approach allows us to:

‚è±Ô∏è Make productive use of our workshop time.

üß† Understand the evaluation framework and metrics.

üìà Compare existing model performance results.

In the following cells, we'll:

üîÑ Check the status of our submitted job(s).

üì• Retrieve and analyze results from completed evaluation jobs.

‚öñÔ∏è Compare performance across different models.

üìä Visualize key metrics and insights.
</div>

Next, you retrieve the most recent jobs run for the selected models, task type and dataset.

In [None]:
from datetime import datetime, timedelta, timezone

def get_completed_automatic_jobs(custom=False):
    all_jobs = []
    next_token = None
    
    # Get all jobs with pagination
    while True:
        params = {
            'sortBy': 'CreationTime',
            'sortOrder': 'Descending',
            'statusEquals': 'Completed',
            'maxResults': 1000
        }
        
        if next_token:
            params['nextToken'] = next_token
            
        response = bedrock_client.list_evaluation_jobs(**params)
        all_jobs.extend(response['jobSummaries'])
        
        next_token = response.get('nextToken')
        if not next_token:
            break
    
    print("response #", len(all_jobs))
    
    jobs = [
        job for job in all_jobs
        if 'evaluatorModelIdentifiers' not in job
        and any(job.get('modelIdentifiers', []) == [model] for model in model_arns)
        and job.get('evaluationTaskTypes', []) == [task_type]
    ]
    
    # Group jobs by unique combination of model and dataset type
    job_groups = {}
    
    for job in jobs:
        details = bedrock_client.get_evaluation_job(jobIdentifier=job['jobArn'])
        dataset_name = details['evaluationConfig']['automated']['datasetMetricConfigs'][0]['dataset']['name']
        is_builtin = dataset_name.startswith('Builtin.')
        
        # Skip if doesn't match the requested type (custom vs builtin)
        if (custom and is_builtin) or (not custom and not is_builtin):
            continue
             
        model_id = job['modelIdentifiers'][0]
        eval_type = 'builtin' if is_builtin else 'custom'
        key = (model_id, eval_type, dataset_name)
         
        # Keep only the most recent job for each unique combination
        if key not in job_groups or job['creationTime'] > job_groups[key]['creationTime']:
            job_groups[key] = job
    
    return list(job_groups.values())

In [None]:
completed_jobs = get_completed_automatic_jobs(custom=False)

if len(completed_jobs) >= 2:
    print(f"Found {len(completed_jobs)} completed jobs. Selecting the latest.")
    get_eval_job1 = bedrock_client.get_evaluation_job(jobIdentifier=completed_jobs[0]['jobArn'])
    print(f"Job1 name: {get_eval_job1['jobName']}\nDetails: {get_eval_job1}")
    get_eval_job2 = bedrock_client.get_evaluation_job(jobIdentifier=completed_jobs[1]['jobArn'])
    print(f"Job2 name: {get_eval_job2['jobName']}\nDetails: {get_eval_job2}")
else:
    print(f"Only found {len(completed_jobs)} completed jobs. Need to wait for jobs to complete.")

### Function to get the S3 output location of model evaluation job.

In [None]:
s3_client = boto3.client('s3')
def get_output_jsonl(bucket, eval_job_response, model, task_type, dataset):
    prefix = "{}{}/{}/models/{}/taskTypes/{}/datasets/{}".format("/".join(eval_job_response["outputDataConfig"]["s3Uri"].split('/')[3:]), eval_job_response["jobName"], eval_job_response["jobArn"].split("/")[1], model, task_type, dataset)
    print(bucket, prefix)
    response = s3_client.list_objects(
        Bucket=bucket,
        Prefix=prefix,
    )
    print(response)
    return response['Contents'][0]['Key']

In [None]:
model_val1 = get_eval_job1['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]
model_val2 = get_eval_job2['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]

bucket_job1 = get_eval_job1["outputDataConfig"]["s3Uri"].split('/')[2]
job1_output = get_output_jsonl(bucket_job1, get_eval_job1, model_val1, task_type, dataset)
print(job1_output)
bucket_job2 = get_eval_job2["outputDataConfig"]["s3Uri"].split('/')[2]
job2_output = get_output_jsonl(bucket_job2, get_eval_job2, model_val2, task_type, dataset)
print(job2_output)

### Function to retrieve metrics from the output

In [None]:
import json
s3_res = boto3.resource('s3')

def retrieve_metrics(bucket, output_jsonl):
    content_object = s3_res.Object(bucket, output_jsonl)
    jsonl_content = content_object.get()['Body'].read().decode('utf-8')
    output_content = [json.loads(jline) for jline in jsonl_content.splitlines()]
    return output_content

job1_metrics =  retrieve_metrics(bucket_job1, job1_output)
job2_metrics =  retrieve_metrics(bucket_job2, job2_output)

### Function to filter and load the metrics in pandas DataFrame

In [None]:

import pandas as pd

def pd_metrics(model1, model2, metric, job1_metrics, job2_metrics):
    met1 = []
    met2 = []
    met_index = [job1_metrics[0]['automatedEvaluationResult']['scores'].index(i) for i in job1_metrics[0]['automatedEvaluationResult']['scores'] if i["metricName"]==metric]
    for i, (x, y) in enumerate(zip(job1_metrics, job2_metrics)):
        met1.append(x['automatedEvaluationResult']['scores'][met_index[0]]['result'])
        met2.append(y['automatedEvaluationResult']['scores'][met_index[0]]['result'])
    met = pd.DataFrame({model1.split(':')[0]: met1, model2.split(':')[0]: met2})
    return met

In [None]:
metrics = [m.split('.')[1] for m in metric_names]
stats_list = []
for metric in metric_names:
    met_pd = pd_metrics(model_1, model_2, metric, job1_metrics, job2_metrics)
    
    stats_list.append(met_pd)

### Function to line plot for model comparison per metric

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

def plot_line_metrics(metrics, stats_list):
    for metric, df in zip(metrics, stats_list):
        print("\n \n \n")
        if metric == "Toxicity":
            sub = "    Lower the better"
        else:
            sub = "    Higher the better"
        plt.figure(figsize=(12, 6))
        sns.set_style("whitegrid")
        sns.lineplot(data=df, markers=True, palette="flare")
        plt.legend(title='Model')
        plt.xlabel('Inference test')
        plt.ylabel(metric)
        plt.title(metric)
        plt.figtext(0.5, 0.01, sub, horizontalalignment='center', verticalalignment='bottom', fontsize=10, fontstyle='italic', color='purple')
        plt.show();

In [None]:
plot_line_metrics(metrics, stats_list)

### Function to plot bar chart for avg accuracy per model

In [None]:

def plt_acc_bar(df, metric):
    # Calculate the average of each column
    column_averages = df.mean()

    # Create a bar plot
    plt.figure()
    sns.barplot(x=column_averages.index, y=column_averages.values)

    # Customize the plot
    plt.title("Average metric - {}".format(metric))
    plt.figtext(0.5, -0.01, "   Higher the better", horizontalalignment='center', verticalalignment='bottom', fontsize=10, fontstyle='italic', color='purple')
    plt.xlabel('Models')
    plt.ylabel('Average Value')

    # Rotate x-axis labels if there are many columns
    plt.xticks(rotation=45, ha='right')

    # Add value labels on top of each bar
    for i, v in enumerate(column_averages.values):
        plt.text(i, v, f'{v:.2f}', ha='center', va='bottom')

    plt.tight_layout()
    plt.show()

In [None]:
#Average Accuracy
plt_acc_bar(stats_list[0], metrics[0])

### Function to bin the accuracy data in different accuracy(in percentage) bins [0, 20, 40, 60, 80, 100] and compare between models

In [None]:

def bin_data(series, bins_list):
    bins = pd.cut(series, bins=bins_list)
    return bins, bins.value_counts().index

def plot_bin_accuracy(df, bins_list):
    # Apply binning to both columns
    df_binned = df.apply(lambda x: bin_data(x, bins_list)[0])
    bin_edges = bin_data(df.values.flatten(), bins_list)[1]

    # Melt the DataFrame to long format
    df_melted = df_binned.melt(var_name='model', value_name='bin')

    # Count the occurrences of each bin for each model
    df_counted = df_melted.groupby(['model', 'bin']).size().reset_index(name='count')

    # Create the plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='bin', y='count', hue='model', data=df_counted)

    # Customize the plot
    plt.title('Comparison of Accuracy Range Across Two Models')
    plt.figtext(0.5, -0.01, "    Higher the better", horizontalalignment='center', verticalalignment='bottom', fontsize=10, fontstyle='italic', color='purple')
    plt.xlabel('Accuracy Range')
    plt.ylabel('Count')
    plt.legend(title='Model')

    # Set x-axis labels to actual bin ranges
    plt.xticks(range(len(bin_edges)), [f'({interval.left:.2f}, {interval.right:.2f}]' for interval in bin_edges], rotation=45, ha='right')

    plt.tight_layout()
    plt.show()

In [None]:
plot_bin_accuracy(stats_list[0], bins_list=[0, 0.2, 0.4, 0.6, 0.8, 1.0]) #update the bin values as needed

## <ins> Automatic Model Evaluation using  Custom Dataset </ins>

Now lets start evaluating the same models with a custom dataset. 

*For this demo purpose only, we use Databricks Dolly-15k Dataset from HuggingFace.*

**Note: Customers may use their own validation(groundtruth) dataset in the given format below based on their workload.**


You can create a custom prompt dataset in an automatic model evaluation jobs. Custom prompt datasets must be stored in Amazon S3, and use the JSON line format and use the .jsonl file extension. Each line must be a valid JSON object. There can be up to 1000 prompts in your dataset per automatic evaluation job.

**Custom dataset must use the following keys value pairs format.**

`prompt` ‚Äì required to indicate the input for the following tasks:
* The prompt that your model should respond to, in general text generation.
* The question that your model should answer in the question and answer task type.
* The text that your model should summarize in text summarization task.
* The text that your model should classify in classification tasks.

`referenceResponse` ‚Äì required to indicate the ground truth response against which your model is evaluated for the following tasks types:
* The answer for all prompts in question and answer tasks.
* The answer for all accuracy, and robustness evaluations.

`category` ‚Äì (optional) generates evaluation scores reported for each category.

As an example, accuracy requires both the question asked, and a answer to check the model's response against. In this example, use the key `prompt` with the value contains the question, the key `referenceResponse` with the value contains the answer and the key `category` contains the category of the question as follows.

```
{"prompt": "Are The Smiths a good band?", 
"referenceResponse": "The Smiths were one of the most critically acclaimed bands to come from England in the 1980s. Typically classified as an \"indie rock\" band, the band released 4 albums from 1984 until their breakup in 1987. The band members, notably Morrissey and Johnny Marr, would go on to accomplish successful solo careers.",
"category": "general_qa"}
```

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets.html#model-evaluation-prompt-datasets-custom for more details.

### Download dolly-15k dataset

In [None]:
!wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl --no-check-certificate

### In this example, we will sample 100 records of "open_qa" category from dolly-15k dataset

In [None]:
# Function to filter and select 100 records from dolly dataset
import json

def filter_jsonl(data, key, value):
    filtered_data = []
    for item in data:
        if item.get(key) == value:
            filtered_data.append(item)
    return filtered_data

with open('databricks-dolly-15k.jsonl', 'r') as file:
    data = [json.loads(line) for line in file]

filtered_data = filter_jsonl(data, "category", "open_qa")[:100]
print(len(filtered_data))

### Function to modify the format as needed for custom dataset

In [None]:
custom_jsonl = './custom_dataset.jsonl'

def write_jsonl(data, filename):
    with open(filename, 'w') as f:
        for item in data:
            item_mod = {}
            item_mod['prompt'] = item['instruction']
            item_mod['referenceResponse'] = item['response']
            item_mod['category'] = item['category']
            f.write(json.dumps(item_mod) + '\n')

# Write to JSONL file
write_jsonl(filtered_data, custom_jsonl)
                     

### Upload dataset jsonl to S3 Bucket

In [None]:
import boto3

s3_res = boto3.resource('s3')
s3_res.Bucket(bucket).upload_file(custom_jsonl, 'custom_datasets/dolly/')


### Choose task_type, metrics and s3 input/output path

In [None]:
task_type = "QuestionAndAnswer"
metric_names = ["Builtin.Accuracy", "Builtin.Robustness", "Builtin.Toxicity"] #Add or remove metrics within the list format
output_path = "s3://{}/outputs/".format(bucket)
cus_ds_s3 = "s3://{}/custom_datasets/dolly/".format(bucket)

### Submit automatic model evaluation jobs with custom dataset

In [None]:
import datetime
cust_eval_jobs = []
for model_arn in model_arns:
    job_name = "eval-custom-{}-{}".format(model_arn.split('/')[-1].split(':')[0], str(datetime.datetime.now().timestamp()).split('.')[0])
    job_name = job_name.replace(".", "-")
    print(job_name)
    job = model_eval(model_arn, "dolly-open-qa-custom", task_type, output_path, job_name, metric_names, custom_ds=True, custom_ds_s3=cus_ds_s3)
    cust_eval_jobs.append(job)

### Track evaluation job status until "COMPLETED" or "FAILED"

In [None]:
get_cust_eval_job1, get_cust_eval_job2 = check_job_status(cust_eval_jobs,False)

### Get evaluation jobs output

<div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 15px; border-radius: 5px;">

<strong>The evaluation jobs you just submitted may take several minutes to complete.</strong><br><br>


Instead of waiting for the submitted evaluation job(s) to complete, let's proceed with monitoring and analyzing results from previously completed jobs. This approach allows us to:

‚è±Ô∏è Make productive use of our workshop time.

üß† Understand the evaluation framework and metrics.

üìà Compare existing model performance results.

In the following cells, we'll:

üîÑ Check the status of our submitted job(s).

üì• Retrieve and analyze results from completed evaluation jobs.

‚öñÔ∏è Compare performance across different models.

üìä Visualize key metrics and insights.
</div>

In [None]:
completed_jobs = get_completed_automatic_jobs(custom=True)

if len(completed_jobs) >= 2:
    get_eval_job1 = bedrock_client.get_evaluation_job(jobIdentifier=completed_jobs[0]['jobArn'])
    print(f"Job1 name: {get_eval_job1['jobName']}")
    get_eval_job2 = bedrock_client.get_evaluation_job(jobIdentifier=completed_jobs[1]['jobArn'])
    print(f"Job2 name: {get_eval_job2['jobName']}")
else:
    print(f"Only found {len(completed_jobs)} completed jobs. Need to wait for jobs to complete.")

In [None]:
model_val1 = get_eval_job1['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]
model_val2 = get_eval_job2['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]

bucket_cust_job1 = get_eval_job1["outputDataConfig"]["s3Uri"].split('/')[2]
print(bucket_cust_job1)
cust_job1_output = get_output_jsonl(bucket_cust_job1, get_eval_job1, model_val1, task_type, dataset="dolly-open-qa-custom")
print(cust_job1_output)
bucket_cust_job2 = get_eval_job2["outputDataConfig"]["s3Uri"].split('/')[2]
cust_job2_output = get_output_jsonl(bucket_cust_job2, get_eval_job2, model_val2, task_type, dataset="dolly-open-qa-custom")

### Retrieve metrics

In [None]:
cust_job1_metrics =  retrieve_metrics(bucket_cust_job1, cust_job1_output)
cust_job2_metrics =  retrieve_metrics(bucket_cust_job2, cust_job2_output)

In [None]:
metrics = [m.split('.')[1] for m in metric_names]
stats_list = []
for metric in metric_names:
    met_pd = pd_metrics(model_1, model_2, metric, job1_metrics, job2_metrics)
    
    stats_list.append(met_pd)

### Draw line plot for model comparison per metric

In [None]:
plot_line_metrics(metrics, stats_list)

### Average Accuracy per model

In [None]:
plt_acc_bar(stats_list[0], metrics[0])

### Plot across different ranges of accuracy and compare

In [None]:
plot_bin_accuracy(stats_list[0], bins_list=[0, 0.2, 0.4, 0.6, 0.8, 1.0]) #update the bin values as needed