# Evaluating Nova.lite and Nova.pro on MeetingBank Dataset

This notebook demonstrates how to evaluate Amazon Bedrock models (Nova.lite and Nova.pro) on the MeetingBank dataset for meeting summarization tasks.

In [11]:
# Import required libraries
%load_ext autoreload
%autoreload 2
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
import boto3
from datetime import datetime

# Import utility functions
from utils.dataset_utils import load_meetingbank_dataset, get_test_samples, prepare_for_bedrock_evaluation
from utils.bedrock_utils import (
    create_s3_bucket_if_not_exists,
    apply_cors_if_not_exists,
    upload_to_s3,
    create_evaluation_job,
    wait_for_job_completion,
    download_evaluation_results
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Configure AWS Credentials

Make sure you have AWS credentials configured with appropriate permissions for Amazon Bedrock and S3.

In [None]:
# Set AWS region
region = "us-east-1"  # Change to your preferred region where Bedrock is available
BEDROCK_ROLE_ARN = "arn:aws:iam::864016358360:role/service-role/Amazon-Bedrock-IAM-Role-20250531T202875"
bucket_name = 'eval-datasets-us-east-1'
NUM_SAMPLES_PER_EVAL = 50

# Set IAM role ARN with permissions for Bedrock evaluation
# This role needs permissions to access S3 and invoke Bedrock models
#os.environ["BEDROCK_ROLE_ARN"] = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_BEDROCK_ROLE"  # Replace with your role ARN
os.environ["BEDROCK_ROLE_ARN"] = BEDROCK_ROLE_ARN

# Verify AWS credentials
try:
    sts = boto3.client('sts')
    identity = sts.get_caller_identity()
    print(f"AWS Identity verified: {identity['Arn']}")
except Exception as e:
    print(f"Error verifying AWS credentials: {e}")
    raise

AWS Identity verified: arn:aws:sts::864016358360:assumed-role/Admin/gili-Isengard


## 2. Load MeetingBank Dataset

In [13]:
# Load the dataset
dataset = load_meetingbank_dataset()
print(f"Dataset structure: {dataset}")
print(f"Available splits: {dataset.keys()}")

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['summary', 'uid', 'id', 'transcript'],
        num_rows: 5169
    })
    validation: Dataset({
        features: ['summary', 'uid', 'id', 'transcript'],
        num_rows: 861
    })
    test: Dataset({
        features: ['summary', 'uid', 'id', 'transcript'],
        num_rows: 862
    })
})
Available splits: dict_keys(['train', 'validation', 'test'])


In [None]:
# Get the first {NUM_SAMPLES_PER_EVAL} samples from the test set
test_samples = get_test_samples(dataset, num_samples=NUM_SAMPLES_PER_EVAL)
print(f"Number of test samples: {len(test_samples)}")

# Display sample information
for i, sample in enumerate(test_samples):
    print(f"\nSample {i+1}:")
    print(f"Transcript length: {len(sample['transcript'])} characters")
    print(f"Summary length: {len(sample['summary'])} characters")
    print(f"Summary: {sample['summary'][:200]}...")

Number of test samples: 20

Sample 1:
Transcript length: 7153 characters
Summary length: 278 characters
Summary: A RESOLUTION encouraging as a best practice the use of an individualized tenant assessment using the Fair Housing Act’s discriminatory effects standard to avoid Fair Housing Act violations when crimin...

Sample 2:
Transcript length: 2822 characters
Summary length: 643 characters
Summary: On the message and order, referred on December 1, 2021, Docket #1239, authorizing the creation of a Sheltered Market Program in conformity with the requirements of G.L.C Chapter 30 B Section 18. This ...

Sample 3:
Transcript length: 1381 characters
Summary length: 350 characters
Summary: Adopt resolution consenting to inclusion of certain properties within the jurisdiction in the California HERO Program to finance distributed generation renewable energy sources, energy and water effic...

Sample 4:
Transcript length: 37394 characters
Summary length: 644 characters
Summary: AN ORDINANCE rel

## 3. Prepare Dataset for Bedrock Evaluation

In [15]:
# Prepare the dataset for Bedrock evaluation
evaluation_dataset_path = prepare_for_bedrock_evaluation(test_samples)
print(f"Evaluation dataset created at: {evaluation_dataset_path}")

# Display the content of the evaluation dataset
with open(evaluation_dataset_path, 'r') as f:
    for i, line in enumerate(f):
        record = json.loads(line)
        print(f"\nRecord {i+1}:")
        print(f"Prompt length: {len(record['prompt'])} characters")
        print(f"Reference response length: {len(record['referenceResponse'])} characters")
        print(f"Category: {record['category']}")

Evaluation dataset created at: ./data/bedrock_evaluation_dataset.jsonl

Record 1:
Prompt length: 7198 characters
Reference response length: 278 characters
Category: meeting_summarization

Record 2:
Prompt length: 2867 characters
Reference response length: 643 characters
Category: meeting_summarization

Record 3:
Prompt length: 1426 characters
Reference response length: 350 characters
Category: meeting_summarization

Record 4:
Prompt length: 37439 characters
Reference response length: 644 characters
Category: meeting_summarization

Record 5:
Prompt length: 5081 characters
Reference response length: 251 characters
Category: meeting_summarization

Record 6:
Prompt length: 3415 characters
Reference response length: 399 characters
Category: meeting_summarization

Record 7:
Prompt length: 4384 characters
Reference response length: 128 characters
Category: meeting_summarization

Record 8:
Prompt length: 5981 characters
Reference response length: 354 characters
Category: meeting_summarization


## 4. Upload Dataset to S3

In [16]:
create_s3_bucket_if_not_exists(bucket_name, region)
apply_cors_if_not_exists(bucket_name, region)

# Upload the evaluation dataset to S3
dataset_s3_key = "evaluation/meetingbank_dataset.jsonl"
dataset_s3_uri = upload_to_s3(evaluation_dataset_path, bucket_name, dataset_s3_key, region)
print(f"Dataset uploaded to: {dataset_s3_uri}")

# Define the output location in S3
output_s3_uri = f"s3://{bucket_name}/evaluation/results/"
print(f"Results will be stored at: {output_s3_uri}")

Bucket eval-datasets-us-east-1 already exists
CORS configuration already exists for bucket eval-datasets-us-east-1
Dataset uploaded to: s3://eval-datasets-us-east-1/evaluation/meetingbank_dataset.jsonl
Results will be stored at: s3://eval-datasets-us-east-1/evaluation/results/


## 5. Create and Run Bedrock Evaluation Job

In [None]:
# Define the models to evaluate
models = [
    {
        "name" : "nova-micro",
        "model_id" : "us.amazon.nova-micro-v1:0",
    },
    {
        "name" : "nova-lite",
        "model_id" : "us.amazon.nova-lite-v1:0",
    },
    {
        "name" : "nova-pro",
        "model_id" : "us.amazon.nova-pro-v1:0",
    },
    {
        "name" : "nova-premier",
        "model_id" : "us.amazon.nova-premier-v1:0",
    },
    {
        "name" : "Haiku-3",
        "model_id" : "us.anthropic.claude-3-haiku-20240307-v1:0",
    },
    {
        "name" : "Sonnet-3.5-v2",
        "model_id" : "us.anthropic.claude-3-5-sonnet-20241022-v2:0",
    },
]

for model in models:
    print(f"Model: {model}")
    model_name = model["name"]
    model_id = model["model_id"]
    # Create a unique job name
    job_name = f"meetingbank-{model_name}-{datetime.now().strftime('%Y%m%d%H%M%S')}"
    model[job_name] = job_name

    # Create the evaluation job
    try:
        job_arn = create_evaluation_job(
            job_name=job_name,
            dataset_s3_uri=dataset_s3_uri,
            output_s3_uri=output_s3_uri,
            model_id=model_id,
            region=region
        )
        print(f"Evaluation job created with ARN: {job_arn}")
        model['job_arn'] = job_arn
    except Exception as e:
        print(f"Error creating evaluation job: {e}")
        raise

Model: {'name': 'nova-micro', 'model_id': 'us.amazon.nova-micro-v1:0'}
Evaluation job created with ARN: arn:aws:bedrock:us-east-1:864016358360:evaluation-job/cgp7gjwllrkf


In [18]:
for model in models:
    job_arn=model['job_arn']
    print(f"name = {model["name"]}. Job ARN: {job_arn}")
    # Wait for the job to complete
    print("Waiting for evaluation job to complete...")
    job_details = wait_for_job_completion(job_arn, region)
    print(f"Job completed with status: {job_details['status']}")

name = nova-micro. Job ARN: arn:aws:bedrock:us-east-1:864016358360:evaluation-job/cgp7gjwllrkf
Waiting for evaluation job to complete...
Job status: InProgress. Waiting 60 seconds...


KeyboardInterrupt: 

## 6. Download and Analyze Results

In [None]:

for model in models:
    job_name = model['job_name']
    results_local_path = f'./results/{job_name}'
    # Download the evaluation results
    results_base_dir_s3 = f"{output_s3_uri}/{job_name}/"
    print(results_base_dir_s3)

    try:
        download_evaluation_results(results_base_dir_s3, results_local_path, region)
        print(f"Results downloaded to: {results_local_path}")
    except Exception as e:
        print(f"Error downloading results: {e}")
        raise

s3://gili-datasets-us-east-1/evaluation/results//meetingbank-nova-micro-20250531184651/
prefix=evaluation/results/meetingbank-nova-micro-20250531184651/
Downloading evaluation/results/meetingbank-nova-micro-20250531184651/b63z3uekqo2i/models/amazon.nova-micro-v1:0/taskTypes/General/datasets/meetingbank_dataset/70402edd-5f97-4233-8b25-24dd1f230d22_output.jsonl to ./results/meetingbank-nova-micro-20250531184651/output.jsonl
Results downloaded to: ./results/meetingbank-nova-micro-20250531184651
s3://gili-datasets-us-east-1/evaluation/results//meetingbank-nova-lite-20250531184652/
prefix=evaluation/results/meetingbank-nova-lite-20250531184652/
Downloading evaluation/results/meetingbank-nova-lite-20250531184652/xzzfgfmzni9k/models/amazon.nova-lite-v1:0/taskTypes/General/datasets/meetingbank_dataset/acb7b4d5-e825-495f-b84f-0037fb405e84_output.jsonl to ./results/meetingbank-nova-lite-20250531184652/output.jsonl
Results downloaded to: ./results/meetingbank-nova-lite-20250531184652


In [None]:
# Load and analyze the results
for model in models:
    job_name = model['job_name']
    results_local_path = f'./results/{job_name}/output.jsonl'
    with open(results_local_path, 'r') as f:
        results = json.load(f)

    # Extract and display model scores
    model_scores = {}
    for model_id in model_ids:
        model_name = "Nova.lite" if "haiku" in model_id else "Nova.pro"
        model_scores[model_name] = {
            "Relevance": 0,
            "Accuracy": 0,
            "Coherence": 0,
            "Conciseness": 0
        }
        
        # Extract scores from results (structure depends on actual output format)
        # This is a placeholder - adjust based on actual result structure
        # model_scores[model_name]["Relevance"] = results[model_id]["metrics"]["Relevance"]
        # ...

    # Display the scores
    scores_df = pd.DataFrame(model_scores)
    print(scores_df)

JSONDecodeError: Extra data: line 2 column 1 (char 11731)

In [None]:
# Visualize the results
ax = scores_df.plot(kind='bar', figsize=(10, 6))
ax.set_title('Model Evaluation Scores on MeetingBank Dataset')
ax.set_ylabel('Score')
ax.set_ylim(0, 5)  # Assuming scores are on a 0-5 scale
plt.legend(title='Models')
plt.tight_layout()
plt.savefig('evaluation_results.png')
plt.show()

## 7. Conclusion

This notebook demonstrated how to:
1. Load the MeetingBank dataset
2. Prepare the dataset for Bedrock evaluation
3. Create and run a Bedrock evaluation job
4. Analyze and visualize the evaluation results

The evaluation compared Nova.lite and Nova.pro models on meeting summarization tasks using built-in Bedrock evaluators for Relevance, Accuracy, Coherence, and Conciseness.