# Model Evaluation and Performance Analysis

## Overview

This notebook represents the final stage in our model distillation journey, where we evaluate the performance of our distilled model against the original model. We leverage Amazon Bedrock's RAG Evaluation capabilities with Bring Your Own Inference (BYOI) support to conduct a comprehensive assessment of model quality and citation capabilities.

### Learning Objectives

By the end of this notebook, you will understand:
- How to structure and format evaluation datasets for BYOI evaluation
- Advanced evaluation metrics for assessing RAG system performance
- Techniques for analyzing citation quality and knowledge transfer effectiveness

## Evaluation Metrics Deep Dive

Our evaluation framework uses several sophisticated metrics designed for RAG systems:

### Citation Quality Metrics
- **Citation Coverage**: Evaluates how comprehensively the model utilizes available context. This metric helps identify if the model is under-utilizing or over-relying on certain passages.

### Response Quality Metrics
- **Correctness**: Assesses factual accuracy by comparing generated content against ground truth responses and source documents.
- **Completeness**: Measures response thoroughness relative to the question's requirements and available context.
- **Faithfulness**: Evaluates how well responses align with provided context, detecting potential hallucinations or unsupported claims.
- **Helpfulness**: Analyzes practical utility by considering factors like clarity, relevance, and actionability.
- **Logical Coherence**: Examines response consistency and reasoning quality, particularly important for complex queries.

> **Advanced Note**: These metrics are calculated using specialized evaluator models that perform semantic analysis rather than simple string matching, enabling nuanced assessment of model performance.

## Prerequisites

Ensure you have completed the previous notebooks in this sequence:
1. `01_prepare_data.ipynb`: Data preparation and formatting
2. `02_distill.ipynb`: Model distillation process
3. `03_batch_inference.ipynb`: Batch inference implementation

Additional requirements:
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region ([Enable Bedrock models](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html))
- An S3 bucket for storing evaluation data and results
- An IAM role with necessary permissions for S3 and Bedrock ([IAM setup guide](https://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html))
- RAG system outputs formatted according to the BYOI specification

> **Important**: The evaluation process requires access to Amazon Bedrock evaluator models. Ensure these are enabled in your account and you have sufficient [quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html) for your evaluation workload.

## BYOI Evaluation Schema

The BYOI format enables evaluation of any RAG system by providing a standardized schema for inputs and outputs. This section details the required format and structure for evaluation data.

### Schema Components

1. **Conversation Turns**: Each turn represents a complete interaction
   - Prompt: The original question or query
   - Reference Responses: Ground truth answers for comparison
   - Output: The system's response including:
     * Generated text
     * Retrieved passages
     * Citation information

2. **Citations**: Structured references linking response segments to source passages
   - Generated Response Part: The specific text segment using cited material
   - Retrieved References: The source passages supporting the citation

3. **Metadata**: Additional context about the evaluation
   - Model identifier
   - Knowledge base information
   - Optional source attribution

### Example Schema

```json
{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": "Your prompt here"
          }
        ]
      },
      "referenceResponses": [
        {
          "content": [
            {
              "text": "Expected ground truth answer"
            }
          ]
        }
      ],
      "output": {
        "text": "Generated response text",
        "knowledgeBaseIdentifier": "third-party-RAG",
        "retrievedPassages": {
          "retrievalResults": [
            {
              "name": "Optional passage name",
              "content": {
                "text": "Retrieved passage content"
              },
              "metadata": {
                "source": "Optional metadata"
              }
            }
          ]
        },
        "citations": [
          {
            "generatedResponsePart": {
              "textResponsePart": {
                "span": {
                  "start": 0,
                  "end": 50
                },
                "text": "Part of the response that uses cited material"
              }
            },
            "retrievedReferences": [
              {
                "name": "Optional passage name",
                "content": {
                  "text": "Source passage for the citation"
                },
                "metadata": {
                  "source": "Optional metadata"
                }
              }
            ]
          }
        ]
      }
    }
  ]
}
```

## Implementation

Let's set up our evaluation pipeline by first importing required dependencies and configuring our environment:

In [None]:
#Upgrade Boto3
!pip install --upgrade boto3

In [None]:
# Import required libraries and setup environment
import boto3
import json
import os
import sys
from datetime import datetime
import re

print(boto3.__version__)
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
skip_dir = os.path.dirname(parent_dir)
sys.path.append(skip_dir)
from utils import read_jsonl_to_dataframe, upload_training_data_to_s3

In [None]:
def split_user_message(user_msg):
    # Regex pattern to extract content between tags
    pattern = r"<context>(.*?)</context>\s*<question>(.*?)</question>"
    
    # Apply regex
    match = re.search(pattern, user_msg, re.DOTALL)
    
    if match:
        context = match.group(1).strip()
        question = match.group(2).strip()
        return context, question
        
    else:
        print("Pattern not found in the text")

In [None]:
def create_eval_record(results_row, labeled_row, model_identifier=None, kb_identifier='kb_id'):
    """
    Takes a batch inference result data set and builds an evaluation data set for use in bedrock evaluation
    
    Args:
        row:            pandas row
        max_records:    defaults to 1000 - max for bedrock evaluation
    
    Returns:
        dict: A formatted payload dictionary ready for Bedrock Evaluations API
    
    """
    result = results_row.to_dict()
    label = labeled_row.to_dict()

    retrieval_content, question = split_user_message(result['modelInput']['messages'][0]['content'][0]['text'])
    return {
        "conversationTurns": [
            {
                "prompt": {
                    "content": [
                        {
                            "text": question
                        }
                    ]
                },
                "referenceResponses": [
                    {
                        "content": label['modelInput']['messages'][1]['content'] # labeled answer content from assistant
                    }
                ],
                "output": {
                    "text": result['modelOutput']['output']['message']['content'][0]['text'],
                    "modelIdentifier": model_identifier if model_identifier else 'placeholder_model',
                    "knowledgeBaseIdentifier": kb_identifier,
                    "retrievedPassages": {
                        "retrievalResults": [ # put context from squad example here
                            {
                                "name": f"retrieval_{results_row['recordId']}",
                                "content": {
                                    "text": retrieval_content
                                }
                            }
                        ]
                    }
                }
            }
        ]
    }

In [None]:
def create_evaluation_dataset(model_identifier, dataframe, labeled_df, rag_source_id=None, max_records=1000, output_filename=None):
    """
    Create an evaluation dataset from model results and write to a JSONL file.
    
    Parameters:
    -----------
    model_identifier : str
        Identifier for the model being evaluated
    dataframe : pandas.DataFrame
        DataFrame containing the results to evaluate
    labeled_df : pandas.DataFrame
        DataFrame containing labeled data with matching recordIds
    rag_source_id : str, optional
        Identifier for the knowledge base source
    max_records : int, optional
        Maximum number of records to process (default: 1000)
    output_filename : str, optional
        Custom filename for the output JSONL file
        
    Returns:
    --------
    str
        Path to the created JSONL file
    list
        The evaluation dataset as a list of dictionaries
    """
    import json
    
    eval_dataset = []
    kb_identifier = rag_source_id or f"{model_identifier}_kb"
    
    for ix, results_row in dataframe.head(max_records).iterrows():
        labeled_row = labeled_df[labeled_df['recordId'] == results_row['recordId']].iloc[0]
        eval_dataset.append(create_eval_record(
            results_row=results_row, 
            labeled_row=labeled_row,
            model_identifier=model_identifier, 
            kb_identifier=kb_identifier
        ))
    
    jsonl_file = output_filename or f"evaluation_data_{model_identifier}.jsonl"
    
    # Write all records to JSONL file
    with open(jsonl_file, 'w', encoding='utf-8') as f:
        for record in eval_dataset:
            f.write(json.dumps(record) + '\n')
    
    print(f"Successfully wrote {len(eval_dataset)} records to {jsonl_file}")
    return jsonl_file

## IAM Role Configuration

https://docs.aws.amazon.com/bedrock/latest/userguide/judge-service-roles.html

To run evaluation jobs securely, we need an IAM service role with the following permissions:

1. **S3 Access**: Read/write permissions for evaluation data and results
2. **Bedrock Model Access**: Permission to invoke evaluator models
3. **Evaluation Job Management**: Ability to create and monitor jobs

### Required IAM Policies

1. S3 Access Policy:
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket/*",
                "arn:aws:s3:::your-bucket"
            ]
        }
    ]
}
```

2. Bedrock Access Policy:
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:CreateEvaluationJob",
                "bedrock:GetEvaluationJob",
                "bedrock:ListEvaluationJobs"
            ],
            "Resource": "*"
        }
    ]
}
```

For detailed setup instructions, visit the [Bedrock Service Roles documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/judge-service-roles.html)

## Creating and Running Evaluation Jobs

Now let's create and submit our evaluation jobs. We'll start by making a helper function to do this:

In [None]:
def create_rag_evaluation_job(
    bedrock_client,
    model_identifier, 
    rag_source_id,
    input_data,
    role_arn,
    output_path,
    evaluator_model,
    metrics=None
):
    """
    Creates a Bedrock RAG evaluation job.

    Parameters:
    -----------
    bedrock_client : boto3.client
        The Bedrock client to use for creating the job
    model_identifier : str
        Identifier for the model being evaluated
    rag_source_id : str
        Identifier for the RAG knowledge base source
    input_data : str
        S3 URI for the input evaluation dataset
    role_arn : str
        IAM role ARN with sufficient permissions to run the evaluation
    output_path : str
        S3 URI where evaluation results will be stored
    evaluator_model : str
        Identifier for the model that will perform evaluations
    metrics : list, optional
        List of metrics to evaluate (defaults to standard RAG metrics)

    Returns:
    --------
    str
        The job ARN from the create_evaluation_job API call
    """
    from datetime import datetime
    
    # Set default RAG metrics if none provided
    if metrics is None:
        metrics = [
            "Builtin.Correctness",
            "Builtin.Completeness",
            "Builtin.Helpfulness",
            "Builtin.LogicalCoherence",
            "Builtin.Faithfulness"
        ]
    
    # Generate job name using timestamp for uniqueness
    retrieve_generate_job_name = f"citations-{model_identifier.replace('_','-')}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    # Create the evaluation job
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Evaluate retrieval and generation",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [
                {
                    "precomputedRagSourceConfig": {
                        "retrieveAndGenerateSourceConfig": {
                            "ragSourceIdentifier": rag_source_id
                        }
                    }
                }
            ]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "QuestionAndAnswer",  
                    "dataset": {
                        "name": f"{model_identifier}_dataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": metrics
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )
    
    return retrieve_generate_job['jobArn']

## Submit Evaluation Jobs
We'll now submit each of the evaluation jobs for the models. Enter in each of the batch inference results file locations along with a model name that will be used to name the evaluation job.


Bedrock Evaluation jobs, like Batch Inference Jobs, can take a number of hours to run. Once the jobs have kicked off feel free to return here to continue the evaluation.

In [None]:
labeled_df = read_jsonl_to_dataframe('labeled_data.jsonl')

evaluation_models = [
    {
        "model_name": "nova_premier",
        "batch_inference_results_file": "./batch_inference_results/amazon-nova-premier-data.jsonl",
    },
    {
        "model_name": "nova_lite_distilled",
        "batch_inference_results_file": "./batch_inference_results/distilled_results.jsonl",
    }
]

# Configure knowledge base and model settings
evaluator_model = 'us.anthropic.claude-3-5-haiku-20241022-v1:0' # "<YOUR_EVALUATOR_MODEL>"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
BUCKET_NAME = '<YOUR_S3_BUCKET_NAME>' # Replace by your bucket name "<YOUR_S3_BUCKET_NAME>"
PREFIX = 'citations_distillation' # "<YOUR_BUCKET_PREFIX>"

output_path = f"s3://{BUCKET_NAME}/{PREFIX}/"

In [None]:
# Create Bedrock client
bedrock_client = boto3.client('bedrock', region_name='us-east-1')

eval_details = []
for model in evaluation_models:
    dataframe = read_jsonl_to_dataframe(model['batch_inference_results_file'])
    eval_dataset_local_file = create_evaluation_dataset(model['model_name'],dataframe, labeled_df)
    eval_dataset_s3 = upload_training_data_to_s3(bucket_name=BUCKET_NAME, local_file_path=eval_dataset_local_file,prefix=PREFIX)
    eval_job_arn = create_rag_evaluation_job(
        bedrock_client,
        model_identifier=model['model_name'], 
        rag_source_id = f"{model['model_name']}_kb",
        input_data = eval_dataset_s3,
        role_arn=role_arn,
        output_path=output_path,
        evaluator_model=evaluator_model,
        metrics=None
    )
    eval_details.append({**model, 
                        'eval_dataset_s3': eval_dataset_s3,
                        'eval_job_arn': eval_job_arn})

## Download Evaluation Results
Here we've provided a helper function to download the evaluation results using only an evaluation job ARN. This will download results to a folder titled `evaluation_results`.

In [None]:
from urllib.parse import urlparse

def download_evaluation_results(bedrock_client, eval_job_arn):
    # Get the evaluation job details
    eval_results_response = bedrock_client.get_evaluation_job(
        jobIdentifier=eval_job_arn
    )
    
    # Check if the job is completed
    job_status = eval_results_response.get('status', '')
    if job_status not in ['Completed']:
        print(f"Evaluation job status is {job_status}, not ready to download results yet.")
        return
    
    # Extract the S3 output URI with the correct path construction
    base_s3_uri = eval_results_response.get('outputDataConfig', {}).get('s3Uri')
    job_name = eval_results_response.get('jobName', '')
    job_id = eval_job_arn.split('/')[-1]
    dataset_name = eval_results_response.get('evaluationConfig', {}).get('automated', {}).get('datasetMetricConfigs',{})[0].get('dataset', {}).get('name', {})
    
    s3_output_uri = f"{base_s3_uri}{job_name}/{job_id}/inference_configs/0/datasets/{dataset_name}"
    
    if not s3_output_uri:
        print("Could not construct S3 output URI")
        return
    
    print(f"Looking for .jsonl files in: {s3_output_uri}")
    
    # Parse the S3 URI
    parsed_uri = urlparse(s3_output_uri)
    bucket_name = parsed_uri.netloc
    # Remove leading '/' from the path to get the prefix
    prefix = parsed_uri.path.lstrip('/')
    
    # Create the evaluation_results directory if it doesn't exist
    output_dir = 'evaluation_results'
    os.makedirs(output_dir, exist_ok=True)
    
    # Initialize S3 client
    s3_client = boto3.client('s3')
    
    # List objects in the S3 bucket with the given prefix
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
    
    if 'Contents' not in response:
        print(f"No files found in {s3_output_uri}")
        return
    
    # Download only .jsonl files
    jsonl_files_found = False
    
    for obj in response['Contents']:
        # Get the file path relative to the prefix
        key = obj['Key']
        
        # Only download .jsonl files
        if not key.lower().endswith('.jsonl'):
            continue
            
        jsonl_files_found = True
        
        # Determine the local file name by extracting just the filename
        filename = os.path.basename(f"eval_{dataset_name}.jsonl")
        local_file_path = os.path.join(output_dir, filename)
        
        print(f"Downloading {key} to {local_file_path}")
        s3_client.download_file(bucket_name, key, local_file_path)
    
    if jsonl_files_found:
        print(f"Downloaded all .jsonl evaluation results to {output_dir}/")
    else:
        print(f"No .jsonl files found in {s3_output_uri}")
    
    return output_dir

# Example usage:
# output_path = download_evaluation_results(bedrock_client, eval['eval_job_arn'])

Grab the following details about each of the evaluation jobs and build a list of dictionaries as follows:

In [None]:
eval_details_temp = [
    {
      "model_name": "nova_lite_distilled",
      "batch_inference_results_file": "./batch_inference_results/distilled_results.jsonl", # local results.jsonl file from batch inferences
      "eval_dataset_s3": f"s3://{BUCKET_NAME}/{PREFIX}/evaluation_data.jsonl",
      "eval_job_arn": "arn:aws:bedrock:us-east-1:<account_id>:evaluation-job/<job_id>"
    }
  ]

In [None]:
eval_results = []
for eval in eval_details_temp:
    print(eval['eval_job_arn'])
    eval_results_response = bedrock_client.get_evaluation_job(
    jobIdentifier=eval['eval_job_arn']
    )
    print(eval_results_response)
    # s3_results_prefix = f"{eval_results_response.get('outputDataConfig', {}).get('s3Uri')}/{eval_results_response['jobName']}/{eval['eval_job_arn'].split('/')[-1]}"
    # print(s3_results_prefix)
    output_directory = download_evaluation_results(bedrock_client=bedrock_client, eval_job_arn=eval['eval_job_arn'])
    eval_results.append(
        {
            **eval,
            **{
                "output_directory": output_directory
            }
        }
    )

## Analyze Results
Now with our evaluations completed, we'll add one more metric using an LLM as a judge - citation coverage. 

**Citation coverage** is a measure of how well the response is supported by cited passages. The higher the score, the better the responses are supported by citations on average. Responses are graded on a 5-point likert scale.

We will use this 5-point scale to measure the model's ability to answer a question given the passages its citing. Note that if the model has correctly stated it cannot answer the question given the passages, we will assign a 5 as that is the correct answer.

We've built `eval_json_parser.py` to help with this. It uses Claude 3.5 Haiku as the judge, with a prompt from the Bedrock Evaluations documentation on evaluator prompts: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-kb-haiku35.html. Feel free to use these to build further metrics into this process.

This process uses cross-region inference and can take quite a while. Be sure to checkout the `time.sleep()` rate limiting and adjust accordingly. More robust rate limiting for production purposes would warrant some further development here.

We'll iterate through the 500 records for each of our evaluation sets, add in our citation coverage metrics, then aggregate across all of our results four our final analysis.

In [None]:
from eval_jsonl_parser import parse_jsonl_to_df, aggregate_metrics_by_model

for e in eval_results:
    df = parse_jsonl_to_df(e['output_dir'])
    pkl_name = f"{e['output_dir'].split('/')[0]}/{e['model_name']}_dataframe_pickle.pkl"
    print(f"saving backup: {pkl_name}")
    df.to_pickle(pkl_name)

In [1]:
from eval_jsonl_parser import parse_jsonl_to_df, aggregate_metrics_by_model

# eval_nova_distilled_results_df = parse_jsonl_to_df('evaluation_results/eval_nova_lite_distilled_dataset.jsonl')
# eval_nova_premier_results_df = parse_jsonl_to_df('evaluation_results/eval_nova_premier_dataset.jsonl')
# eval_nova_lite_results_df = parse_jsonl_to_df('evaluation_results/eval_nova_lite_dataset.jsonl')
# eval_nova_micro_results_df = parse_jsonl_to_df('evaluation_results/eval_nova_micro_dataset.jsonl')

Optional back up to .pkl

In [2]:
# save to pickle
# eval_nova_distilled_results_df.to_pickle('evaluation_results/eval_nova_lite_distilled_dataframe_pickle.pkl')
# eval_nova_premier_results_df.to_pickle('evaluation_results/eval_nova_premier_dataframe_pickle.pkl')
# eval_nova_lite_results_df.to_pickle('evaluation_results/eval_nova_lite_dataframe_pickle.pkl')
# eval_nova_micro_results_df.to_pickle('evaluation_results/eval_nova_micro_dataframe_pickle.pkl')

# Load DataFrame from pickle file
# df = pd.read_pickle('backup_data.pkl')

Optional load from .pkl

In [None]:
from pathlib import Path
import pandas as pd

aggregated_dfs = []

# Convert string path to Path object if needed
directory = Path('evaluation_results')

# Iterate through all files in the directory
for file_path in directory.iterdir():
    # Check if the file has .pkl extension
    if file_path.suffix.lower() == '.pkl':
        try:
            # Read the pickle file into a dataframe
            df = pd.read_pickle(file_path)
            # Aggregate the dataframe
            aggregated_df = aggregate_metrics_by_model(df)
            aggregated_dfs.append(aggregated_df)
        except Exception as e:
            print(f"Error processing {file_path}: {str(e)}")

# Combine all aggregated dataframes into a single dataframe
final_df = pd.concat(aggregated_dfs, ignore_index=True)


In [4]:
final_df

Unnamed: 0,model_identifier,metric_correctness,metric_completeness,metric_helpfulness,metric_logicalcoherence,metric_faithfulness,all_citations_valid,citation_coverage
0,nova_micro,0.838,0.835,0.7943,0.9217,0.8785,0.02,2.466
1,nova_premier,0.863,0.8415,0.7846,0.9309,0.9005,0.008,2.34
2,nova_lite_distilled,0.932,0.6005,0.5336,0.9709,0.5641,0.542,2.728
3,nova_lite,0.824,0.803,0.7533,0.9006,0.8465,0.106,2.302


## Conclusion

In this notebook, we've demonstrated how to:

1. **Structure Evaluation Data**: Format RAG system outputs for comprehensive evaluation using the BYOI specification
2. **Configure Evaluation Jobs**: Set up secure IAM roles and configure evaluation parameters
3. **Execute Evaluations**: Run parallel evaluations of multiple models using Amazon Bedrock
4. **Analyze Results**: Interpret evaluation metrics to assess model performance

This completes our four-notebook series on model distillation for citation-aware RAG systems. Through this series, we've covered:
- Data preparation and formatting
- Model distillation techniques
- Batch inference implementation
- Comprehensive model evaluation

For more information, explore:
- [Amazon Bedrock Evaluation Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html)
- [RAG Best Practices Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/rag-best-practices.html)
- [Advanced Model Evaluation Techniques](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-metrics.html)