# Model Evaluation and Performance Analysis

## Overview

This notebook represents the final stage in our model distillation journey, where we evaluate the performance of our distilled model against the original model. We leverage Amazon Bedrock's RAG Evaluation capabilities with Bring Your Own Inference (BYOI) support to conduct a comprehensive assessment of model quality and citation capabilities.

### Learning Objectives

By the end of this notebook, you will understand:
- How to structure and format evaluation datasets for BYOI evaluation
- Advanced evaluation metrics for assessing RAG system performance
- Techniques for analyzing citation quality and knowledge transfer effectiveness

## Evaluation Metrics Deep Dive

Our evaluation framework uses several sophisticated metrics designed for RAG systems:

### Citation Quality Metrics
- **Citation Coverage**: Evaluates how comprehensively the model utilizes available context. This metric helps identify if the model is under-utilizing or over-relying on certain passages.

### Response Quality Metrics
- **Correctness**: Assesses factual accuracy by comparing generated content against ground truth responses and source documents.
- **Completeness**: Measures response thoroughness relative to the question's requirements and available context.
- **Faithfulness**: Evaluates how well responses align with provided context, detecting potential hallucinations or unsupported claims.
- **Helpfulness**: Analyzes practical utility by considering factors like clarity, relevance, and actionability.
- **Logical Coherence**: Examines response consistency and reasoning quality, particularly important for complex queries.

> **Advanced Note**: These metrics are calculated using specialized evaluator models that perform semantic analysis rather than simple string matching, enabling nuanced assessment of model performance.

## Prerequisites

Ensure you have completed the previous notebooks in this sequence:
1. `01_prepare_data.ipynb`: Data preparation and formatting
2. `02_distill.ipynb`: Model distillation process
3. `03_batch_inference.ipynb`: Batch inference implementation

Additional requirements:
- An active AWS account with appropriate permissions
- Amazon Bedrock access enabled in your preferred region ([Enable Bedrock models](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html))
- An S3 bucket for storing evaluation data and results
- An IAM role with necessary permissions for S3 and Bedrock ([IAM setup guide](https://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html))
- RAG system outputs formatted according to the BYOI specification

> **Important**: The evaluation process requires access to Amazon Bedrock evaluator models. Ensure these are enabled in your account and you have sufficient [quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html) for your evaluation workload.

## Implementation

Let's set up our evaluation pipeline by first importing required dependencies and configuring our environment:

In [None]:
#Upgrade Boto3
!pip install --upgrade boto3 matplotlib

In [None]:
# Import required libraries and setup environment
import boto3
import json
import os
import sys
from datetime import datetime
import re

print(boto3.__version__)
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
skip_dir = os.path.dirname(parent_dir)
sys.path.append(skip_dir)
from utils import read_jsonl_to_dataframe, upload_training_data_to_s3

## Aggregate Evaluation Results
We'll use a helper script called `eval_results.py` to process all of the results files we have now and compare the model outputs to the ground truth for each question.

This will generate evaluation_results.csv which we'll plot to analyze later.

In [None]:
!python eval_results.py

## Analyze Results


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load CSV into DataFrame
function_calling_results = pd.read_csv('evaluation_results.csv')
function_calling_results.head()

In [None]:
# Create bar chart for specific columns
columns_to_plot = ['prompting_approach', 'model_type', 'model_name', 'overall_accuracy']

# Sort by overall_accuracy in descending order
df_sorted = function_calling_results.sort_values(by='overall_accuracy', ascending=False)

ax = df_sorted[columns_to_plot].plot(kind='bar', figsize=(10, 6))
plt.title('Function Calling Accuracy by Model (BFCL Dataset)')
plt.xlabel('Eval')
plt.ylabel('Values')

# Create custom x-axis labels combining prompting_approach, model_type, and model_name
labels = [f"{row['prompting_approach']}\n{row['model_type']}\n{row['model_name']}" 
          for _, row in df_sorted.iterrows()]
plt.xticks(range(len(df_sorted)), labels, rotation=45, ha='right')

# Add value labels on bars
for container in ax.containers:
    ax.bar_label(container)

plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we've demonstrated how to:

1. **Structure Evaluation Data**: Format RAG system outputs for comprehensive evaluation using the BYOI specification
2. **Configure Evaluation Jobs**: Set up secure IAM roles and configure evaluation parameters
3. **Execute Evaluations**: Run parallel evaluations of multiple models using Amazon Bedrock
4. **Analyze Results**: Interpret evaluation metrics to assess model performance

This completes our four-notebook series on model distillation for citation-aware RAG systems. Through this series, we've covered:
- Data preparation and formatting
- Model distillation techniques
- Batch inference implementation
- Comprehensive model evaluation

For more information, explore:
- [Amazon Bedrock Evaluation Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html)
- [RAG Best Practices Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/rag-best-practices.html)
- [Advanced Model Evaluation Techniques](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-metrics.html)