
## Comparing Fine-tuned vs Base Llama 3.2 1B Models Using Amazon Bedrock Evaluation
### Introduction
This notebook demonstrates how to conduct a systematic evaluation between two language models:

- A base Llama 3.2 1B model (meta-llama3-2-1b-base)
- The same Llama 3.2 1B model fine-tuned on biomedical data (llama3-2-1b-fine-tuned-pubmed)
We will use Amazon Bedrock for model inference and evaluation, with a focus on comparing the models' performance on domain-specific questions related to biomedical literature. This comprehensive evaluation workflow includes creating appropriate datasets, calculating performance metrics, and preparing configuration files for Amazon Bedrock evaluation jobs.

### Objectives
- Set up the necessary infrastructure for model evaluation in AWS
- Create evaluation datasets in the correct format for Bedrock evaluations
- Calculate performance metrics to compare the base and fine-tuned models
- Generate evaluation job configuration files to submit to AWS Bedrock
- Analyze the results to understand the impact of fine-tuning
### Setup and Dependencies
First, we'll install the required packages for model evaluation.

In [None]:
%pip install --quiet --upgrade sagemaker jmespath datasets transformers jinja2 ipywidgets boto3

Now, let's import the necessary libraries and set up our SageMaker and AWS environment.

In [1]:
from IPython.display import display, Markdown, Latex
import sagemaker
import boto3
import botocore
sess = sagemaker.Session()
import pprint

# Import custom functions
from utils import (
    download_artifacts, 
    remove_field_from_json, 
    upload_artifacts, 
    cleanup_local_files, 
    wait_for_model_availability, 
    test_image_processing
)
from iam_role_helper import create_or_update_role


# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    #change the name of the role if you are running locally
    role = iam.get_role(RoleName='Update with your sagemaker role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=bucket)
region=sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/rivasge/.config/sagemaker/config.yaml


## Configure AWS Bedrock Client
Set up the AWS Bedrock client with optimized configuration for reliable API calls.

In [None]:
import boto3
import os
from botocore.config import Config
config = Config(
    retries={
        'total_max_attempts': 100,  # More reasonable number than 100
        'max_attempts': 3,         # Maximum retry attempts
        'mode': 'adaptive',        # Uses adaptive retry mode with client-side throttling
    },
    connect_timeout=5,    # Reduce connection timeout from default 60s
    read_timeout=30,      # Reduce read timeout from default 60s
    max_pool_connections=50,  # Increase from default 10
    tcp_keepalive=True    # Enable TCP keepalive
)

# Bedrock clients for model inference
bedrock = boto3.client(
    service_name='bedrock',
    region_name='us-west-2',
    config=config
)

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2',
    config=config
)




## Helper Functions for Model Inference
Define helper functions to create prompt payloads and handle model inference.

In [None]:
def create_payload(
    prompt: str,
    system_message: str = None,
    parameters: dict = {
        "max_gen_len": 512,
        "temperature": 0.0,
        "top_p": 0.9
    }
) -> dict:
    """
    Creates a payload for Llama model invocation using the instruct format.
    
    Args:
        prompt (str): The main prompt/question for the model
        system_message (str, optional): System message to set context/behavior
        parameters (dict): Model parameters like max_tokens_to_sample, temperature, etc.
    
    Returns:
        dict: Formatted payload for model invocation
    """
    if not prompt:
        raise ValueError("Please provide a non-empty prompt.")
    
    # Construct the prompt format using Llama style
    if system_message:
        prompt_data = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    else:
        prompt_data = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        
    # Merge the prompt with allowed parameters
    payload = {
        "prompt": prompt_data,
        "max_gen_len": parameters.get("max_gen_len", 250),
        "temperature": parameters.get("temperature", 0.0),
        "top_p": parameters.get("top_p", 0.9)
    }
    
    return json.dumps(payload)

## Metrics and Evaluation Framework
Now we'll set up classes and functions to track model performance and evaluation metrics.

In [None]:
import jsonlines
import json
import concurrent.futures
import time
from botocore.exceptions import ClientError
from tqdm import tqdm
import logging
from collections import Counter, defaultdict
import sys
import random
import boto3
from botocore.config import Config
import threading
import os

from datetime import datetime

# Create logs directory if it doesn't exist
os.makedirs('logs', exist_ok=True)

# Create a more robust logging setup
def setup_logging(timestamp):
    """Set up logging with both file and console handlers"""
    # Create logs directory in current working directory
    log_dir = os.path.join(os.getcwd(), 'logs')
    os.makedirs(log_dir, exist_ok=True)
    
    # Create log filename
    log_file = os.path.join(log_dir, f'model_comparison_{timestamp}.log')
    
    # Remove any existing handlers
    logger = logging.getLogger()
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    
    # Configure logging
    logger.setLevel(logging.DEBUG)
    
    # File handler
    file_handler = logging.FileHandler(log_file)
    file_handler.setLevel(logging.DEBUG)
    file_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    file_handler.setFormatter(file_formatter)
    logger.addHandler(file_handler)
    
    # Console handler
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(logging.INFO)
    console_formatter = logging.Formatter('%(levelname)s: %(message)s')
    console_handler.setFormatter(console_formatter)
    logger.addHandler(console_handler)
    
    return log_file


class ModelMetrics:
    def __init__(self):
        self.request_times = []
        self.input_tokens = []
        self.output_tokens = []
        self.start_times = {}
        self.active_hours = set()
        self.request_count = 0
        self.successful_requests = 0
        self.failed_requests = 0
        self.total_processing_start = time.time()  # Add total processing time tracking

    def start_request(self):
        """Start timing a request"""
        self.start_times[threading.get_ident()] = time.time()
        self.active_hours.add(time.localtime().tm_hour)

    def record_request(self, input_text, output_text, success=True):
        """Record metrics for a completed request"""
        thread_id = threading.get_ident()
        if thread_id in self.start_times:
            duration = time.time() - self.start_times[thread_id]
            del self.start_times[thread_id]
        else:
            duration = 0

        # Simple token estimation (approximate)
        input_tokens = len(input_text.split())
        output_tokens = len(output_text.split()) if output_text else 0
        
        self.request_times.append(duration)
        self.input_tokens.append(input_tokens)
        self.output_tokens.append(output_tokens)
        self.request_count += 1
        
        if success:
            self.successful_requests += 1
        else:
            self.failed_requests += 1
            logger = logging.getLogger()
            logger.error(f"Failed request - Input: {input_text[:100]}... Output length: {len(output_text)}")

    def calculate_metrics(self):
        """Calculate all metrics"""
        if not self.request_times:
            return {}
            
        total_processing_time = time.time() - self.total_processing_start
        total_time_mins = total_processing_time / 60
        total_input_tokens = sum(self.input_tokens)
        total_output_tokens = sum(self.output_tokens)

        metrics = {
            'total_processing_time_seconds': total_processing_time,
            'total_processing_time_minutes': total_time_mins,
            'peak_input_tpm': max(self.input_tokens) * (60 / min([t for t in self.request_times if t > 0] or [1])),
            'peak_output_tpm': max(self.output_tokens) * (60 / min([t for t in self.request_times if t > 0] or [1])),
            'peak_load_hours': len(self.active_hours),
            'avg_input_tpm': total_input_tokens / total_time_mins if total_time_mins > 0 else 0,
            'avg_output_tpm': total_output_tokens / total_time_mins if total_time_mins > 0 else 0,
            'avg_load_hours': len(self.active_hours) / 24,
            'avg_rpm': self.request_count / total_time_mins if total_time_mins > 0 else 0,
            'avg_input_tokens_per_request': total_input_tokens / self.request_count if self.request_count > 0 else 0,
            'avg_output_tokens_per_request': total_output_tokens / self.request_count if self.request_count > 0 else 0,
            'avg_latency': sum(self.request_times) / len(self.request_times) if self.request_times else 0,
            'max_latency': max(self.request_times) if self.request_times else 0,
            'min_latency': min(t for t in self.request_times if t > 0) if any(t > 0 for t in self.request_times) else 0,
            'total_requests': self.request_count,
            'successful_requests': self.successful_requests,
            'failed_requests': self.failed_requests,
            'success_rate': (self.successful_requests / self.request_count * 100) if self.request_count > 0 else 0
        }
        return metrics


# Set up logging with immediate flush
class ImmediateLogger(logging.StreamHandler):
    def emit(self, record):
        super().emit(record)
        self.flush()

# Set up detailed logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f'logs/model_comparison_{time.strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)
logger.addHandler(ImmediateLogger(sys.stdout))

# Bedrock Configuration
config = Config(
    retries={
        'total_max_attempts': 100,
        'max_attempts': 10,
        'mode': 'adaptive',
    },
    connect_timeout=5,
    read_timeout=30,
    max_pool_connections=50,
    tcp_keepalive=True
)

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2',
    config=config
)

class ProgressStats:
    def __init__(self):
        self.total = 0
        self.errors = Counter()
        self.success = 0
        self.failed_requests = defaultdict(list)
        self.skipped = 0

def process_line(line, model_id, model_name, bedrock_runtime, metrics):
    logger = logging.getLogger()
    
    if "question" not in line or not line["question"].strip():
        logger.warning(f"Skipped empty or invalid question for model {model_name}")
        return None, "skipped", None

    question = line["question"]
    reference = line.get("answers", "")
    
    try:
        metrics.start_request()
        
        payload = create_payload(
            prompt=question,
            system_message="You are an AI assistant helping to answer questions about biomedical research literature.",
            parameters={
                "max_gen_len": 512,
                "temperature": 0.0,
                "top_p": 0.9
            }
        )
        
        try:
            response = bedrock_runtime.invoke_model(
                body=payload,
                modelId=model_id,
                accept="application/json",
                contentType="application/json"
            )
            
            response_body = json.loads(response.get("body").read())
            logger.debug(f"Model {model_name} - Raw response: {response_body}")
            
            # Extract model output
            if 'generation' in response_body:
                model_output = response_body['generation'].strip()
                if model_output:
                    metrics.record_request(question, model_output, success=True)
                    
                    # Format for Bedrock evaluation with full model identifier
                    output_format = {
                        "prompt": question,
                        "referenceResponse": reference if reference else None,
                        "category": "Biomedical Literature",
                        "modelResponses": [
                            {
                                "response": model_output,
                                "modelIdentifier": model_name  # This will now be the full model name from config
                            }
                        ]
                    }
                    return output_format, "success", None
            
            logger.error(f"Model {model_name} - Invalid or empty response format")
            metrics.record_request(question, "", success=False)
            return None, "empty_response", {"question": question, "original_line": line}
                
        except ClientError as ce:
            error_code = ce.response['Error']['Code']
            error_message = ce.response['Error']['Message']
            logger.error(f"Model {model_name} - ClientError: {error_code} - {error_message}")
            metrics.record_request(question, "", success=False)
            return None, "client_error", {"question": question, "original_line": line}
            
    except Exception as e:
        logger.error(f"Model {model_name} - Unexpected error: {str(e)}")
        logger.error("Full error details:", exc_info=True)
        metrics.record_request(question, "", success=False)
        return None, "unexpected_error", {"question": question, "original_line": line}
    


def process_file(input_file, output_file, model_id, model_name, bedrock_runtime, max_workers=5):
    stats = ProgressStats()
    metrics = ModelMetrics()
    processed_results = []
    
    with jsonlines.open(input_file) as input_fh, jsonlines.open(output_file, mode='w') as output_fh:
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = []
            for line in input_fh:
                if "question" not in line or not line["question"].strip():
                    stats.skipped += 1
                    continue
                future = executor.submit(process_line, line, model_id, model_name, bedrock_runtime, metrics)
                futures.append(future)
            
            with tqdm(total=len(futures), desc="Processing", file=sys.stdout) as pbar:
                for future in concurrent.futures.as_completed(futures):
                    try:
                        result, status, failed_data = future.result()
                        stats.total += 1
                        
                        if status == "success" and result:
                            stats.success += 1
                            output_fh.write(result)  # Write the properly formatted result
                            processed_results.append(result)
                        elif status == "skipped":
                            stats.skipped += 1
                        else:
                            stats.errors[status] += 1
                            if failed_data:
                                logger.error(f"Failed request: {failed_data}")
                        
                        pbar.set_description(
                            f"Processed: {stats.total} | "
                            f"Success: {stats.success} | "
                            f"Failed: {sum(stats.errors.values())} | "
                            f"Skipped: {stats.skipped}"
                        )
                        pbar.refresh()
                        
                    except Exception as e:
                        stats.errors["unexpected_error"] += 1
                        logger.error(f"Unexpected error in future processing: {str(e)}")
                    
                    pbar.update(1)
                    sys.stdout.flush()

    return {
        'metrics': metrics.calculate_metrics(),
        'processed_results': processed_results,
        'stats': {
            'total': stats.total,
            'success': stats.success,
            'skipped': stats.skipped,
            'errors': dict(stats.errors)
        }
    }

# Modified analyze_errors function
def analyze_errors(log_file):
    """Analyze errors from the log file and print a summary."""
    if not os.path.exists(log_file):
        print(f"Warning: Log file not found at {log_file}")
        return defaultdict(int), defaultdict(list)
        
    error_counts = defaultdict(int)
    error_examples = defaultdict(list)
    
    try:
        with open(log_file, 'r') as f:
            for line in f:
                if 'ERROR' in line:
                    for error_type in ['ClientError', 'empty_response', 'unexpected_error']:
                        if error_type in line:
                            error_counts[error_type] += 1
                            if len(error_examples[error_type]) < 3:  # Keep up to 3 examples
                                error_examples[error_type].append(line.strip())
    except Exception as e:
        print(f"Error reading log file: {str(e)}")
        return defaultdict(int), defaultdict(list)
    
    print("\nError Analysis:")
    print("=" * 50)
    for error_type, count in error_counts.items():
        print(f"\n{error_type}: {count} occurrences")
        print("Example errors:")
        for example in error_examples[error_type]:
            print(f"  - {example}")
            
    return error_counts, error_examples


def retry_failed_requests(retry_file, output_file, model_id, bedrock_runtime, max_workers=2):
    """Process a file of failed requests with reduced concurrency and increased delays."""
    print(f"\nProcessing retry file: {retry_file}")
    time.sleep(5)  # Add initial delay before starting retries
    process_file(retry_file, f"retry_results_{time.strftime('%Y%m%d_%H%M%S')}.jsonl", 
                model_id, bedrock_runtime, max_workers=max_workers)
    

# Usage example:
def run_model_comparison(input_file, model_configs, sample_size=None, timestamp=None):
    """
    Run comparison across multiple models with custom naming
    
    Args:
        input_file (str): Input file path
        model_configs (dict): Dictionary containing model configurations
        sample_size (int, optional): Number of samples to process (None for all)
        timestamp (str, optional): Custom timestamp for file naming
    """
    if timestamp is None:
        timestamp = time.strftime('%Y%m%d_%H%M%S')
    
    # Load and sample data if needed
    with jsonlines.open(input_file) as reader:
        all_data = list(reader)
    
    if sample_size and sample_size < len(all_data):
        sampled_data = random.sample(all_data, sample_size)
        temp_input_file = f"temp_sample_{timestamp}.jsonl"
        with jsonlines.open(temp_input_file, mode='w') as writer:
            writer.write_all(sampled_data)
        input_file_to_use = temp_input_file
        print(f"\nUsing {sample_size} samples from the dataset")
    else:
        input_file_to_use = input_file
    
    results = {}
    try:
        for model_key, config in model_configs.items():
            print(f"\nProcessing with model: {config['model_name']}")  # Use the full model name
            output_file = f"{config['output_prefix']}_{timestamp}.jsonl"
            result = process_file(
                input_file_to_use, 
                output_file, 
                config['model_id'],
                config['model_name'],  # Pass the full model name
                bedrock_runtime
            )
            results[model_key] = {
                'model_id': config['model_id'],
                'model_name': config['model_name'],
                'output_file': output_file,
                **result
            }
    finally:
        # Clean up temporary file if it was created
        if 'temp_input_file' in locals():
            try:
                os.remove(temp_input_file)
            except:
                pass
                
    return results


## Model Configuration
Define the model configurations for the base and fine-tuned models.

In [None]:
# Define model configurations
MODEL_CONFIGS = {
    'student': {
        'model_id': 'arn:aws:bedrock:us-west-2:786045444066:imported-model/b407sofa52h6',#This needs to be updated with your CMI arfcn, this can be integrated on the other notebook
        'output_prefix': 'student_model',
        'model_name': 'llama3-2-1b-fine-tuned-pubmed'  # This will be used as modelIdentifier
    },
    'base': {
        'model_id': 'us.meta.llama3-2-1b-instruct-v1:0',
        'output_prefix': 'base_model',
        'model_name': 'meta-llama3-2-1b-base'  # This will be used as modelIdentifier
    }
}


 # Generate timestamp once and use it consistently
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')


## Run Model Evaluation
Execute the evaluation process for both models on the test dataset.

In [None]:

# Setup logging
log_file = setup_logging(timestamp)
logger = logging.getLogger()

input_file = "dataset.jsonl"
sample_size = 100  # Start with a small number to verify everything works

logger.info(f"Starting model comparison with sample size: {sample_size}")
logger.info(f"Log file: {log_file}")

try:
    # Run comparison with sample size
    comparison_results = run_model_comparison(
        input_file=input_file, 
        model_configs=MODEL_CONFIGS, 
        sample_size=sample_size,
        timestamp=timestamp
    )
    print(comparison_results)

    # Print results
    logger.info("\nComparison Results:")
    logger.info("=" * 50)
    for model_name, result in comparison_results.items():
        logger.info(f"\nModel: {model_name}")
        logger.info(f"Output file: {result['output_file']}")
        logger.info("\nMetrics:")
        logger.info(json.dumps(result['metrics'], indent=2))

    # Save comparison results
    comparison_file = f"model_comparison_{timestamp}.json"
    with open(comparison_file, 'w') as f:
        json.dump(comparison_results, f, indent=2)
    logger.info(f"\nDetailed comparison saved to: {comparison_file}")
    
    # Analyze errors
    logger.info("\nAnalyzing errors from log file...")
    error_counts, error_examples = analyze_errors(log_file)
    
except Exception as e:
    logger.error("Fatal error in main execution:", exc_info=True)
    raise

## Visualize Results
Create visualizations to compare model performance metrics.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def create_comparison_visualizations(data):
    # Extract metrics for both models
    models_data = {}
    
    for model_type, model_info in data.items():
        models_data[model_info['model_name']] = model_info['metrics']
    
    # Convert to DataFrame
    df = pd.DataFrame(models_data)
    
    # Round numeric values to 2 decimal places
    df = df.round(2)
    
    # Create visualizations for all metrics
    metrics = df.index.tolist()
    n_metrics = len(metrics)
    
    # Calculate number of rows and columns for subplots
    n_cols = 3
    n_rows = (n_metrics + n_cols - 1) // n_cols
    
    # Create subplots
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 4*n_rows))
    axes = axes.ravel()
    
    # Plot each metric
    for idx, metric in enumerate(metrics):
        ax = axes[idx]
        df.loc[metric].plot(kind='bar', ax=ax)
        ax.set_title(f'{metric.replace("_", " ").title()}')
        ax.tick_params(axis='x', rotation=45)
        
        # Add value labels on top of each bar
        for i, v in enumerate(df.loc[metric]):
            ax.text(i, v, str(round(v, 2)), ha='center', va='bottom')
    
    # Remove extra subplots if any
    for idx in range(len(metrics), len(axes)):
        fig.delaxes(axes[idx])
    
    plt.tight_layout()
    
    return df, fig

# Create visualizations
df, fig = create_comparison_visualizations(comparison_results)

# Display DataFrame
print("\nComparative Metrics Table:")
display(df)  # For Jupyter notebook display

# Display the plot
plt.show()

# Optionally save the results
df.to_csv('model_metrics_comparison.csv')
fig.savefig('model_comparison_plots.png', bbox_inches='tight', dpi=300)


    

## S3 Bucket Configuration and File Upload
This section handles the configuration of S3 buckets for evaluation data storage and manages the upload of evaluation files.

In [None]:
#CORS Configuration
def configure_bucket_cors(bucket_name):
    """
    Configure CORS for the S3 bucket used in model evaluation
    
    Args:
        bucket_name (str): Name of the S3 bucket
    """
    try:
        s3_client = boto3.client('s3')
        
        # Define the required CORS configuration
        cors_configuration = {
            'CORSRules': [
                {
                    'AllowedHeaders': ['*'],
                    'AllowedMethods': ['GET', 'PUT', 'POST', 'DELETE'],
                    'AllowedOrigins': ['*'],
                    'ExposeHeaders': ['Access-Control-Allow-Origin']
                }
            ]
        }
        
        # Apply CORS configuration
        try:
            s3_client.put_bucket_cors(
                Bucket=bucket_name,
                CORSConfiguration=cors_configuration
            )
            print(f"Successfully configured CORS for bucket: {bucket_name}")
            
            # Verify CORS configuration
            response = s3_client.get_bucket_cors(Bucket=bucket_name)
            print("\nVerified CORS Configuration:")
            print(json.dumps(response, indent=2))
            
        except s3_client.exceptions.ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchCORSConfiguration':
                print(f"No existing CORS configuration found. Setting new configuration.")
                s3_client.put_bucket_cors(
                    Bucket=bucket_name,
                    CORSConfiguration=cors_configuration
                )
                print(f"Successfully configured CORS for bucket: {bucket_name}")
            else:
                raise
                
    except Exception as e:
        print(f"Error configuring CORS for bucket {bucket_name}: {str(e)}")
        raise

#File Upload Functions
#Upload Evaluation Files
def upload_evaluation_files_to_s3(model_configs, timestamp=None):
    """
    Upload evaluation JSONL files to S3 bucket with clean paths
    
    Args:
        model_configs (dict): Dictionary containing model configurations
        timestamp (str, optional): Timestamp for file naming
        
    Returns:
        dict: Dictionary of model names and their S3 locations
    """
    if timestamp is None:
        timestamp = time.strftime('%Y%m%d_%H%M%S')
    
    # Create S3 path for evaluation
    s3_prefix = f"model-evaluation/{timestamp}"
    s3_base_path = f"s3://{bucket}/{s3_prefix}"
    
    print(f"Uploading files to base path: {s3_base_path}")
    
    # Dictionary to store S3 locations
    s3_locations = {}
    
    try:
        # Upload each model's output file
        for model_name, config in model_configs.items():
            local_file = f"{config['output_prefix']}_{timestamp}.jsonl"
            
            if os.path.exists(local_file):
                print(f"Found local file: {local_file}")
                
                # Create the S3 destination path
                s3_destination = f"{s3_base_path}/{model_name}"
                
                try:
                    # Upload and get the actual S3 location
                    actual_s3_location = S3Uploader.upload(
                        local_path=local_file,
                        desired_s3_uri=s3_destination,
                        sagemaker_session=sess
                    )
                    s3_locations[model_name] = actual_s3_location
                    print(f"Successfully uploaded {local_file} to {actual_s3_location}")
                except Exception as upload_error:
                    print(f"Error uploading {local_file}: {str(upload_error)}")
                    raise
            else:
                print(f"Warning: File {local_file} not found")
        
        # Upload comparison results if they exist
        comparison_file = f"model_comparison_{timestamp}.json"
        if os.path.exists(comparison_file):
            print(f"Found comparison file: {comparison_file}")
            
            # Create the S3 destination path for comparison file
            comparison_s3_destination = f"{s3_base_path}/comparison"
            
            try:
                # Upload and get the actual S3 location
                actual_comparison_location = S3Uploader.upload(
                    local_path=comparison_file,
                    desired_s3_uri=comparison_s3_destination,
                    sagemaker_session=sess
                )
                s3_locations['comparison'] = actual_comparison_location
                print(f"Successfully uploaded comparison results to {actual_comparison_location}")
            except Exception as upload_error:
                print(f"Error uploading comparison file: {str(upload_error)}")
                raise
        
        return s3_locations
        
    except Exception as e:
        print(f"Error in upload process: {str(e)}")
        raise

#Preparation and Verification
def prepare_and_upload_evaluation_files(model_configs, timestamp=None):
    """
    Prepare bucket and upload evaluation files
    """
    if timestamp is None:
        timestamp = time.strftime('%Y%m%d_%H%M%S')
    
    try:
        # Configure CORS for the bucket
        print("\nConfiguring CORS for S3 bucket...")
        configure_bucket_cors(bucket)
        
        # Upload files
        print("\nUploading evaluation files...")
        s3_locations = upload_and_verify_files(model_configs, timestamp)
        
        return s3_locations
        
    except Exception as e:
        print(f"Error in preparation and upload process: {str(e)}")
        raise




#Upload Verification
def upload_and_verify_files(model_configs, timestamp=None):
    """Upload files and verify they exist in S3"""
    if timestamp is None:
        timestamp = time.strftime('%Y%m%d_%H%M%S')
    
    try:
        # Upload files
        s3_locations = upload_evaluation_files_to_s3(model_configs, timestamp)
        
        # Verify uploads using boto3
        s3_client = boto3.client('s3')
        
        print("\nVerifying uploaded files:")
        for model_name, s3_uri in s3_locations.items():
            # Parse S3 URI
            s3_path = s3_uri.replace('s3://', '').split('/')
            bucket_name = s3_path[0]
            key = '/'.join(s3_path[1:])
            
            try:
                # Check if file exists
                s3_client.head_object(Bucket=bucket_name, Key=key)
                print(f"✓ Verified {model_name}: {s3_uri}")
            except Exception as e:
                print(f"✗ Failed to verify {model_name}: {s3_uri}")
                print(f"Error: {str(e)}")
        
        return s3_locations
        
    except Exception as e:
        print(f"Error in upload and verify process: {str(e)}")
        raise


In [None]:
MODEL_CONFIGS

### S3 Bucket Configuration and File Upload

In [None]:
import sagemaker
from sagemaker.s3 import S3Uploader
#Upload and verify results
# Prepare bucket and upload results
s3_locations = prepare_and_upload_evaluation_files(MODEL_CONFIGS, timestamp)

# Print final locations
print("\nFinal S3 Locations:")
print("=" * 50)
for model_name, location in s3_locations.items():
    if(model_name=='student'):
        MODEL_CONFIGS['student']['s3_location']=location
    elif(model_name=='base'):
        MODEL_CONFIGS['base']['s3_location']=location
    print(f"{model_name}: {location}")


In [None]:
MODEL_CONFIGS

In [None]:
# Define the configuration rules
cors_configuration = {
    'CORSRules': [{
        'AllowedHeaders': ['*'],
        'AllowedMethods': ['GET', 'PUT',"POST","DELETE"],
        'AllowedOrigins': ['*'],
        'ExposeHeaders': ['ETag', 'x-amz-request-id',"Access-Control-Allow-Origin"],
        'MaxAgeSeconds': 3000
    }]
}

# Set the CORS configuration
s3 = boto3.client('s3')
s3.put_bucket_cors(Bucket='sagemaker-us-west-2-786045444066',
                   CORSConfiguration=cors_configuration)

In [None]:
import logging
import boto3
from botocore.exceptions import ClientError


def get_bucket_cors(bucket_name):
    """Retrieve the CORS configuration rules of an Amazon S3 bucket

    :param bucket_name: string
    :return: List of the bucket's CORS configuration rules. If no CORS
    configuration exists, return empty list. If error, return None.
    """

    # Retrieve the CORS configuration
    s3 = boto3.client('s3')
    try:
        response = s3.get_bucket_cors(Bucket=bucket_name)
    except ClientError as e:
        if e.response['Error']['Code'] == 'NoSuchCORSConfiguration':
            return []
        else:
            # AllAccessDisabled error == bucket not found
            logging.error(e)
            return None
    return response['CORSRules']
get_bucket_cors('sagemaker-us-west-2-786045444066')

## IAM Role and Policy Configuration for AWS Bedrock Evaluation
This section sets up the necessary IAM role and policies required to run model evaluations using AWS Bedrock. The configuration includes permissions for both Bedrock services and S3 access.

In [None]:
# Custom modules (assuming these exist in your environment)
import importlib.util
spec = importlib.util.spec_from_file_location("iam_role_helper", "iam_role_helper.py")
iam_role_manager = importlib.util.module_from_spec(spec)
sys.modules["iam_role_manager"] = iam_role_manager
spec.loader.exec_module(iam_role_manager)
from iam_role_helper import create_or_update_role

In [None]:
import boto3
import json

# 1. Setup Basic Variables
account_id = boto3.client('sts').get_caller_identity()['Account']  # Get current AWS account ID
region = "us-west-2"  # Note: Custom Model Import (CMI) only works in us-west-2 and us-east-1
role_name = "Bedrock_Evaluation_Role"  # Name for the new IAM role we'll create

# 2. Define Trust Relationship Policy
trust_relationship = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "bedrock.amazonaws.com"},
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {"aws:SourceAccount": account_id},
                "ArnLike": {"aws:SourceArn": f"arn:aws:bedrock:{region}:{account_id}:*"}
            }
        }
    ]
}

# 3. Define Permission Policy
# Bedrock resources access policy
bedrock_access_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BedrockConsole",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateEvaluationJob",
                "bedrock:GetEvaluationJob",
                "bedrock:ListEvaluationJobs",
                "bedrock:StopEvaluationJob",
                "bedrock:GetCustomModel",
                "bedrock:ListCustomModels",
                "bedrock:CreateProvisionedModelThroughput",
                "bedrock:UpdateProvisionedModelThroughput",
                "bedrock:GetProvisionedModelThroughput",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:GetImportedModel",
                "bedrock:ListImportedModels",
                "bedrock:ListTagsForResource",
                "bedrock:UntagResource",
                "bedrock:TagResource",
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                f"arn:aws:bedrock:{region}::foundation-model/*",
                f"arn:aws:bedrock:{region}:{account_id}:*"
            ]
        }
    ]
}

# S3 access policy
s3_access_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3AccessForModelEvaluation",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:GetBucketCORS",
                "s3:PutBucketCORS",
                "s3:ListBucket",
                "s3:ListBucketVersions",
                "s3:GetBucketLocation",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts",
                "s3:ListBucketMultipartUploads"
            ],
            "Resource": [
                f"arn:aws:s3:::{sagemaker_session_bucket}",
                f"arn:aws:s3:::{sagemaker_session_bucket}/*"
            ]
        }
    ]
}

# Combine policies
combined_policy = {
    "Version": "2012-10-17",
    "Statement": (
        bedrock_access_policy["Statement"] +
        s3_access_policy["Statement"]
    )
}

# 4. Create or Update the IAM Role

bedrock_evaluation_role_arn = create_or_update_role(
    role_name=role_name,
    trust_relationship=trust_relationship,
    permission_policy=combined_policy
)

print(f"Role ARN: {bedrock_evaluation_role_arn}")

In [None]:
MODEL_CONFIGS['base']['model_name']

In [None]:
evaluation_path_student = '/'.join(MODEL_CONFIGS['student']['s3_location'].rsplit('/', 1)[0].split('/')[:-1] + ['evaluation'])
evaluation_path_base = '/'.join(MODEL_CONFIGS['base']['s3_location'].rsplit('/', 1)[0].split('/')[:-1] + ['evaluation'])

## AWS Bedrock Evaluation Job Configuration - Student Model
This section creates a configuration file for evaluating the fine-tuned (student) model using AWS Bedrock's evaluation capabilities.

In [None]:
import json
import uuid


# Define your variables
# Define your variable with a unique UUID
job_name_student = f"model-eval-student-{uuid.uuid4().hex[:8]}"


role_arn = bedrock_evaluation_role_arn
dataset_s3_uri = MODEL_CONFIGS['student']['s3_location']
output_s3_uri = evaluation_path_student
model_identifier = "us.meta.llama3-1-70b-instruct-v1:0"
inference_source_identifier = MODEL_CONFIGS['student']['model_name']

# Create a Python dictionary with your data
evaluation_job_dict = {
    "jobName": job_name_student,
    "roleArn": role_arn,
    "evaluationConfig": {
        "automated": {
            "datasetMetricConfigs": [
                {
                    "taskType": "General",
                    "dataset": {
                        "name": "text_dataset",
                        "datasetLocation": {
                            "s3Uri": dataset_s3_uri
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness"
                    ]
                }
            ],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [
                    {
                        "modelIdentifier": "llama3-2-1b-fine-tuned-pubmed" 
                    }
                ]
            }
        }
    },
    "inferenceConfig": {
        "models": [
            {
                "precomputedInferenceSource": {
                    "inferenceSourceIdentifier": "us.meta.llama3-1-70b-instruct-v1:0"
                }
            }
        ]
    },
    "outputDataConfig": {
        "s3Uri": output_s3_uri+"/"
    }
}

# Convert the dictionary to a JSON string
evaluation_job_json = json.dumps(evaluation_job_dict, indent=4)

# Create a unique filename using timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"evaluation_job_{timestamp}.json"

# Define the directory where you want to save the file
save_directory = "evaluation_jobs"  # You can change this to your preferred directory

# Create the directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Combine the directory and filename to get the full path
file_path = os.path.join(save_directory, filename)

# Write the JSON string to the file
with open(file_path, 'w') as f:
    f.write(evaluation_job_json)

# Store the file location in a variable
json_file_location_student = file_path
print(evaluation_job_json)

print(f"JSON file saved to: {json_file_location_student}")


## AWS Bedrock Evaluation Job Configuration - Base Model
This section creates a configuration file for evaluating the Base model using AWS Bedrock's evaluation capabilities.

In [None]:
import json
import uuid


# Define your variables
# Define your variable with a unique UUID
job_name_base = f"model-eval-base-{uuid.uuid4().hex[:8]}"


role_arn = bedrock_evaluation_role_arn
dataset_s3_uri = MODEL_CONFIGS['base']['s3_location']
output_s3_uri = evaluation_path_student
model_identifier = "us.meta.llama3-1-70b-instruct-v1:0"
inference_source_identifier = MODEL_CONFIGS['base']['model_name']

# Create a Python dictionary with your data
evaluation_job_dict = {
    "jobName": job_name_base,
    "roleArn": role_arn,
    "evaluationConfig": {
        "automated": {
            "datasetMetricConfigs": [
                {
                    "taskType": "General",
                    "dataset": {
                        "name": "text_dataset",
                        "datasetLocation": {
                            "s3Uri": dataset_s3_uri
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness"
                    ]
                }
            ],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [
                    {
                        "modelIdentifier":"us.meta.llama3-1-70b-instruct-v1:0"
                    }
                ]
            }
        }
    },
    "inferenceConfig": {
        "models": [
            {
                "precomputedInferenceSource": {
                    "inferenceSourceIdentifier":  inference_source_identifier 
                }
            }
        ]
    },
    "outputDataConfig": {
        "s3Uri": output_s3_uri+"/"
    }
}

# Convert the dictionary to a JSON string
evaluation_job_json = json.dumps(evaluation_job_dict, indent=4)

# Create a unique filename using timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"evaluation_job_{timestamp}.json"

# Define the directory where you want to save the file
save_directory = "evaluation_jobs"  # You can change this to your preferred directory

# Create the directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Combine the directory and filename to get the full path
file_path = os.path.join(save_directory, filename)

# Write the JSON string to the file
with open(file_path, 'w') as f:
    f.write(evaluation_job_json)

# Store the file location in a variable
json_file_location_base = file_path
print(evaluation_job_json)

print(f"JSON file saved to: {json_file_location_base}")

## Executing AWS Bedrock Evaluation Job

This section demonstrates how to programmatically create an evaluation job in AWS Bedrock using the AWS CLI through Python's subprocess module.

⚠️ **Important Note**
At present, evaluation jobs should be created directly through the AWS Bedrock console The programmatic creation shown here may not work as expected.

### Alternative Approach
To create an evaluation job:

1. Navigate to the [AWS Bedrock Console](https://console.aws.amazon.com/bedrock)
2. Select "Evaluation" from the left navigation pane
3. Choose "Create automatic evaluation using model as a judge" after click create button.
4. Use the configuration from the files generated in this notebook.



In [None]:
import subprocess

aws_command = f"aws bedrock create-evaluation-job --cli-input-json file://{json_file_location_base}"
subprocess.run(aws_command, shell=True)