# Response-Based Knowledge Distillation with QA Specialization

## Using Amazon SageMaker JumpStart for LLM distilation (Llama 3.2 90B → Llama 3.2 1B)

This notebook demonstrates an end-to-end workflow for knowledge distillation using Amazon SageMaker JumpStart and Amazon Bedrock. The process involves distilling knowledge from a large language model (90B parameters) to a smaller model (1B parameters) while maintaining performance on specialized QA tasks.

## 1. Introduction

### Overview of Knowledge Distillation
Knowledge distillation is a model compression technique where a smaller model (student) learns to mimic the behavior of a larger model (teacher). This approach helps:
- Reduce computational requirements
- Lower inference costs
- Maintain acceptable performance levels
- Enable deployment on resource-constrained environments
### SageMaker JumpStart Benefits
Amazon SageMaker JumpStart provides:
- Pre-trained models optimized for AWS infrastructure
- Simplified model deployment and fine-tuning workflows
- Integration with other AWS services like Amazon S3 and Amazon CloudWatch
- Built-in security features and compliance controls
- Automated model optimization and deployment pipelines
### Project Goals and Objectives
This implementation aims to:
1. Create an efficient, smaller model specialized for QA tasks
2. Maintain high accuracy on domain-specific questions
3. Reduce inference costs
4. Demonstrate AWS best practices for model optimization
5. Enable deployment through Amazon Bedrock Custom Model Import
6. Evaluate performance of model using Custom Model Import instead of running it in Sagemaker Endpoints.

In [None]:
%pip install --quiet --upgrade sagemaker jmespath datasets transformers jinja2 ipywidgets boto3

## 2. Environment Setup

### AWS Account Configuration 

This section configures the necessary AWS resources including:

- SageMaker session and default bucket
- IAM roles and permissions
- Region-specific settings
- Required SDK versions and dependencies

In [None]:
# Standard library imports
import json
import time
import uuid
import random
import sys
import logging
from datetime import datetime
import pprint
from IPython.display import display, Markdown, Latex

# AWS SDK imports
import boto3
import botocore
from botocore.config import Config
import sagemaker
from sagemaker.s3 import S3Uploader
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker import hyperparameters, metric_definitions
from sagemaker.parameter import ContinuousParameter, CategoricalParameter, IntegerParameter
from sagemaker.tuner import HyperparameterTuner
from sagemaker.debugger import TensorBoardOutputConfig

# Data processing and ML imports
import pandas as pd
import requests

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Custom modules (assuming these exist in your environment)
import importlib.util
spec = importlib.util.spec_from_file_location("iam_role_helper", "iam_role_helper.py")
iam_role_manager = importlib.util.module_from_spec(spec)
sys.modules["iam_role_manager"] = iam_role_manager
spec.loader.exec_module(iam_role_manager)

spec = importlib.util.spec_from_file_location("utils", "utils.py")
utils = importlib.util.module_from_spec(spec)
sys.modules["utils"] = utils
spec.loader.exec_module(utils)

# Import custom functions
from utils import (
    download_artifacts, 
    remove_field_from_json, 
    upload_artifacts, 
    cleanup_local_files, 
    wait_for_model_availability, 
    test_image_processing
)
from iam_role_helper import create_or_update_role

# Initialize key AWS clients
sess = sagemaker.Session()
sagemaker_client = boto3.client('sagemaker')
bedrock_client = boto3.client('bedrock', region_name='us-west-2')
s3_client = boto3.client('s3')
iam_client = boto3.client('iam')

# Set default configurations
config = Config(
    retries={
        'total_max_attempts': 100,
        'mode': 'standard'
    }
)

In [None]:
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    #change the name of the role if you are running locally
    role = iam.get_role(RoleName='sagemaker-execution-role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=bucket)
region=sess.boto_region_name

prefix = "llama-qa-distillation"
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")

### Role Configuration
Configures IAM roles with required permissions for:
- Amazon Bedrock model access
- S3 bucket operations for model artifacts
- CloudWatch logging capabilities
- Cross-service permissions for SageMaker

Key components:
1. Trust relationships for service principals
2. Permission policies for resource access
3. Cross-account access configurations
4. Logging and monitoring permissions

In [None]:
# IAM Role Configuration for Amazon Bedrock Custom Model Import

# 1. Setup Basic Variables
account_id = boto3.client('sts').get_caller_identity()['Account']  # Get current AWS account ID
region = "us-west-2"  # Note: Custom Model Import (CMI) only works in us-west-2 and us-east-1
training_bucket = sagemaker_session_bucket  # S3 bucket where training artifacts are stored
role_name = "Sagemaker_Bedrock_import_role"  # Name for the IAM role we'll create

# 2. Define Trust Relationship Policy
# This policy defines which AWS services can assume this role
trust_relationship = {
    "Version": "2012-10-17",
    "Statement": [
        # Allow Bedrock service to assume this role
        {
            "Effect": "Allow",
            "Principal": {"Service": "bedrock.amazonaws.com"},
            "Action": "sts:AssumeRole",
            "Condition": {
                # Ensure requests only come from our account
                "StringEquals": {"aws:SourceAccount": account_id},
                # Limit to specific Bedrock model import jobs
                "ArnEquals": {"aws:SourceArn": f"arn:aws:bedrock:{region}:{account_id}:model-import-job/*"}
            }
        },
        # Allow Lambda service to assume this role (if needed for auxiliary functions)
        {
            "Effect": "Allow",
            "Principal": {"Service": "lambda.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

# 3. Define Permission Policy
# This policy defines what AWS resources the role can access
permission_policy = {
    "Version": "2012-10-17",
    "Statement": [
        # Allow S3 access for model artifacts
        {
            "Effect": "Allow",
            "Action": ["s3:GetObject", "s3:ListBucket"],  # Read-only access to S3
            "Resource": [
                f"arn:aws:s3:::{training_bucket}",  # Access to bucket
                f"arn:aws:s3:::{training_bucket}/*"  # Access to objects in bucket
            ],
            "Condition": {"StringEquals": {"aws:ResourceAccount": account_id}}  # Restrict to our account
        },
        # Allow CloudWatch Logs access for monitoring
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"  # Access to CloudWatch Logs
        }
    ]
}

# 4. Create or Update the IAM Role
bedrock_role_arn = create_or_update_role(
    role_name=role_name,
    trust_relationship=trust_relationship,
    permission_policy=permission_policy
)

print(f"Role ARN: {bedrock_role_arn}")

## 3. Teacher Model Selection in Amazon Bedrock

### Model Selection Criteria
When choosing a foundation model in Amazon Bedrock for knowledge distillation, several key factors should be considered:

#### 1. Model Architecture and Size
The Meta Llama 3 405B model offers several advantages as a teacher model:
- Larger parameter count provides richer knowledge representation
- Enhanced ability to capture complex patterns and relationships
- Superior performance on specialized tasks like medical QA
- Better few-shot learning capabilities
#### 2. Cost-Performance Trade-offs
Amazon Bedrock's pay-per-use pricing model enables:
- No upfront infrastructure costs
- Payment only for actual inference time
- Flexible scaling based on demand
- Cost optimization through batch processing

Reference: [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/)

#### 3. Specialized Knowledge Transfer
The 405B model is particularly suitable for knowledge distillation because:
- Higher accuracy on complex medical terminology
- Better understanding of scientific context
- More nuanced response generation
- Improved zero-shot performance on domain-specific tasks
#### 4. Operational Considerations
Benefits of using Bedrock for the teacher model:
- Serverless architecture eliminates infrastructure management
- Built-in auto-scaling
- High availability across AWS regions
- Simplified API integration
### Model Configuration
The Llama 3 405B model in Bedrock can be configured with:
- Temperature settings for response diversity
- Maximum token length for comprehensive answers
- Top-p and top-k sampling parameters
- Custom prompt templates for specialized tasks

Reference: [Amazon Bedrock Llama Model Configuration](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html)

### Integration with Knowledge Distillation
The workflow leverages Bedrock's advantages:
1. Generate high-quality training data through batch inference
2. Create specialized QA pairs for student model training
3. Maintain quality while reducing computational requirements
4. Enable seamless deployment through Custom Model Import

Reference: 
- [Amazon Bedrock Custom Model Import](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html)
- [Amazon Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html)

### Best Practices
When using the teacher model:
1. Implement proper error handling and retry mechanisms
2. Use batch processing for dataset generation
3. Monitor usage and costs through AWS CloudWatch
4. Implement appropriate security controls and encryption

For more information on model selection and configuration, see:
- [Choose the best foundational model for your AI applications](https://community.aws/content/2fKJW0z9PEIKec94DZwtYigCF7i/choose-the-best-foundational-model-for-your-ai-applications?lang=en)
- [Llama Technical Documentation](https://www.llama.com/docs/overview/)
- [Amazon Bedrock Developer Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html)

### Bedrock client setup

There are multiple models available on Bedrock depending the region. In our case we would focus on llama 3.1 405b instruct that is available in us-west-2.

In [None]:
import boto3
bedrock_client = boto3.client('bedrock', region_name="us-west-2")
model_id='meta.llama3-1-405b-instruct-v1:0'

### Testing Inference with Bedrock Runtime

This section demonstrates how to perform inference using the Bedrock Runtime client with the Llama model.

**Note**: The Bedrock runtime client is specifically for model inference, separate from the main Bedrock client used for model management.


Inference Helper Function.

This function handles the core interaction with the Bedrock Runtime API, including error handling and response formatting.

In [None]:
brt = boto3.client(service_name='bedrock-runtime',region_name='us-west-2')
def invoke_model(body, model_id, accept, content_type):
    try:
        response = brt.invoke_model(
            body=json.dumps(body), 
            modelId=model_id, 
            
            accept=accept, 
            contentType=content_type
        )

        return response

    except Exception as e:
        print(f"Couldn't invoke {model_id}")
        raise e

Query Setup and Model Parameters.

Key Parameters:

- temperature: Lower values make output more focused and deterministic
- top_p: Controls diversity of token selection
- max_gen_len: Limits response length

In [None]:
# If you'd like to try your own prompt, edit this parameter!

question = """Is a mandatory general surgery rotation necessary in the surgical clerkship?"""
user_message = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n {question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

body = {
    "prompt": user_message,
    "temperature": 0.5,
    "top_p": 0.9,
    "max_gen_len": 512,
}


Model Configuration and Invocation
- Uses Llama 3 405B parameter model
- Expects and returns JSON formatted data
- Response includes generated text in the "generation" field

In [None]:
modelId = "meta.llama3-1-405b-instruct-v1:0"
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body["generation"])

## 4. Dataset Generation

This section explains how to prepare and process the PubMedQA dataset for knowledge distillation using AWS services.

### Overview
The PubMedQA dataset is a large-scale question-answering dataset focused on biomedical research literature. We'll use Amazon S3 for storage and SageMaker Processing Jobs for data preparation.



### Dataset Details
**PubMedQA Dataset**
- Source: [PubMedQA GitHub Repository](https://github.com/pubmedqa/pubmedqa/tree/master)
- Citation: 
>
> Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, pp. 2567-2577.
- Format: JSON

### Implementation Steps

#### 1. Dataset Download and Validation
This module handles downloading and processing PubMedQA dataset from GitHub for use with 
Amazon SageMaker and Amazon Bedrock knowledge distillation workflow.

In [None]:
import requests

def get_github_json(url):
    try:
        # Convert regular GitHub URL to raw content URL
        raw_url = url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")
        return requests.get(raw_url).json()
    except Exception as e:
        print(f"Error: {e}")
        return None

# Example usage:
url = "https://github.com/pubmedqa/pubmedqa/blob/master/data/ori_pqal.json"
data = get_github_json(url)

#### 2. Data Processing and JSONL Conversion
This section demonstrates how to process the PubMedQA dataset in jsonl format.

In [None]:
dataset=[]
qa_index=list(data.keys())
for i in qa_index:
    keys_to_get = ['QUESTION', 'CONTEXTS','LONG_ANSWER']
    result = {k: data[i].get(k) for k in keys_to_get}
    dataset.append(result)

In [None]:
output_file_dataset='dataset.jsonl'
with open(output_file_dataset, 'w') as outfile:
    for sample in dataset:
        # Create the complete record for batch inference
        batch_record = {
            "question": sample['QUESTION'],
            "answers": sample['LONG_ANSWER']
        }
        
        outfile.write(json.dumps(batch_record) + '\n')

### Using Teacher Model for QA Generation

Explains the process of:

- Generating synthetic QA pairs
- Batch processing with Bedrock
- Data augmentation strategies
- Quality control measures

#### Batch Processing vs Real-Time Inference

Based on performance testing and cost analysis, Amazon Bedrock's batch processing capabilities offer significant advantages over real-time inference:

1. **Performance Benefits**
   - Higher throughput for large-scale processing
   - Reduced risk of API throttling
   - More efficient resource utilization

2. **Cost Optimization**
   - Lower per-request costs compared to real-time inference
   - Better resource allocation and scheduling
   - Reduced overhead from connection management

3. **Operational Advantages**
   - Built-in retry mechanisms
   - Simplified monitoring and logging
   - Better handling of large datasets

For this implementation, we leverage Bedrock's batch processing to optimize both performance and cost efficiency while maintaining processing quality.

### Preparing Dataset for Bedrock Batch Processing

This code creates a JSONL file formatted specifically for Amazon Bedrock batch inference:

- **Purpose**: Converts QA dataset into Bedrock's required batch processing format
- **Key Operations**:
  - Formats prompts using Llama 3's instruction template
  - Assigns unique IDs to each record
  - Sets inference parameters (temperature, max length, etc.)
  - Creates JSONL output with required Bedrock structure

The resulting file enables efficient batch processing of multiple questions through Bedrock's batch inference API, optimizing for throughput and cost efficiency.

> **Note**: The template uses Llama 3's specific tokens (`<|begin_of_text|>`, `<|eot_id|>`) for proper model instruction formatting.

In [None]:
import json
from datetime import datetime
import uuid

def create_bedrock_batch_dataset(dataset, output_file='bedrock_batch_dataset.jsonl'):
    # Simplified prompt template for Llama 3 instruction format
    prompt_template = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{question}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""
    
    with open(output_file, 'w') as outfile:
        for sample in dataset:
            # Generate a unique record ID (11 characters)
            record_id = str(uuid.uuid4())[:11]
            
            # Format the prompt
            formatted_prompt = prompt_template.format(
                question=sample["QUESTION"]
            )

            # Create the model input body for Llama 3
            body = {
                "prompt": formatted_prompt,
                "max_gen_len": 1024,
                "temperature": 0.0,
                "top_p": 0.9
            }

            # Create the complete record for batch inference
            batch_record = {
                "recordId": record_id,
                "modelInput": body
            }
            
            outfile.write(json.dumps(batch_record) + '\n')

# Usage
create_bedrock_batch_dataset(dataset)

### Uploading Batch Dataset to Amazon S3

This code handles the upload of the prepared batch dataset to Amazon S3, a necessary step before running Bedrock batch inference:

- **Purpose**: Transfers the local JSONL file to S3 for Bedrock access
- **Components**:
  - Uses SageMaker's `S3Uploader` utility for simplified file transfer
  - Organizes files under a structured prefix (`distillation/batch/data`)
  - Automatically handles S3 path formatting and permissions

> **Note**: The S3 location will be referenced in subsequent Bedrock batch inference job configurations. Ensure the Bedrock role has appropriate S3 read permissions.


In [None]:
import sagemaker
from sagemaker.s3 import S3Uploader
# Define source and destination paths
local_path_batch_file = 'bedrock_batch_dataset.jsonl'
s3_prefix_batch = 'distillation/batch/data'  # This will be the folder in S3

# Upload the file
s3_path_batch = S3Uploader.upload(
    local_path=local_path_batch_file,
    desired_s3_uri=f's3://{bucket}/{s3_prefix_batch}',
)

print(f"File uploaded successfully to: {s3_path_batch}")

### Bedrock Batch Inference Configuration

This section configures and launches a batch inference job using Amazon Bedrock for large-scale QA processing:

#### Configuration Components
- **Input Configuration**: Points to the JSONL dataset in S3
- **Output Configuration**: Specifies where Bedrock will store inference results
- **Job Settings**: 
  - Unique job name using timestamp
  - Model ARN for Llama 3.3
  - IAM role for execution permissions

In [None]:
output_prefix="output"
inputDataConfig=({
    "s3InputDataConfig": {
        "s3Uri": s3_path_batch
    }
})

outputDataConfig=({
    "s3OutputDataConfig": {
        "s3Uri": f"s3://{bucket}/{s3_prefix_batch}/{output_prefix}/"
    }
})

Launch batch job

In [None]:

from datetime import datetime  # This is the correct import
jobName = 'batch-job-ga' + str(int(datetime.now().timestamp()))
response=bedrock_client.create_model_invocation_job(
    roleArn=role,
    modelId='arn:aws:bedrock:us-west-2::foundation-model/meta.llama3-1-405b-instruct-v1:0',
    #modelId='meta.llama3-1-405b-instruct-v1:0',
    
    jobName=jobName,
    inputDataConfig=inputDataConfig,
    outputDataConfig=outputDataConfig
)

For more information, see Amazon Bedrock Batch Inference documentation.
Reference: [Amazon Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html)

### Monitoring Bedrock Batch Job Status

This code implements a job status monitoring loop for the Bedrock batch inference:

- **Purpose**: Tracks batch job progress until completion or failure
- **Key Operations**:
  - Extracts job ARN and ID for tracking
  - Polls job status every 5 minutes
  - Provides real-time status updates
  - Handles completion and failure scenarios

> **Note**: Consider implementing this monitoring pattern in AWS Lambda or Step Functions for production workloads.

In [None]:
import time
jobArn = response.get('jobArn')
job_id = jobArn.split('/')[1]

print(jobArn)

status = ''
while status not in ['Completed', 'Failed']:
    job_response = bedrock_client.get_model_invocation_job(jobIdentifier=jobArn)
    status = job_response['status']
    if status == 'Failed':
        print(job_response)
    elif status == 'Completed':
        print(datetime.now(), ": ", status)
        break
    else: 
        print(datetime.now(), ": ", status)
        time.sleep(300)

### Processing Bedrock Batch Results for Training

This section handles the retrieval and processing of batch inference results from S3 for model training:
#### Data Flow
1. **Retrieval**: Fetches batch results from S3
2. **Processing**: Extracts model generations from JSON responses
3. **Formatting**: Prepares data for JumpStart/Bedrock training format

Retrieve batch results from S3

In [None]:
# Retrieve batch results from S3
job_id='yqlsjasksdyv'
s3 = boto3.client('s3')
prefix = f"{s3_prefix_batch}/{output_prefix}/{job_id}/"
print(f"prefix: {bucket}/{prefix}")
object_key = f"{prefix}{local_path_batch_file}.out"
response = s3.get_object(Bucket=bucket, Key=object_key)

In [None]:
# Process and extract teacher model responses
json_data = response['Body'].read().decode('utf-8')
teacher_answer=[]
for line in json_data.splitlines():
        data = json.loads(line)
        print(data['modelOutput']['generation'])
        teacher_answer.append(data['modelOutput']['generation'])

This code combines the original dataset with teacher model responses:

In [None]:
for data_item, teacher in zip(dataset, teacher_answer):
    data_item['TEACHER_ANSWER'] = teacher

> **Note**: This paired dataset forms the foundation for training the student model to mimic the teacher's behavior.

### Preparing Training Data for SageMaker JumpStart

This section formats the QA dataset for fine-tuning using SageMaker JumpStart's specific requirements:

#### Data Formatting Process
1. **Template Creation**
   - Defines Llama 3's instruction format
   - Includes system message and conversation structure
   - Maintains special tokens for model context

In [None]:
import json

template = {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n",
    "completion": " {response}",
}
with open("template.json", "w") as f:
    json.dump(template, f)

2. **Dataset Transformation**
   - Converts QA pairs to instruction format
   - Structures teacher responses as completions
   - Creates JSONL format required by JumpStart

In [None]:
import json

def create_jumpstart_dataset(dataset, output_file='train.jsonl', template_file='template.json'):
    # Create the template file required by JumpStart for Q&A format
    template = {
        "prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "completion": "{response}"
    }
    
    # Save the template file
    with open(template_file, 'w') as f:
        json.dump(template, f)

    # Process the dataset and create the training file
    with open(output_file, 'w') as outfile:
        for sample in dataset:
            # Format the data in the same structure as the synthetic data
            training_entry = {
                "instruction": sample["QUESTION"],
                "response": sample["TEACHER_ANSWER"].strip()
            }
            
            outfile.write(json.dumps(training_entry) + '\n')
            
def verify_jsonl(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            try:
                data = json.loads(line)
                if i == 0:  # Print first example
                    print("Sample entry:")
                    print(json.dumps(data, indent=2))
                break
            except json.JSONDecodeError as e:
                print(f"Error in line {i+1}: {e}")

In [None]:
# Create the dataset files for JumpStart fine-tuning
create_jumpstart_dataset(dataset)
verify_jsonl('train.jsonl')

#### Data Format
- **Input**: Question-answer pairs with teacher model responses
- **Output**: JSONL file containing:
  - Instruction prompt with special tokens (`<|begin_of_text|>`)
  - Question text
  - Teacher model response
  - End of text markers (`<|eot_id|>`)

> **Important**: Follows [JumpStart Data Format Guidelines](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-fine-tuning-instruction-based.html).

#### Validation Process
The `verify_jsonl()` function checks:
- JSONL format validity
- Special token placement
- Complete instruction/response pairs

Example format:
```json
{
  "instruction": "What is the role of antibiotics in treating viral infections?",
  "response": "Antibiotics are not effective against viral infections..."
}

## 5. Student Model Configuration (LLAMA 3.2 1B)

This section covers the setup and configuration of the student model using Amazon SageMaker JumpStart:

### Model Selection Criteria
- Base model: LLAMA 3 1B
- Optimized for knowledge distillation
- Suitable for QA tasks
- Efficient inference characteristics

### Training Data Upload

The following code uploads the prepared training dataset and template to Amazon S3:

In [None]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

# Configure S3 paths with SageMaker defaults
default_bucket_prefix = sagemaker.Session().default_bucket_prefix
default_bucket_prefix_path = ""

# If a default bucket prefix is specified, append it to the s3 path
if default_bucket_prefix:
    default_bucket_prefix_path = f"/{default_bucket_prefix}"

# Upload training files to S3
local_data_file = "train.jsonl"
template_file="template.json"
train_data_location = f"s3://{bucket}{default_bucket_prefix_path}/oasst_top1"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload(template_file,train_data_location)
print(f"Training data: {train_data_location}")
print(f"template saved on:{train_data_location}")

### Student Model Selection in SageMaker JumpStart

This section implements an interactive model selection interface and configures training metrics:

#### Model Selection Process
- **Purpose**: Enables selection of appropriate student model from JumpStart's catalog
- **Focus**: Text generation models suitable for knowledge distillation
- **Default**: LLAMA 3 2.1B instruct model

In [None]:

from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Create interactive model selector
try:
    dropdown = Dropdown(
        options=list_jumpstart_models("search_keywords includes Text Generation"),
        value="meta-textgeneration-llama-3-2-1b-instruct",
        description="Select a JumpStart text generation model:",
        style={"description_width": "initial"},
        layout={"width": "max-content"},
    )
    display(dropdown)
except:
    dropdown = None
    pass

In [None]:
if dropdown:
    student_model_id = dropdown.value
else:
    # Provide model id as meta-textgeneration-llama-3-1-405b-instruct-fp8 for the instruct variant
    model_id = "meta-textgeneration-llama-3-2-1b-instruct"
model_version_student = "*"

#### Metric Setup
- **Purpose**: Establishes standardized metrics for training evaluation
- **Implementation**: Leverages SageMaker's built-in metric definitions
- **Scope**: Covers training, validation, and system metrics

In [None]:
from sagemaker import metric_definitions
print(metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-1b-instruct", model_version='1.1.1',))

In [None]:
metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-1b-instruct", model_version='1.1.1',)

### Training Job Hyperparameter Configuration

This section retrieves and configures the default hyperparameters for the student model training:

#### Hyperparameter Setup
- **Purpose**: Initializes model training configuration
- **Source**: Uses JumpStart's optimized defaults
- **Scope**: Includes learning rates, batch sizes, and model-specific parameters

In [None]:
from sagemaker import hyperparameters

my_hyperparameters_student = hyperparameters.retrieve_default(
    model_id=student_model_id, model_version=model_version_student,
)

print(my_hyperparameters_student)

### Hyperparameter Customization
This section modifies default hyperparameters for knowledge distillation training:

#### Parameter Adjustments
- **Purpose**: Customizes training configuration for instruction-based learning
- **Key Modifications**:
  - Sets single epoch for initial testing
  - Configures for instruction tuning
  - Establishes fixed random seed for reproducibility
  - Defines maximum input length constraints

In [None]:
my_hyperparameters_student["epoch"] = "1"
my_hyperparameters_student['chat_dataset']="False"
my_hyperparameters_student['instruction_tuned']="True"
my_hyperparameters_student['seed']="10"# this could help us to have the same results
my_hyperparameters_student['max_input_length']="1024"# this could help us to have the same results


hyperparameters.validate(
    model_id=student_model_id, model_version=model_version_student, hyperparameters=my_hyperparameters_student
)

In [None]:
pprint.pprint(my_hyperparameters_student)

### Hyperparameter Tuning Configuration

This section configures and executes automated hyperparameter optimization using SageMaker's Hyperparameter Tuning Jobs:

#### Parameter Search Space Configuration
- **Purpose**: Defines ranges for key training parameters
- **Implementation**: Uses SageMaker's parameter types for optimization
- **Scope**: Covers learning dynamics and LoRA-specific parameters

In [None]:
from sagemaker.parameter import ContinuousParameter, CategoricalParameter,IntegerParameter

# Define hyperparameter ranges without as_json_range
hyperparameter_ranges = {
    'learning_rate': ContinuousParameter(0.00001, 0.0005, scaling_type="Logarithmic"),
    'lora_r': CategoricalParameter(['4', '8', '12', '16']),
    'lora_alpha': CategoricalParameter(['16', '32', '48', '64']),
    'lora_dropout': ContinuousParameter(0.01, 0.2),
    'per_device_train_batch_size': CategoricalParameter(['2', '4', '6', '8']),
    'gradient_accumulation_steps': CategoricalParameter(['1', '2', '3', '4']),
    'max_steps': CategoricalParameter(['50', '75', '100']),
    'warmup_steps': CategoricalParameter(['5', '7', '10']),
    'num_train_epochs': CategoricalParameter(['1', '2'])

}


In [None]:
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter, CategoricalParameter

metric_defs=metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-1b", model_version='1.1.1',)
print(metric_defs)


#### Enhanced Metric Tracking
- **Purpose**: Monitors both training and resource utilization metrics
- **Implementation**: Combines default and custom GPU memory metrics
- **Scope**: Enables comprehensive performance monitoring


In [None]:
memory_metrics = [
    {'Name': 'gpu:memory_allocated', 'Regex': 'Max CUDA memory allocated was ([0-9\\.]+) GB'},
    {'Name': 'gpu:memory_reserved', 'Regex': 'Max CUDA memory reserved was ([0-9\\.]+) GB'},
    {'Name': 'gpu:peak_active_memory', 'Regex': 'Peak active CUDA memory was ([0-9\\.]+) GB'},
    {'Name': 'train:loss', 'Regex': 'train_loss = ([0-9\\.]+)'}
]

In [None]:
combined_metrics = metric_defs + memory_metrics

### Tuning Job Configuration

This section configures the hyperparameter optimization job using SageMaker's tuning capabilities:

#### Configuration Components
- **Purpose**: Automates hyperparameter optimization for model training
- **Strategy**: Uses Bayesian optimization for efficient parameter search
- **Scale**: Manages multiple training jobs in parallel

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

# Create the estimator
estimator = JumpStartEstimator(
    model_id=student_model_id,
    model_version=model_version_student,
    hyperparameters=my_hyperparameters_student,
    role=role,
    disable_output_compression=True,
    instance_type='ml.g5.2xlarge',
    environment={"accept_eula": "true"},
    metric_definitions=combined_metrics,  # Add metric definitions here
    enable_sagemaker_metrics=True  # Enable SageMaker metrics,
)

In [None]:
# Create the hyperparameter tuner
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name='huggingface-textgeneration:train-loss',
    metric_definitions=combined_metrics,
    objective_type='Minimize',
    max_jobs=20,
    max_parallel_jobs=4,#Adjust depending the available instances
    hyperparameter_ranges=hyperparameter_ranges,
    strategy='Bayesian',
    base_tuning_job_name='llm-llama-3-2-1b',
)


In [None]:
#Start the hyperparameter tuning job
tuner.fit({"training": train_data_location}, wait=True)
# First, wait for the tuning job to complete
tuner.wait()



> **Best Practices for Hyperparameter Tuning**
>
> 1. **Resource Management**
>    - Set `max_parallel_jobs` based on quota limits
>    - Choose appropriate instance types (`ml.g5.2xlarge`)
>    - Monitor GPU memory utilization
>    - Consider cost optimization with spot instances
>
> 2. **Job Configuration**
>    - Use descriptive `base_tuning_job_name`
>    - Enable SageMaker metrics for monitoring
>    - Set appropriate stopping conditions
>    - Configure proper objective metrics
>
> 3. **Optimization Strategy**
>    - Start with Bayesian optimization
>    - Define meaningful parameter ranges
>    - Balance exploration vs exploitation
>    - Monitor convergence patterns
>
> See [Hyperparameter Tuning Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html)

### Retrieving Best Training Results

This section explains how to access and analyze the best performing model from the hyperparameter tuning job:
#### Accessing Best Model
- **Purpose**: Retrieves optimal hyperparameters and model artifacts
- **Implementation**: Uses SageMaker's tuning job APIs
- **Output**: Best performing model configuration and metrics

#### Process Overview
1. Get best training job name from tuner
2. Retrieve detailed job information using SageMaker client
3. Extract optimized hyperparameters
4. Access performance metrics

In [None]:
# Get the best training job
best_training_job = tuner.best_training_job()
print(f"Best training job: {best_training_job}")

In [None]:
# Create a SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Get the best hyperparameters using the SageMaker client
best_hyperparameters_student_1 = sagemaker_client.describe_training_job(TrainingJobName=best_training_job)['HyperParameters']
print("Best hyperparameters: \n")
pprint.pprint(best_hyperparameters_student_1)


In [None]:
# Get the best training job
best_training_job_1 = tuner.best_training_job()
print(f"Best training job: {best_training_job}")


# Get the best hyperparameters using the SageMaker client
best_hyperparameters_student_1 = sagemaker_client.describe_training_job(TrainingJobName=best_training_job_1)['HyperParameters']
print(f"Best hyperparameters: {best_hyperparameters_student_1}")

In [None]:
pprint.pprint(best_hyperparameters_student_1)

> **Best Practices**:
> 1. **Result Analysis**
>    - Review convergence patterns
>    - Compare against baseline metrics
>    - Document optimal parameters
>
> 2. **Model Management**
>    - Save best configuration
>    - Track experiment metadata
>    - Document performance characteristics
>
> For more information, see [Analyzing Hyperparameter Tuning Results](https://sagemaker-examples.readthedocs.io/en/latest/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.html)

### Training with Optimized Hyperparameters

This section configures and launches a training job using the best hyperparameters from tuning:

#### Configuration Components
1. **Hyperparameter Setup**
   - Uses optimized parameters from tuning
   - Extends training epochs for full model convergence
   - Configures training environment

In [None]:
pprint.pprint(best_hyperparameters_student_1)
best_hyperparameters_student_1['num_train_epochs']=10
best_hyperparameters_student_1['epoch']=10

2. **TensorBoard Integration**
   - **Purpose**: Enables real-time training visualization
   - **Storage**: Configures S3 location for logs
   - **Access**: Enables Studio integration

In [None]:
from sagemaker.debugger import TensorBoardOutputConfig

# Create proper TensorBoard output configuration
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=f's3://{bucket}/tensorboard-logs/llama3-model-distillation',
    container_local_output_path='/opt/ml/output/tensorboard'
)


3. **Metric Tracking Configuration**
   - **Training Metrics**: Loss, perplexity, epoch statistics
   - **Resource Metrics**: GPU/CPU memory utilization
   - **Performance Metrics**: Throughput and timing data

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

student_model_id = "meta-textgeneration-llama-3-2-1b"
model_version_student = "*"

estimator_student = JumpStartEstimator(
    model_id=student_model_id,
    model_version=model_version_student,
    hyperparameters=best_hyperparameters_student_1,
    role=role,
    disable_output_compression=True,
    enable_sagemaker_metrics=True,
    environment={
        "accept_eula": "true",
        "TENSORBOARD_LOGGING": "true",
    },  # please change `accept_eula` to be `true` to accept EULA.
    tensorboard_output_config=tensorboard_output_config  # Use the proper config object
)
# Define metrics to track
metric_definitions = [
    # Training Metrics
    {'Name': 'train:loss', 'Regex': 'step .* is completed and loss is ([0-9\\.]+)'},
    {'Name': 'train:perplexity', 'Regex': 'train_perplexity=([0-9\\.]+)'},
    {'Name': 'train:epoch_loss', 'Regex': 'train_epoch_loss=([0-9\\.]+)'},
    
    # Evaluation Metrics
    {'Name': 'eval:loss', 'Regex': 'eval_epoch_loss=tensor\\(([0-9\\.]+)'},
    {'Name': 'eval:perplexity', 'Regex': 'eval_ppl=tensor\\(([0-9\\.]+)'},
    
    # Performance Metrics
    {'Name': 'epoch_time', 'Regex': 'epcoh time ([0-9\\.]+)'},
    {'Name': 'training_throughput', 'Regex': '([0-9\\.]+)it/s'},
    
    # Memory Usage
    {'Name': 'gpu:memory_allocated', 'Regex': 'Max CUDA memory allocated was ([0-9\\.]+) GB'},
    {'Name': 'gpu:memory_reserved', 'Regex': 'Max CUDA memory reserved was ([0-9\\.]+) GB'},
    {'Name': 'gpu:peak_active_memory', 'Regex': 'Peak active CUDA memory was ([0-9\\.]+) GB'},
    {'Name': 'cpu:peak_memory', 'Regex': 'CPU Total Peak Memory consumed during the train \\(max\\): ([0-9\\.]+) GB'}
]
# Add metrics to estimator
estimator_student.metric_definitions = metric_definitions
# Launch TensorBoard in SageMaker Studio
tensorboard_callback = {
    'Config': {
        'TrainingJobName': 'llama-3-2-1b-model-distilation'
    }
}


4. **Training Launch**
   - **Implementation**: Uses JumpStart estimator
   - **Monitoring**: Enables comprehensive logging
   - **Visualization**: Integrates with TensorBoard

In [None]:
estimator_student.fit({"training": train_data_location},
    wait=True,
    logs="All")

> **Best Practices**:
> 1. **Training Monitoring**
>    - Track all defined metrics
>    - Monitor resource utilization
>    - Review TensorBoard visualizations
>
> 2. **Resource Management**
>    - Configure appropriate instance types
>    - Monitor memory usage
>    - Track training progress
>
> 3. **Output Management**
>    - Organize TensorBoard logs
>    - Maintain training artifacts
>    - Document training results

For more information, see:
- [SageMaker Training Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html)
- [TensorBoard Integration](https://docs.aws.amazon.com/sagemaker/latest/dg/tensorboard-on-sagemaker.html)

## 6. Model Evaluation and Deployment

This section covers the deployment and testing of the trained student model using Amazon Bedrock Custom Model Import (CMI):

### Custom Model Import Process
- **Purpose**: Deploys trained model to Bedrock for serverless inference
- **Implementation**: Automates model import and configuration
- **Benefits**: Enables seamless integration with AWS AI services

#### Import Configuration
1. **Model Preparation**
   - Retrieves training artifacts

In [None]:
# Get the training job name and model URI
training_job_name = estimator_student._current_job_name
model_uri_1 = estimator_student.model_data['S3DataSource']['S3Uri']

2. **Deployment Process**
   - Configures import job parameters
   - Sets up unique model identifiers
   - Creates import job

In [None]:
REGION_NAME = 'us-west-2'
bedrock = boto3.client(service_name='bedrock',
                       region_name=REGION_NAME)
# Generate a uni
timestamp = int(time.time())
random_number = random.randint(1000, 9999)
JOB_NAME = f"meta3-import-model-{timestamp}-{random_number}"

ROLE_ARN = bedrock_role_arn
IMPORTED_MODEL_NAME = f"llama3_1_student_1_llama_1b_{timestamp}-{random_number}"
S3_URI = model_uri_1

# createModelImportJob API
create_job_response = bedrock.create_model_import_job(
    jobName=JOB_NAME,
    importedModelName=IMPORTED_MODEL_NAME,
    roleArn=ROLE_ARN,
    modelDataSource={
        "s3DataSource": {
            "s3Uri": model_uri_1
        }
    },
)
job_arn = create_job_response.get("jobArn")
print(f"Model import job created with ARN: {job_arn}")

 - Monitors deployment status
 - Validates model availability

In [None]:
model_name_filter = IMPORTED_MODEL_NAME  # Replace with your model name
model_info = wait_for_model_availability(model_name_filter,max_attempts=30,delay=60)
#
if model_info:
    model_arn_1=model_info["modelArn"]
    print("Model is now available in Bedrock.")
else:
    print("Failed to find the model in Bedrock within the specified attempts.")

3. **Testing Configuration**
   - Sets up runtime client
   - Configures retry policies
   - Implements error handling

In [None]:
from botocore.config import Config
import json

REGION_NAME = 'us-west-2'
MODEL_ID= model_arn_1
#MODEL_ID='arn:aws:bedrock:us-west-2:786045444066:imported-model/d6ky0o73eq1l'

config = Config(
    retries={
        'total_max_attempts': 100, 
        'mode': 'standard'
    }
)
message = "Hello, what it is the weather in seattle?"


session = boto3.session.Session()
br_runtime = session.client(service_name = 'bedrock-runtime', 
                                 region_name=REGION_NAME, 
                                 config=config)
    
try:
    invoke_response = br_runtime.invoke_model(modelId=MODEL_ID, 
                                            body=json.dumps({'prompt': message}), 
                                            accept="application/json", 
                                            contentType="application/json")
    invoke_response["body"] = json.loads(invoke_response["body"].read().decode("utf-8"))
    print(json.dumps(invoke_response, indent=4))
except Exception as e:
    print(e)
    print(e.__repr__())

### Best Practices for Bedrock Model Deployment with Custom Model Import

> **1. Import Configuration**
> - Use descriptive, unique model names with timestamps
> - Configure appropriate IAM roles and permissions
> - Implement robust error handling mechanisms
> - Set appropriate timeout values
> - Validate model artifacts before import

> **2. Deployment Monitoring**
> - Track import job status regularly
> - Implement automated status checks
> - Set up CloudWatch alerts
> - Monitor resource utilization
> - Track deployment metrics

> **3. Testing Strategy**
> - Implement comprehensive test cases
> - Validate model responses
> - Monitor inference latency
> - Track error rates and types
> - Test with various input formats

### Key Benefits of Bedrock Deployment

#### Operational Benefits
- **Serverless Infrastructure**
  - No server management required
  - Automatic scaling capabilities
  - Pay-per-use pricing model

- **Management Simplification**
  - Automated deployments
  - Built-in monitoring
  - Simplified updates

#### Technical Benefits
- **Performance**
  - Optimized inference
  - Low-latency responses
  - Automatic resource scaling

- **Integration**
  - Seamless AWS service connectivity
  - Built-in security features
  - Standardized APIs

#### Cost Benefits
- Pay-per-invocation pricing
- No minimum commitments
- Resource-efficient scaling
#### Additional Resources
- [Bedrock Custom Model Import Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html)
- [Bedrock Custom Model Import Pricing Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/import-model-calculate-cost.html)
- [Model Monitoring Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/monitoring.html)

### Evaluation Environment Setup

#### Configuration Requirements
1. **FMBench YML template creation**




In [None]:
def create_experiment_template(cmi_arn):
    """Returns a template string for a single experiment"""
    return f'''  - name: {cmi_arn}
    # model_id is interpreted in conjunction with the deployment_script, so if you
    # use a JumpStart model id then set the deployment_script to jumpstart.py.
    # if deploying directly from HuggingFace this would be a HuggingFace model id
    # see the DJL serving deployment script in the code repo for reference.    
    model_id: {cmi_arn}
    model_version: 
    model_name: {cmi_arn}
    ep_name: {cmi_arn}
    instance_type: {cmi_arn}
    image_uri:
    deploy: no
    instance_count:
    deployment_script:
    inference_script: bedrock_predictor.py
    inference_spec:
      split_input_and_parameters: no
      parameter_set: bedrock
      stream: True
      start_token:
      stop_token: "<|eot_id|>"
    payload_files:
    - payload_en_1-500.jsonl
    - payload_en_500-1000.jsonl
    - payload_en_1000-2000.jsonl
    - payload_en_2000-3000.jsonl
    - payload_en_3000-3840.jsonl
    concurrency_levels:
    - 1
    env:'''

def create_config_file(template_file, output_file, cmi_arn_list):
    # Read the template file
    with open(template_file, 'r') as file:
        content = file.read()
    
    # Find the experiments section
    start_marker = "experiments:"
    start_idx = content.find(start_marker)
    
    if start_idx == -1:
        raise ValueError("Could not find experiments section in template file")
    
    # Split content at experiments section
    header = content[:start_idx + len(start_marker)]
    
    # Create experiments content
    experiments = "\n"  # Start with newline after "experiments:"
    
    # Add default Llama experiment
    llama_experiment = create_experiment_template("us.meta.llama3-2-90b-instruct-v1:0")
    experiments += llama_experiment + "\n"
    
    # Add experiments for each CMI ARN
    for cmi_arn in cmi_arn_list:
        experiments += create_experiment_template(cmi_arn) + "\n"
    
    # Add the report section
    report_section = '''
report:
  latency_budget: 2
  cost_per_10k_txn_budget: 100
  error_rate_budget: 0
  per_inference_request_file: per_inference_request_results.csv
  all_metrics_file: all_metrics.csv
  txn_count_for_showing_cost: 10000
  v_shift_w_single_instance: 0.025
  v_shift_w_gt_one_instance: 0.025'''
    
    # Combine all parts
    final_content = header + experiments + report_section
    
    # Write to output file
    with open(output_file, 'w') as file:
        file.write(final_content)
    
    print(f"Created new config file: {output_file}")

# Example usage
template_file = 'config-bedrock-llama3-template.yml'
output_file = 'config-bedrock-llama3-1-8b_cmi_distilation_comp.yml'

# List of CMI ARNs to create experiments for
cmi_arn_list = [
    model_arn_1,
    
    # Add more ARNs as needed
]

create_config_file(template_file, output_file, cmi_arn_list)

Create S3 bucket for reports:

In [None]:
s3_reports=f's3://{bucket}/model-evaluation' 
print(f'You need to use the next bucket for performance evaluations:\n{s3_reports}')


2. **SageMaker Studio Environment**
   - Use SageMaker Studio Code Editor
   - Minimum instance: `ml.t3.xlarge`
   - Storage: 50GB minimum
   - Required IAM role permissions:
     ```json
     {
         "Effect": "Allow",
         "Principal": {
             "Service": "sagemaker.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
     }
     ```

3. **FMBench Environment Setup**
   ```bash
   # Create and activate conda environment
   conda create --name fmbench_python311 -y python=3.11 ipykernel
   source activate fmbench_python311
   
   # Install FMBench
   pip install -U fmbench

#### Benchmark Setup

1. **Directory Configuration**
   ```bash
   # Set working directory
   mkdir fmbench 
   export EVAL_DIR="tmp"
   mkdir -p $EVAL_DIR

   # Download FMBench dependencies
   curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- "$EVAL_DIR"   

2. **Evaluation Execution**
```bash
   # Run evaluation
    fmbench --config-file $EVAL_DIR/fmbench-read/configs/bedrock/config-bedrock-llama3-1-8b_cmi_distilation_comp.yml \
        --local-mode yes \
        --write-bucket s3://{your-bucket}/model-evaluation \
        --tmp-dir $EVAL_DIR > $EVAL_DIR/fmbench.log 2>&1


3. **Monitor Progress**
```bash
   # View live logs
   tail -f $EVAL_DIR/fmbench.log
```

**Results Collection**
- Evaluation metrics stored in: $EVAL_DIR/fmbench-write/
- Results automatically uploaded to: s3://{your-bucket}/model-evaluation/
- Report artifacts include:
    - Performance metrics (CSV)
    - Visualization plots (PNG)
    - Interactive dashboards (HTML)
    - Raw evaluation data (JSON)

    **Note**: Replace {your-bucket} with your S3 bucket name. Ensure the IAM role has appropriate S3 permissions.

**Accesing Results**
```python

# Example code to load evaluation results (to be implemented)
import boto3
s3 = boto3.client('s3')

def load_evaluation_results(bucket, prefix):
    # Load evaluation results from S3
    pass


In [None]:
# Example code to load evaluation results (to be implemented)
import boto3
s3 = boto3.client('s3')

def load_evaluation_results(bucket, prefix):
    # Load evaluation results from S3
    pass

#### 

### Comparative Testing (Teacher vs Student)
This section presents the results from FMBench evaluation comparing the teacher model (Llama 70B) and student model (Llama 1B).

#### Evaluation Metrics
| Model | Judge Accuracy (Cohere) | Judge Accuracy (Claude) | Judge Accuracy (Llama) | Majority Voting |
|-------|------------------------|------------------------|---------------------|-----------------|
| Teacher (70B) | 96.52% | 92.75% | 91.02% | 93.02% |
| Student (3B) | [Pending] | [Pending] | [Pending] | [Pending] |

> **Note**: Model evaluations performed by 3 LLM judges using ground truth comparison


#### Testing Methodology
- Dataset: Multiple QA datasets from LongBench
- Prompt lengths: 500-3840 tokens
- Concurrency levels: 1-4
- Evaluation criteria: Accuracy, latency, cost
[Display accuracy_trajectory_per_payload.png]
*Figure 1: Accuracy across different prompt lengths*

#### Performance Comparison

**Latency Metrics**
| Model | p50 Latency | p95 Latency | p99 Latency | Transactions/min |
|-------|-------------|-------------|-------------|------------------|
| Teacher (70B) | 5.27s | 5.27s | 5.27s | 2 |
| Student (3B) | [Pending] | [Pending] | [Pending] | [Pending] |

[Display tokens_vs_latency.png]
*Figure 2: Token processing latency comparison*



- Performance comparison
- Error analysis

### Performance Metrics Analysis

#### Latency Measurements
- Time to First Token (TTFT)
- Time Per Output Token (TPOT)
- Overall response latency
[Display concurrency_vs_inference_latency.png]
*Figure 3: Concurrency vs Inference Latency*

#### Throughput Analysis
| Model | Prompt Token Throughput | Completion Token Throughput |
|-------|------------------------|---------------------------|
| Teacher (70B) | 203 tokens/s | 2 tokens/s |
| Student (3B) | [Pending] | [Pending] |

#### Cost Comparison
| Model | Price per Transaction | Price per Token | Cost per 10k Transactions |
|-------|---------------------|----------------|------------------------|
| Teacher (70B) | $0.002875 | $0.00000072 | $28.75 |
| Student (3B) | [Pending] | [Pending] | [Pending] |

[Display business_summary.png]
*Figure 4: Price Performance Comparison*

#### Quality Metrics
Error rates and model accuracy across different prompt lengths:

[Display error_rates.png]
*Figure 5: Error Rates by Model and Concurrency*

> **Note**: Full interactive versions of these visualizations are available in the evaluation report at `s3://{bucket}/fmbench-results/`


https://github.com/aws-samples/foundation-model-benchmarking-tool

### Model evaluation

## 7. Production Deployment(Jumpstart or Bedrock)
This section talks about differences using Jumpstart deployment vs BedRock
### Endpoint Configuration 

#### SageMaker JumpStart Endpoints
- Provides complete infrastructure control through endpoint configurations
- Supports custom containers and model serving code
- Enables A/B testing through production variants
- Requires endpoint management and maintenance

#### Bedrock Custom Model Import
- Offers serverless deployment with minimal configuration
- Streamlines deployment through model import workflow
- Integrates automatically with AWS AI services
- Manages infrastructure automatically

### Scaling and Cost Management 

#### SageMaker JumpStart
- Instance-based pricing with reserved capacity
- Auto-scaling based on custom metrics
- Granular control over instance types and counts
- Best for consistent, high-throughput workloads

#### Bedrock Custom Model Import
- Pay-per-invocation pricing model
- Built-in automatic scaling
- No minimum commitment required
- Optimal for variable workload patterns

### Monitoring Setup 

#### SageMaker JumpStart
- CloudWatch integration for custom metrics
- Model monitoring for drift detection
- Detailed logging and debugging capabilities
- Advanced endpoint metrics and alarms
#### Bedrock Custom Model Import
- Simplified monitoring through AWS Console
- Built-in performance metrics
- Automated operational monitoring
- Streamlined logging integration

## 8. Cleanup and Best Practices

### Resource Termination

1. Delete the Bedrock Custom Model
First, let's remove the custom model from Amazon Bedrock:


In [None]:
def delete_bedrock_custom_model(model_name):
    bedrock_client = boto3.client('bedrock')
    try:
        bedrock_client.delete_imported_model(modelIdentifier=model_name)
        print(f"Successfully deleted Bedrock custom model: {model_name}")
    except botocore.exceptions.ClientError as error:
        error_code = error.response['Error']['Code']
        if error_code == 'ValidationException':
            print(f"Error deleting Bedrock custom model: The provided model name is invalid. Model Name: {model_name}")
        elif error_code == 'ResourceNotFoundException':
            print(f"Error: The model '{model_name}' was not found in Bedrock.")
        elif error_code == 'AccessDeniedException':
            print("Error: You do not have permission to delete this model.")
        elif error_code == 'ConflictException':
            print("Error: The model is currently in use or in a state that doesn't allow deletion.")
        else:
            print(f"Error deleting Bedrock custom model: {error}")

# Replace with your actual model name
MODEL_NAME = "llama3-qa-model"
delete_bedrock_custom_model(MODEL_NAME)

2. Delete IAM Roles
Now, let's remove the IAM roles we created specifically for this project:

In [None]:
def delete_iam_role(role_name):
    iam = boto3.client('iam')
    try:
        # Delete inline policies
        inline_policies = iam.list_role_policies(RoleName=role_name)['PolicyNames']
        for policy in inline_policies:
            iam.delete_role_policy(RoleName=role_name, PolicyName=policy)
            
        # Detach managed policies
        attached_policies = iam.list_attached_role_policies(RoleName=role_name)['AttachedPolicies']
        for policy in attached_policies:
            iam.detach_role_policy(RoleName=role_name, PolicyArn=policy['PolicyArn'])
            
        # Delete permissions boundary if it exists
        try:
            iam.delete_role_permissions_boundary(RoleName=role_name)
        except iam.exceptions.NoSuchEntityException:
            pass
        
        # Finally delete the role
        iam.delete_role(RoleName=role_name)
        print(f"Successfully deleted IAM role: {role_name}")
    except botocore.exceptions.ClientError as error:
        print(f"Error deleting IAM role: {error}")

# Delete LambdaBedrockExecutionRole
delete_iam_role("LambdaBedrockExecutionRole")

# Delete Sagemaker_Bedrock_import_role
delete_iam_role("Sagemaker_Bedrock_import_role")

### Best Practices
#### Cost Optimization Tips
1. Training Optimization
    - Use spot instances for training when possible
    - Implement early stopping in training jobs
    - Clean up training artifacts promptly
    - Monitor training metrics to avoid unnecessary epochs
2. Inference Optimization
    - Choose between Bedrock and SageMaker based on workload patterns
    - Use auto-scaling for SageMaker endpoints
    - Consider batch processing for large-scale inference
    - Monitor and adjust instance sizes based on utilization
3. Storage Management
    - Implement S3 lifecycle policies for training artifacts
    - Clean up temporary datasets after training
    - Use appropriate storage classes for different data types
#### JumpStart Best Practices
1. Model Selection
2. Training Configuration
3. Data Management
4. Knowledge Distillation Specific
5. Production Deployment
#### Security Best Practices
1. Access Control
2. Monitoring and Compliance

For more information, see:

- SageMaker Best Practices[link]

- Bedrock Security[link]

- AWS Machine Learning Security[link]




## 9. Conclusion and Next Steps

### Summary of Results 
### Lessons Learned 
### Future Improvements 