# Response-Based Knowledge Distillation with QA Specialization

## Using Amazon SageMaker JumpStart for LLM distilation (70B → 3B)

This notebook demonstrates how to perform knowledge distillation from a large language model (70B parameters) to a smaller model (3B parameters) using Amazon SageMaker JumpStart and Amazon Bedrock.

## 1. Introduction

### Overview of Knowledge Distillation
Knowledge distillation is a model compression technique where a smaller model (student) learns to mimic the behavior of a larger model (teacher). This process helps reduce computational requirements while maintaining acceptable performance levels. 
### SageMaker JumpStart Benefits 
Amazon SageMaker JumpStart provides:

- Pre-trained models optimized for AWS infrastructure
- Simplified model deployment and fine-tuning workflows
- Integration with other AWS services
- Built-in security and compliance features

### Project Goals and Objectives 

This project aims to:

- Create a smaller, more efficient model for QA tasks
- Maintain high accuracy on domain-specific questions
- Reduce inference costs and latency with Custom Model Import in Bedrock
- Demonstrate AWS best practices for model optimization

In [None]:
%pip install --quiet --upgrade sagemaker jmespath datasets transformers jinja2 ipywidgets

In [None]:
!rm -Rf ~/.cache/pip/*
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall

## 2. Environment Setup

### AWS Account Configuration 

This section configures the necessary AWS resources including:

- SageMaker session and default bucket
- IAM roles and permissions
- Region-specific settings
- Required SDK versions and dependencies

In [None]:
import sagemaker
import boto3
import botocore
sess = sagemaker.Session()
import pprint
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    #change the name of the role if you are running locally
    role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-your-role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=bucket)
region=sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")


In [None]:
prefix = "llama-qa-distillation"

# Print AWS configuration
print(f"SageMaker Session: {sess}")
print(f"Role: {role}")
print(f"Region: {region}")
print(f"Bucket: {bucket}")

### Role configuration

Configures IAM roles with:

- Bedrock-specific permissions
- S3 access for model artifacts
- CloudWatch logging capabilities
- Required cross-service permissions

In [None]:
import sys
# Custom modules
import importlib.util
spec = importlib.util.spec_from_file_location("iam_role_helper", "iam_role_helper.py")
iam_role_manager = importlib.util.module_from_spec(spec)
sys.modules["iam_role_manager"] = iam_role_manager
spec.loader.exec_module(iam_role_manager)

spec = importlib.util.spec_from_file_location("utils", "utils.py")
utils = importlib.util.module_from_spec(spec)
sys.modules["utils"] = utils
spec.loader.exec_module(utils)

from utils import download_artifacts, remove_field_from_json, upload_artifacts, cleanup_local_files, wait_for_model_availability, test_image_processing
from iam_role_helper import create_or_update_role

In [None]:
# Set up variables
account_id = boto3.client('sts').get_caller_identity()['Account']
region = "us-east-1"
training_bucket = sagemaker_session_bucket
role_name = "Sagemaker_Bedrock_import_role"

# Define policies
trust_relationship = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "bedrock.amazonaws.com"},
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {"aws:SourceAccount": account_id},
                "ArnEquals": {"aws:SourceArn": f"arn:aws:bedrock:{region}:{account_id}:model-import-job/*"}
            }
        },
        {
            "Effect": "Allow",
            "Principal": {"Service": "lambda.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

permission_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": ["s3:GetObject", "s3:ListBucket"],
            "Resource": [f"arn:aws:s3:::{training_bucket}", f"arn:aws:s3:::{training_bucket}/*"],
            "Condition": {"StringEquals": {"aws:ResourceAccount": account_id}}
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

# Create or update the role
bedrock_role_arn = create_or_update_role(
    role_name=role_name,
    trust_relationship=trust_relationship,
    permission_policy=permission_policy
)

print(f"Role ARN: {bedrock_role_arn}")

## 3. Teacher Model (LLaMA 3.3 70B)
### Selecting Model in Bedrock 
This section covers:

- Available foundation models in Amazon Bedrock
- Model selection criteria for the teacher model
- Configuration of the Llama 70B model
- Required permissions and quotas


In [None]:
import logging
import json
import boto3
import pandas as pd
from botocore.exceptions import ClientError

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def list_foundation_models(bedrock_client):
    """
    Gets a list of available Amazon Bedrock foundation models.

    :return: The list of available bedrock foundation models.
    """
    try:
        response = bedrock_client.list_foundation_models()
        models = response["modelSummaries"]
        logger.info("Got %s foundation models.", len(models))
        return models

    except ClientError:
        logger.error("Couldn't list foundation models.")
        raise

def create_models_dataframe(models):
    """
    Creates a pandas DataFrame with relevant model information.
    
    :param models: List of model summaries from Bedrock
    :return: pandas DataFrame with model information
    """
    model_data = []
    
    for model in models:
        model_info = {
            'Model Name': model['modelName'],
            'Provider': model['providerName'],
            'Model ID': model['modelId'],
            'Input Modalities': ', '.join(model['inputModalities']),
            'Output Modalities': ', '.join(model['outputModalities']),
            'Customizations Supported': ', '.join(model['customizationsSupported']) if 'customizationsSupported' in model else 'None',
            'Inference Types': ', '.join(model['inferenceTypesSupported'])
        }
        model_data.append(model_info)
    
    df = pd.DataFrame(model_data)
    return df

In [None]:
bedrock_client = boto3.client(service_name="bedrock",region_name="us-east-1")
fm_models = list_foundation_models(bedrock_client)

# Create DataFrame
models_df = create_models_dataframe(fm_models)

# Display the DataFrame
print("\nAmazon Bedrock Foundation Models:")
print(models_df.to_string(index=False))

# Optionally, you can also save to CSV
# models_df.to_csv('bedrock_models.csv', index=False)

logger.info("Done.")

In [None]:

models_df['Model ID'].to_list()

In [None]:
models_df[(models_df['Provider']=='Meta') & (models_df['Inference Types']=='INFERENCE_PROFILE')]['Model ID']

### Bedrock client setup

In order to use 'INFERENCE_PROFILE' models you need to create an inference profile, you dont need that for 'ON_DEMAND' models

[Note] LLama 3.3 70b only works in us-east-1 not sure if is an issue

In [None]:
import boto3
#only us-east-1 let you use llama 3.3 70b us-west-2 fails
bedrock_client = boto3.client('bedrock', region_name="us-east-1")

In [None]:
model_id='meta.llama3-70b-instruct-v1:0'
inference_profile_name='llama3-70b-inference'
inf_profile_response = bedrock_client.create_inference_profile(
    inferenceProfileName=inference_profile_name,
    description='Teacher model use for syntetic generation in a Llama distilation project',
    modelSource={
        'copyFrom': f'arn:aws:bedrock:us-east-1::foundation-model/{model_id}'
    },
    tags=[
        {
        'key': 'project',
            'value': 'Llama-model-distilation'
        },
        {
        'key': 'model-id',
            'value': 'meta.llama3-3-70b-instruct'
        },
    ]
)

In [None]:
print(f"Inference profile created successfully. ARN: {inf_profile_response['inferenceProfileArn']}")
model_arn=inf_profile_response['inferenceProfileArn']

In [None]:
inf_profile_response

### Testing Inference 

In [None]:
brt = boto3.client(service_name='bedrock-runtime',region_name='us-east-1')
def invoke_model(body, model_id, accept, content_type):
    try:
        response = brt.invoke_model(
            body=json.dumps(body), 
            modelId=model_id, 
            
            accept=accept, 
            contentType=content_type
        )

        return response

    except Exception as e:
        print(f"Couldn't invoke {model_id}")
        raise e

In [None]:
# If you'd like to try your own prompt, edit this parameter!
prompt_data = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Write me a blog about making strong business decisions as a leader. [/INST]"""

body = {
    "prompt": prompt_data,
    "temperature": 0.5,
    "top_p": 0.9,
    "max_gen_len": 512,
}

modelId = "us.meta.llama3-3-70b-instruct-v1:0"
accept = "application/json"
contentType = "application/json"

response = invoke_model(body, modelId, accept, contentType)
response_body = json.loads(response.get("body").read())

print(response_body["generation"])

## 4. Dataset Generation

### Corpus Preparation(Prepare QA + Context dataset)
Details about:

- Dataset selection and preprocessing
- PubMedQA dataset structure and characteristics
- Data formatting for knowledge distillation
- Quality checks and validation steps


PreaApproved dataset details:

https://github.com/pubmedqa/pubmedqa/tree/master




> **PubMedQA: A Dataset for Biomedical Research Question Answering**
>
> Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, pp. 2567-2577.

In [None]:
import requests

def get_github_json(url):
    try:
        # Convert regular GitHub URL to raw content URL
        raw_url = url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")
        return requests.get(raw_url).json()
    except Exception as e:
        print(f"Error: {e}")
        return None

# Example usage:
url = "https://github.com/pubmedqa/pubmedqa/blob/master/data/ori_pqal.json"
data = get_github_json(url)

In [None]:
qa_index=list(data.keys())

In [None]:
print(len(qa_index))

In [None]:
data[qa_index[0]]

In [None]:
dataset=[]
for i in qa_index:
    keys_to_get = ['QUESTION', 'CONTEXTS','LONG_ANSWER']
    result = {k: data[i].get(k) for k in keys_to_get}
    dataset.append(result)

In [None]:
dataset[0]

### Using Teacher Model for QA Generation(Create more data based on the Questions and the context)

Explains the process of:

- Generating synthetic QA pairs
- Batch processing with Bedrock
- Data augmentation strategies
- Quality control measures

#### Batch Processing vs Real-Time Inference

Based on performance testing and cost analysis, Amazon Bedrock's batch processing capabilities offer significant advantages over real-time inference:

1. **Performance Benefits**
   - Higher throughput for large-scale processing
   - Reduced risk of API throttling
   - More efficient resource utilization

2. **Cost Optimization**
   - Lower per-request costs compared to real-time inference
   - Better resource allocation and scheduling
   - Reduced overhead from connection management

3. **Operational Advantages**
   - Built-in retry mechanisms
   - Simplified monitoring and logging
   - Better handling of large datasets

For this implementation, we leverage Bedrock's batch processing to optimize both performance and cost efficiency while maintaining processing quality.

In [None]:
import json
from datetime import datetime
import uuid

def create_bedrock_batch_dataset(dataset, output_file='bedrock_batch_dataset.jsonl'):
    system_message = """You are a specialized biomedical research assistant trained to analyze and answer questions about medical and scientific literature. Your role is to:
        Extract and interpret key information from biomedical research papers, clinical studies, and medical literature
        Provide accurate, evidence-based responses based solely on the provided research context
        Focus on specific medical findings, methodologies, and clinical outcomes
        Present complex medical information in clear, understandable terms
        Maintain precision when discussing medical terminology, study results, and statistical data
        Distinguish between preliminary findings and established conclusions
        Reference specific sections of the provided research when answering questions
        Acknowledge limitations in studies when relevant
        Avoid making medical recommendations or providing diagnosis When responding, only use information explicitly stated in the provided biomedical context."""

    prompt_template = """System: {system}

Question: {question}

Provide a clear and concise answer."""
    
    with open(output_file, 'w') as outfile:
        for sample in dataset:
            # Generate a unique record ID (11 characters)
            record_id = str(uuid.uuid4())[:11]
            
            # Format the prompt
            formatted_prompt = prompt_template.format(
                system=system_message,
                question=sample["QUESTION"]
            )

            # Create the model input body for Llama 2
            body = {
                "prompt": formatted_prompt,
                "max_gen_len": 512,
                "temperature": 0.7,
                "top_p": 0.9
            }

            # Create the complete record for batch inference
            batch_record = {
                "recordId": record_id,
                "modelInput": body
            }
            
            outfile.write(json.dumps(batch_record) + '\n')

# Usage
create_bedrock_batch_dataset(dataset)

In [None]:
import sagemaker
from sagemaker.s3 import S3Uploader
# Define source and destination paths
local_path_batch_file = 'bedrock_batch_dataset.jsonl'
s3_prefix_batch = 'distillation/batch/data'  # This will be the folder in S3

# Upload the file
s3_path_batch = S3Uploader.upload(
    local_path=local_path_batch_file,
    desired_s3_uri=f's3://{bucket}/{s3_prefix_batch}',
)

print(f"File uploaded successfully to: {s3_path_batch}")

#### Bedrock Batch Job Configuration

The following code configures a Bedrock batch inference job using Llama 3.3 as the teacher model to generate the synthetic dataset.

> **Note**: This implementation uses direct model responses. Consider enhancing with:
> - Additional synthetic data generation methods
> - Answer validation against ground truth context
> - Quality assurance metrics for generated responses

In [None]:
output_prefix="output"
inputDataConfig=({
    "s3InputDataConfig": {
        "s3Uri": s3_path_batch
    }
})

outputDataConfig=({
    "s3OutputDataConfig": {
        "s3Uri": f"s3://{bucket}/{s3_prefix_batch}/{output_prefix}/"
    }
})

In [None]:
jobName = 'batch-job-ga' + str(int(datetime.now().timestamp()))
response=bedrock_client.create_model_invocation_job(
    roleArn=role,
    #modelId='meta.llama3-3-70b-instruct-v1:0',
    modelId='us.meta.llama3-3-70b-instruct-v1:0',
    jobName=jobName,
    inputDataConfig=inputDataConfig,
    outputDataConfig=outputDataConfig
)

In [None]:
import time
jobArn = response.get('jobArn')
job_id = jobArn.split('/')[1]

print(jobArn)

status = ''
while status not in ['Completed', 'Failed']:
    job_response = bedrock_client.get_model_invocation_job(jobIdentifier=jobArn)
    status = job_response['status']
    if status == 'Failed':
        print(job_response)
    elif status == 'Completed':
        print(datetime.now(), ": ", status)
        break
    else: 
        print(datetime.now(), ": ", status)
        time.sleep(300)

### Dataset Formatting for JumpStart/Bedrock
Covers:

- Required data format for training
- Chat template structure
- Input/output specifications
- Validation procedures


In [None]:
# Create an S3 client
s3 = boto3.client('s3')
prefix = f"{s3_prefix_batch}/{output_prefix}/{job_id}/"
print(f"prefix: {bucket}/{prefix}")
object_key = f"{prefix}{local_path_batch_file}.out"

In [None]:
response = s3.get_object(Bucket=bucket, Key=object_key)

In [None]:
json_data = response['Body'].read().decode('utf-8')

In [None]:
teacher_answer=[]
for line in json_data.splitlines():
        data = json.loads(line)
        print(data['modelOutput']['generation'])
        teacher_answer.append(data['modelOutput']['generation'])

In [None]:
len(teacher_answer)

In [None]:
teacher_answer[0]

In [None]:
dataset[0]['LONG_ANSWER']

In [None]:
for data_item, teacher in zip(dataset, teacher_answer):
    data_item['TEACHER_ANSWER'] = teacher

#### Dataset Formatting for JumpStart Training

Converts input data into JSONL format following SageMaker JumpStart chat template specifications. For more information about JumpStart data formats, see [Training Data Format](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-fine-tuning-instruction-based.html).

In [None]:
import json

def create_qa_training_data(dataset, output_file='train.jsonl', max_samples=5000):
    """
    Transform the dataset into JSONL format with a dialog structure including a system message for Llama fine-tuning.

    Args:
        dataset: List of dictionaries containing 'QUESTION', 'CONTEXTS', 'LONG_ANSWER', 'TEACHER_ANSWER'
        output_file: Output JSONL file path
        max_samples: Maximum number of samples to include
    """
    # Define the system message
    system_message = """You are a specialized biomedical research assistant trained to analyze and answer questions about medical and scientific literature. Your role is to:
    - Extract and interpret key information from biomedical research papers, clinical studies, and medical literature
    - Provide accurate, evidence-based responses based solely on the provided research context
    - Focus on specific medical findings, methodologies, and clinical outcomes
    - Present complex medical information in clear, understandable terms
    - Maintain precision when discussing medical terminology, study results, and statistical data
    - Distinguish between preliminary findings and established conclusions
    - Reference specific sections of the provided research when answering questions
    - Acknowledge limitations in studies when relevant
    - Avoid making medical recommendations or providing diagnosis
    When responding, only use information explicitly stated in the provided biomedical context."""

    # Limit the number of samples if specified
    #dataset = dataset[:max_samples] if max_samples else dataset

    with open(output_file, 'w', encoding='utf-8') as f:
        for item in dataset:
            try:
                # Create the dialog structure with system message
                dialog = [
                    {
                        "content": f"<<SYS>>\n{system_message}\n<</SYS>>\n\n{item['QUESTION']}",
                        "role": "user"
                    },
                    {
                        "content": item['TEACHER_ANSWER'],
                        "role": "assistant"
                    }
                ]
                
                # Create the JSON object
                json_object = {
                    "dialog": dialog
                }
                
                # Write the JSON line
                f.write(json.dumps(json_object) + '\n')
            except KeyError as e:
                print(f"Skipping item due to missing key: {e}")
                continue

def verify_jsonl(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            try:
                data = json.loads(line)
                if i == 0:  # Print first example
                    print("Sample entry:")
                    print(json.dumps(data, indent=2))
                break
            except json.JSONDecodeError as e:
                print(f"Error in line {i+1}: {e}")


# Usage example
create_qa_training_data(dataset, output_file='train.jsonl', max_samples=5000)
verify_jsonl('train.jsonl')

## 5. Student Model (LLaMA 3B)
Selecting Student Model in JumpStart
Details about:

- Available model options in JumpStart
- Selection criteria for student model
- Configuration parameters
- Resource requirements

### Upload dataset to S3 bucket

In [None]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random


default_bucket_prefix = sagemaker.Session().default_bucket_prefix
default_bucket_prefix_path = ""

# If a default bucket prefix is specified, append it to the s3 path
if default_bucket_prefix:
    default_bucket_prefix_path = f"/{default_bucket_prefix}"

local_data_file = "train.jsonl"
train_data_location = f"s3://{bucket}{default_bucket_prefix_path}/oasst_top1"
S3Uploader.upload(local_data_file, train_data_location)
print(f"Training data: {train_data_location}")

### Selecting Student Model in JumpStart 

In [None]:
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models


try:
    dropdown = Dropdown(
        options=list_jumpstart_models("search_keywords includes Text Generation"),
        value="meta-textgeneration-llama-3-2-1b",
        description="Select a JumpStart text generation model:",
        style={"description_width": "initial"},
        layout={"width": "max-content"},
    )
    display(dropdown)
except:
    dropdown = None
    pass

In [None]:
if dropdown:
    student_model_id = dropdown.value
else:
    # Provide model id as meta-textgeneration-llama-3-1-405b-instruct-fp8 for the instruct variant
    model_id = "meta-textgeneration-llama-3-2-1b"
model_version_student = "*"

In [None]:
from sagemaker import metric_definitions
print(metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-1b-instruct", model_version='1.1.1',))

In [None]:
metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-1b-instruct", model_version='1.1.1',)

### Configuring Training Job 

In [None]:
from sagemaker import hyperparameters

my_hyperparameters_student = hyperparameters.retrieve_default(
    model_id=student_model_id, model_version=model_version_student,
)

print(my_hyperparameters_student)

### Hyperparameters  

In [None]:
my_hyperparameters_student["epoch"] = "1"
my_hyperparameters_student['chat_dataset']="True"
my_hyperparameters_student['instruction_tuned']="False"
my_hyperparameters_student['seed']="10"# this could help us to have the same results

hyperparameters.validate(
    model_id=student_model_id, model_version=model_version_student, hyperparameters=my_hyperparameters_student
)

In [None]:
pprint.pprint(my_hyperparameters_student)

In [None]:
from sagemaker.parameter import ContinuousParameter, CategoricalParameter,IntegerParameter

# Define hyperparameter ranges without as_json_range
hyperparameter_ranges = {
    'learning_rate': ContinuousParameter(0.00001, 0.0005, scaling_type="Logarithmic"),
    'lora_r': CategoricalParameter(['4', '8', '12', '16']),
    'lora_alpha': CategoricalParameter(['16', '32', '48', '64']),
    'lora_dropout': ContinuousParameter(0.01, 0.2),
    'per_device_train_batch_size': CategoricalParameter(['2', '4', '6', '8']),
    'gradient_accumulation_steps': CategoricalParameter(['1', '2', '3', '4']),
    'max_steps': CategoricalParameter(['50', '75', '100']),
    'warmup_steps': CategoricalParameter(['5', '7', '10']),
    'num_train_epochs': CategoricalParameter(['1', '2'])

}





In [None]:
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter, CategoricalParameter

metric_defs=metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-1b", model_version='1.1.1',)
print(metric_defs)


In [None]:
memory_metrics = [
    {'Name': 'gpu:memory_allocated', 'Regex': 'Max CUDA memory allocated was ([0-9\\.]+) GB'},
    {'Name': 'gpu:memory_reserved', 'Regex': 'Max CUDA memory reserved was ([0-9\\.]+) GB'},
    {'Name': 'gpu:peak_active_memory', 'Regex': 'Peak active CUDA memory was ([0-9\\.]+) GB'},
    {'Name': 'train:loss', 'Regex': 'train_loss = ([0-9\\.]+)'}
]

In [None]:
combined_metrics = metric_defs + memory_metrics

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

# Create the estimator
estimator = JumpStartEstimator(
    model_id=student_model_id,
    model_version=model_version_student,
    hyperparameters=my_hyperparameters_student,
    role=role,
    disable_output_compression=True,
    instance_type='ml.g5.2xlarge',
    environment={"accept_eula": "true"},
    metric_definitions=combined_metrics,  # Add metric definitions here
    enable_sagemaker_metrics=True  # Enable SageMaker metrics,
)

In [None]:
# Create the hyperparameter tuner
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name='huggingface-textgeneration:train-loss',
    metric_definitions=combined_metrics,
    objective_type='Minimize',
    max_jobs=20,
    max_parallel_jobs=4,#Adjust depending the available instances
    hyperparameter_ranges=hyperparameter_ranges,
    strategy='Bayesian',
    base_tuning_job_name='llm-llama-3-2-1b',
)


In [None]:
#Start the hyperparameter tuning job
tuner.fit({"training": train_data_location}, wait=True)
# First, wait for the tuning job to complete
tuner.wait()



In [None]:
# Get the best training job
best_training_job = tuner.best_training_job()
print(f"Best training job: {best_training_job}")



In [None]:
# Create a SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Get the best hyperparameters using the SageMaker client
best_hyperparameters_student_1 = sagemaker_client.describe_training_job(TrainingJobName=best_training_job)['HyperParameters']
print("Best hyperparameters: \n")
pprint.pprint(best_hyperparameters_student_1)


In [None]:
student_model_id_2='meta-textgeneration-llama-3-2-3b'
metric_defs=metric_definitions.retrieve_default(model_id="meta-textgeneration-llama-3-2-3b", model_version='1.1.1',)

# Define hyperparameter ranges for tuning
hyperparameter_ranges = {
    # Learning rate - logarithmic scale for better exploration
    'learning_rate': ContinuousParameter(1e-5, 5e-4, scaling_type="Logarithmic"),
    
    # LoRA specific parameters
    'lora_r': CategoricalParameter(['8', '16', '32', '64']),  # Higher ranks possible with 24GB GPUs
    'lora_alpha': CategoricalParameter(['16', '32', '64', '128']),  # Scaled with lora_r
    'lora_dropout': ContinuousParameter(0.05, 0.2),  # Wider range for regularization
    
    # Batch size optimization for 4x A10G GPUs
    'per_device_train_batch_size': CategoricalParameter(['4', '8', '16']),  # Larger due to 4 GPUs
    'gradient_accumulation_steps': CategoricalParameter(['1', '2', '4', '8']),  # Adjusted for total batch size
    
    # Training dynamics
    'max_steps': CategoricalParameter(['200', '400', '600']),  # More steps for initial evaluation
    'warmup_steps': CategoricalParameter(['20', '40', '60']),  # 10% of max_steps
    'num_train_epochs': CategoricalParameter(['1', '2'])  # Initial epochs for evaluation
}


In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

# Create the estimator
estimator_2 = JumpStartEstimator(
    model_id=student_model_id_2,
    model_version=model_version_student,
    hyperparameters=my_hyperparameters_student,
    instance_type="ml.g5.12xlarge",
    role=role,
    disable_output_compression=True,
    environment={"accept_eula": "true"},
    metric_definitions=combined_metrics,  # Add metric definitions here
    enable_sagemaker_metrics=True  # Enable SageMaker metrics
)

In [None]:
# Create the hyperparameter tuner
tuner_2=HyperparameterTuner(
    estimator,
    objective_metric_name='huggingface-textgeneration:train-loss',
    metric_definitions=combined_metrics,
    objective_type='Minimize',
    max_jobs=10,
    max_parallel_jobs=4,
    hyperparameter_ranges=hyperparameter_ranges,
    strategy='Bayesian',
    base_tuning_job_name='llm-llama-3-2-3b',
)


In [None]:
#Start the hyperparameter tuning job
tuner_2.fit({"training": train_data_location}, wait=True)
# First, wait for the tuning job to complete
tuner_2.wait()


In [None]:
# Get the best training job
best_training_job_1 = tuner.best_training_job()
print(f"Best training job: {best_training_job}")


# Get the best hyperparameters using the SageMaker client
best_hyperparameters_student_1 = sagemaker_client.describe_training_job(TrainingJobName=best_training_job_1)['HyperParameters']
print(f"Best hyperparameters: {best_hyperparameters_student_1}")

In [None]:
pprint.pprint(best_hyperparameters_student_1)

### Launching Training Job 

In [None]:
pprint.pprint(best_hyperparameters_student_1)
best_hyperparameters_student_1['num_train_epochs']=10
best_hyperparameters_student_1['epoch']=10

In [None]:
from sagemaker.debugger import TensorBoardOutputConfig

# Create proper TensorBoard output configuration
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=f's3://{bucket}/tensorboard-logs/llama3-model-distillation',
    container_local_output_path='/opt/ml/output/tensorboard'
)



In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator

student_model_id = "meta-textgeneration-llama-3-2-1b"
model_version_student = "*"

estimator_student = JumpStartEstimator(
    model_id=student_model_id,
    model_version=model_version_student,
    hyperparameters=best_hyperparameters_student_1,
    role=role,
    disable_output_compression=True,
    enable_sagemaker_metrics=True,
    environment={
        "accept_eula": "true",
        "TENSORBOARD_LOGGING": "true",
    },  # please change `accept_eula` to be `true` to accept EULA.
    tensorboard_output_config=tensorboard_output_config  # Use the proper config object
)
# Define metrics to track
metric_definitions = [
    # Training Metrics
    {'Name': 'train:loss', 'Regex': 'step .* is completed and loss is ([0-9\\.]+)'},
    {'Name': 'train:perplexity', 'Regex': 'train_perplexity=([0-9\\.]+)'},
    {'Name': 'train:epoch_loss', 'Regex': 'train_epoch_loss=([0-9\\.]+)'},
    
    # Evaluation Metrics
    {'Name': 'eval:loss', 'Regex': 'eval_epoch_loss=tensor\\(([0-9\\.]+)'},
    {'Name': 'eval:perplexity', 'Regex': 'eval_ppl=tensor\\(([0-9\\.]+)'},
    
    # Performance Metrics
    {'Name': 'epoch_time', 'Regex': 'epcoh time ([0-9\\.]+)'},
    {'Name': 'training_throughput', 'Regex': '([0-9\\.]+)it/s'},
    
    # Memory Usage
    {'Name': 'gpu:memory_allocated', 'Regex': 'Max CUDA memory allocated was ([0-9\\.]+) GB'},
    {'Name': 'gpu:memory_reserved', 'Regex': 'Max CUDA memory reserved was ([0-9\\.]+) GB'},
    {'Name': 'gpu:peak_active_memory', 'Regex': 'Peak active CUDA memory was ([0-9\\.]+) GB'},
    {'Name': 'cpu:peak_memory', 'Regex': 'CPU Total Peak Memory consumed during the train \\(max\\): ([0-9\\.]+) GB'}
]
# Add metrics to estimator
estimator_student.metric_definitions = metric_definitions
# Launch TensorBoard in SageMaker Studio
tensorboard_callback = {
    'Config': {
        'TrainingJobName': 'llama-3-2-3b-model-distilation'
    }
}



In [None]:
estimator_student.fit({"training": train_data_location},
    wait=True,
    logs="All")

## 6. Evaluation

### Deploying Student Model Endpoint or Bedrock CMI 
Explains:

- Deployment options (SageMaker endpoints vs Bedrock)
- Configuration requirements
- Monitoring setup

### Custom Model import in Bedrock

In [None]:
# Get the training job name and model URI
training_job_name = estimator_student._current_job_name
model_uri = estimator_student.model_data['S3DataSource']['S3Uri']

In [None]:
REGION_NAME = 'us-east-1'
bedrock = boto3.client(service_name='bedrock',
                       region_name=REGION_NAME)
# Generate a uni
timestamp = int(time.time())
random_number = random.randint(1000, 9999)
JOB_NAME = f"meta3-import-model-{timestamp}-{random_number}"

ROLE_ARN = bedrock_role_arn
IMPORTED_MODEL_NAME = f"llama3_1_student_{timestamp}-{random_number}"
S3_URI = model_uri

# createModelImportJob API
create_job_response = bedrock.create_model_import_job(
    jobName=JOB_NAME,
    importedModelName=IMPORTED_MODEL_NAME,
    roleArn=ROLE_ARN,
    modelDataSource={
        "s3DataSource": {
            "s3Uri": model_uri
        }
    },
)
job_arn = create_job_response.get("jobArn")
print(f"Model import job created with ARN: {job_arn}")

In [None]:
model_name_filter = IMPORTED_MODEL_NAME  # Replace with your model name
model_info = wait_for_model_availability(model_name_filter,max_attempts=30,delay=60)
#
if model_info:
    model_arn=model_info["modelArn"]
    print("Model is now available in Bedrock.")
else:
    print("Failed to find the model in Bedrock within the specified attempts.")

In [None]:
from botocore.config import Config

REGION_NAME = 'us-east-1'
MODEL_ID= model_arn

config = Config(
    retries={
        'total_max_attempts': 100, 
        'mode': 'standard'
    }
)
message = "Hello, what it is the weather in seattle?"


session = boto3.session.Session()
br_runtime = session.client(service_name = 'bedrock-runtime', 
                                 region_name=REGION_NAME, 
                                 config=config)
    
try:
    invoke_response = br_runtime.invoke_model(modelId=MODEL_ID, 
                                            body=json.dumps({'prompt': message}), 
                                            accept="application/json", 
                                            contentType="application/json")
    invoke_response["body"] = json.loads(invoke_response["body"].read().decode("utf-8"))
    print(json.dumps(invoke_response, indent=4))
except Exception as e:
    print(e)
    print(e.__repr__())

### Evaluation Environment Setup

#### Configuration Requirements

1. **SageMaker Studio Environment**
   - Use SageMaker Studio Code Editor
   - Minimum instance: `ml.t3.xlarge`
   - Storage: 50GB minimum
   - Required IAM role permissions:
     ```json
     {
         "Effect": "Allow",
         "Principal": {
             "Service": "sagemaker.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
     }
     ```

2. **FMBench Environment Setup**
   ```bash
   # Create and activate conda environment
   conda create --name fmbench_python311 -y python=3.11 ipykernel
   source activate fmbench_python311
   
   # Install FMBench
   pip install -U fmbench



#### Benchmark Setup

1. **Directory Configuration**
   ```bash
   # Set working directory
   mkdir fmbench 
   export EVAL_DIR="tmp"
   mkdir -p $EVAL_DIR

   # Download FMBench dependencies
   curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- "$EVAL_DIR"   

2. **Evaluation Execution**
```bash
   # Run evaluation
    fmbench --config-file $EVAL_DIR/fmbench-read/configs/llama3/config-bedrock-llama3.yml \
        --local-mode yes \
        --write-bucket s3://{your-bucket}/model-evaluation \
        --tmp-dir $EVAL_DIR > $EVAL_DIR/fmbench.log 2>&1


3. **Monitor Progress**
```bash
   # View live logs
   tail -f $EVAL_DIR/fmbench.log

**Results Collection**
- Evaluation metrics stored in: $EVAL_DIR/fmbench-write/
- Results automatically uploaded to: s3://{your-bucket}/model-evaluation/
- Report artifacts include:
    - Performance metrics (CSV)
    - Visualization plots (PNG)
    - Interactive dashboards (HTML)
    - Raw evaluation data (JSON)

    **Note**: Replace {your-bucket} with your S3 bucket name. Ensure the IAM role has appropriate S3 permissions.

**Accesing Results**
```python

# Example code to load evaluation results (to be implemented)
import boto3
s3 = boto3.client('s3')

def load_evaluation_results(bucket, prefix):
    # Load evaluation results from S3
    pass


In [None]:
# Example code to load evaluation results (to be implemented)
import boto3
s3 = boto3.client('s3')

def load_evaluation_results(bucket, prefix):
    # Load evaluation results from S3
    pass

#### 

### Comparative Testing (Teacher vs Student)
This section presents the results from FMBench evaluation comparing the teacher model (Llama 70B) and student model (Llama 1B).

#### Evaluation Metrics
| Model | Judge Accuracy (Cohere) | Judge Accuracy (Claude) | Judge Accuracy (Llama) | Majority Voting |
|-------|------------------------|------------------------|---------------------|-----------------|
| Teacher (70B) | 96.52% | 92.75% | 91.02% | 93.02% |
| Student (3B) | [Pending] | [Pending] | [Pending] | [Pending] |

> **Note**: Model evaluations performed by 3 LLM judges using ground truth comparison


#### Testing Methodology
- Dataset: Multiple QA datasets from LongBench
- Prompt lengths: 500-3840 tokens
- Concurrency levels: 1-4
- Evaluation criteria: Accuracy, latency, cost
[Display accuracy_trajectory_per_payload.png]
*Figure 1: Accuracy across different prompt lengths*

#### Performance Comparison

**Latency Metrics**
| Model | p50 Latency | p95 Latency | p99 Latency | Transactions/min |
|-------|-------------|-------------|-------------|------------------|
| Teacher (70B) | 5.27s | 5.27s | 5.27s | 2 |
| Student (3B) | [Pending] | [Pending] | [Pending] | [Pending] |

[Display tokens_vs_latency.png]
*Figure 2: Token processing latency comparison*



- Performance comparison
- Error analysis

### Performance Metrics Analysis

#### Latency Measurements
- Time to First Token (TTFT)
- Time Per Output Token (TPOT)
- Overall response latency
[Display concurrency_vs_inference_latency.png]
*Figure 3: Concurrency vs Inference Latency*

#### Throughput Analysis
| Model | Prompt Token Throughput | Completion Token Throughput |
|-------|------------------------|---------------------------|
| Teacher (70B) | 203 tokens/s | 2 tokens/s |
| Student (3B) | [Pending] | [Pending] |

#### Cost Comparison
| Model | Price per Transaction | Price per Token | Cost per 10k Transactions |
|-------|---------------------|----------------|------------------------|
| Teacher (70B) | $0.002875 | $0.00000072 | $28.75 |
| Student (3B) | [Pending] | [Pending] | [Pending] |

[Display business_summary.png]
*Figure 4: Price Performance Comparison*

#### Quality Metrics
Error rates and model accuracy across different prompt lengths:

[Display error_rates.png]
*Figure 5: Error Rates by Model and Concurrency*

> **Note**: Full interactive versions of these visualizations are available in the evaluation report at `s3://{bucket}/fmbench-results/`


https://github.com/aws-samples/foundation-model-benchmarking-tool

## 7. Production Deployment(Jumpstart or Bedrock)
This section talks about differences using Jumpstart deployment vs BedRock
### Endpoint Configuration 

#### SageMaker JumpStart Endpoints
- Provides complete infrastructure control through endpoint configurations
- Supports custom containers and model serving code
- Enables A/B testing through production variants
- Requires endpoint management and maintenance

#### Bedrock Custom Model Import
- Offers serverless deployment with minimal configuration
- Streamlines deployment through model import workflow
- Integrates automatically with AWS AI services
- Manages infrastructure automatically

### Scaling and Cost Management 

#### SageMaker JumpStart
- Instance-based pricing with reserved capacity
- Auto-scaling based on custom metrics
- Granular control over instance types and counts
- Best for consistent, high-throughput workloads

#### Bedrock Custom Model Import
- Pay-per-invocation pricing model
- Built-in automatic scaling
- No minimum commitment required
- Optimal for variable workload patterns

### Monitoring Setup 

#### SageMaker JumpStart
- CloudWatch integration for custom metrics
- Model monitoring for drift detection
- Detailed logging and debugging capabilities
- Advanced endpoint metrics and alarms
#### Bedrock Custom Model Import
- Simplified monitoring through AWS Console
- Built-in performance metrics
- Automated operational monitoring
- Streamlined logging integration

# 8. Cleanup and Best Practices

### Resource Termination

1. Delete the Bedrock Custom Model
First, let's remove the custom model from Amazon Bedrock:


In [None]:
def delete_bedrock_custom_model(model_name):
    bedrock_client = boto3.client('bedrock')
    try:
        bedrock_client.delete_imported_model(modelIdentifier=model_name)
        print(f"Successfully deleted Bedrock custom model: {model_name}")
    except botocore.exceptions.ClientError as error:
        error_code = error.response['Error']['Code']
        if error_code == 'ValidationException':
            print(f"Error deleting Bedrock custom model: The provided model name is invalid. Model Name: {model_name}")
        elif error_code == 'ResourceNotFoundException':
            print(f"Error: The model '{model_name}' was not found in Bedrock.")
        elif error_code == 'AccessDeniedException':
            print("Error: You do not have permission to delete this model.")
        elif error_code == 'ConflictException':
            print("Error: The model is currently in use or in a state that doesn't allow deletion.")
        else:
            print(f"Error deleting Bedrock custom model: {error}")

# Replace with your actual model name
MODEL_NAME = "llama3-qa-model"
delete_bedrock_custom_model(MODEL_NAME)

2. Delete IAM Roles
Now, let's remove the IAM roles we created specifically for this project:

In [None]:
def delete_iam_role(role_name):
    iam = boto3.client('iam')
    try:
        # Delete inline policies
        inline_policies = iam.list_role_policies(RoleName=role_name)['PolicyNames']
        for policy in inline_policies:
            iam.delete_role_policy(RoleName=role_name, PolicyName=policy)
            
        # Detach managed policies
        attached_policies = iam.list_attached_role_policies(RoleName=role_name)['AttachedPolicies']
        for policy in attached_policies:
            iam.detach_role_policy(RoleName=role_name, PolicyArn=policy['PolicyArn'])
            
        # Delete permissions boundary if it exists
        try:
            iam.delete_role_permissions_boundary(RoleName=role_name)
        except iam.exceptions.NoSuchEntityException:
            pass
        
        # Finally delete the role
        iam.delete_role(RoleName=role_name)
        print(f"Successfully deleted IAM role: {role_name}")
    except botocore.exceptions.ClientError as error:
        print(f"Error deleting IAM role: {error}")

# Delete LambdaBedrockExecutionRole
delete_iam_role("LambdaBedrockExecutionRole")

# Delete Sagemaker_Bedrock_import_role
delete_iam_role("Sagemaker_Bedrock_import_role")

### Best Practices
#### Cost Optimization Tips
1. Training Optimization
    - Use spot instances for training when possible
    - Implement early stopping in training jobs
    - Clean up training artifacts promptly
    - Monitor training metrics to avoid unnecessary epochs
2. Inference Optimization
    - Choose between Bedrock and SageMaker based on workload patterns
    - Use auto-scaling for SageMaker endpoints
    - Consider batch processing for large-scale inference
    - Monitor and adjust instance sizes based on utilization
3. Storage Management
    - Implement S3 lifecycle policies for training artifacts
    - Clean up temporary datasets after training
    - Use appropriate storage classes for different data types
#### JumpStart Best Practices
1. Model Selection
2. Training Configuration
3. Data Management
4. Knowledge Distillation Specific
5. Production Deployment
#### Security Best Practices
1. Access Control
2. Monitoring and Compliance

For more information, see:

- SageMaker Best Practices[link]

- Bedrock Security[link]

- AWS Machine Learning Security[link]




# 9. Conclusion and Next Steps

### Summary of Results 
### Lessons Learned 
### Future Improvements 