# Performance test Custom Model Import on Amazon Bedrock

This notebook illustrates the process of performance testing the fine tuned model once it is hosted in Bedrock. You can view the process to import the model via [Custom Model Import]()



### License Information

In this notebook we are providing a sample of how to performance test. This is by no means a definetive guide on how to performance test your models. This can be used as a starting point for your testing. 

### Installing pre-requisites

- Please un comment and run this cell to install the required libraries to run this notebook

In [2]:
#!pip install boto3 numpy --upgrade --quiet

## Setup

Loading the boto3 client we will need to access our model

In [3]:
import warnings

from io import StringIO
import sys
import textwrap
import os
from typing import Optional

# External Dependencies:
import boto3
from botocore.config import Config

warnings.filterwarnings('ignore')

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))
        

def get_boto_client_tmp_cred(
    retry_config = None,
    target_region: Optional[str] = None,
    runtime: Optional[bool] = True,
    service_name: Optional[str] = None,
):

    if not service_name:
        if runtime:
            service_name='bedrock-runtime'
        else:
            service_name='bedrock'

    bedrock_client = boto3.client(
        service_name=service_name,
        config=retry_config,
        aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
        aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
        aws_session_token=os.getenv('AWS_SESSION_TOKEN',""),

    )
    print("boto3 Bedrock client successfully created!")
    print(bedrock_client._endpoint)
    return bedrock_client    

def get_boto_client(
    assumed_role: Optional[str] = None,
    region: Optional[str] = None,
    runtime: Optional[bool] = True,
    service_name: Optional[str] = None,
):
    """Create a boto3 client for Amazon Bedrock, with optional configuration overrides

    Parameters
    ----------
    assumed_role :
        Optional ARN of an AWS IAM role to assume for calling the Bedrock service. If not
        specified, the current active credentials will be used.
    region :
        Optional name of the AWS Region in which the service should be called (e.g. "us-east-1").
        If not specified, AWS_REGION or AWS_DEFAULT_REGION environment variable will be used.
    runtime :
        Optional choice of getting different client to perform operations with the Amazon Bedrock service.
    """
    if region is None:
        target_region = os.environ.get("AWS_REGION", os.environ.get("AWS_DEFAULT_REGION"))
    else:
        target_region = region

    print(f"Create new client\n  Using region: {target_region}")
    session_kwargs = {"region_name": target_region}
    client_kwargs = {**session_kwargs}

    profile_name = os.environ.get("AWS_PROFILE", None)
    retry_config = Config(
        region_name=target_region,
        signature_version = 'v4',
        retries={
            "max_attempts": 10,
            "mode": "standard",
        },
    )
    if profile_name:
        print(f"  Using profile: {profile_name}")
        session_kwargs["profile_name"] = profile_name
    else: # use temp credentials -- add to the client kwargs
        print(f"  Using temp credentials")

        return get_boto_client_tmp_cred(retry_config=retry_config,target_region=target_region, runtime=runtime, service_name=service_name)

    session = boto3.Session(**session_kwargs)

    if assumed_role:
        print(f"  Using role: {assumed_role}", end='')
        sts = session.client("sts")
        response = sts.assume_role(
            RoleArn=str(assumed_role),
            RoleSessionName="cmi-llm-1"
        )
        print(" ... successful!")
        client_kwargs["aws_access_key_id"] = response["Credentials"]["AccessKeyId"]
        client_kwargs["aws_secret_access_key"] = response["Credentials"]["SecretAccessKey"]
        client_kwargs["aws_session_token"] = response["Credentials"]["SessionToken"]

    if not service_name:
        if runtime:
            service_name='bedrock-runtime'
        else:
            service_name='bedrock'

    bedrock_client = session.client(
        service_name=service_name,
        config=retry_config,
        **client_kwargs
    )

    print("boto3 Bedrock client successfully created!")
    print(bedrock_client._endpoint)
    return bedrock_client

### Boto3 client
- Create the run time client which we will use to run through the various classes

In [4]:
#os.environ["AWS_PROFILE"] = '<replace with your profile if you have that set up>'
region_aws = 'us-east-1' #- replace with your region
boto3_bedrock = get_boto_client(region=region_aws, runtime=True, service_name='bedrock-runtime')

Create new client
  Using region: us-east-1
  Using temp credentials
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


### Read the prompts from a file

This is for the larger context sizes. we want to read the prompts from a file which allows us to customize easily

- we have provided 2 files one has smaller 475 tokens approximately
- The second file is a larger file which has close to 2000 tokens approximately
- Fell free to use either or bring your own data

In [5]:
prompt_test = "Generate an article om economics"
with open("./perf_data/perf_test_small.txt", "r+") as file1:
    # Reading from a file
    prompt_test = file1.read()
    
print(f"approx no of tokens ---- > {len(prompt_test)/4} ")

approx no of tokens ---- > 474.5 


### Async method to invoke the model. Since this is a IO operation we are not impacted by the GIL

- We cannot use the asyncio library because the boto3 is not a async function 
- We will resort to threading since this is a IO operation and not impacted by GIL.
- However for your own perf testing we would recomend you to use multi processing assuming your CPU has enough cores or any performance testing library designed for parallel test
- We create a Thread pool with max size determined by the `max_parallel_invocations` variable
- We will create a batch of runs determined by the `batch_size` variable
- this means say we want to run a batch size of 100 records and run 10 of them parallel at anything, we wil set the `max_parallel_invocations` to 10 and `batch_size` to 100

This function takes in the following
- it will use the run _id of the calling thread from the pool
- it will create a new boto3 client and use for this run
- The wrapper runs each 3 times and then returns the model time taken in sec and also reports the latency metrics from what the model returned back

In [25]:
import boto3
import json
import traceback
import asyncio
import time

def invoke_custom_model(boto_client, model_arn, prompt, run_id, max_tokens=200, temperature=0, top_p=0.9):
    response = ""
    print(f"starting run id --- > {run_id}")
    try:
        response = boto_client.converse(
            modelId=model_arn,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "text": prompt
                        }
                    ]
                }
            ],
            inferenceConfig={
                "temperature": temperature,
                "maxTokens": max_tokens,
                "topP": top_p
            }
            #additionalModelRequestFields={
            #}
        )
        #print(response)
        #print(f"returning results:: ending run id --- > {run_id}:: ", flush=True)
    except :
        print(traceback.format_exc())
    try:
        result = (
            response['metrics']['latencyMs'],
            response['usage']['inputTokens'], 
            response['usage']['outputTokens'],
            f"{len(response['output']['message']['content'][0]['text'])}" \
            + '\n--- Latency: ' + str(response['metrics']['latencyMs']) \
            + 'ms - Input tokens:' + str(response['usage']['inputTokens']) \
            + ' - Output tokens:' + str(response['usage']['outputTokens']) + ' ---\n'
        )
    except:
        print(traceback.format_exc())
        result = -1,-1,-1, "Output parsing error"
    

    
    return result


def invoke_wrapper(model_arn, prompt, run_id, max_tokens=200, temperature=0, top_p=0.9):
    boto_client = get_boto_client(region=region_aws, runtime=True, service_name='bedrock-runtime')
    
    start_time_sec = time.time()
    for _ in range(3):
        result = invoke_custom_model(boto_client, model_arn, prompt, run_id, max_tokens, temperature, top_p)
        
    time_diff = (time.time() - start_time_sec)/3
    
    return time_diff, result[0], result[1], result[2] # return a tuple of time diff and the latency metrics, total  Input tokens and total output tokens

#### Set your variables as described above

- Set the Model id to be the [model arn](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html) you would get back after running your import process

In [26]:
model_id = 'us.meta.llama3-2-1b-instruct-v1:0'
#model_id = 'us.meta.llama3-2-11b-instruct-v1:0'
#model_id = 'anthropic.claude-3-haiku-20240307-v1:0'

#- how many total runs do we want to execute
batch_size = 2

#- how many of these runs should be executed in parallel
max_parallel_invocations = 2



#### Run it via threading

In [27]:
import concurrent.futures 

result_metrics = []
with concurrent.futures.ThreadPoolExecutor(max_workers = max_parallel_invocations) as executor:

    # Submit tasks to the thread pool
    futures_cmi = [executor.submit(invoke_wrapper, model_id, prompt_test,run_id ) for run_id in range(batch_size)]

    # Get the results as they are completed
    for future in concurrent.futures.as_completed(futures_cmi):
        result = future.result()
        result_metrics.append(result)
        print(f"Result: {result}")

Create new client
  Using region: us-east-1Create new client
  Using region: us-east-1
  Using temp credentials

  Using temp credentials
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)
starting run id --- > 1
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)
starting run id --- > 0
starting run id --- > 0
starting run id --- > 1
starting run id --- > 0
starting run id --- > 1
Result: (0.9432307084401449, 781, 486, 139)
Result: (0.9755787054697672, 780, 486, 139)


### Here we tabulate the results

- we will display Mean and P90 values
- we display the end to end latency as measured by our clock 
- we will also display the latency reported by the model

In [24]:
import numpy as np

time_end_end = [result[0] for result in result_metrics]
time_model_latency = [result[1] for result in result_metrics]

time_mean = np.mean(time_end_end)
time_mean_by_model = np.mean(time_model_latency)


p90 = np.percentile(time_end_end, 90)
p90_by_model = np.percentile(time_model_latency, 90)

print(f" Final Mean latency results --- > by Wall time:{time_mean} secs::  by Model Latency metrics:{time_mean_by_model} ms")
print(f" Final P90 latency results --- > by Wall time:{p90} secs ::  by Model Latency metrics:{p90_by_model} ms")




 Final Mean latency results --- > by Wall time:1.0329072078069053 secs::  by Model Latency metrics:803.0 ms
 Final P90 latency results --- > by Wall time:1.0379081169764202 secs ::  by Model Latency metrics:807.8 ms
