### Import a customized model into Amazon Bedrock

You can create a custom model in Amazon Bedrock by using the Amazon Bedrock Custom Model Import feature to import Foundation Models that you have customized in other environments, such as Amazon SageMaker AI. For example, you might have a model that you have created in Amazon SageMaker AI that has proprietary model weights. You can now import that model into Amazon Bedrock and then leverage Amazon Bedrock features to make inference calls to the model.

You can use a model that you import with on demand throughput. Use the InvokeModel or InvokeModelWithResponseStream operations to make inference calls to the model. For more information, see Submit a single prompt with InvokeModel.

https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html

In [39]:
import time
import json
import boto3
from botocore.config import Config

In [40]:
# Initialize Bedrock Runtime client
session = boto3.Session()
client = session.client(
    service_name='bedrock-runtime',
    config=Config(
        connect_timeout=300,  # 5 minutes
        read_timeout=300,     # 5 minutes
        retries={'max_attempts': 3}
    )
)

In [41]:
model_id = "arn:aws:bedrock:us-west-2:707684582322:imported-model/4mxlk57hcdv7"

In [42]:
def apply_chat_template(messages):
    prompt = ""
    # First message is always user
    prompt = "<｜begin▁of▁sentence｜><｜User｜>" + messages[0]["content"] + "<｜Assistant｜>"            
    return prompt

In [43]:
def chat_completion(messages, temperature=0.6, max_tokens=1024, top_p=0.9, model_id=model_id, max_retries=3):
    """
    Simplified chat completion with retry mechanism
    
    Parameters:
        messages (list): List of message dicts with 'role' and 'content'
        temperature (float): Controls randomness (0.0-1.0)
        max_tokens (int): Max tokens to generate
        top_p (float): Nucleus sampling parameter
        model_id (str): Model identifier
        max_retries (int): Max retry attempts
    
    Returns:
        dict: Model response with generated text
    """
    prompt = apply_chat_template(messages)
    
    for attempt in range(max_retries):
        try:
            response = client.invoke_model(
                modelId=model_id,
                body=json.dumps({
                    'prompt': prompt,
                    'temperature': temperature,
                    'max_gen_len': max_tokens,
                    'top_p': top_p
                }),
                accept='application/json',
                contentType='application/json'
            )
            result = json.loads(response['body'].read().decode('utf-8'))
            #print(result)
            return result
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(30)

    raise Exception("Max retries reached")

In [47]:
# Test conversation
messages = [

    {
        "role": "user", 
        "content": """please help me to generate python code:  
        Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.
        The overall run time complexity should be O(log (m+n))."""
    }
]

In [48]:
print("=== Test Case 1: Normal Conversation ===")
response = chat_completion(messages)
print(response)


=== Test Case 1: Normal Conversation ===
{'generation': '<think>\nOkay, so I need to write a Python function that takes two sorted arrays, nums1 and nums2, of sizes m and n respectively, and returns their median. The catch is that the solution needs to have a time complexity of O(log(m + n)). Hmm, I remember that the median is the middle value of a sorted list, so if I can find the middle element between the two arrays efficiently, that would work.\n\nWait, both arrays are already sorted, right? That should help because I can probably use a two-pointer technique. Let me think about how that would work. So, I can start with two pointers, one at the beginning of each array, and another two at the end. Then, I compare the elements at these pointers and move the pointer with the smaller value forward. This way, I can find the median without having to sort the combined arrays, which would have been O(m + n log(m + n)) time.\n\nBut wait, how do I handle cases where m and n are odd or even? L

### Testing the throughput and lantency with locust

In [49]:
!pip install locust



In [50]:
%%writefile locustfile.py

from locust import User, task, between
import logging

import boto3
import json
from botocore.config import Config

# Initialize Bedrock Runtime client
session = boto3.Session()
client = session.client(
    service_name='bedrock-runtime',
    config=Config(
        connect_timeout=300,  # 5 minutes
        read_timeout=300,     # 5 minutes
        retries={'max_attempts': 3}
    )
)

model_id = "arn:aws:bedrock:us-west-2:707684582322:imported-model/4mxlk57hcdv7"

def apply_chat_template(messages):
    prompt = ""
    # First message is always user
    prompt = "<｜begin▁of▁sentence｜><｜User｜>" + messages[0]["content"] + "<｜Assistant｜>"            
    return prompt


def chat_completion(messages, temperature=0.6, max_tokens=1024, top_p=0.9, model_id=model_id, max_retries=3):
    """
    Simplified chat completion with retry mechanism
    
    Parameters:
        messages (list): List of message dicts with 'role' and 'content'
        temperature (float): Controls randomness (0.0-1.0)
        max_tokens (int): Max tokens to generate
        top_p (float): Nucleus sampling parameter
        model_id (str): Model identifier
        max_retries (int): Max retry attempts
    
    Returns:
        dict: Model response with generated text
    """
    prompt = apply_chat_template(messages)
    
    for attempt in range(max_retries):
        try:
            response = client.invoke_model(
                modelId=model_id,
                body=json.dumps({
                    'prompt': prompt,
                    'temperature': temperature,
                    'max_gen_len': max_tokens,
                    'top_p': top_p
                }),
                accept='application/json',
                contentType='application/json'
            )
            result = json.loads(response['body'].read().decode('utf-8'))
            #print(result)
            return result
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(30)

    raise Exception("Max retries reached")

# Test conversation
messages = [

    {
        "role": "user", 
        "content": """please help me to generate python code:  
        Given two sorted arrays nums1 and nums2 of size m and n respectively, return the median of the two sorted arrays.
        The overall run time complexity should be O(log (m+n))."""
    }
]
    
class LLMUser(User):
    @task
    def generation(self):
        # Invoke the model
        with self.environment.events.request.measure("[Send]", "Prompt"):
            response = chat_completion(messages)
            logging.info(response.get('generation')[:100])
            
        logging.info("Finished generation!")            

Overwriting locustfile.py


The configuration with Command Line Options https://docs.locust.io/en/stable/configuration.html

--users Peak number of concurrent Locust users. Primarily used together with --headless or --autostart.

--headless Disable the web interface, and start the test immediately.

--csv Store request stats to files in CSV format.

--spawn-rate Rate to spawn users at (users per second)

In this example, the --users option sets the total number of users to 30, and the --spawn-rate option sets the rate of user spawning to 30 users per second. By using the same value for --spawn-rate as the total number of users, all 30 users will be spawned immediately. Therefore, at any given time during the test, there will be a maximum of 30 concurrent users.

Please note that the --run-time option sets the duration of the test in seconds. In this example, the test will run for 120 seconds before stopping.

!locust --headless --users 30 --spawn-rate 30 --run-time 120 --csv ./benchmark_metric/benchmark_u30



In [None]:
!locust --headless --users 10 --spawn-rate 10 --run-time 60 --csv ./benchmark_metric/benchmark_u10

[2025-02-07 08:05:26,981] ip-172-16-23-227/INFO/locust.main: Starting Locust 2.32.8
[2025-02-07 08:05:26,981] ip-172-16-23-227/INFO/locust.main: Run time limit set to 60 seconds
Type     Name  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated       0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2025-02-07 08:05:26,982] ip-172-16-23-227/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-02-07 08:05:26,983] ip-172-16-23-227/INFO/locust.runners: All users spawned: {"LLMUser": 10} (10 total users)
Type     Name  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
--------||-------|-------------|-------|-------|-------|-------