https://aws.amazon.com/blogs/machine-learning/mistral-7b-foundation-models-from-mistral-ai-are-now-available-in-amazon-sagemaker-jumpstart/

If you have already the endpoint deployed, you can just copy the endpoint name to do the load test

In [47]:
endpoint_name = "jumpstart-dft-hf-llm-mistral-7b-instruct"
print(endpoint_name)

jumpstart-dft-hf-llm-mistral-7b-instruct


In [48]:
import logging

import boto3
import json

# Create a Boto3 client for SageMaker Runtime
sagemaker_client = boto3.client("sagemaker-runtime")

max_tokens_to_sample = 200
 
# Define the prompt and other parameters
prompt = f"""
Write a long and high-quality story about two dogs. Make the story longer than {max_tokens_to_sample}

Rex and Charlie were best friends who did everything together. They lived next door to each other with their human families and spent all day playing in the backyard. Rex was a golden retriever, always happy and eager for fun. Charlie was a German shepherd, more serious but very loyal. 

Every morning, Rex and Charlie would wake up and bark excitedly, ready to start the day's adventures. Their families would let them out into the backyard and they'd run around chasing each other and sniffing for interesting smells. After tiring themselves out, they'd nap in the shade of the big oak tree, Rex's tail still thumping contentedly even in his sleep. 

"""

# hyperparameters for llm
parameters = {
    "max_new_tokens": max_tokens_to_sample,
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
}

contentType = 'application/json'

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})


In [49]:
response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name, Body=body, ContentType=contentType)

# Process the response
response_body = json.loads(response.get('Body').read())

In [50]:
print(response_body[0]['generated_text'])

As the sun started to set, Rex and Charlie would start to play again. They'd chase balls and frisbees and even play fetch. They were both great swimmers and would often go for a swim in the nearby lake. They'd swim around and splash water at each other, laughing and having the time of their lives. 

But even when they weren't playing, Rex and Charlie were still close. They'd sit next to each other and watch the birds fly by or the squirrels scurry up the trees. They'd bark at the mailman and the paper delivery guy, but only to be friendly and let them know that they were there. 

One day, Rex and Charlie's families went on a camping trip. They packed up all their gear and hit the road, Rex and Charlie eagerly following. They drove for hours, passing through rolling hills and small towns.


### Testing the throughput and lantency with locust

In [51]:
!pip install locust



In [51]:
%%writefile locustfile.py

from locust import User, task, between
import logging

import boto3
import json

# Create a Boto3 client for SageMaker Runtime
sagemaker_client = boto3.client("sagemaker-runtime")

endpoint_name = "jumpstart-dft-hf-llm-mistral-7b-instruct"
max_tokens_to_sample = 200

# Define the prompt and other parameters
prompt = f"""
Write a long and high-quality story about two dogs. Make the story longer than {max_tokens_to_sample}

Rex and Charlie were best friends who did everything together. They lived next door to each other with their human families and spent all day playing in the backyard. Rex was a golden retriever, always happy and eager for fun. Charlie was a German shepherd, more serious but very loyal. 

Every morning, Rex and Charlie would wake up and bark excitedly, ready to start the day's adventures. Their families would let them out into the backyard and they'd run around chasing each other and sniffing for interesting smells. After tiring themselves out, they'd nap in the shade of the big oak tree, Rex's tail still thumping contentedly even in his sleep. 

"""

# hyperparameters for llm
parameters = {
    "max_new_tokens": max_tokens_to_sample,
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
}

contentType = 'application/json'

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})


class LLMUser(User):
    @task
    def generation(self):
        # Invoke the model
        with self.environment.events.request.measure("[Send]", "Prompt"):
            response = sagemaker_client.invoke_endpoint(
                            EndpointName=endpoint_name, Body=body, ContentType=contentType)
            # Process the response
            response_body = json.loads(response.get('Body').read())
            logging.info(response_body[0]['generated_text'])
            
        logging.info("Finished generation!")            


Overwriting locustfile.py


The configuration with Command Line Options https://docs.locust.io/en/stable/configuration.html

--users <int> Peak number of concurrent Locust users. Primarily used together with --headless or --autostart.
    
--headless Disable the web interface, and start the test immediately.
    
--csv Store request stats to files in CSV format.

--spawn-rate <float> Rate to spawn users at (users per second)

In this example, the --users option sets the total number of users to 30, and the --spawn-rate option sets the rate of user spawning to 30 users per second. By using the same value for --spawn-rate as the total number of users, all 30 users will be spawned immediately. Therefore, at any given time during the test, there will be a maximum of 30 concurrent users.

Please note that the --run-time option sets the duration of the test in seconds. In this example, the test will run for 120 seconds before stopping.

!locust --headless --users 10 --spawn-rate 10 --run-time 120 --csv ./benchmark_metric/benchmark_u30

In [None]:
!locust --headless --users 30 --spawn-rate 30 --run-time 120 --csv ./benchmark_metric/benchmark_u30

[2024-02-26 04:32:20,603] ip-172-16-188-233.ec2.internal/INFO/locust.main: Run time limit set to 120 seconds
[2024-02-26 04:32:20,603] ip-172-16-188-233.ec2.internal/INFO/locust.main: Starting Locust 2.23.1
Type     Name  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated       0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2024-02-26 04:32:20,604] ip-172-16-188-233.ec2.internal/INFO/locust.runners: Ramping to 30 users at a rate of 30.00 per second
[2024-02-26 04:32:20,605] ip-172-16-188-233.ec2.internal/INFO/locust.runners: All users spawned: {"LLMUser": 30} (30 total users)
Type     Name  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
------