### Mistral 7B deployment guide

Following this tutorial from DJL sample code, you will use LMI container from DLC to SageMaker and run inference with it.


https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_mistral_7b.html

In [8]:
%pip install sagemaker --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [23]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

In [29]:
%%writefile serving.properties
engine=Python
option.model_id=mistralai/Mistral-7B-v0.1
option.dtype=fp16
option.task=text-generation
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.device_map=auto

Writing serving.properties


In [30]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


In [31]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.25.0"
    )

s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- &gt; s3://sagemaker-us-east-1-70768********/large-model-lmi/code/mymodel.tar.gz


In [32]:
image_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.25.0-deepspeed0.11.0-cu118'

The deployment will take several minutes. Once finished, we can invoke the model with python sdk

In [33]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name
            )

-----------!

If you have already the endpoint deployed, you can just copy the endpoint name to do the load test

In [34]:
endpoint_name
print(endpoint_name)

lmi-model-2024-02-27-03-02-59-220


In [50]:
import logging

import boto3
import json

# Create a Boto3 client for SageMaker Runtime
sagemaker_client = boto3.client("sagemaker-runtime")

max_tokens_to_sample = 200
 
# Define the prompt and other parameters
prompt = f"""
Write a long and high-quality story about two dogs. Make the story longer than {max_tokens_to_sample}

Rex and Charlie were best friends who did everything together. They lived next door to each other with their human families and spent all day playing in the backyard. Rex was a golden retriever, always happy and eager for fun. Charlie was a German shepherd, more serious but very loyal. 

Every morning, Rex and Charlie would wake up and bark excitedly, ready to start the day's adventures. Their families would let them out into the backyard and they'd run around chasing each other and sniffing for interesting smells. After tiring themselves out, they'd nap in the shade of the big oak tree, Rex's tail still thumping contentedly even in his sleep. 

"""

# hyperparameters for llm
parameters = {
    "max_new_tokens": max_tokens_to_sample,
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
}

contentType = 'application/json'

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})


In [51]:
response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name, Body=body, ContentType=contentType)

# Process the response
response_body = json.loads(response.get('Body').read())

In [52]:
print(response_body['generated_text'])

In the afternoon, Rex and Charlie would go for a walk with their humans. They'd stroll down the street, stopping to greet other dogs and people along the way. Sometimes they'd stop for a treat at the local pet store, where they knew the owner and got lots of attention. 

At night, Rex and Charlie would snuggle up together on their favorite couch, watching their humans watch TV. They'd listen to the show and nudge each other when something funny happened. When their humans finally went to bed, Rex and Charlie would stay up a little longer, playing with their favorite toys or just snuggling together. 

Rex and Charlie's friendship was a joy to their families, who loved seeing them play and explore together. But more than that, their friendship was a joy to Rex and Charlie themselves. They were always there for each other, through good times and bad, and they always knew they


### Testing the throughput and lantency with locust

In [41]:
!pip install locust



In [53]:
%%writefile locustfile.py

from locust import User, task, between
import logging

import boto3
import json

# Create a Boto3 client for SageMaker Runtime
sagemaker_client = boto3.client("sagemaker-runtime")

endpoint_name = "lmi-model-2024-02-27-03-02-59-220"
max_tokens_to_sample = 200

# Define the prompt and other parameters
prompt = f"""
Write a long and high-quality story about two dogs. Make the story longer than {max_tokens_to_sample}

Rex and Charlie were best friends who did everything together. They lived next door to each other with their human families and spent all day playing in the backyard. Rex was a golden retriever, always happy and eager for fun. Charlie was a German shepherd, more serious but very loyal. 

Every morning, Rex and Charlie would wake up and bark excitedly, ready to start the day's adventures. Their families would let them out into the backyard and they'd run around chasing each other and sniffing for interesting smells. After tiring themselves out, they'd nap in the shade of the big oak tree, Rex's tail still thumping contentedly even in his sleep. 

"""

# hyperparameters for llm
parameters = {
    "max_new_tokens": max_tokens_to_sample,
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
}

contentType = 'application/json'

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})


class LLMUser(User):
    @task
    def generation(self):
        # Invoke the model
        with self.environment.events.request.measure("[Send]", "Prompt"):
            response = sagemaker_client.invoke_endpoint(
                            EndpointName=endpoint_name, Body=body, ContentType=contentType)
            # Process the response
            response_body = json.loads(response.get('Body').read())
            logging.info(response_body['generated_text'])
            
        logging.info("Finished generation!")            


Overwriting locustfile.py


The configuration with Command Line Options https://docs.locust.io/en/stable/configuration.html

--users <int> Peak number of concurrent Locust users. Primarily used together with --headless or --autostart.
    
--headless Disable the web interface, and start the test immediately.
    
--csv Store request stats to files in CSV format.

--spawn-rate <float> Rate to spawn users at (users per second)

In this example, the --users option sets the total number of users to 30, and the --spawn-rate option sets the rate of user spawning to 30 users per second. By using the same value for --spawn-rate as the total number of users, all 30 users will be spawned immediately. Therefore, at any given time during the test, there will be a maximum of 30 concurrent users.

Please note that the --run-time option sets the duration of the test in seconds. In this example, the test will run for 120 seconds before stopping.

!locust --headless --users 10 --spawn-rate 10 --run-time 120 --csv ./benchmark_metric/benchmark_u30

In [54]:
!locust --headless --users 30 --spawn-rate 30 --run-time 120 --csv ./benchmark_metric/benchmark_u30

[2024-02-27 03:47:43,428] ip-172-16-188-233.ec2.internal/INFO/locust.main: Run time limit set to 120 seconds
[2024-02-27 03:47:43,428] ip-172-16-188-233.ec2.internal/INFO/locust.main: Starting Locust 2.23.1
Type     Name  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated       0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2024-02-27 03:47:43,429] ip-172-16-188-233.ec2.internal/INFO/locust.runners: Ramping to 30 users at a rate of 30.00 per second
[2024-02-27 03:47:43,430] ip-172-16-188-233.ec2.internal/INFO/locust.runners: All users spawned: {"LLMUser": 30} (30 total users)
Type     Name  # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------||-------|-------------|-------|-------|-------|-------|--------|-----------
------