# Host mixtral-8x7B model on Amazon SageMaker using LMI(TensorRTLLM) container

In [3]:
%pip install sagemaker --upgrade  --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.4 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Import the relevant libraries and configure global variables

In [4]:
import boto3
import sagemaker
import json
import io
import numpy as np
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = session._region_name  # region name of the current SageMaker Studio environment

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


## Create serving.properties file, upload model artifacts to S3 and specify the inference container

SageMaker Large Model Inference (LMI) containers can be used to host models without any additional inference code. You can configure the model server either through the environment variables or a serving.properties file.  Optionally, we could have like a model.py for any pre or post processing and a requirements.txt file for any additional packages that are required to be installed.

In this case we use the serving.properties file to configure the parameters and customize the LMI container behavior.

#### Create serving.properties 
Next, we create the serving.properties configuration file and specify the following settings:


- `engine`: The engine for DJL to use. 
- `option.model_id`: This can be the S3 uri of the pre-trained model or the model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). In this case, we provide the model id for the Mixtral-8x7B model.
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. We set this value to "max"(maximum GPU on the current machine).
- `option.max_rolling_batch_size`: Sets the maximum size of the continuous batch, defining how many sequences can be processed in parallel at any given time  
- `option.rolling_batch`: Set to enable continuous batching to optimize accelerator utilization and overall throughput.
- `option.model_loading_timeout` : Sets the timeout value for downloading and loading the model to serve inference

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

In [5]:
%%writefile serving.properties
engine=MPI
option.model_id=mistralai/Mixtral-8x7B-v0.1
option.tensor_parallel_degree=max
option.rolling_batch=auto
option.max_rolling_batch_size=32
option.model_loading_timeout = 7200 

Writing serving.properties


#### We package the serving.properties configuration file in the tar.gz format, so that it meets SageMaker hosting requirements

In [6]:
%%sh
mkdir mixtral7b-model
mv serving.properties mixtral7b-model/
tar czvf mixtral7b-model.tar.gz mixtral7b-model/
rm -rf mixtral7b-model

mixtral7b-model/
mixtral7b-model/serving.properties


#### Configure the Image URI for the inference container

We configure the DJL LMI container with deepspeed as the backend engine. Also note that we are specifying the latest version of the container (0.26.0)

In [7]:
image_uri = image_uris.retrieve(
        framework="djl-tensorrtllm",
        region=session.boto_session.region_name,
        version="0.26.0"
    )

#### Next we upload the local tarball (containing the serving.properties configuration file) to an S3 prefix 

In [None]:
s3_code_prefix = "large-model-lmi/code"
bucket = session.default_bucket()  # bucket to house artifacts
code_artifact = session.upload_data("mixtral7b-model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

## Create the SageMaker model object and Deploy the Model with the LMI container
 
We use the image URI for the DJL container and the s3 location to which the model serving artifacts tarball were uploaded, to create the SageMaker model object.

The container downloads the model into the `/tmp` space on the container because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB.

In [None]:
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

instance_type = "ml.p4d.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("mixtral-lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             VolumeSizeInGB =30,
             container_startup_health_check_timeout=1800
            )

#### Note: Please ensure that the size of the mount is large enough to hold the model using the VolumeSizeInGB configuration above.

## Generating log_prob and finish_reason as additional details as part of the output

As part of LMI 0.26.0, you can now use 2 additional fine grained details about the generated output i.e. log_prob and finish_reason. 

* log_probs - the log probability assigned by the model for each token in the streamed sequence chunk. You can use these as a rough estimate of model confidence by computing the joint probability of a sequence as the sum of the log probabilities of the individual tokens, which can be useful for scoring and ranking model outputs. Be mindful that LLM token probabilities are generally overconfident without calibration.

* finish_reason - the reason for generation completion, which can be reaching the maximum generation length, generation an end-of-sentence (EOS) toke, or generating a user-defined stop token. This is returned with the last streamed sequence chunk. 


You can enable these by passing 'details'=True as part of your input to the model. Next, let's see how you can generate these details and we also use a content generation example to understand their application.

## Helper functions for processing streaming response from the model

First, we define a LineIterator class, which has functions to lazily fetch bytes from a response stream, buffer them and breakdown the buffer into lines. The idea is to serve bytes from the buffer while fetching more bytes from stream asynchronously.

In [None]:
class LineIterator:

    def __init__(self, stream):
        # Iterator to get bytes from stream 
        self.byte_iterator = iter(stream)  
        # Buffer stream bytes until we get a full line
        self.buffer = io.BytesIO()  
      # Track current reading position within buffer
        self.read_pos = 0

    def __iter__(self):
        # Make class iterable 
        return self

    def __next__(self):
        while True:
           # Seek read position within buffer
           self.buffer.seek(self.read_pos)  
           # Try reading a line from current position
           line = self.buffer.readline()
           # If we have a full line
           if line and line[-1] == ord('\n'):
               # Increment reading position past this line
               self.read_pos += len(line)  
               # Return the line read without newline char
               return line[:-1] 
           # Fetch next chunk from stream  
           try:
               chunk = next(self.byte_iterator)
           # Handle end of stream 
           except StopIteration:
               # Check if we have any bytes still unread
               if self.read_pos < self.buffer.getbuffer().nbytes:
                   continue
               # If not, raise StopIteration
               raise
           # Add fetched bytes to end of buffer
           self.buffer.seek(0, io.SEEK_END)  
           self.buffer.write(chunk['PayloadPart']['Bytes'])

### log_prob

Next, consider a use case where we are generating content. Specifically, let's assume that we are tasked writing a brief paragraph about benefits of exercising regularly for a lifestyle focused website. Additionally, we not only want to generate content but also output some indicative score of the confidence that the model has in the generated content.

In [None]:
prompt="""Your task is to write a short paragraph in about 100 words about exercising regularly for a lifestyle focused website. Discuss benefits of regular exercises along with some tips for increasing exercise effectiveness"""

We invoke the model endpoint with our prompt and capture the generated response. Notice that we set details: True as a runtime parameter within the input to the model.

In [None]:
sm_client = boto3.client("sagemaker-runtime")

# set details: True as a runtime parameter within the input.
body = {"inputs": prompt, "parameters": {"max_new_tokens":512, "details": True}}
resp = sm_client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
event_stream = resp['Body']

Since the log probability is generated for each output token, we append the individual log probabilities to a list. We also capture the complete generated text from the response.

In [None]:
overall_log_prob = []

for line in LineIterator(event_stream):
    resp = json.loads(line)
    if resp['token'].get('text') != None:
        token_log_prob = resp['token']['log_prob']
        overall_log_prob.append(token_log_prob)
    elif resp['generated_text'] != None:
        generated_text= resp['generated_text']

Now to calculate the overall confidence score, we calculate the mean of all the individual token probabilities and subsequently get the exponential value between 0 and 1.

This is our inferred overall confidence score for the generated text, which in this case is a paragraph about benefits of exercising.

In [None]:
print(generated_text)
overall_score=np.exp(np.mean(overall_log_prob))      
print(f"\n\nOverall confidence score in the generated text: {overall_score}")

### finish_reason

Now lets build on the same use case, but lets assume that we are tasked with writing a longer article about benefits of exercising regularly for a lifestyle focused website. Additionally, we want to ensure that the output is not truncated due to generation length issues(max token length) or due to stop tokens being encountered.

In [None]:
prompt="""Your task is to write a paragraph in about 500 words about exercising regularly for a lifestyle focused website. Discuss benefits of regular exercises along with some tips for increasing exercise effectiveness while reducing required time commitment"""

To accomplish this, we use the finish_reason attribute generated in the output, monitor its value and continue generating, until the entire output is generated.

1. We define an `inference` function that takes a `payload` input and calls the SageMaker endpoint, streams back a response, and processes the response to extract generated text.

2. The `payload` contains the prompt text as `inputs` and parameters like max tokens and details.

3. The response is read in a stream and processed line-by-line to extract the generated text tokens into a list. we extract details like `finish_reason`.

4. We call the `inference` function in a loop(chained requests) while adding more context each time, and track number of tokens generated and number of requests sent until the model finishes.

In [None]:
def inference(payload):
    # Call SageMaker endpoint and get response stream
    resp = sm_client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(payload), ContentType="application/json")
    event_stream = resp['Body']
    text_output = []
    for line in LineIterator(event_stream):
        resp = json.loads(line) 
        # Extract text tokens if present
        if resp['token'].get('text') != None:
            token = resp['token']['text']
            text_output.append(token)  
            print(token, end='')
        # Get finish reason if details present
        if resp.get('details') != None:
            finish_reason = resp['details']['finish_reason']
            # Return extracted output, finish reason and token length
            return payload['inputs'] + ''.join(text_output), finish_reason, len(text_output)

# set details: True as a runtime parameter within the input.
payload = {"inputs": prompt,  "parameters": {"max_new_tokens":256, "details": True}} 

finish_reason = "length"
# Print initial output 
print(f"Output: {payload['inputs']}", end='')  
total_tokens = 0
total_requests = 0
while finish_reason == 'length':
    # Call inference and get extracts
    output_text, finish_reason, out_token_len = inference(payload)
    # Update payload for next request
    payload['inputs'] = output_text 
    total_tokens += out_token_len
    total_requests += 1
# Print metrics
print(f"\n\ntotal tokens generated: {total_tokens} \ntotal requests sent: {total_requests}")

As we see above, even though the `max_new_token`paramater is set to 256, we use the `finish_reason` detail attribute as part of the output to chain multiple requests to the endpoint, until the entire output is generated.

## Cleanup the environment

In [None]:
session.delete_endpoint(endpoint_name)
session.delete_endpoint_config(endpoint_name)
model.delete_model()