# Run Instance Performance Benchmarking for Pixtral-12B-2409

This Jupyter notebook is designed to benchmark the performance of the Pixtral 12B model on Amazon SageMaker across multiple instance types. The primary goal is to evaluate how different instance configurations impact the model’s inference time, throughput, and efficiency. The notebook includes detailed steps for loading the Pixtral 12B model, running inference tasks, and collecting performance metrics on a variety of SageMaker instance types. By comparing these metrics, users can gain insights into the optimal instance choice based on their specific workload requirements, whether they prioritize speed, scalability, or cost-effectiveness.

### awscurl

To generate load on the SageMaker endpoints during benchmarking, this notebook utilizes the [awscurl](https://github.com/okigan/awscurl) tool. awscurl is a command-line utility that simplifies making authenticated HTTP requests to AWS services, including SageMaker endpoints. By using awscurl, we can simulate high traffic and stress test the endpoints, enabling us to measure the model’s performance under varying levels of load.

For more information on running Pixtral-12B-2409, please see the [Pixtral-12b-LMI-SageMaker-realtime-inference.ipynb] (https://github.com/aws-samples/mistral-on-aws/blob/main/notebooks/Pixtral-samples/Pixtral-12b-LMI-SageMaker-realtime-inference.ipynb) notebook.

If you're interested in learning more about Pixtral capabilities, please see the [pixtral_capabilities.ipynb] (https://github.com/aws-samples/mistral-on-aws/blob/main/notebooks/Pixtral-samples/Pixtral_capabilities.ipynb) notebook.


### Install Dependencies


Setup tools and python packages used in this notebook.


In [None]:
%pip install -Uq sagemaker boto3

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade --quiet

In [None]:
!sudo apt-get update -qq > /dev/null
!sudo apt-get install -y default-jre wget > /dev/null

In [None]:
!wget --no-check-certificate --quiet https://www.github.com/frankfliu/junkyard/releases/download/v0.3.1/awscurl
!chmod +x awscurl

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import json
import os
import base64
import re
import base64
from PIL import Image
from typing import List
from IPython.display import display, HTML
from sagemaker.djl_inference import DJLModel
import pandas as pd
import matplotlib.pyplot as plt

Capture sagemaker role and session information to be used later in the notebook


In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

sess = sagemaker.Session(default_bucket=bucket)
region = sess.boto_region_name
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

## Dataset Preparation

In this section, we prepare a dataset to generate load against the SageMaker endpoint by encoding images as part of the inference prompts. Although we’re using a specific set of images here, feel free to substitute others that better match your own use case. Note that each image, once encoded, may contribute roughly ~1,500–2,500 tokens to the input payload


In [None]:
# This function reads image file and base64 encodes 
def read_and_encode_image(image_path):
    """
    Reads an image from a local file path and encodes it to a data URL.
    """
    with open(image_path, 'rb') as image_file:
        image_bytes = image_file.read()
    base64_encoded = base64.b64encode(image_bytes).decode('utf-8')
    
    # Determine the image MIME type (e.g., image/jpeg, image/png)
    mime_type = Image.open(image_path).get_format_mimetype()
    image_content = f"data:{mime_type};base64,{base64_encoded}"
    return image_content
    
def prepare_prompt_file(prompt, image_path, file_name):
    """
    Generates a prompt file
    """

    data_url = read_and_encode_image(image_path)
    content_list = [{
        "type": "text",
        "text": prompt
    }]
    content_list.append({
            "type": "image_url",
            "image_url": {
                "url": data_url
            }
            
        })

    payload = {
        "messages": [
            {
                "role": "user",
                "content": content_list
            }
        ],
        "max_tokens": 2000,
        "temperature": 0.1,
        "top_p": 0.9,
    }

    file_path = f'{local_dataset_path}{file_name}'
    
    with open(file_path, 'w') as json_file:
        json.dump(payload, json_file, indent=4) 

In [None]:
# Load dataset is stored in json files 

local_dataset_path="./Pixtral_benchmarking_data/"

prepare_prompt_file(prompt='extract product information from the image', 
                    image_path='Pixtral_data/cleaner.jpg', 
                    file_name='prompt1.json')



prepare_prompt_file(prompt='Analyze the image and transcribe any handwritten text present. Convert the handwriting into a single, continuous string of text. Maintain the original spelling, punctuation, and capitalization as written. Ignore any printed text, drawings, or other non-handwritten elements in the image.', 
                    image_path='Pixtral_data/a01-082u-01.png', 
                    file_name='prompt2.json')


prepare_prompt_file(prompt='As an interior designer, provide your comments on the aesthetics', 
                    image_path='Pixtral_data/dresser.jpg', 
                    file_name='prompt3.json')



prepare_prompt_file(prompt='for an e-commerce catalog, generate product description for the product in the image', 
                    image_path='Pixtral_data/luggage.jpg', 
                    file_name='prompt4.json')

## Create SageMaker Endpoints

In this section, we create multiple SageMaker endpoints. Each endpoint runs same model with different instance types.


In [None]:
image_uri =f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124" 

# You can also obtain the image_uri programatically as follows.
# image_uri = image_uris.retrieve(framework="djl-lmi", version="0.30.0", region="us-west-2")

model = DJLModel(
    role=role,
    image_uri=image_uri,
    env={
        "HF_MODEL_ID": "mistralai/Pixtral-12B-2409",
        "HF_TOKEN": "<HF_Token>", #since the model "mistralai/Pixtral-12B-2409" is gated model, you need a HF_TOKEN & go to https://huggingface.co/mistralai/Pixtral-12B-2409 to be granted access
        "OPTION_ENGINE": "Python",
        "OPTION_MPI_MODE": "true",
        "OPTION_ROLLING_BATCH": "lmi-dist",
        "OPTION_MAX_MODEL_LEN": "8192", # this can be tuned depending on instance type + memory available
        "OPTION_MAX_ROLLING_BATCH_SIZE": "16", # this can be tuned depending on instance type + memory available
        "OPTION_TOKENIZER_MODE": "mistral",
        "OPTION_ENTRYPOINT": "djl_python.huggingface",
        "OPTION_TENSOR_PARALLEL_DEGREE": "max",
        "OPTION_LIMIT_MM_PER_PROMPT": "image=4", # this can be tuned to control how many images per prompt are allowed
    }
)

In [None]:
# Deploy endpoint

endpoint_name = 'pixtral12b-on-ml-g5-12xlarge'
predictor12xlarge = model.deploy(instance_type="ml.g5.12xlarge", initial_instance_count=1, endpoint_name=endpoint_name)

In [None]:
# Deploy endpoint

endpoint_name = 'pixtral12b-on-ml-g5-24xlarge'
predictor24xlarge = model.deploy(instance_type="ml.g5.24xlarge", initial_instance_count=1, endpoint_name=endpoint_name)

In [None]:
# Deploy endpoint

endpoint_name = 'pixtral12b-on-ml-g5-48xlarge'
predictor48xlarge = model.deploy(instance_type="ml.g5.48xlarge", initial_instance_count=1, endpoint_name=endpoint_name)

In [None]:
# Deploy endpoint

endpoint_name = 'pixtral12b-on-ml-p4d-24xlarge'
predictorp4d24xlarge = model.deploy(instance_type="ml.p4d.24xlarge", initial_instance_count=1, endpoint_name=endpoint_name)

## Run Benchmarks

In this section, we benchmark each endpoint with same dataset, concurrency and iterations.


In [None]:
# Heloer function to set aws credentials with temporary token

def set_credentials():
    sts_client = boto3.client('sts')
    credentials = sts_client._get_credentials()
    
    # Set environment variables
    os.environ["AWS_ACCESS_KEY_ID"] = credentials.access_key
    os.environ["AWS_SECRET_ACCESS_KEY"] = credentials.secret_key
    os.environ["AWS_SESSION_TOKEN"] = credentials.token


In [None]:
# Helper Function to extract values from the output file

def extract_values_from_benchmark_output(output_data: str, field_name: str):
    
    pattern = r"{field_name}:\s([\d\.]+)\s"
    pattern = pattern.replace('{field_name}', field_name)

    # Search for the pattern in the list
    for line in output_data:
        match = re.search(pattern, line)
        if match:
            value = float(match.group(1))
            break
    
    return value


# Helper function to extract instance type from the endpoint name
def get_instance_type_from_endpoint_name(endpoint_name):
    # Create a SageMaker client
    sagemaker_client = boto3.client('sagemaker')

    # Describe the endpoint to get the endpoint configuration name
    response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
    
    # Extract the endpoint configuration name
    endpoint_config_name = response['EndpointConfigName']
    
    # Describe the endpoint configuration to get the instance type
    config_response = sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
    
    # Extract the instance types from the configuration
    instance_types = []
    for variant in config_response['ProductionVariants']:
        instance_types.append(variant['InstanceType'])

    return instance_types[0]

In [None]:
# Helper function to run benchmarks on provided endpoint

def run_benchmark(endpoint_name):
    endpoint_url = f'https://runtime.sagemaker.{region}.amazonaws.com/endpoints/{endpoint_name}/invocations'
    print(endpoint_url)
    os.environ["ENDPOINT_URL"] = endpoint_url
    
    set_credentials()
    print('Starting Benchmarking...')
    output = !./awscurl -c 5 -N 50 -X POST $ENDPOINT_URL \
    --connect-timeout 120   -H "Content-Type: application/json" --dataset Pixtral_benchmarking_data -t -n sagemaker 

    print('Finished Benchmarking')
    return output


In [None]:
# maintain list of results to display in graph 

benchmark_results = []

In [None]:
# Run the benchmarks on ml.g5.12xlarge instance

endpoint_name = predictor12xlarge.endpoint_name

output = run_benchmark(endpoint_name)

instance_type = get_instance_type_from_endpoint_name(endpoint_name)
avg_latency = extract_values_from_benchmark_output(output, 'Average Latency')

benchmark_results.append({'InstanceType': instance_type, 'AverageLatency': avg_latency})

In [None]:
# Run the benchmarks on ml.g5.24xlarge instance

endpoint_name = predictor24xlarge.endpoint_name

output = run_benchmark(endpoint_name)

instance_type = get_instance_type_from_endpoint_name(endpoint_name)
avg_latency = extract_values_from_benchmark_output(output, 'Average Latency')

benchmark_results.append({'InstanceType': instance_type, 'AverageLatency': avg_latency})

In [None]:
# Run the benchmarks on ml.g5.48xlarge instance

endpoint_name = predictor48xlarge.endpoint_name

output = run_benchmark(endpoint_name)

instance_type = get_instance_type_from_endpoint_name(endpoint_name)
avg_latency = extract_values_from_benchmark_output(output, 'Average Latency')

benchmark_results.append({'InstanceType': instance_type, 'AverageLatency': avg_latency})

In [None]:
# Run the benchmarks on ml.p4d.24xlarge instance

endpoint_name = predictorp4d24xlarge.endpoint_name

output = run_benchmark(endpoint_name)

instance_type = get_instance_type_from_endpoint_name(endpoint_name)
avg_latency = extract_values_from_benchmark_output(output, 'Average Latency')

benchmark_results.append({'InstanceType': instance_type, 'AverageLatency': avg_latency})

In [None]:
# load results list in a dataframe and display as a bar chart

def display_results(results: list):

    # Create a dataframe from the list
    df = pd.DataFrame(results)
    plt.figure(figsize=(8, 6))
    
    # Plot a bar chart using the DataFrame
    plt.bar(df['InstanceType'], df['AverageLatency'], color='skyblue')
    
    # Add titles and labels
    plt.title('Average Latency by Instance Type', fontsize=14)
    plt.xlabel('Instance Type', fontsize=12)
    plt.ylabel('Average Latency (ms)', fontsize=12)
    
    # Display the graph
    plt.show()


In [None]:
# display results

display_results(benchmark_results)

## Observations

The preliminary results suggest that a larger instance may not necessarily yield lower latency under the current test conditions. We hypothesize that additional overhead from managing more GPUs, less-than-optimal parallelization parameters, and insufficient concurrency levels to fully leverage the available hardware may be contributing factors. Further tests and tuning are needed to confirm these suspicions and adjust parameters accordingly.


### Cleanup

Do not forget to cleanup your resources to avoid SageMaker endpoint costs in your account


In [None]:
# delete endpoints, model config
predictor12xlarge.delete_endpoint()
predictor24xlarge.delete_endpoint()
predictor48xlarge.delete_endpoint()
predictorp4d24xlarge.delete_endpoint()
model.delete_model()