# Pixtral: A Comparative Analysis of Vision Models

This notebook provides a structured and in-depth comparison of Pixtral, a cutting-edge vision model, against select peers such as Amazon Nova Pro, Anthropic’s Haiku 3 (excluding version 3.5) and Llama 3.2 11b. Our primary goal is to evaluate Pixtral’s performance, identify its strengths and limitations, and establish best practices for integrating Pixtral into workflows that demand accurate and efficient image understanding.

To achieve this, we will conduct a series of controlled tests and qualitative assessments, leveraging services like the Converse API, Amazon Bedrock, and a SageMaker inference endpoint for Pixtral. In addition to exploring model outputs on various image types—ranging from general object recognition tasks to financial document analysis and handwriting transcription—we will employ a judging model (Sonnet 3.5) to systematically evaluate and rank the quality of responses.

Through this process, the notebook will:

- Demonstrate how to efficiently use Pixtral’s endpoints for real-time inference.
- Compare Pixtral’s capabilities to other leading vision models using standardized prompts and test images.
- Help you understand the relative advantages of Pixtral, guiding you in deciding when and how to deploy it in your own applications.

We have included licensing details and quick-start references for further exploration. By the end of this analysis, you should have a clear perspective on Pixtral’s performance profile and actionable insights into optimizing its use in your specific scenarios.

All example outputs have been preserved in this notebook, allowing you to review the results without needing to run the code on your own instance or pay for compute costs. 

## Use

- **License:** Apache 2.0 - Pixtral

## Getting Started

The instructions for how to get started using this notebook can be found in the [Pixtral LMI notebook](https://github.com/aws-samples/mistral-on-aws/blob/59ab4ab9736122200a2d284039cb4557782e4a20/notebooks/Pixtral-samples/Pixtral-12b-LMI-SageMaker-realtime-inference.ipynb)

Want to learn more about Pixtral? [Check out the Pixtral_capabilities notebook](https://github.com/aws-samples/mistral-on-aws/blob/main/notebooks/Pixtral-samples/Pixtral_capabilities.ipynb)

In [2]:
!pip install nvidia-ml-py3==7.352.0
!pip install dash-core-components==2.0.0
!pip install dash-html-components==2.0.0
!pip install dash-table==5.0.0
!pip install faiss-cpu==1.10.0
!pip uninstall --no-input jsonschema==4.23.0
!pip install --no-input jsonschema==4.18.0
!pip uninstall --no-input nltk==3.9
!pip install --no-input nltk==3.4.5
!pip uninstall --no-input numpy==1.26.4
!pip install --no-input numpy==2.0.0
!pip uninstall --no-input flask 3.1.0
!pip install --no-input flask==1.0.4
!pip uninstall --no-input werkzeug==3.1.3
!pip install --no-input werkzeug==3.1.1
!pip uninstall --no-input attrs==25.1.0
!pip install --no-input attrs==23.1.0
!pip uninstall --no-input pandas 2.2.3
!pip install --no-input pandas 0.17.1
!pip install mistral_common[opencv] mistral_common=="v1.4.4" numpy==1.26.4 pypdfium2==4.30.1 --force --quiet

[0mFound existing installation: jsonschema 4.23.0
Uninstalling jsonschema-4.23.0:
  Would remove:
    /opt/conda/bin/jsonschema
    /opt/conda/lib/python3.11/site-packages/jsonschema-4.23.0.dist-info/*
    /opt/conda/lib/python3.11/site-packages/jsonschema/*
[31mERROR: Exception:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/pip/_internal/cli/base_command.py", line 105, in _run_wrapper
    status = _inner_run()
             ^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pip/_internal/cli/base_command.py", line 96, in _inner_run
    return self.run(options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pip/_internal/commands/uninstall.py", line 106, in run
    uninstall_pathset = req.uninstall(
                        ^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pip/_internal/req/req_install.py", line 723, in uninstall
    uninstalled_pathset.remove(auto_confirm, verbose)
  

In [1]:
!pip uninstall -y -r jsonschema==4.23.0 omegaconf==2.3.0 numpy==2.0.0 werkzeug==3.1.1 fsspec==2025.2.0 pytz==2025.1 attrs==25.1.0 pandas==2.2.3
!pip install mistral_common[opencv] mistral_common=="v1.4.4" 'llmeter[plotting]' numpy==1.22.4 botocore==1.36.0 sqlparse==0.5.0 pypdfium2==4.30.1 omegaconf==2.1.1 numpy==1.22.4 werkzeug==3.0.0 fsspec==2023.6.0 pytz==2025.1 attrs==23.1.0 pandas==0.17.1 --force --quiet

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'jsonschema==4.23.0'[0m[31m
[0m  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[17 lines of output][0m
  [31m   [0m   import pkg_resources
  [31m   [0m /opt/conda/lib/python3.11/site-packages/setuptools/__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  [31m   [0m !!
  [31m   [0m 
  [31m   [0m         ********************************************************************************
  [31m   [0m         Requirements should be satisfied by a PEP 517 installer.
  [31m   [0m         If you are using pip, you can try `pip install --use-pep517`.
  [31m   [0m         ********************************************************************************
  [31m   [0m 
  [31m   [0m !!
  [31m   [0m   dist.fetch

In [None]:
import re
import base64
import json
from PIL import Image
from io import BytesIO
from typing import List
import pypdfium2 as pdfium
from IPython.display import display, HTML

from llmeter.endpoints import BedrockConverseStream, SageMakerStreamEndpoint
from llmeter.experiments import LatencyHeatmap

import boto3
import sagemaker
from sagemaker.djl_inference import DJLModel

# Colors to display information
RESET = "\033[0m"
GREEN = "\033[38;5;29m"
BLUE = "\033[38;5;43m"
ORANGE = "\033[38;5;208m"
PURPLE = "\033[38;5;93m"
RED = "\033[38;5;196m"

In [None]:
bedrock_client = boto3.client('bedrock-runtime', region_name='us-west-2')

In [None]:
sess = sagemaker.Session() # sagemaker session for interacting with different AWS APIs

sagemaker_session_bucket = None # bucket to house artifacts
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role() # execution role for the endpoint
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region = sess.boto_region_name
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

In [None]:
image_uri =f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124" 

# You can also obtain the image_uri programatically as follows.
# image_uri = image_uris.retrieve(framework="djl-lmi", version="0.30.0", region="us-west-2")

model = DJLModel(
    role=role,
    image_uri=image_uri,
    env={
        "HF_MODEL_ID": "mistralai/Pixtral-12B-2409",
        "HF_TOKEN": "HF_Token", #since the model "mistralai/Pixtral-12B-2409" is gated model, you need a HF_TOKEN & go to https://huggingface.co/mistralai/Pixtral-12B-2409 to be granted access
        "OPTION_ENGINE": "Python",
        "OPTION_MPI_MODE": "true",
        "OPTION_ROLLING_BATCH": "lmi-dist",
        "OPTION_MAX_MODEL_LEN": "8192", # this can be tuned depending on instance type + memory available
        "OPTION_MAX_ROLLING_BATCH_SIZE": "16", # this can be tuned depending on instance type + memory available
        "OPTION_TOKENIZER_MODE": "mistral",
        "OPTION_ENTRYPOINT": "djl_python.huggingface",
        "OPTION_TENSOR_PARALLEL_DEGREE": "max",
        "OPTION_LIMIT_MM_PER_PROMPT": "image=4", # this can be tuned to control how many images per prompt are allowed
    }
)

In [None]:
predictor = model.deploy(instance_type="ml.g5.24xlarge", initial_instance_count=1)

In [None]:
def call_sagemaker_pdf_to_base64(file_path):
    pdf = pdfium.PdfDocument(file_path)
    images = []
    for page_index in range(len(pdf)):
        page = pdf[page_index]
        bitmap = page.render()
        images.append(bitmap)
    encoded_messages = []
    for i in range(len(images)):
        buffered = BytesIO()
        pil_image = images[i].to_pil()
        pil_image.save(buffered, format='PNG')
        img_byte = buffered.getvalue()
        base64_encoded = base64.b64encode(img_byte).decode('utf-8')
        encoded_messages.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{base64_encoded}"
                }
            })
    return encoded_messages

def encode_image_to_data_url(image_path):
    """
    Reads an image from a local file path and encodes it to a data URL.
    """
    with open(image_path, 'rb') as image_file:
        image_bytes = image_file.read()
    base64_encoded = base64.b64encode(image_bytes).decode('utf-8')
    # Determine the image MIME type (e.g., image/jpeg, image/png)
    mime_type = Image.open(image_path).get_format_mimetype()
    data_url = f"data:{mime_type};base64,{base64_encoded}"
    return data_url

def send_images_to_model(predictor, prompt, image_paths):
    """
    Sends images and a prompt to the model and returns the response in plain text.
    """
    if isinstance(image_paths, str):
        image_paths = [image_paths]
    
    content_list = [{
        "type": "text",
        "text": prompt
    }]
    
    for image_path in image_paths:
        # Encode image to data URL
        if ".pdf" in image_path:
            content_list.extend(call_sagemaker_pdf_to_base64(image_path))
        else:
            data_url = encode_image_to_data_url(image_path)
            content_list.append({
                "type": "image_url",
                "image_url": {
                    "url": data_url
                }
                
            })
    
    payload = {
        "messages": [
            {
                "role": "user",
                "content": content_list
            }
        ],
        "max_tokens": 4000,
        "temperature": 0.1,
        "top_p": 0.9,
    }
    
    response = predictor.predict(payload)
    return response['choices'][0]['message']['content']

In [None]:
def get_image_format(image_path):
    with Image.open(image_path) as img:
        # Normalize the format to a known valid one
        fmt = img.format.lower() if img.format else 'jpeg'
        # Convert 'jpg' to 'jpeg'
        if fmt == 'jpg':
            fmt = 'jpeg'
    return fmt

def get_image_format(image_path):
    with Image.open(image_path) as img:
        # Normalize the format to a known valid one
        fmt = img.format.lower() if img.format else 'jpeg'
        # Convert 'jpg' to 'jpeg'
        if fmt == 'jpg':
            fmt = 'jpeg'
    return fmt

def call_bedrock_pdf_to_base64(file_path):
    pdf = pdfium.PdfDocument(file_path)
    images = []
    for page_index in range(len(pdf)):
        page = pdf[page_index]
        bitmap = page.render()
        images.append(bitmap)
    encoded_messages = []
    for i in range(len(images)):
        buffered = BytesIO()
        pil_image = images[i].to_pil()
        pil_image.save(buffered, format='PNG')
        img_byte = buffered.getvalue()
        encoded_messages.append({
            "image": {
                "format": "png",
                "source": {
                            "bytes": img_byte
                        }
            }
        })
    return encoded_messages

def call_bedrock_model(model_id=None, inference_arn=None, prompt="", image_paths=None, system_prompts=None, temperature=0.1, top_p=0.9, max_tokens=3000):
    if isinstance(image_paths, str):
        image_paths = [image_paths]
    if image_paths is None:
        image_paths = []
    if system_prompts is None:
        system_prompts = []

    # Start building the content array for the user message
    content_blocks = []

    # Include a text block if prompt is provided
    if prompt.strip():
        content_blocks.append({"text": prompt})

    # Add images as raw bytes
    for img_path in image_paths:
        if ".pdf" in img_path:
            content_blocks.extend(call_bedrock_pdf_to_base64(img_path))
        else:
            fmt = get_image_format(img_path)
            # Read the raw bytes of the image (no base64 encoding!)
            with open(img_path, 'rb') as f:
                image_raw_bytes = f.read()
    
            content_blocks.append({
                "image": {
                    "format": fmt,
                    "source": {
                        "bytes": image_raw_bytes
                    }
                }
            })

    # Construct the messages structure
    messages = [
        {
            "role": "user",
            "content": content_blocks
        }
    ]

    # Prepare additional kwargs if system prompts are provided
    kwargs = {}
    if system_prompts:
        kwargs["system"] = system_prompts

    # Build the arguments for the `converse` call
    converse_kwargs = {
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 4000,
            "temperature": temperature,
            "topP": top_p
        },
        **kwargs
    }

    # Use inferenceArn if provided, otherwise use modelId
    if inference_arn:
        converse_kwargs["inferenceArn"] = inference_arn
    else:
        converse_kwargs["modelId"] = model_id

    # Call the converse API
    try:
        response = bedrock_client.converse(**converse_kwargs)
    
        # Parse the assistant response
        assistant_message = response.get('output', {}).get('message', {})
        assistant_content = assistant_message.get('content', [])
        result_text = "".join(block.get('text', '') for block in assistant_content)
    except Exception as e:
        result_text = f"Error message: {e}"
    return result_text

In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"Pixtral_data/cleaner.jpg"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = "Describe this image in a short paragraph."

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In the following step, we’ll use a LLM as a “judge” to compare the quality of each response. While this automated evaluation can offer valuable insights, it’s best supplemented with human judgment to ensure that the chosen response aligns with your specific goals. If all three outputs appear equally strong, your own criteria and preferences will guide the final decision.

For this demonstration, we’ll rely on Sonnet 3.5 as the judge. We’ll provide the original image and the three responses to determine which one emerges as the most accurate and helpful.

In [None]:
def evaluate_responses(image_path, nova_response, claude_response, llama_response, pixtral_response):
    evaluation_prompt = f"""Here is an image and three different AI models' descriptions of it. Please evaluate which model produced the best description and explain why.

Model A (Nova): {nova_response}

Model B (Claude): {claude_response}

Model C (Llama): {llama_response}

Model D (Pixtral): {pixtral_response}

Which model provided the best description? Please explain your reasoning and declare a winner."""

    judge_response = call_bedrock_model(
        model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
        prompt=evaluation_prompt,
        image_paths=image_path,
        temperature=0.1
    )
    print(f"{GREEN}Judge's Evaluation:{RESET}")
    print(f"{ORANGE}{judge_response}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Analyzing a Financial Statement

Next, we’ll examine an Amazon financial document using all three models. Note that Llama 3.2’s image input must not exceed 1120 x 1120 in resolution, which requires us to provide a lower-resolution version of the document for Llama. By contrast, Pixtral and Haiku have no such image resolution constraints. This limitation provides an early indicator of where model selection might depend on input requirements.

In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"Pixtral_data/AMZN-Q2-2024-Earning-High-Quality.png"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = """Analyze the attached image of an earnings report.

Extract Key Data: Identify and summarize main financial metrics:

Title

Revenue
Net income or loss
Earnings per share (EPS)
Operating expenses
Significant one-time items or adjustments
Diluted earnings per share
Insights:

Evaluate overall financial health based on profitability, revenue growth, or cost management.
Note any risks or positive signals impacting future performance.
Conclusion: Provide a brief summary of the company’s performance this quarter, highlighting potential growth areas or concerns for investors. If specific data isn't present, then leave blank.
"""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Handwriting Recognition

In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"Pixtral_data/a01-082u-01.png"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = """Analyze the image and transcribe any handwritten text present.
Convert the handwriting into a single, continuous string of text.
Maintain the original spelling, punctuation, and capitalization as written.
Ignore any printed text, drawings, or other non-handwritten elements in the image."""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Chart Analysis

In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"Pixtral_data/Amazon_Chart.png"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = """Analyze the attached image of the chart or graph. Your tasks are to:

Identify the type of chart or graph (e.g., bar chart, line graph, pie chart, etc.).
Extract the key data points, including labels, values, and any relevant scales or units.
Identify and describe the main trends, patterns, or significant observations presented in the chart.
Generate a clear and concise paragraph summarizing the extracted data and insights. The summary should highlight the most important information and provide an overview that would help someone understand the chart without seeing it.
Ensure that your summary is well-structured, accurately reflects the data, and is written in a professional tone.
"""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Image Captioning



In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"Pixtral_data/dresser.jpg"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = """Analyze the image and provide a detailed description of what you see. Include:

1. The main subject or focus of the image
2. Key elements or objects present
3. Colors, lighting, and overall mood
4. Spatial arrangement and composition
5. Any text or symbols visible
6. Actions or events taking place, if applicable
7. Background and setting details
8. Distinctive features or unusual aspects
9. Estimated time of day or season, if relevant
10. Overall context or type of scene (e.g., natural landscape, urban setting, indoor space)

Describe the image as if explaining it to someone who cannot see it. Be thorough but concise, focusing on the most important and interesting aspects of the image.
"""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Reasoning Over Complex Figures

In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"Pixtral_data/Amazon_Chart.png"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = """Analyze the following image and answer the following questions: 

-Which quarter had the highest net sales and which quarter had the lowest?
-What was the average net sale across quarters?
-What was the Q2 2023 & Q2 2025 net sales combined?
-Which quarter had the highest operating income and which quarter had the lowers?
-What was the average operating income across quarters?
-What was the Q2 2023 & Q2 2025 operating income combined?

"""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## FSI: Insurance Form Data Extraction

In [None]:
import base64
from IPython.display import IFrame
import IPython.display as display

# open PDF
pdf_path = f"Pixtral_data/insurance_90degree.pdf"
with open(pdf_path, "rb") as pdf:
    content = pdf.read()

# encode PDF
base64_pdf = base64.b64encode(content).decode("utf-8")

# display encoded PDF
print("PDF being analyzed:")
display.display(IFrame(f"data:application/pdf;base64,{base64_pdf}", width=1000, height=500))

prompt = """As a medical document analyzer, extract these fields from the insurance verification form and return as JSON:

Required fields:
- Patient name
- Date of birth 
- Policy number
- Insurance provider name
- Coverage start date
- Phone numbers (work/home)
- Subscriber name
- Group number
- Plan type (PPO/HMO)
- Verification date

Format the response as:
{
  'patient_info': {
    'name': string,
    'dob': string,
    'phone': {
      'work': string,
      'home': string
    }
  },
  'insurance_info': {
    'provider': string,
    'policy_number': string, 
    'group_number': string,
    'subscriber': string,
    'plan_type': string,
    'coverage_start': string,
    'verification_date': string
  }
}"""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=pdf_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=pdf_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=pdf_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=pdf_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=pdf_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Traffic Scene Analysis

In [None]:
from PIL import Image
import IPython.display as display

print("Image being analyzed:")
image_path = f"/home/sagemaker-user/pixtral/mistral-on-aws/notebooks/Pixtral-samples/Pixtral_data/airport_lanes.jpg"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = """Analyze the image and provide a detailed description of what you see in less than 200 words.
    
Additionally answer the following questions with maximum 30 words per question:
1. Which infrastructure can be identifed in the image?
2. Which lane do I need to follow if I want to depart by plane?
3. Which sports championship is advertised in the image?

Describe the image as if explaining it to someone who cannot see it. Be thorough but concise, focusing on the most important and interesting aspects of the image.
"""

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Mapping latency by input & output token counts

For many LLMs, the time to process a request can significantly depend on the length (in number of tokens) of the input provided and the output generated.

We can produce a heatmap showing how latency varies by these factors, to give an idea of how optimizing your input length or generation lengths might affect the response times observed by users.

The `LatencyHeatmap` experiment automatically generates a set of request payloads with varying (approximate) input lengths and uses it to test the endpoint.

To construct the requests, we need a base text to use as a seed. The semantic aspects are not particularly important, so any sufficiently long text can serve the purpose - but remember that many LLMs have their own internal guardrails, so it's possible that the model might decline to respond in some cases.

We'd like the generated reply to be limited by the `max_tokens` parameter (so the heatmap can measure latency for various output lengths), so will engineer a prompt that encourages the model to generate as long a response as possible from the seed text:

To use Bedrock's streaming API, we can instead connect with an LLMeter `BedrockConverseStream`. If the selected Jumpstart endpoint supports model streaming, we can instead create an LLMeter `SageMakerStreamEndpoint` to handle this.

In [None]:
def bedrock_prompt_fn(prompt, **kwargs):
    formatted_prompt = f"Create a story based on the following prompt: {prompt}"
    return BedrockConverseStream.create_payload(
        formatted_prompt, inferenceConfig={"temperature": 1.0}, **kwargs
    )

def sagemaker_prompt_fn(prompt, **kwargs):
    formatted_prompt = f"Create a story based on the following prompt: {prompt}"
    return SageMakerStreamEndpoint.create_payload(formatted_prompt, **kwargs)

With a seed text and prompt generation function, we're ready to set up our latency heatmapping experiment.

- The `source_file` and `create_payload_fn` will be used to generate requests with various input lengths.
- The set of `input_lengths` you'd like to test is approximate, since the locally-available tokenizer won't exactly match the one used internally by the model
- The set of `output_lengths` you'd like to test may not always be reached, if the model stops generating early for the given prompts.
- The `requests_per_combination` impacts both the time to run the test and the quality of your output statistics. Note for example that it doesn't make sense to consider p95 or p99 latency on a dataset with only 10 requests!
- A higher number of concurrent `clients` will speed up the overall test run, but could cause problems if you reach quota limits (on as-a-service models) or high request volumes that start to impact response latency (see the "Load testing" section below for more details!)

Similar to low-level test Runners, the `output_path` can be used to configure where the test result data should be saved (either locally or on the Cloud).

Here we'll use the same source text as LLMeter's own examples: The text of short story "Frankenstein" by Mary Shelley:

In [None]:
!curl -o Pixtral_data/MaryShelleyFrankenstein.txt \
    https://raw.githubusercontent.com/awslabs/llmeter/main/examples/MaryShelleyFrankenstein.txt

With a source text and a function (below) to format example requests from fragments of that text, we're ready to run our experiment to measure latency across various input and output lengths:

In [None]:
def get_latency_heatmap(
    model_id: str,
    endpoint_name=None,
    source_file="Pixtral_data/MaryShelleyFrankenstein.txt"
):
    if endpoint_name is None:
        endpoint_stream = BedrockConverseStream(
            model_id=model_id,
        )
        prompt_fn = bedrock_prompt_fn
    else:
        endpoint_stream = SageMakerStreamEndpoint(
            endpoint_name,
            model_id=model_id
        )
        prompt_fn = sagemaker_prompt_fn
    
    latency_heatmap = LatencyHeatmap(
        endpoint=endpoint_stream,
        clients=4,
        requests_per_combination=20,
        output_path=f"data/llmeter/{endpoint_stream.model_id}/heatmap",
        source_file=source_file,
        input_lengths=[50, 500, 1000],
        output_lengths=[128, 256, 512],
        create_payload_fn=prompt_fn,
    )

    return latency_heatmap

In [None]:
latency_heatmap_nova = get_latency_heatmap(model_id="us.amazon.nova-pro-v1:0")
heatmap_results_nova = await latency_heatmap_nova.run()

latency_heatmap_claude = get_latency_heatmap(model_id="anthropic.claude-3-haiku-20240307-v1:0")
heatmap_results_claude = await latency_heatmap_claude.run()

latency_heatmap_llama = get_latency_heatmap(model_id="us.meta.llama3-2-11b-instruct-v1:0")
heatmap_results_llama = await latency_heatmap_llama.run()

latency_heatmap_pixtral = get_latency_heatmap(model_id="Pixtral-12B-2409", endpoint_name=predictor.endpoint_name)
heatmap_results_pixtral = await latency_heatmap_pixtral.run()

Now, you'll be able to plot the heatmap results visually to explore how the latency varies with input and output token count:

In [None]:
print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
fig, axs = latency_heatmap_nova.plot_heatmap()

In [None]:
print(f"{GREEN}#### Claude Haiku Latency HeatMap:{RESET}")
fig, axs = latency_heatmap_claude.plot_heatmap()

In [None]:
print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
fig, axs = latency_heatmap_llama.plot_heatmap()

In [None]:
print(f"{GREEN}#### Pixtral Response:{RESET}")
fig, axs = latency_heatmap_pixtral.plot_heatmap()

For many models the overall `time_to_last_token` depends more strongly on the number of tokens *generated* by the model (`num_tokens_output`), while the `time_to_first_token` depends more strongly on the *input* length (`num_tokens_input`) if any significant correlation is present.

## Generate SQL from ER Diagrams

In [None]:
print("Image being analyzed:")
image_path = "/home/sagemaker-user/pixtral/mistral-on-aws/notebooks/Pixtral-samples/Pixtral_data/er-diagram.jpeg"
image = Image.open(image_path)
display.display(image)
print("\n")

prompt = "You are an expert on SQL. You have an ER diagram. Prepare PostgreSQL compatible SQL queries to create tables from this ER diagram."

response_nova = call_bedrock_model(
    model_id="us.amazon.nova-pro-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Amazon Nova Pro Response:{RESET}")
print(f"{BLUE}{response_nova}{RESET}")

response_claude = call_bedrock_model(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Claude Haiku Response:{RESET}")
print(f"{RED}{response_claude}{RESET}")

response_llama = call_bedrock_model(
    model_id="us.meta.llama3-2-11b-instruct-v1:0",
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Llama 3.2 11b Response:{RESET}")
print(f"{PURPLE}{response_llama}{RESET}")

response_pixtral = send_images_to_model(
    predictor=predictor,
    prompt=prompt,
    image_paths=image_path
)

print(f"{GREEN}#### Pixtral Response:{RESET}")
print(f"{ORANGE}{response_pixtral}{RESET}")

In [None]:
evaluate_responses(
    image_path=image_path,
    nova_response=response_nova,
    claude_response=response_claude,
    llama_response=response_llama,
    pixtral_response=response_pixtral
)

## Observations

Across multiple scenarios, the models—Nova, Claude Haiku, Llama 3.2, and Pixtral—demonstrated varying strengths and weaknesses. The evaluations were conducted with the assistance of a judging LLM (Sonnet 3.5), which assessed clarity, completeness, accuracy, and overall descriptive quality. The following summarizes the key observations from each test scenario:

**Household Cleaner Image:**

When describing two bottles of disinfectant surface cleaner, Pixtral’s response was judged superior. It provided a highly detailed, visually rich description, accurately capturing colors, branding elements, and design motifs that the other models overlooked. This suggests Pixtral’s strong capability for nuanced visual analysis of everyday objects.

**Financial Document Analysis:**

In the case of an Amazon financial statement image, Pixtral again excelled. It offered a well-structured, comprehensive breakdown of financial metrics and contextualized the company’s performance effectively. The model’s balanced approach—combining raw data extraction with insightful commentary—surpassed the more limited or less organized presentations from Claude and Llama.

**Handwriting Transcription:**

For handwriting recognition, none of the models performed perfectly. However, the judge deemed Claude’s response the least inaccurate, as it introduced fewer extraneous words compared to Llama and Pixtral. This scenario highlights a challenge for all tested models: accurately parsing handwritten text. While Claude edged ahead here, the overall quality from all three remained suboptimal. The author isn't entirely sure what the second to third to last word is. None of the models were correct, however, Claude did not introduce additional words while Pixtral and Llama did.

**Chart/Graph Interpretation:**

When analyzing a chart of North American segment results, Pixtral outperformed the others. It displayed an impressive level of detail, correctly interpreting data points, capturing year-over-year changes, and providing a clear, logical structure. The model’s ability to handle numeric data and present it contextually was a standout feature, reaffirming Pixtral’s strength in scenarios where clarity and thoroughness are paramount.

**Indoor Scene Description (Dresser Image):**

In describing a minimalist indoor space, Llama was selected as the winner. It provided granular detail, accurately noted subtle elements like a corkboard and shelf, and gave a clear sense of the environment’s purpose and ambiance. Although Pixtral and Claude produced competent descriptions, Llama’s richer detail and organization gave it the edge in this setting.

**Insurance Form Data Extraction:**

In analyzing medical insurance verification forms, Llama demonstrated a great performance by correctly identifying most information and maintaining proper JSON structure, while other models relied on generic placeholder data. Though Llama and Nova misinterpreted some information, their accurate field capture and standardized formatting outperformed Claude and Pixtral, highlighting both progress and persistent challenges in automated document processing.

**Traffic Scene Analysis**

In analyzing this traffic scene, Pixtral was selected as the winner. Compared to the other models, it excelled by providing a comprehensive and naturally flowing description of the image. When answering the prompt's questions, Pixtral spots information 'hidden' in one of the image's details, that was not detected by other models. However, it needs to be noted, that Pixtral provides incorrect information with respect to this detail.

**Mapping latency by input & output token counts**

Examining median (p50) performance, Pixtral demonstrates better first-token response times (0.07-0.94s), outperforming all other models. This advantage is expected, as Pixtral operates on Amazon SageMaker with dedicated resources and direct infrastructure access. The remaining models, running on Amazon Bedrock, are constrained by default service quotas that limit their concurrent requests and processing capacity. Claude Haiku maintains consistent performance with first-token times ranging from 0.3-2.22s. Llama 3.2 shows moderate performance with first-token times of 0.4-3.77s, while Amazon Nova Pro exhibits the highest latency, ranging from 3.01-12.2s.
The tail latencies (p99) indicate that Pixtral maintains relatively stable performance even in edge cases, while other models experience more significant performance degradation, particularly with larger token counts. Time-to-last-token metrics follow similar patterns, though with higher absolute values across all models.



### Overall Conclusions:

**Pixtral frequently delivered the most comprehensive and structured analyses, particularly for tasks requiring detailed, multi-level descriptions of data-rich images (e.g., financial statements, charts, and product details).** Its performance in these areas suggests that it is a strong candidate for use cases demanding thorough and accurate visual summarization.

Llama 3.2 excelled in capturing intricate details within certain contexts, as seen in the indoor scene description. Its strength appears to lie in careful observation and nuanced environmental portrayal, making it a good fit for tasks requiring a keen eye for subtle elements and layout.

Claude Haiku generally produced reasonable summaries but often lacked the depth or precision of the others. It performed best in the handwriting scenario, possibly due to simpler transcription logic relative to the errors the others introduced. While Claude’s descriptions are understandable and coherent, they may not always match the richer level of detail and analysis provided by Pixtral or Llama.

Nova excels at structured document processing, demonstrating superior performance in form data extraction through accurate field identification and JSON structuring. While showing limitations with dates and numbers, its performance indicates strong potential for document automation, fitting well within the broader ecosystem of specialized vision-language models.

In conclusion, all four models have their merits and shortcomings. Pixtral stands out for structured, data-heavy image analyses; Llama shines in scenario-based detail and compositional complexity; Nova excels at document processing and form data extraction; Claude is a steady if less detailed performer, excelling occasionally in simpler tasks like handwriting. Depending on the complexity of the use case and the type of image being processed, each model could be the right choice.

**Cleanup**
It's important to cleanup the provisioned resources to avoid incurring costs. You have two options to delete the endpoint created in this notebook.

Option 1 - **Cleanup using AWS Console**

In AWS Console, navigate to Amazon Bedrock service and click on Marketplace deployments under Foundation models. Here, select the deployed endpoint and click on Delete button.

Delete Endpoint

Upon clicking the Delete button, a confirmation popup shows up. Here you read the warning carefully and confirm deletion.

Option 2 - **Cleanup using Bedrock SDK**
You can run below cell to delete the endpoint.

In [None]:
bedrock_client.delete_marketplace_model_endpoint(endpointArn=endpoint_arn)