# In-Depth Exploration of Pixtral's Capabilities

Welcome to this comprehensive notebook where we delve into the diverse capabilities of **Pixtral**, a cutting-edge multimodal language model. In this notebook, we will examine Pixtral's performance across a variety of tasks, including:

- **Optical Character Recognition (OCR)**
- **Image Classification**
- **Object Detection**
- **Image Captioning**
- **Visual Question Answering (VQA)**
- **Handwriting Recognition**
- **And More**

The objective of this notebook is to provide you with a clear understanding of where Pixtral excels and to identify areas where it may face challenges. While these insights are based on our observations, your experiences and results may vary depending on the datasets and use cases you explore.

### Pixtral 12B in Short

- **Natively multimodal:** Trained with interleaved image and text data.
- **Strong performance on multimodal tasks:** Excels in instruction following.
- **State-of-the-art text-only benchmarks:** Maintains top performance in text-based evaluations.

### Architecture

- **Vision Encoder:** New 400M parameter encoder trained from scratch.
- **Multimodal Decoder:** 12B parameter decoder based on Mistral Nemo.
- **Flexible Image Support:** Handles variable image sizes and aspect ratios.
- **Long Context Window:** Supports multiple images within a 128k token context.

### Use

- **License:** Apache 2.0

Pixtral is trained to understand both natural images and documents, achieving 52.5% on the MMMU reasoning benchmark, surpassing a number of larger models. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.

<div style="display: flex; gap: 20px; justify-content: center; align-items: center;">
  <img src="https://mistral.ai/images/news/pixtral-12b/pixtral-benchmarks.png" alt="Benchmarks" style="width: 45%; height: auto;"/>
  <img src="https://mistral.ai/images/news/pixtral-12b/pixtral-comparison.png" alt="Evals" style="width: 45%; height: auto;"/>
</div>

## Getting Started

[The instructions for how to get started using this notebook can be found in the Pixtral LMI notebook](https://github.com/aws-samples/mistral-on-aws/blob/59ab4ab9736122200a2d284039cb4557782e4a20/notebooks/Pixtral-samples/Pixtral-12b-LMI-SageMaker-realtime-inference.ipynb)



In [None]:
import boto3
import sagemaker
from sagemaker.djl_inference import DJLModel

In [None]:
sess = sagemaker.Session() # sagemaker session for interacting with different AWS APIs

sagemaker_session_bucket = None # bucket to house artifacts
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role() # execution role for the endpoint
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region = sess.boto_region_name
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

In [None]:
image_uri =f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124" 

# You can also obtain the image_uri programatically as follows.
# image_uri = image_uris.retrieve(framework="djl-lmi", version="0.30.0", region="us-west-2")

model = DJLModel(
    role=role,
    image_uri=image_uri,
    env={
        "HF_MODEL_ID": "mistralai/Pixtral-12B-2409",
        "HF_TOKEN": "<HF_Token>", #since the model "mistralai/Pixtral-12B-2409" is gated model, you need HF_TOKEN
        "OPTION_ENGINE": "Python",
        "OPTION_MPI_MODE": "true",
        "OPTION_ROLLING_BATCH": "lmi-dist",
        "OPTION_MAX_MODEL_LEN": "8192", # this can be tuned depending on instance type + memory available
        "OPTION_MAX_ROLLING_BATCH_SIZE": "16", # this can be tuned depending on instance type + memory available
        "OPTION_TOKENIZER_MODE": "mistral",
        "OPTION_ENTRYPOINT": "djl_python.huggingface",
        "OPTION_TENSOR_PARALLEL_DEGREE": "max",
        "OPTION_LIMIT_MM_PER_PROMPT": "image=4", # this can be tuned to control how many images per prompt are allowed
    }
)

## Performance Considerations

If you want to use the Pixtral 12B model with its full capabilities—including the maximum context window of 128k tokens and support for multiple images—you should look to use a ml.p4d.24xlarge instance. This instance provides the necessary GPU memory and computational power to ensure optimal performance. 

If you prefer a balance between performance and cost, using the Pixtral 12B model at half precision on an ml.g5.12xlarge instance is a great choice. This setup handles context windows up to 8192 tokens efficiently and supports multiple images per prompt.

For a minimal setup that keeps costs low, you can run the Pixtral 12B model on an ml.g5.xlarge instance. This is suitable for basic tasks with smaller context windows and single-image prompts.

This is the author's back-of-the-napkin math - feel free to experiment.

In [None]:
predictor = model.deploy(instance_type="ml.g5.24xlarge", initial_instance_count=1)

## How many tokens make up an image?

Images are passed through the vision encoder at their native resolution and aspect ratio, converting them into image tokens for each 16x16 patch in the image. These tokens are then flattened to create a sequence, with [IMG BREAK] and [IMG END] tokens added between rows and at the end of the image.[IMG BREAK] tokens let the model distinguish between images of different aspect ratios with the same number of tokens.

For example, a 1024x1024 image is (1024/16)**2 = 4096 tokens. The image of luggage in the Pixtral data folder is 512x512 and is 1056 tokens. 

[More information can be found in the Pixtral blogpost](https://mistral.ai/news/pixtral-12b/)

## Exploring Use Cases

encode_image_to_data_url(image_path): Converts an image file into a base64-encoded data URL.

send_images_to_model(predictor, prompt, image_paths): Sends a text prompt and images (as data URLs) to a model and returns the response text.

In [None]:
def encode_image_to_data_url(image_path):
    """
    Reads an image from a local file path and encodes it to a data URL.
    """
    with open(image_path, 'rb') as image_file:
        image_bytes = image_file.read()
    base64_encoded = base64.b64encode(image_bytes).decode('utf-8')
    # Determine the image MIME type (e.g., image/jpeg, image/png)
    mime_type = Image.open(image_path).get_format_mimetype()
    data_url = f"data:{mime_type};base64,{base64_encoded}"
    return data_url

def send_images_to_model(predictor, prompt, image_paths):
    """
    Sends images and a prompt to the model and returns the response in plain text.
    """
    if isinstance(image_paths, str):
        image_paths = [image_paths]
    
    content_list = [{
        "type": "text",
        "text": prompt
    }]
    
    for image_path in image_paths:
        # Encode image to data URL
        data_url = encode_image_to_data_url(image_path)
        
        content_list.append({
            "type": "image_url",
            "image_url": {
                "url": data_url
            }
        })
    
    payload = {
        "messages": [
            {
                "role": "user",
                "content": content_list
            }
        ],
        "max_tokens": 2000,
        "temperature": 0.1,
        "top_p": 0.9,
    }
    
    response = predictor.predict(payload)
    return response['choices'][0]['message']['content']

## OCR



In [None]:
prompt = "Extract and transcribe all text visible in the image, preserving its exact formatting, layout, and any special characters. Include line breaks and maintain the original capitalization and punctuation."
image_path = "Pixtral_data/amazon_s1_2.jpg"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

Legal and financial terminology was accurately recognized, which is crucial for documents like registration statements. The model effectively captured sections, such as headers and subheaders (e.g., "### CALCULATION OF REGISTRATION FEE"), indicating a good understanding of hierarchical text structures.

In [None]:
prompt = """
Analyze the attached image of an earnings report.

Extract Key Data: Identify and summarize main financial metrics:

Title

Revenue
Net income or loss
Earnings per share (EPS)
Operating expenses
Significant one-time items or adjustments
Diluted earnings per share
Insights:

Evaluate overall financial health based on profitability, revenue growth, or cost management.
Note any risks or positive signals impacting future performance.
Conclusion: Provide a brief summary of the company’s performance this quarter, highlighting potential growth areas or concerns for investors. If specific data isn't present, then leave blank.
"""
image_path = "Pixtral_data/AMZN-Q2-2024-Earnings-Release.jpg"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

When Pixtral is provided with a low-resolution image, the model may hallucinate or misinterpret the image data. In this instance, Pixtral incorrectly extracted the dates from our earnings report due to the poor image quality.

In [None]:
prompt = """
Analyze the attached image of an earnings report.

Extract Key Data: Identify and summarize main financial metrics:

Title

Revenue
Net income or loss
Earnings per share (EPS)
Operating expenses
Significant one-time items or adjustments
Diluted earnings per share
Insights:

Evaluate overall financial health based on profitability, revenue growth, or cost management.
Note any risks or positive signals impacting future performance.
Conclusion: Provide a brief summary of the company’s performance this quarter, highlighting potential growth areas or concerns for investors. If specific data isn't present, then leave blank.
"""
image_path = "Pixtral_data/AMZN-Q2-2024-Earning-High-Quality.png"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

In contrast, when we use the same image at a higher resolution, Pixtral generates the correct completion.

## Handwriting Recognition

In [None]:
prompt = "Analyze the image and transcribe any handwritten text present. Convert the handwriting into a single, continuous string of text. Maintain the original spelling, punctuation, and capitalization as written. Ignore any printed text, drawings, or other non-handwritten elements in the image."
image_path = "Pixtral_data/a01-082u-01.png"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

## Chart Analysis

In [None]:
prompt= """

Analyze the attached image of the chart or graph. Your tasks are to:

Identify the type of chart or graph (e.g., bar chart, line graph, pie chart, etc.).
Extract the key data points, including labels, values, and any relevant scales or units.
Identify and describe the main trends, patterns, or significant observations presented in the chart.
Generate a clear and concise paragraph summarizing the extracted data and insights. The summary should highlight the most important information and provide an overview that would help someone understand the chart without seeing it.
Ensure that your summary is well-structured, accurately reflects the data, and is written in a professional tone.
"""
image_path = "Pixtral_data/Amazon_Chart.png"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

## Image Captioning

In [None]:
prompt = """
Analyze the image and provide a detailed description of what you see. Include:

1. The main subject or focus of the image
2. Key elements or objects present
3. Colors, lighting, and overall mood
4. Spatial arrangement and composition
5. Any text or symbols visible
6. Actions or events taking place, if applicable
7. Background and setting details
8. Distinctive features or unusual aspects
9. Estimated time of day or season, if relevant
10. Overall context or type of scene (e.g., natural landscape, urban setting, indoor space)

Describe the image as if explaining it to someone who cannot see it. Be thorough but concise, focusing on the most important and interesting aspects of the image.
"""
image_path = "Pixtral_data/3a1SR_oZI0-dCEvLG7US5g.jpg"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

In [None]:
prompt = """
Analyze the image and identify all distinct objects present. For each object detected:

1. Name the object
2. Specify its approximate location in the image (e.g., top-left, center, bottom-right)
3. Estimate its size relative to the image (e.g., small, medium, large)
4. Note any relevant characteristics (color, shape, condition)
5. Identify if it's partially obscured or fully visible

List all objects detected, including people, animals, vehicles, furniture, buildings, natural elements, and any other identifiable items. If multiple instances of the same object type are present, count and report them separately. Ignore very small or indistinct elements that can't be clearly identified. If applicable, note any obvious interactions or relationships between objects.
"""
image_path = "Pixtral_data/dresser.jpg"  # Replace with your local image path
response = send_images_to_model(predictor, prompt, image_path)
print('Response from the model:\n\n')
print(response)

## Structured Data Extraction

Extracting structured data from product images is essential for efficient inventory management, e-commerce listings, and data analysis. Pixtral's multimodal capabilities enable the accurate transformation of visual information into a standardized JSON format, facilitating seamless integration with databases and applications.

### Implementation

The following code demonstrates how to utilize Pixtral to analyze product images and output the information in a predefined JSON structure. This ensures consistency and accuracy in capturing essential product details.

In [None]:
def send_images_to_model(predictor, prompt, image_paths):
    """
    Sends multiple images and a prompt to the model and returns the response in plain text.
    """
    # Construct the content list starting with the prompt
    content_list = [
        {
            "type": "text",
            "text": prompt
        }
    ]

    # Add each image as an image_url entry
    for image_path in image_paths:
        data_url = encode_image_to_data_url(image_path)
        content_list.append({
            "type": "image_url",
            "image_url": {
                "url": data_url
            }
        })

    # Construct the payload matching the expected structure
    payload = {
        "messages": [
            {
                "role": "user",
                "content": content_list
            }
        ],
        "max_tokens": 3000,  # Adjusted max_tokens as needed
        "temperature": 0.0,
        "top_p": 0.9,
    }

    response = predictor.predict(payload)
    return response['choices'][0]['message']['content']

In [None]:
prompt = """
You are a product analyst your job is to analyze the images provided and output the information in the exact JSON structure specified below. Ensure that you populate each field accurately based on the visible details in the image. If any information is not available or cannot be determined, use 'Unknown' for string fields and an empty array [] for lists.

Use the format shown exactly, ensuring all fields and values align with the JSON schema requirements.

Use this JSON schema:

{
  "title": "string",
  "description": "string",
  "category": {
    "type": "string",
    "enum": ["Electronics", "Furniture", "Clothing", "Appliances", "Toys", "Books", "Tools", "Other"]
  },
  "metadata": {
    "color": {
      "type": "array",
      "items": { "type": "string" }
    },
    "shape": {
      "type": "string",
      "enum": ["Round", "Square", "Rectangular", "Irregular", "Other"]
    },
    "condition": {
      "type": "string",
      "enum": ["New", "Like New", "Good", "Fair", "Poor", "Unknown"]
    },
    "material": {
      "type": "array",
      "items": { "type": "string" }
    },
    "brand": { "type": "string" }
  },
  "image_quality": {
    "type": "string",
    "enum": ["High", "Medium", "Low"]
  },
  "background": "string",
  "additional_features": {
    "type": "array",
    "items": { "type": "string" }
  }
}
"""
image_paths = [
    "Pixtral_data/luggage.jpg",
    "Pixtral_data/dresser.jpg",
    "Pixtral_data/dog_bag.jpg",
]  # Replace with your actual image paths

# Send to model
response = send_multiple_images_to_model(predictor, prompt, image_paths)
print('Response from the model:\n\n')
print(response)

## Visual Q&A

Visual Question and Answering (Visual Q&A) is a powerful feature of Pixtral that allows users to interact with images through natural language queries. By enabling multi-turn conversations, Pixtral can provide detailed and contextually relevant answers based on the visual content of the images. This capability is invaluable for applications such as customer support, educational tools, and interactive data analysis.

### Implementation

The following code demonstrates how to utilize Pixtral's Visual Q&A functionality. Users can pass their own images or use images from the Pixtral data folder. Additionally, the max_turns parameter can be adjusted to allow for more extended conversations.

In [None]:
def visual_qa(predictor, image_paths, max_turns=2):
    """
    Performs visual Q&A with multiple images and multi-turn conversation.

    Parameters:
    - predictor: The SageMaker Predictor object.
    - image_paths: A list of local image file paths.
    - max_turns: The maximum number of conversational turns.

    Returns:
    - None. Outputs are printed directly.
    """
    import base64
    from PIL import Image
    from IPython.display import display, Image as IPythonImage

    # Encode images to data URLs
    data_urls = [encode_image_to_data_url(image_path) for image_path in image_paths]

    # Initialize conversation messages
    messages = []

    # Define the initial prompt within the function
    initial_prompt = ("You're an extremely friendly helper. Help the user answer questions about the images shown to you. "
                      "If the answer isn't in the image, say 'I'm sorry, the answer is not in the provided image.'")

    # Start the conversation loop
    for turn in range(max_turns):
        # Get user input
        user_question = input("\nYou: ")
        if user_question.strip() == '':
            print("Please enter a question.")
            continue

        # Build user's message content
        if turn == 0:
            # Include initial prompt and images in the first message
            user_content = [{"type": "text", "text": initial_prompt + " " + user_question}]
            for data_url in data_urls:
                user_content.append({
                    "type": "image_url",
                    "image_url": {"url": data_url}
                })
        else:
            user_content = user_question

        # Append user's message to messages
        messages.append({
            "role": "user",
            "content": user_content
        })

        # Construct the payload
        payload = {
            "messages": messages,
            "max_tokens": 3000,
            "temperature": 0.0,
            "top_p": 0.9
        }

        # Send payload to model and get assistant's response
        response = predictor.predict(payload)
        assistant_response = response['choices'][0]['message']['content']

        # Append assistant's response to messages
        messages.append({
            "role": "assistant",
            "content": assistant_response
        })

        print("\nAssistant:", assistant_response)



In [None]:
# List of image paths
image_paths = [
    "Pixtral_data/trimmed_green_beans.jpg",
    "Pixtral_data/amazon_gloves.jpg",
    "Pixtral_data/cleaner.jpg"
]

# Run the visual Q&A function
visual_qa(predictor, image_paths)

In [None]:
# clean up resources
predictor.delete_endpoint()
model.delete_model()

Attribution for street sign images
This image is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Creator: Unknown
Source: https://www.mapillary.com/dataset/trafficsign
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Modifications: none


U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int. Journal on Document Analysis and Recognition, Volume 5, pages 39 - 46, 2002.