# Using Mistral AI API (Pixtral Multimodal)

- Author: Martin Fockedey with the help of Copilot
- Based on: [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

## Overview

This tutorial explains how to effectively use Mistral AI's **Pixtral** multimodal model with **LangChain**. You'll learn to set up and work with the `ChatMistralAI` object for tasks such as generating responses, analyzing model outputs, and leveraging features like real-time response streaming. By the end of this guide, you'll have the tools to experiment with and deploy Mistral AI multimodal solutions.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Mistral AI Pixtral Models](#mistral-ai-pixtral-models)
- [Configuring Multimodal AI with System and User Prompts](#configuring-multimodal-ai-with-system-and-user-prompts)

### References

- [Mistral AI Documentation](https://docs.mistral.ai/)
- [Mistral AI Models](https://docs.mistral.ai/getting-started/models/)
- [LangChain Mistral Integration](https://python.langchain.com/docs/integrations/chat/mistralai)

---

## Environment Setup

Set up the environment.

**[Note]**
- You have one on your Teams group channel.
- Store your API key in a `.env` file as `MISTRAL_API_KEY`

In [None]:
%%capture --no-stderr
%pip install -q python-dotenv langchain_mistralai


[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: C:\Users\FKY\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
# Configuration file to manage the API KEY as an environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv(override=True)

True

## Mistral AI Pixtral Models



Multimodal refers to technologies or approaches that integrate and process multiple types of information (modalities). This includes a variety of data types such as:

- Text: Information in written form, such as documents, books, or web pages.
- Image: Visual information, including photos, graphics, or illustrations.
- Audio: Auditory information, such as speech, music, or sound effects.
- Video: A combination of visual and auditory information, including video clips or real-time streaming.

Mistral AI offers powerful multimodal models under the **Pixtral** family:

| Model | Description | Capabilities |
|-------|-------------|-------------|
| `pixtral-12b-2409` | 12B multimodal model | Text + Image processing |
| `pixtral-large-latest` | Larger multimodal model | Advanced vision + text tasks |

### Key Features

- ✅ **Text Generation**: High-quality text responses
- ✅ **Image Understanding**: Analyze and describe images
- ✅ **Streaming**: Real-time response generation
- ✅ **Multiple Images**: Process multiple images in one request

### Limitations

**Note**: This tutorial uses `pixtral-12b-2409` for cost-effectiveness. For more complex tasks, consider `pixtral-large-latest`.

### Step 1: Setting up ChatMistralAI with Pixtral

Create a `ChatMistralAI` object with the Pixtral model for multimodal tasks.

In [3]:
from langchain_mistralai import ChatMistralAI
# Create ChatMistralAI object with Pixtral model
llm_vision = ChatMistralAI(
    temperature=0.1,
    model="pixtral-12b-2409",  # Multimodal model
)

### Step 2: Encoding Images

Images need to be encoded into **Base64** format. This function handles both URLs and local files.

In [6]:
import requests
import base64
import mimetypes
from IPython.display import display, HTML, Image


def encode_image(image_path_or_url):
    """Encode an image to base64 format from URL or local file."""
    if image_path_or_url.startswith("http://") or image_path_or_url.startswith(
        "https://"
    ):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        # Download image from URL
        response = requests.get(image_path_or_url, headers=headers)
        if response.status_code == 200:
            image_content = response.content
        else:
            raise Exception(f"Failed to download image: {response.status_code}")
        # Guess MIME type based on URL
        mime_type, _ = mimetypes.guess_type(image_path_or_url)
        if mime_type is None:
            mime_type = "application/octet-stream"
    else:
        # Read image from local file
        try:
            with open(image_path_or_url, "rb") as image_file:
                image_content = image_file.read()
            # Guess MIME type based on file extension
            mime_type, _ = mimetypes.guess_type(image_path_or_url)
            if mime_type is None:
                mime_type = "application/octet-stream"
        except FileNotFoundError:
            raise Exception(f"File not found: {image_path_or_url}")

    # Base64 encode the image
    return f"data:{mime_type};base64,{base64.b64encode(image_content).decode()}"

### Step 3: Test Image Encoding

Let's test encoding and displaying an image from a URL.

In [7]:
# Example: URL-based image
IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
encoded_image_url = encode_image(IMAGE_URL)
display(Image(url=IMAGE_URL))  # Display the image

### Step 4: Creating Multimodal Messages

Create a message structure that includes both text and images.

In [8]:
def create_vision_messages(encoded_image, user_prompt="Describe this image in detail."):
    """Create messages for vision tasks."""
    return [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": user_prompt},
                {"type": "image_url", "image_url": {"url": encoded_image}},
            ],
        }
    ]

### Step 5: Get Vision Response

Send the image to Pixtral and get a description.

In [9]:
# Create messages with the encoded image
messages = create_vision_messages(
    encoded_image_url,
    "Describe this image in detail. What do you see?"
)

# Get response from Pixtral
response = llm_vision.invoke(messages)
print("\n=== Pixtral's Description ===")
print(response.content)


=== Pixtral's Description ===
The image captures a serene scene in nature. A wooden boardwalk, constructed from wooden planks, meanders through a field of tall, green grass. The boardwalk, appearing to be well-trodden, leads the viewer's eye towards the horizon. The sky above is a clear blue, dotted with fluffy white clouds. In the distance, a line of trees forms a natural boundary for the field. The perspective of the image is from the ground, looking down the boardwalk towards the horizon, giving a sense of depth and distance. The colors in the image are vibrant, with the green of the grass contrasting with the blue of the sky. The wooden planks of the boardwalk add a touch of rustic charm to the scene. The image does not contain any discernible text or human activity. The relative positions of the objects suggest a peaceful, untouched landscape.


### Step 6: Streaming Multimodal Response

You can also stream the response for better user experience.

In [None]:
def stream_vision_response(llm, messages):
    """Stream the vision model response."""
    print("\n=== Streaming Response ===")
    for chunk in llm.stream(messages):
        print(chunk.content, end="", flush=True)
    print()  # New line at the end


# Stream the response
stream_vision_response(llm_vision, messages)

## Configuring Multimodal AI with System and User Prompts

Let's demonstrate how to use system and user prompts for specific tasks like analyzing charts or documents.

### Understanding Prompts

**System Prompt**
- Defines the AI's role and behavior
- Sets context for consistent responses
- Example: "You are an expert data analyst"

**User Prompt**
- Provides specific task instructions
- Guides what the user wants
- Example: "Analyze this chart and extract key metrics"

### Example: Financial Chart Analysis

Let's analyze a financial chart using Pixtral with custom prompts.

In [10]:
# Example financial chart URL
CHART_URL = "https://media.wallstreetprep.com/uploads/2022/05/24100154/NVIDIA-Income-Statement.jpg"

# Encode and display the chart
encoded_chart = encode_image(CHART_URL)
display(Image(url=CHART_URL))

In [11]:
# Define system and user prompts for financial analysis
system_prompt = """You are a financial analyst expert specializing in reading and interpreting 
financial statements and charts. Your task is to analyze financial data and provide clear, 
actionable insights."""

user_prompt = """Analyze this financial statement and provide:
1. Key revenue trends
2. Notable changes in metrics
3. Overall financial health assessment
4. Any concerning or positive patterns

Be specific with numbers when possible."""

# Create messages with system prompt
messages_with_system = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {"type": "image_url", "image_url": {"url": encoded_chart}},
        ],
    },
]

In [12]:
# Get detailed financial analysis
print("\n" + "="*60)
print("FINANCIAL ANALYSIS")
print("="*60 + "\n")

for chunk in llm_vision.stream(messages_with_system):
    print(chunk.content, end="", flush=True)

print("\n" + "="*60)


FINANCIAL ANALYSIS

### Key Revenue Trends
1. **Revenue Growth**: NVIDIA's revenue has shown significant growth over the three years presented.
   - January 30, 2022: $26,914 million
   - January 31, 2021: $16,675 million
   - January 26, 2020: $10,918 million
   - This indicates a substantial increase in revenue, particularly from 2020 to 2021 and 2021 to 2022.

### Notable Changes in Metrics
1. **Cost of Revenue**: The cost of revenue has also increased but at a slower rate compared to revenue.
   - January 30, 2022: $9,439 million
   - January 31, 2021: $6,279 million
   - January 26, 2020: $4,150 million
   - This suggests that while revenue is growing, the cost of producing or acquiring the revenue is also rising.

2. **Gross Profit**: Gross profit has seen a significant increase.
   - January 30, 2022: $17,475 million
   - January 31, 2021: $10,396 million
   - January 26, 2020: $6,768 million
   - This indicates improved profitability on a per-unit basis.

3. **Operating Expens

### Example: Multiple Images Analysis

Pixtral can analyze multiple images in a single request.

In [13]:
# Example: Compare two images
image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Cat_August_2010-4.jpg/2560px-Cat_August_2010-4.jpg"

encoded_img1 = encode_image(image1_url)
encoded_img2 = encode_image(image2_url)

# Display both images
print("Image 1:")
display(Image(url=image1_url, width=400))
print("\nImage 2:")
display(Image(url=image2_url, width=400))

Image 1:



Image 2:


In [14]:
# Create message with multiple images
multi_image_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare and contrast these two images. What are the main differences?"},
            {"type": "image_url", "image_url": {"url": encoded_img1}},
            {"type": "image_url", "image_url": {"url": encoded_img2}},
        ],
    }
]

# Get comparison
print("\n=== Comparison Analysis ===")
for chunk in llm_vision.stream(multi_image_messages):
    print(chunk.content, end="", flush=True)


=== Comparison Analysis ===
The two images presented are quite different in terms of content, setting, and mood. Here is a detailed comparison and contrast:

### Image 1:
- **Setting**: The image depicts a serene natural landscape with a wooden boardwalk extending into a grassy wetland or marsh area.
- **Elements**: The boardwalk is surrounded by tall, green grasses and leads towards a distant horizon with trees and a partly cloudy sky.
- **Mood**: The scene evokes a sense of tranquility, nature, and openness. The bright blue sky and the lush greenery contribute to a peaceful and refreshing atmosphere.

### Image 2:
- **Setting**: The image shows a domestic scene with a cat lying on a white ledge or window sill.
- **Elements**: The cat is stretched out, with its paws extended and eyes closed, appearing relaxed and content. The background is a plain white wall.
- **Mood**: The scene is cozy and intimate, focusing on the cat's relaxed posture, which conveys a sense of comfort and conten

## Summary

In this tutorial, you learned:

1. **Mistral AI Models**: Understanding Pixtral multimodal models
3. **Image Processing**: Encoding and sending images to Pixtral
4. **Multimodal Prompts**: Creating effective system and user prompts
5. **Streaming**: Real-time response generation
6. **Multiple Images**: Processing multiple images in one request


### Best Practices

1. **Use Specific Prompts**: Be clear about what you want to extract from images
2. **Optimize Images**: Compress large images for faster processing
3. **System Prompts**: Use system prompts to set expertise context
4. **Streaming**: Enable streaming for better UX in interactive applications

