# Mistral OCR Workshop

This notebook demonstrates Optical Character Recognition using Mistral models on AWS.

## Introduction to Mistral OCR

**Mistral OCR (mistral-ocr-latest)** is a specialized optical character recognition model designed for extracting text, images, tables, and mathematical expressions from documents. Unlike traditional OCR models that only extract text, Mistral OCR comprehends each element of documents and returns ordered interleaved text and images in markdown format, making it ideal for multimodal document understanding and RAG systems.

### Key Features

- **State-of-the-Art Performance**: Achieves 94.89% overall accuracy on benchmarks, outperforming Google Document AI (83.42%), Azure OCR (89.52%), Gemini models (88-90%), and GPT-4o (89.77%)
- **Complex Document Understanding**: Excels at processing interleaved imagery, mathematical expressions, tables, and advanced layouts such as LaTeX formatting
- **Natively Multilingual**: Parses, understands, and transcribes thousands of scripts, fonts, and languages across all continents with 99%+ accuracy
- **Multi-format Support**: Process images (JPG, PNG, WebP, etc.) and multi-page PDF documents
- **Blazingly Fast**: Processes up to 2000 pages per minute on a single node—fastest in its category
- **Doc-as-Prompt & Structured Output**: Use documents as prompts and extract information into structured formats like JSON for downstream function calls and agent building
- **RAG-Ready**: Ideal model for use in Retrieval-Augmented Generation (RAG) systems with multimodal documents such as slides or complex PDFs
- **Production-Ready**: Deploy on Amazon SageMaker, la Plateforme, or self-host for organizations with stringent data privacy requirements
- **Cost-Effective**: Processes 1000 pages per dollar (approximately double with batch inference)

### Performance Highlights

Mistral OCR excels across multiple dimensions:

| Category | Mistral OCR 2503 | Best Competitor |
|----------|------------------|-----------------|
| Overall Accuracy | 94.89% | 90.23% (Gemini-1.5-Flash) |
| Mathematical Expressions | 94.29% | 89.11% (Gemini-1.5-Flash) |
| Scanned Documents | 98.96% | 96.15% (Gemini-1.5-Pro) |
| Tables | 96.12% | 91.70% (GPT-4o) |
| Multilingual | 99.02% | 97.31% (Azure OCR) |

### Use Cases

1. **Document Digitization**: Convert scanned documents, receipts, forms, and historical archives into searchable, AI-ready text
2. **Scientific Research**: Extract text, equations, tables, charts, and figures from scientific papers and journals
3. **Handwriting Recognition**: Process handwritten notes, whiteboard images, and forms
4. **Multimodal RAG Systems**: Build intelligent document understanding pipelines by combining Mistral OCR with LLMs like Mistral Small for analysis and summarization
5. **Multi-language Processing**: Handle documents in diverse linguistic backgrounds, from global organizations to hyperlocal businesses
6. **Structured Data Extraction**: Extract specific information from documents and format it into JSON for automated workflows and agent systems
7. **Cultural Heritage Preservation**: Digitize historical documents and artifacts for preservation and accessibility

This workshop will demonstrate how to deploy and use Mistral OCR on Amazon SageMaker for various document processing tasks, leveraging its industry-leading accuracy, speed, and versatility.

## Setup and Imports

First, let's import the necessary libraries for OCR processing.

In [None]:
import base64
import os
import boto3
import json
from typing import Optional, Dict, Any
from IPython.display import Markdown, display

## Helper Functions

These helper functions support image processing, model invocation, and post-response processing for the Mistral OCR model.

In [None]:
def encode_local_file_base64(file_path: str, file_type: Optional[str] = None) -> str:
    """
    Encode a local file (image or PDF) to base64 string.
    
    Args:
        file_path: Path to the local file
        file_type: Type of file ('image' or 'pdf'). If None, inferred from extension.
    
    Returns:
        Base64 encoded string of the file
    """
    if file_type is None:
        ext = os.path.splitext(file_path)[1].lower()
        if ext == ".pdf":
            file_type = "pdf"
        elif ext in (".jpg", ".jpeg", ".png", ".gif", ".bmp", ".webp"):
            file_type = "image"
        else:
            raise ValueError(f"Unsupported file type from extension: {ext}")

    try:
        with open(file_path, "rb") as file:
            encoded_data = base64.b64encode(file.read()).decode("utf-8")
            return encoded_data
    except Exception as e:
        print(f"Failed to encode {file_type} at {file_path}: {e}")
        raise

def run_inference(client, endpoint_name: str, payload: dict[str, Any]) -> Dict[str, Any]:
    """
    Invoke the SageMaker endpoint for OCR inference.
    
    Args:
        client: SageMaker runtime client
        endpoint_name: Name of the deployed endpoint
        payload: JSON payload containing the image data
        
    Returns:
        Dictionary containing parsed OCR results
    """
    try:
        inference_out = client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="application/json",
            Body=json.dumps(payload)
        )
        inference_resp_str = inference_out["Body"].read().decode("utf-8")
        return json.loads(inference_resp_str)
    except Exception as e:
        print(f"Inference error: {e}")
        raise


def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
    """
    Replace image placeholders in markdown with base64-encoded images.
    
    Args:
        markdown_str: Markdown string with image placeholders
        images_dict: Dictionary mapping image names to base64 strings
    
    Returns:
        Markdown string with embedded base64 images
    """
    for img_name, base64_str in images_dict.items():
        markdown_str = markdown_str.replace(
            f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
        )
    return markdown_str

def get_combined_markdown(ocr_response: dict) -> str:
    """
    Combine OCR text and images into a single markdown document.
    
    Args:
        ocr_response: Response dictionary from OCR model
    
    Returns:
        Combined markdown string with embedded images
    """
    markdowns = []
    for page in ocr_response["pages"]:
        image_data = {img["id"]: img["image_base64"] for img in page.get("images", [])}
        markdown_with_images = replace_images_in_markdown(page["markdown"], image_data)
        markdowns.append(markdown_with_images)
    return "\n\n".join(markdowns)

## Using Mistral OCR

Now let's use the Mistral OCR model to extract text from an image document.

In [None]:
MISTRAL_OCR_ENDPOINT_NAME = "mistral-ocr-endpoint"
image_b64 = encode_local_file_base64(file_path="images/french.png")

# Prepare the payload for Mistral OCR model
payload = {
    "model": "mistral-ocr-2505",
    "document": {
        "type": "image_url",
        "image_url": f"data:image/jpeg;base64,{image_b64}"
    }
}

# Create a client and invoke the endpoint
sagemaker_client = boto3.client("sagemaker-runtime")
result_parsed = run_inference(client=sagemaker_client, endpoint_name=MISTRAL_OCR_ENDPOINT_NAME, payload=payload)

# Display final markdown content with embedded images
display(Markdown(get_combined_markdown(result_parsed)))

## Document Understanding Pipeline with OCR and LLM

Combine Mistral OCR with Mistral Small 3.0 to build an intelligent document understanding pipeline that can extract text and answer questions about the content.

In [None]:
# Helper functions for Bedrock Runtime
bedrock_runtime = boto3.client("bedrock-runtime")

def bedrock_converse(system_prompt: str, messages: list, endpoint_arn: str, display_usage=False):
    """Invoke model using Converse API"""
    system = [{"text": system_prompt}]
    
    response = bedrock_runtime.converse(
        modelId=endpoint_arn,
        messages=messages,
        system=system,
        additionalModelRequestFields={"max_tokens": 2000, "temperature": 0.3}
    )

    output_content = ''.join(
        content['text'] for content in response['output']['message']['content']
    )

    if display_usage:
        token_usage = response['usage']
        print(f"\tLatency: {response['metrics']['latencyMs']}ms")
    
    return output_content

def bedrock_converse_stream(system_prompt: str, messages: list, endpoint_arn: str):
    """Invoke model with streaming"""
    system = [{"text": system_prompt}]
    
    response = bedrock_runtime.converse_stream(
        modelId=endpoint_arn,
        messages=messages,
        system=system,
        additionalModelRequestFields={"max_tokens": 2000, "temperature": 0.3}
    )
    
    stream = response.get('stream')
    output_content = ''
    
    if stream:
        for event in stream:
            if 'messageStart' in event:
                print(f"\nRole: {event['messageStart']['role']}")
            
            if 'contentBlockDelta' in event:
                text_chunk = event['contentBlockDelta']['delta']['text']
                print(text_chunk, end="")
                output_content += text_chunk
            
            if 'messageStop' in event:
                print(f"\nStop reason: {event['messageStop']['stopReason']}")
            
            if 'metadata' in event:
                metadata = event['metadata']
                if 'metrics' in metadata:
                    print(f"Latency: {metadata['metrics']['latencyMs']}ms")
    
    return output_content

In [None]:
def document_understanding_pipeline(
    image_path: str,
    user_prompt: str,
    ocr_endpoint: str,
    llm_endpoint: str
) -> str:
    """
    Full pipeline for document understanding from image input.

    Args:
        image_path: Local path to the document image
        user_prompt: What insights the user wants to extract
        ocr_endpoint: SageMaker endpoint for OCR model
        llm_endpoint: Bedrock endpoint for document understanding LLM

    Returns:
        Model-generated response with document insights
    """

    # Step 1: Encode local file using helper
    encoded_image = encode_local_file_base64(image_path)

    payload = {
        "model": "mistral-ocr-2505",
        "document": {
            "type": "image_url",
            "image_url": f"data:image/jpeg;base64,{encoded_image}"
        }
    }

    # Step 2: Run OCR model
    print("Running OCR model...")
    ocr_result = run_inference(client=sagemaker_client, endpoint_name=ocr_endpoint, payload=payload)

    # Step 3: Convert OCR output to Markdown
    print("Formatting OCR output...")
    markdown_doc = get_combined_markdown(ocr_result)

    print("----- OCR Text  -----")
    display(Markdown(markdown_doc))

    # Step 4: Prepare LLM messages
    system_prompt = (
        "You are a document understanding assistant. The user will provide structured OCR content "
        "from a scanned document. Use that information to generate clear, factual insights that "
        "answer the user's request."
    )

    messages = [
        {
            "role": "user",
            "content": [{"text": f"{user_prompt}\n\n--- Document Content ---\n{markdown_doc}"}]
        }
    ]

    # Step 5: Call Bedrock LLM with streaming
    print("Running LLM for document insights...")
    insights = bedrock_converse_stream(system_prompt, messages, llm_endpoint)

    return insights

## Example: Document Summarization

Let's use the pipeline to extract and summarize a document.

In [None]:
image_path = "images/french.png"
user_prompt = "can you summarise this document"
ocr_endpoint = MISTRAL_OCR_ENDPOINT_NAME
llm_endpoint = "<ENDPOINT_ARN>"  # Replace with your Mistral Small 3.0 endpoint ARN

document_understanding_pipeline(image_path, user_prompt, ocr_endpoint, llm_endpoint)

## Example: Whiteboard/Handwriting OCR

Extract text from whiteboard images or handwritten notes using Mistral OCR.

In [None]:
# Process whiteboard image
whiteboard_b64 = encode_local_file_base64(file_path="images/whiteboard.png")

# Prepare the payload for Mistral OCR model
whiteboard_payload = {
    "model": "mistral-ocr-2505",
    "document": {
        "type": "image_url",
        "image_url": f"data:image/png;base64,{whiteboard_b64}"
    }
}

# Invoke the endpoint
whiteboard_result = run_inference(
    client=sagemaker_client, 
    endpoint_name=MISTRAL_OCR_ENDPOINT_NAME, 
    payload=whiteboard_payload
)

# Display the extracted text
print("Extracted Whiteboard Content:")
display(Markdown(get_combined_markdown(whiteboard_result)))

### Low-Resolution Handwriting Recognition

Now let's test Mistral OCR's ability to handle challenging, low-quality handwritten content. This demonstrates the model's robustness even when dealing with compressed or poor-quality images.

In [None]:
# Process low-resolution handwriting image
handwriting_b64 = encode_local_file_base64(file_path="images/handwriting_low_res_2_resize.jpg")

# Prepare the payload for Mistral OCR model
handwriting_payload = {
    "model": "mistral-ocr-2505",
    "document": {
        "type": "image_url",
        "image_url": f"data:image/jpeg;base64,{handwriting_b64}"
    }
}

# Invoke the endpoint
handwriting_result = run_inference(
    client=sagemaker_client, 
    endpoint_name=MISTRAL_OCR_ENDPOINT_NAME, 
    payload=handwriting_payload
)

# Display the extracted text
print("Extracted Low-Resolution Handwriting Content:")
display(Markdown(get_combined_markdown(handwriting_result)))

## Example: Invoice Processing

Process invoice images to extract structured data such as vendor details, line items, totals, and dates. This is a common use case for automating accounts payable workflows and financial document processing.

### Single Invoice Processing

First, let's process a single invoice to extract all its content.

In [None]:
# Process a single invoice
invoice_b64 = encode_local_file_base64(file_path="images/invoice_1.jpg")

# Prepare the payload for Mistral OCR model
invoice_payload = {
    "model": "mistral-ocr-2505",
    "document": {
        "type": "image_url",
        "image_url": f"data:image/jpeg;base64,{invoice_b64}"
    }
}

# Invoke the endpoint
invoice_result = run_inference(
    client=sagemaker_client, 
    endpoint_name=MISTRAL_OCR_ENDPOINT_NAME, 
    payload=invoice_payload
)

# Display the extracted invoice content
print("Extracted Invoice Content:")
display(Markdown(get_combined_markdown(invoice_result)))

### Batch Invoice Processing

Now let's process all three invoices in a single request. The Mistral OCR model can handle multiple images by treating them as separate pages in the response.

In [None]:
# Process all three invoices together
invoice_files = ["images/invoice_1.jpg", "images/invoice_2.jpg", "images/Invoice_3.jpg"]

# Encode all invoices
print(f"Processing {len(invoice_files)} invoices...")
encoded_invoices = [encode_local_file_base64(file_path=invoice_file) for invoice_file in invoice_files]

# Create a combined payload with all invoices as multiple images
# Note: For multiple images, we can create a multi-page document by concatenating them
batch_payload = {
    "model": "mistral-ocr-2505",
    "documents": [
        {
            "type": "image_url",
            "image_url": f"data:image/jpeg;base64,{encoded_invoice}"
        }
        for encoded_invoice in encoded_invoices
    ]
}

# Invoke the endpoint with all invoices
batch_result = run_inference(
    client=sagemaker_client, 
    endpoint_name=MISTRAL_OCR_ENDPOINT_NAME, 
    payload=batch_payload
)

# Display results for each invoice
print(f"\n{'='*80}")
print("BATCH PROCESSING RESULTS")
print(f"{'='*80}\n")

for idx, page in enumerate(batch_result.get("pages", []), 1):
    print(f"\n--- Invoice {idx} ---")
    invoice_markdown = page.get("markdown", "")
    # Handle embedded images
    if "images" in page:
        image_data = {img["id"]: img["image_base64"] for img in page["images"]}
        invoice_markdown = replace_images_in_markdown(invoice_markdown, image_data)
    display(Markdown(invoice_markdown))
    print(f"\n{'='*80}\n")