# Testing from blog posts and docs

[https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct)

In [6]:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-500M-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="eager", #"flash_attention_2" if DEVICE == "cuda" else "eager"
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.55M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Some kwargs in processor config are unused and will not have any effect: image_seq_len. 


config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

User:



Can you describe this image?
Assistant: The image depicts a cityscape featuring a prominent landmark, the Statue of Liberty, prominently positioned on Liberty Island. The statue is a green, humanoid figure with a crown atop its head and is situated on a small island surrounded by water. The statue is characterized by its detailed features, including a crown, a long, flowing robe, and a rectangular chest.

In the background, the cityscape is filled with numerous high-rise buildings, which are predominantly made of glass and steel. These buildings vary in height, with the tallest ones reaching up to several stories. The sky above is clear, suggesting a sunny day, and the lighting casts a soft glow over the entire scene.

To the left of the statue, there is a small strip of land with trees, which adds a touch of natural beauty to the urban landscape. The water surrounding the island is calm, reflecting the sky and the buildings on the mainland.

The image captures a moment of tra

In [7]:
receipt_image = load_image("https://miro.medium.com/v2/resize:fit:2000/1*XABefyicvTbpAARnM33BLA.jpeg")

In [None]:
inputs2 = processor(text=prompt, images=[receipt_image], return_tensors="pt")
inputs2 = inputs2.to(DEVICE)

# Generate outputs
generated_ids2 = model.generate(**inputs2, max_new_tokens=500)
generated_texts2 = processor.batch_decode(
    generated_ids2,
    skip_special_tokens=True,
)

print(generated_texts2[0])

# Try with Outlines

In [3]:
!pip install outlines



In [4]:
import outlines

In [5]:
import torch
from transformers import (
    AutoModelForVision2Seq,
)
model_name="HuggingFaceTB/SmolVLM-500M-Instruct"
model_class=AutoModelForVision2Seq

def get_vision_model(model_name: str, model_class):
    model_kwargs = {
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "eager", #flash_attention_2",
        "device_map": "auto",
    }
    processor_kwargs = {
        "device": "cuda",
    }

    model = outlines.models.transformers_vision(
        model_name,
        model_class=model_class,
        model_kwargs=model_kwargs,
        processor_kwargs=processor_kwargs,
    )
    return model
model = get_vision_model(model_name, model_class)

config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.55M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Some kwargs in processor config are unused and will not have any effect: image_seq_len. 


In [6]:
def load_and_resize_image(image_path, max_size=1024):
    """
    Load and resize an image while maintaining aspect ratio

    Args:
        image_path: Path to the image file
        max_size: Maximum dimension (width or height) of the output image

    Returns:
        PIL Image: Resized image
    """
    image = Image.open(image_path)

    # Get current dimensions
    width, height = image.size

    # Calculate scaling factor
    scale = min(max_size / width, max_size / height)

    # Only resize if image is larger than max_size
    if scale < 1:
        new_width = int(width * scale)
        new_height = int(height * scale)
        image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)

    return image

In [8]:
from PIL import Image

In [9]:
import requests

# Path to the image
image_path = "https://raw.githubusercontent.com/dottxt-ai/outlines/refs/heads/main/docs/cookbook/images/trader-joes-receipt.jpg"

# Download the image
response = requests.get(image_path)
with open("receipt.png", "wb") as f:
    f.write(response.content)

# Load + resize the image
image = load_and_resize_image("receipt.png")

In [11]:
from pydantic import BaseModel, Field
from typing import Literal, Optional, List

In [12]:
class Item(BaseModel):
    name: str
    quantity: Optional[int]
    price_per_unit: Optional[float]
    total_price: Optional[float]

class ReceiptSummary(BaseModel):
    store_name: str
    store_address: str
    store_number: Optional[int]
    items: List[Item]
    tax: Optional[float]
    total: Optional[float]
    # Date is in the format YYYY-MM-DD. We can apply a regex pattern to ensure it's formatted correctly.
    date: Optional[str] = Field(pattern=r'\d{4}-\d{2}-\d{2}', description="Date in the format YYYY-MM-DD")
    payment_method: Literal["cash", "credit", "debit", "check", "other"]

In [13]:
from transformers import AutoProcessor

In [14]:
# Set up the content you want to send to the model
messages = [
    {
        "role": "user",
        "content": [
            {
                # The image is provided as a PIL Image object
                "type": "image",
                "image": image,
            },
            {
                "type": "text",
                "text": f"""You are an expert at extracting information from receipts.
                Please extract the information from the receipt. Be as detailed as possible --
                missing or misreporting information is a crime.

                Return the information in the following JSON schema:
                {ReceiptSummary.model_json_schema()}
            """},
        ],
    }
]

# Convert the messages to the final prompt
processor = AutoProcessor.from_pretrained(model_name)
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

Some kwargs in processor config are unused and will not have any effect: image_seq_len. 


In [15]:
# Prepare a function to process receipts
receipt_summary_generator = outlines.generate.json(
    model,
    ReceiptSummary,

    # Greedy sampling is a good idea for numeric
    # data extraction -- no randomness.
    sampler=outlines.samplers.greedy()
)

# Generate the receipt summary
result = receipt_summary_generator(prompt, [image])
print(result)



store_name='San Francisco, CA' store_address='401 Bay Street' store_number=94133 items=[Item(name='Banana Each', quantity=7, price_per_unit=1.61, total_price=1.61), Item(name='Barebelles Chocolate Doug', quantity=2, price_per_unit=2.29, total_price=4.29), Item(name='Barebelles Creamy Crisp', quantity=2, price_per_unit=2.29, total_price=4.29), Item(name='Spindrift Orange Mango', quantity=1, price_per_unit=7.49, total_price=8.49), Item(name='Bottle Deposit', quantity=8, price_per_unit=0.4, total_price=0.4), Item(name='Milk Organic Gallon Whole', quantity=8, price_per_unit=6.79, total_price=6.79), Item(name='Classic Greek Salad', quantity=1, price_per_unit=3.49, total_price=3.49), Item(name='Cobb Salad', quantity=1, price_per_unit=5.99, total_price=5.99), Item(name='Pepper Bell Red XL Each', quantity=1, price_per_unit=1.29, total_price=1.29), Item(name='Bag Fee', quantity=1, price_per_unit=0.25, total_price=0.25), Item(name='Bag Fee', quantity=1, price_per_unit=0.25, total_price=0.25)] ta

In [16]:
result

ReceiptSummary(store_name='San Francisco, CA', store_address='401 Bay Street', store_number=94133, items=[Item(name='Banana Each', quantity=7, price_per_unit=1.61, total_price=1.61), Item(name='Barebelles Chocolate Doug', quantity=2, price_per_unit=2.29, total_price=4.29), Item(name='Barebelles Creamy Crisp', quantity=2, price_per_unit=2.29, total_price=4.29), Item(name='Spindrift Orange Mango', quantity=1, price_per_unit=7.49, total_price=8.49), Item(name='Bottle Deposit', quantity=8, price_per_unit=0.4, total_price=0.4), Item(name='Milk Organic Gallon Whole', quantity=8, price_per_unit=6.79, total_price=6.79), Item(name='Classic Greek Salad', quantity=1, price_per_unit=3.49, total_price=3.49), Item(name='Cobb Salad', quantity=1, price_per_unit=5.99, total_price=5.99), Item(name='Pepper Bell Red XL Each', quantity=1, price_per_unit=1.29, total_price=1.29), Item(name='Bag Fee', quantity=1, price_per_unit=0.25, total_price=0.25), Item(name='Bag Fee', quantity=1, price_per_unit=0.25, tot