# Image Captioning at Scale with Vision LLMs.

Vision models excel at understanding and describing images. They can grok the contents of images similar to how a human can, and can find patterns, objects, and even process many images at a time.

In this notebook we'll learn how to caption images, and also extract insights from a large number of images.

We'll be using a VLMs (Vision Language Model) to create a dataset of ugly and beautiful websites. For simplicity our dataset will only consist of a couple hundred images, but an AI lab seeking to improve LLM design ability may scale it to millions or billions of web pages.

## Setting Up Your Captioning Pipeline

First we download some libraries. We interact with Inference's API through OpenAI's official API client.

In [14]:
%pip install openai requests pillow datasets -q

You should consider upgrading via the '/Users/michaelryaboy/recent-projects/inference-webhook/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


Then we can connect to Inference using the OpenAI SDK.

We pass the Inference.net API baseurl, and make sure we have our API key set as an environment variable.

In [15]:
from openai import OpenAI
import os
import base64
import requests
from PIL import Image
import json
import time
from datasets import load_dataset
from io import BytesIO

client = OpenAI(
    base_url="https://batch.inference.net/v1",
    api_key=os.getenv("INFERENCE_API_KEY"),
)

VISION_MODEL = "google/gemma-3-27b-instruct/bf-16"

Next we'll load the training split of the website screenshots dataset.

In [16]:
N = 200
ds = load_dataset(
    "Zexanima/website_screenshots_image_dataset",
    split="train",                                   # ⚠️ use “test” later for eval
    streaming=False                                  # stream=True ≈ zero‑RAM, slower
).select(range(N))
print("Loaded", len(ds), "screenshots")

Loaded 200 screenshots


Let's create a helper function to turn images into a data-URI so that we can pass them to our VLM API endpoint:

In [17]:
def to_data_uri(sample):
    if isinstance(sample, Image.Image):
        img = sample.convert("RGB")
        buf = BytesIO()
        img.save(buf, format="PNG", optimize=True)
        data = buf.getvalue()
    else:                                             # remote URL
        data = requests.get(sample).content
    b64 = base64.b64encode(data).decode("utf-8")
    return f"data:image/png;base64,{b64}"

## Your First Image Captioning Request

Let's start by captioning a few sample images. We'll download some images and convert them to base64 data URIs so our LLM API can process them:


In [18]:
import textwrap

SYSTEM_PROMPT = textwrap.dedent("""
    You are a senior product‑designer assistant evaluating full‑page website
    screenshots.  Perform **two independent judgements**:

    ────────────────────────────────────────────────────────────────────
    1. landing_page (boolean)
       ▸ TRUE  – the screenshot looks like the FIRST page a visitor sees
                 (hero section or marquee visual, primary navigation bar,
                 clear top‑level call‑to‑action, little or no scroll offset).
       ▸ FALSE – any interior page, modal, or state that presumes prior
                 navigation (pricing tables, blog posts, dashboards, etc.).
    ────────────────────────────────────────────────────────────────────
    2. aesthetic  (integer 1‑5)
       Rate overall visual polish **at the moment the screenshot was taken**.
       Use the rubric below; intermediate numbers are **not permitted**.

         1 ▪︎ Poor   – chaotic layout, clashing colours, unreadable text,
                      obvious placeholder or broken assets.
         2 ▪︎ Below Avg – dated styling, inconsistent spacing/alignment,
                      low‑contrast elements, generic stock imagery.
         3 ▪︎ Average  – competent but ordinary; standard template vibes,
                      minor visual debts allowed, no major UX anti‑patterns.
         4 ▪︎ Good    – clean hierarchy, harmonious palette & typography,
                      responsive‑looking grid, purposeful imagery/icons.
         5 ▪︎ Excellent – editorial‑grade art direction, meticulous spacing,
                      delightful micro‑details, persuasive visual storytelling.

       ✱ Ignore personal taste; judge by professional UI/UX heuristics
         (legibility, balance, affordance, consistency, brand presence).

    OUTPUT FORMAT  (strict)
    ------------------------
    {
      "landing_page": <true|false>,
      "aesthetic":    <integer 1‑5>
    }

    • Return **JSON ONLY** – no comments, no extra keys, no trailing commas.
    • If unsure, choose the **more conservative** (lower) aesthetic score.
""").strip()

schema = {                                 # strict JSON schema = no post‑cleanup
    "name": "webscreen_classification",
    "strict": True,
    "schema": {
        "type": "object",
        "properties": {
            "landing_page": {"type": "boolean"},
            "aesthetic":    {"type": "integer", "minimum": 1, "maximum": 5}
        },
        "required": ["landing_page", "aesthetic"],
        "additionalProperties": False
    }
}

lines = []
for row in ds:
    data_uri = to_data_uri(row["image"])
    
    body = {
        "model": VISION_MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": data_uri}},
                {"type": "text",      "text": "Classify this screenshot."}
            ]}
        ],
        "response_format": {
            "type": "json_schema",
            "json_schema": schema
        },
        "temperature": 0.1,
        "max_tokens": 300
    }
    
    lines.append(json.dumps({
        "custom_id": f"img_{row['image_id']}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": body
    }))

print(len(lines))

200


In [19]:
from io import BytesIO

jsonl_blob = "\n".join(lines).encode("utf-8")

buf = BytesIO(jsonl_blob)
buf.name = "batch_input.jsonl"          # 👈 give it a legit filename

batch_input_file = client.files.create(
    file=buf,
    purpose="batch"
)
print("Uploaded:", batch_input_file.id)

Uploaded: MvxbZCC7q22aculyfMe_o


In [20]:
batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print("Batch ID:", batch.id, "Status:", batch.status)

Batch ID: XCJP4hQpY8ifQQM8EsPEJ Status: in_progress


In [21]:
import time

# Poll for batch status until it's no longer "in_progress"
while True:
    batch = client.batches.retrieve(batch.id)
    print("Batch status:", batch.status)
    if batch.status != "in_progress":
        break
    time.sleep(10)  # Wait 10 seconds before polling again

print("Final batch object:")
print(batch)

output_file = client.files.content(batch.output_file_id)
print(output_file.text)

Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: in_progress
Batch status: completed
Final batch object:
Batch(id='XCJP4hQpY8ifQQM8EsPEJ', completion_window='24h', created_at=1752187688568, endpoint='/v1/chat/completions', input_file_id='MvxbZCC7q22aculyfMe_o', object='batch', status='completed', cancelled

In [22]:
# show results
print(output_file.text)
print(error_file.text)

{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_174","response":{"status_code":200,"request_id":"26d7a645df0ed301","body":{"id":"943-ZLR9usIiV62NNArKe","choices":[{"finish_reason":"stop","index":0,"message":{"content":"{\"landing_page\": true, \"aesthetic\": 4}","reasoning_content":"","role":"assistant","tool_calls":[]},"logprobs":null}],"created":1752187822,"model":"google/gemma-3-27b-instruct/bf-16","system_fingerprint":"","object":"chat.completion","usage":{"prompt_tokens":731,"total_tokens":745,"completion_tokens":14}}}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_173","response":{"status_code":200,"request_id":"3a5a506a96e7fb0c","body":{"id":"aSqrAK-DWtfspHOopjQA5","choices":[{"finish_reason":"stop","index":0,"message":{"content":"{\"landing_page\": true, \"aesthetic\": 4}","reasoning_content":"","role":"assistant","tool_calls":[]},"logprobs":null}],"created":1752187822,"model":"google/gemma-3-27b-instruct/bf-16","system_fingerprint":"","object":"chat.completion","usage":{"promp

In [23]:
error_file = client.files.content(batch.error_file_id)
print(error_file.text)

{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_0","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_3","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_1","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_4","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_6","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_2","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPEJ","custom_id":"img_5","response":null,"error":{"code":"inference_failed","message":"Maximum retries reached"}}
{"id":"XCJP4hQpY8ifQQM8EsPE