:eyes: Vision :eyes: by rumpl · Pull Request #1889 · docker/cagent

rumpl · 2026-03-02T23:56:59Z

Testing this

With this code, run your agent that has the filesystem toolset and ask it "What's in <image.png>", it should be able to read it and describe the image for you.

Another example is that now you can ask an agent that has the playwright MCP for example: "take a screenshot of random.org and tell me what you see"

Adds end-to-end image support so that LLMs can see images from tool results (read_file, MCP tools) and user attachments (@file.png).

What's new

Image ingestion

The read_file filesystem tool now detects image files (JPEG, PNG, GIF, WebP) and returns them as base64-encoded image content alongside a text description
MCP tool results containing ImageContent are extracted and forwarded to the model
User @file.png attachments are read, resized, and inlined as base64 data URLs — this works across all providers, not just Anthropic's File API

MIME detection

DetectMimeType now uses content sniffing first (magic bytes via http.DetectContentType + manual WebP check), falling back to extension only for text files
New DetectMimeTypeByContent helper for raw byte detection

Image resizing

Images are automatically resized to stay within provider limits (max 2000×2000 px, max 4.5 MB)
Uses CatmullRom (bicubic) interpolation via golang.org/x/image/draw
Progressive fallback strategy: PNG vs JPEG comparison → decreasing JPEG quality (70/55/40) → reduced dimensions (75%/50%/35%/25%)
FormatDimensionNote generates coordinate mapping notes (e.g. "original 4000×3000, displayed at 2000×1500. Multiply coordinates by 2.00...") so the model can translate coordinates back to the original image

Provider-specific encoding

Anthropic: images inline as image blocks in tool_result content (both standard and beta API)
OpenAI: injected user message with image_url parts after tool result (completions and responses API — tool messages only support text)
Gemini: multimodal function responses via FunctionResponseParts with inline image data

Model capability gating

Uses modelsdev model metadata (Modalities.Input) to check if the current model supports image input
Automatically strips image content from messages when the model is text-only (e.g. gpt-3.5-turbo, o1-mini, codex-mini), preventing API errors

New dependency

golang.org/x/image — for draw.CatmullRom (bicubic scaling) and WebP decoding

pkg/chat/image.go

- Guard against decompression bombs in ResizeImage by rejecting decoded images exceeding 20000x20000 pixels before processing - Fix image stripping for models with unknown capabilities: only strip images when modalities are explicitly known and exclude image input - Check file existence before calling IsImageFile in read_file handler to provide clearer errors and avoid type detection on missing files Assisted-By: cagent

Add the ability for the filesystem read_file tool and MCP tools to return image content (JPEG, PNG, GIF, WebP) to LLMs. Images are base64-encoded and forwarded as multimodal content through the provider-specific message formats: - Anthropic: image blocks within tool_result content (both standard and beta API) - OpenAI: injected user message with image after tool result (both completions and responses API) - Gemini: FunctionResponseParts with inline image data Changes: - Add ImageContent type to tools.ToolCallResult for carrying image data - Add IsImageFile/IsImageMimeType helpers to chat package - Filesystem read_file detects image files and returns base64-encoded content - MCP processMCPContent extracts ImageContent from tool results - Runtime propagates tool result images as MultiContent on tool messages - All providers updated to handle images in tool result messages Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>

…lities Use the modelsdev Model.Modalities.Input field to check if the current model supports image input. When it doesn't, strip all image-related content (ImageURL parts and image file attachments) from messages before sending them to the provider. This prevents API errors when conversation history contains images from tool results or user attachments but the model is text-only (e.g. gpt-3.5-turbo, o1-mini, codex-mini). Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>

MIME detection: DetectMimeType now falls back to content sniffing (magic bytes via http.DetectContentType + manual WebP check) when the file extension is unrecognised. This correctly identifies images even when they have wrong or missing extensions. Image resizing: New ResizeImage/ResizeImageBase64 functions ensure images stay within provider limits (max 2000x2000 pixels, max 4.5MB). Uses CatmullRom (bicubic) interpolation via golang.org/x/image/draw. Progressive fallback: tries PNG vs JPEG, then decreasing JPEG quality (70/55/40), then reduced dimensions (75%/50%/35%/25%). FormatDimensionNote generates coordinate mapping notes so models can translate between resized and original image coordinates. Both image ingestion points now resize: - read_file tool (filesystem.go): resizes before base64 encoding - @file attachments (app.go): reads, resizes, and inlines as base64 data URL (MessagePartTypeImageURL) instead of MessagePartTypeFile, making image attachments work across all providers not just Anthropic Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>

Signed-off-by: David Gageot <david.gageot@docker.com>

- Guard against decompression bombs in ResizeImage by rejecting decoded images exceeding 20000x20000 pixels before processing - Fix image stripping for models with unknown capabilities: only strip images when modalities are explicitly known and exclude image input - Check file existence before calling IsImageFile in read_file handler to provide clearer errors and avoid type detection on missing files Assisted-By: cagent

fixed

dgageot · 2026-03-03T11:28:41Z

/review

docker-agent

Review Summary

I've reviewed this PR that adds end-to-end image support to cagent. The implementation includes image detection, base64 encoding, automatic resizing with progressive fallback, MIME type detection via content sniffing, and provider-specific encoding for Anthropic, OpenAI, and Gemini.

The code includes comprehensive safeguards:

Decompression bomb protection (maxDecodedDimension check)
Progressive resize fallback strategy
Error handling for resize failures
Model capability gating to strip images for text-only models

No critical issues found. The changes look good to merge!

rumpl requested a review from a team as a code owner March 2, 2026 23:57

This comment was marked as outdated.

Sign in to view

docker-agent bot reviewed Mar 3, 2026

View reviewed changes

pkg/chat/image.go Outdated Show resolved Hide resolved

docker-agent bot reviewed Mar 3, 2026

View reviewed changes

pkg/chat/image.go Outdated Show resolved Hide resolved

docker-agent bot reviewed Mar 3, 2026

View reviewed changes

pkg/chat/image.go Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

dgageot force-pushed the vision branch from dca950b to 93b9555 Compare March 3, 2026 09:26

This comment was marked as outdated.

Sign in to view

dgageot force-pushed the vision branch from 6e54d4d to ace3f92 Compare March 3, 2026 10:58

This comment was marked as outdated.

Sign in to view

rumpl and others added 5 commits March 3, 2026 12:24

Handle errors

186fa91

Signed-off-by: David Gageot <david.gageot@docker.com>

dgageot force-pushed the vision branch from ace3f92 to 709a5e1 Compare March 3, 2026 11:24

docker-agent bot approved these changes Mar 3, 2026

View reviewed changes

dgageot approved these changes Mar 3, 2026

View reviewed changes

dgageot merged commit 19db850 into docker:main Mar 3, 2026
5 checks passed

BrewTestBot mentioned this pull request Mar 3, 2026

cagent 1.28.1 Homebrew/homebrew-core#270349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👀 Vision 👀#1889

👀 Vision 👀#1889
dgageot merged 5 commits intodocker:mainfrom
rumpl:vision

rumpl commented Mar 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

dgageot commented Mar 3, 2026

Uh oh!

docker-agent bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rumpl commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing this

What's new

New dependency

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

dgageot commented Mar 3, 2026

Uh oh!

docker-agent bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rumpl commented Mar 2, 2026 •

edited

Loading