Skip to content

👀 Vision 👀#1889

Merged
dgageot merged 5 commits intodocker:mainfrom
rumpl:vision
Mar 3, 2026
Merged

👀 Vision 👀#1889
dgageot merged 5 commits intodocker:mainfrom
rumpl:vision

Conversation

@rumpl
Copy link
Member

@rumpl rumpl commented Mar 2, 2026

Testing this

With this code, run your agent that has the filesystem toolset and ask it "What's in <image.png>", it should be able to read it and describe the image for you.

Another example is that now you can ask an agent that has the playwright MCP for example: "take a screenshot of random.org and tell me what you see"


Adds end-to-end image support so that LLMs can see images from tool results (read_file, MCP tools) and user attachments (@file.png).

What's new

Image ingestion

  • The read_file filesystem tool now detects image files (JPEG, PNG, GIF, WebP) and returns them as base64-encoded image content alongside a text description
  • MCP tool results containing ImageContent are extracted and forwarded to the model
  • User @file.png attachments are read, resized, and inlined as base64 data URLs — this works across all providers, not just Anthropic's File API

MIME detection

  • DetectMimeType now uses content sniffing first (magic bytes via http.DetectContentType + manual WebP check), falling back to extension only for text files
  • New DetectMimeTypeByContent helper for raw byte detection

Image resizing

  • Images are automatically resized to stay within provider limits (max 2000×2000 px, max 4.5 MB)
  • Uses CatmullRom (bicubic) interpolation via golang.org/x/image/draw
  • Progressive fallback strategy: PNG vs JPEG comparison → decreasing JPEG quality (70/55/40) → reduced dimensions (75%/50%/35%/25%)
  • FormatDimensionNote generates coordinate mapping notes (e.g. "original 4000×3000, displayed at 2000×1500. Multiply coordinates by 2.00...") so the model can translate coordinates back to the original image

Provider-specific encoding

  • Anthropic: images inline as image blocks in tool_result content (both standard and beta API)
  • OpenAI: injected user message with image_url parts after tool result (completions and responses API — tool messages only support text)
  • Gemini: multimodal function responses via FunctionResponseParts with inline image data

Model capability gating

  • Uses modelsdev model metadata (Modalities.Input) to check if the current model supports image input
  • Automatically strips image content from messages when the model is text-only (e.g. gpt-3.5-turbo, o1-mini, codex-mini), preventing API errors

New dependency

  • golang.org/x/image — for draw.CatmullRom (bicubic scaling) and WebP decoding

@rumpl rumpl requested a review from a team as a code owner March 2, 2026 23:57
docker-agent[bot]

This comment was marked as outdated.

docker-agent[bot]

This comment was marked as outdated.

docker-agent[bot]

This comment was marked as outdated.

dgageot added a commit to rumpl/cagent that referenced this pull request Mar 3, 2026
- Guard against decompression bombs in ResizeImage by rejecting decoded
  images exceeding 20000x20000 pixels before processing
- Fix image stripping for models with unknown capabilities: only strip
  images when modalities are explicitly known and exclude image input
- Check file existence before calling IsImageFile in read_file handler
  to provide clearer errors and avoid type detection on missing files

Assisted-By: cagent
docker-agent[bot]

This comment was marked as outdated.

dgageot added a commit to rumpl/cagent that referenced this pull request Mar 3, 2026
- Guard against decompression bombs in ResizeImage by rejecting decoded
  images exceeding 20000x20000 pixels before processing
- Fix image stripping for models with unknown capabilities: only strip
  images when modalities are explicitly known and exclude image input
- Check file existence before calling IsImageFile in read_file handler
  to provide clearer errors and avoid type detection on missing files

Assisted-By: cagent
docker-agent[bot]

This comment was marked as outdated.

rumpl and others added 5 commits March 3, 2026 12:24
Add the ability for the filesystem read_file tool and MCP tools to
return image content (JPEG, PNG, GIF, WebP) to LLMs. Images are
base64-encoded and forwarded as multimodal content through the
provider-specific message formats:

- Anthropic: image blocks within tool_result content (both standard and beta API)
- OpenAI: injected user message with image after tool result (both completions and responses API)
- Gemini: FunctionResponseParts with inline image data

Changes:
- Add ImageContent type to tools.ToolCallResult for carrying image data
- Add IsImageFile/IsImageMimeType helpers to chat package
- Filesystem read_file detects image files and returns base64-encoded content
- MCP processMCPContent extracts ImageContent from tool results
- Runtime propagates tool result images as MultiContent on tool messages
- All providers updated to handle images in tool result messages

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
…lities

Use the modelsdev Model.Modalities.Input field to check if the current
model supports image input. When it doesn't, strip all image-related
content (ImageURL parts and image file attachments) from messages before
sending them to the provider. This prevents API errors when conversation
history contains images from tool results or user attachments but the
model is text-only (e.g. gpt-3.5-turbo, o1-mini, codex-mini).

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
MIME detection: DetectMimeType now falls back to content sniffing
(magic bytes via http.DetectContentType + manual WebP check) when the
file extension is unrecognised. This correctly identifies images even
when they have wrong or missing extensions.

Image resizing: New ResizeImage/ResizeImageBase64 functions ensure images
stay within provider limits (max 2000x2000 pixels, max 4.5MB). Uses
CatmullRom (bicubic) interpolation via golang.org/x/image/draw.
Progressive fallback: tries PNG vs JPEG, then decreasing JPEG quality
(70/55/40), then reduced dimensions (75%/50%/35%/25%).

FormatDimensionNote generates coordinate mapping notes so models can
translate between resized and original image coordinates.

Both image ingestion points now resize:
- read_file tool (filesystem.go): resizes before base64 encoding
- @file attachments (app.go): reads, resizes, and inlines as base64
  data URL (MessagePartTypeImageURL) instead of MessagePartTypeFile,
  making image attachments work across all providers not just Anthropic

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
Signed-off-by: David Gageot <david.gageot@docker.com>
- Guard against decompression bombs in ResizeImage by rejecting decoded
  images exceeding 20000x20000 pixels before processing
- Fix image stripping for models with unknown capabilities: only strip
  images when modalities are explicitly known and exclude image input
- Check file existence before calling IsImageFile in read_file handler
  to provide clearer errors and avoid type detection on missing files

Assisted-By: cagent
@dgageot
Copy link
Member

dgageot commented Mar 3, 2026

/review

Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I've reviewed this PR that adds end-to-end image support to cagent. The implementation includes image detection, base64 encoding, automatic resizing with progressive fallback, MIME type detection via content sniffing, and provider-specific encoding for Anthropic, OpenAI, and Gemini.

The code includes comprehensive safeguards:

  • Decompression bomb protection (maxDecodedDimension check)
  • Progressive resize fallback strategy
  • Error handling for resize failures
  • Model capability gating to strip images for text-only models

No critical issues found. The changes look good to merge!

@dgageot dgageot merged commit 19db850 into docker:main Mar 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants