Merged
Conversation
dgageot
added a commit
to rumpl/cagent
that referenced
this pull request
Mar 3, 2026
- Guard against decompression bombs in ResizeImage by rejecting decoded images exceeding 20000x20000 pixels before processing - Fix image stripping for models with unknown capabilities: only strip images when modalities are explicitly known and exclude image input - Check file existence before calling IsImageFile in read_file handler to provide clearer errors and avoid type detection on missing files Assisted-By: cagent
dgageot
added a commit
to rumpl/cagent
that referenced
this pull request
Mar 3, 2026
- Guard against decompression bombs in ResizeImage by rejecting decoded images exceeding 20000x20000 pixels before processing - Fix image stripping for models with unknown capabilities: only strip images when modalities are explicitly known and exclude image input - Check file existence before calling IsImageFile in read_file handler to provide clearer errors and avoid type detection on missing files Assisted-By: cagent
Add the ability for the filesystem read_file tool and MCP tools to return image content (JPEG, PNG, GIF, WebP) to LLMs. Images are base64-encoded and forwarded as multimodal content through the provider-specific message formats: - Anthropic: image blocks within tool_result content (both standard and beta API) - OpenAI: injected user message with image after tool result (both completions and responses API) - Gemini: FunctionResponseParts with inline image data Changes: - Add ImageContent type to tools.ToolCallResult for carrying image data - Add IsImageFile/IsImageMimeType helpers to chat package - Filesystem read_file detects image files and returns base64-encoded content - MCP processMCPContent extracts ImageContent from tool results - Runtime propagates tool result images as MultiContent on tool messages - All providers updated to handle images in tool result messages Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
…lities Use the modelsdev Model.Modalities.Input field to check if the current model supports image input. When it doesn't, strip all image-related content (ImageURL parts and image file attachments) from messages before sending them to the provider. This prevents API errors when conversation history contains images from tool results or user attachments but the model is text-only (e.g. gpt-3.5-turbo, o1-mini, codex-mini). Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
MIME detection: DetectMimeType now falls back to content sniffing (magic bytes via http.DetectContentType + manual WebP check) when the file extension is unrecognised. This correctly identifies images even when they have wrong or missing extensions. Image resizing: New ResizeImage/ResizeImageBase64 functions ensure images stay within provider limits (max 2000x2000 pixels, max 4.5MB). Uses CatmullRom (bicubic) interpolation via golang.org/x/image/draw. Progressive fallback: tries PNG vs JPEG, then decreasing JPEG quality (70/55/40), then reduced dimensions (75%/50%/35%/25%). FormatDimensionNote generates coordinate mapping notes so models can translate between resized and original image coordinates. Both image ingestion points now resize: - read_file tool (filesystem.go): resizes before base64 encoding - @file attachments (app.go): reads, resizes, and inlines as base64 data URL (MessagePartTypeImageURL) instead of MessagePartTypeFile, making image attachments work across all providers not just Anthropic Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
Signed-off-by: David Gageot <david.gageot@docker.com>
- Guard against decompression bombs in ResizeImage by rejecting decoded images exceeding 20000x20000 pixels before processing - Fix image stripping for models with unknown capabilities: only strip images when modalities are explicitly known and exclude image input - Check file existence before calling IsImageFile in read_file handler to provide clearer errors and avoid type detection on missing files Assisted-By: cagent
Member
|
/review |
There was a problem hiding this comment.
Review Summary
I've reviewed this PR that adds end-to-end image support to cagent. The implementation includes image detection, base64 encoding, automatic resizing with progressive fallback, MIME type detection via content sniffing, and provider-specific encoding for Anthropic, OpenAI, and Gemini.
The code includes comprehensive safeguards:
- Decompression bomb protection (maxDecodedDimension check)
- Progressive resize fallback strategy
- Error handling for resize failures
- Model capability gating to strip images for text-only models
No critical issues found. The changes look good to merge!
dgageot
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Testing this
With this code, run your agent that has the filesystem toolset and ask it "What's in <image.png>", it should be able to read it and describe the image for you.
Another example is that now you can ask an agent that has the playwright MCP for example: "take a screenshot of random.org and tell me what you see"
Adds end-to-end image support so that LLMs can see images from tool results (read_file, MCP tools) and user attachments (@file.png).
What's new
Image ingestion
MIME detection
Image resizing
Provider-specific encoding
Model capability gating
New dependency