Feature hasn't been suggested before.
Describe the enhancement you want to request
When using a coding-focused LLM that doesn't support image input (e.g. DeepSeek, GLM, Haiku), pasting a screenshot into the chat causes the image to be silently dropped. The model receives an error string instead, with no indication that images aren't supported.
Proposed solution: Before calling streamText, if the active model can't read images but the message contains image parts, automatically find a vision-capable model from any configured provider, call it with a description prompt, and replace the image parts with the returned text description. The main model then receives the description as plain text — entirely transparent, no model switching required.
A new optional vision_model config field lets users pin a specific model (e.g. openai/gpt-4o). If not set, the first image-capable model found across all configured providers is used, skipping the current model to handle cases where the active provider has billing or rate-limit issues.
Implementation: PR #24382
Note on #22828: That issue proposes a similar concept but with a different scope — it focuses on transcription for non-multimodal providers, while this proposal handles the case where the specific active model lacks vision support, regardless of provider, and includes fallback to any available provider. The implementation here also adds a dedicated vision_model config field and integrates into the streaming pipeline rather than as a pre-processing step.
Feature hasn't been suggested before.
Describe the enhancement you want to request
When using a coding-focused LLM that doesn't support image input (e.g. DeepSeek, GLM, Haiku), pasting a screenshot into the chat causes the image to be silently dropped. The model receives an error string instead, with no indication that images aren't supported.
Proposed solution: Before calling
streamText, if the active model can't read images but the message contains image parts, automatically find a vision-capable model from any configured provider, call it with a description prompt, and replace the image parts with the returned text description. The main model then receives the description as plain text — entirely transparent, no model switching required.A new optional
vision_modelconfig field lets users pin a specific model (e.g.openai/gpt-4o). If not set, the first image-capable model found across all configured providers is used, skipping the current model to handle cases where the active provider has billing or rate-limit issues.Implementation: PR #24382
Note on #22828: That issue proposes a similar concept but with a different scope — it focuses on transcription for non-multimodal providers, while this proposal handles the case where the specific active model lacks vision support, regardless of provider, and includes fallback to any available provider. The implementation here also adds a dedicated
vision_modelconfig field and integrates into the streaming pipeline rather than as a pre-processing step.