fix(openai): coalesce system messages for self-hosted and open-model endpoints#3357
Merged
Merged
Conversation
8270106 to
257a9d7
Compare
dgageot
approved these changes
Jul 1, 2026
…endpoints docker-agent emits one system message per source (the agent instruction plus each toolset's instructions). Strict server-side chat templates reject a request that carries more than one system message: Qwen 3.5/3.6 served by vLLM fails with "HTTP 400: System message must be at the beginning" because the model's Jinja chat template only allows a system message at index 0 (issues #2327, #3344). The chat-completions path already coalesced consecutive system messages, but only for an allow-list (explicit api_type=openai_chatcompletions, baseten, ovhcloud), so the reported config (provider: openai plus a base_url pointing at a self-hosted vLLM server) fell through and hit the error. Extend shouldMergeConsecutiveMessages to also cover an openai provider with a custom base_url (self-hosted vLLM/SGLang) and the open-model host aliases that serve strict-template models (openrouter, nebius, alongside baseten and ovhcloud). First-party APIs with a fixed model lineup (official OpenAI, Mistral, xAI, ...) tolerate multiple system messages and are left unchanged. The merge is a safe normalization (same-role content is concatenated), runs only on the Chat Completions path, and matches what the DMR client already does. Re #3344
257a9d7 to
ab96aa8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
docker-agent emits one system message per source: the agent instruction plus each toolset's instructions (see
session.gobuildInvariantSystemMessages), and a handoff prompt in multi-agent teams. Some OpenAI-compatible backends reject a request that carries more than one system message.Reported in #3344 (previously #2327): Qwen 3.5/3.6 served by vLLM fails with
The model's Jinja chat template raises
raise_exception('System message must be at the beginning.')for any system message that is not at index 0.Root cause
The chat-completions path already coalesces consecutive system messages, but
shouldMergeConsecutiveMessagesgated it to a narrow allow-list, so common vLLM configs fell through:openai_chatcompletions)provider: baseten/ovhcloudprovider: openai+base_url(self-hosted vLLM)openrouter/nebius+ QwenThe config in the report (
provider: openaiwith abase_urlpointing at a self-hosted vLLM server) is the third row.Fix
Extend
shouldMergeConsecutiveMessagesto cover the endpoints that plausibly front a strict-template model, while leaving first-party APIs untouched:provider: openaiwith a custombase_url(self-hosted vLLM / SGLang), the exact reported config.openrouter,nebius(alongside the existingbaseten,ovhcloud).api_type: openai_chatcompletions(custom providers already merged).First-party APIs with a fixed model lineup (official OpenAI, Mistral, xAI, MiniMax, GitHub Copilot, OpenCode) tolerate multiple system messages and are deliberately excluded, so their behavior and recorded e2e cassettes are unchanged. Merging is a safe normalization (same-role content is concatenated), runs only on the Chat Completions path (the Responses path uses a separate converter), and matches what the DMR client already does.
Validation
Reproduced end to end using the real binary output and the real upstream Qwen chat template:
chat_template.jinjarendered via jinja2 (the engine transformers/vLLM use)[system, system, user])System message must be at the beginning.[system, user], merged)The request bodies were captured from a real
docker-agent run --execagainst aprovider: openai+base_urlendpoint. The template is the upstream strict version hosted atunsloth/Qwen3.5-2B@8b63d90c32e8(the guard cited in #2327).Tests (no network or GPU dependency, CI-safe):
TestReproIssue3344_QwenViaVLLM: fake vLLM/Qwen server enforcing the leading-system-message rule, coveringprovider: openai+base_urland theopenrouteralias.TestShouldMergeConsecutiveMessages_Gating: table asserting which endpoints merge, including that first-partymistralandxaido not.Existing
system_message_merge_test.go(baseten, ovhcloud, explicitapi_type) and the Mistral e2e cassette tests (TestExec_Mistral,TestExec_Mistral_ToolCall) stay green.Note on real-hardware testing
This change was not validated against a live vLLM deployment: no GPU was available to run vLLM with model weights. Instead the exact failing code path (server-side Jinja chat-template rendering) was reproduced faithfully with the real captured request body and the real upstream Qwen chat template, plus the CI tests above.
Re #3344