Skip to content

Trailing-assistant 400 on llama.cpp/vLLM with thinking-on templates (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2) #27920

@feanor5555

Description

@feanor5555

Symptom

Local OpenAI-compatible servers running thinking-on-by-default chat templates (llama.cpp --reasoning on, vLLM with reasoning, TGI with thinking, mistral.rs, etc.) reject any opencode request whose last message is role:"assistant", with:

HTTP 400 {"error":{"message":"Assistant response prefill is incompatible with enable_thinking."}}

opencode emits a trailing-assistant message in two situations, both of which trip this error:

  1. Empty trailing assistantmessage-v2.toModelMessagesEffect sometimes builds an assistant UIMessage whose only parts are [step-start, reasoning("")]. convertToModelMessages collapses that to content:"", which is sent as a trailing assistant turn.
  2. Non-empty trailing assistantsession/prompt.ts deliberately injects a MAX_STEPS wrap-up instruction as role:"assistant" (response continuation / prefill).

Reproduction

  1. Run a llama-server with a thinking template, e.g.:
    llama-server --model Qwen3.5-9B-...gguf --reasoning on --jinja --port 8080
    
  2. Point opencode at it via an @ai-sdk/openai-compatible provider in opencode.json.
  3. Run any agent with a steps limit small enough to trigger MAX_STEPS, or any flow that emits an empty-reasoning assistant turn.
  4. Observe HTTP 400s in the llama-server log.

Affected model families

Every 2025-2026 open-weight thinking family using enable_thinking-branching templates:

  • Qwen3 hybrid (all sizes), Qwen3-Thinking-2507, Qwen3-VL, Qwen3.5, Qwen3.6, QwQ-32B
  • DeepSeek-R1, R1-0528, V4 (when thinking on)
  • GLM-4.6, GLM-4.7 thinking
  • Kimi-K2-Thinking
  • MiniMax-M2

Not affected: Qwen2.5, Qwen3-Coder, Qwen3-Instruct-2507, all Anthropic/OpenAI/Google/Bedrock-Anthropic models (these either don't use enable_thinking branching or accept prefill natively).

Upstream / cross-framework references

Why fix it in opencode

A llama.cpp-side fix is unlikely soon and would only cover llama.cpp. opencode is the boundary where the per-provider request shape is decided, and where capability data already lives. Fixing it here also covers vLLM/TGI/mistral.rs, which have analogous behaviour but no shared upstream change to wait for.

Proposed fix (three PRs)

  1. Empty-trailing case — extend the existing transform.ts empty-content filter (currently anthropic + bedrock only) to @ai-sdk/openai-compatible. Refactors two near-identical map+filter chains into one helper.
  2. Model.prefill capability — add prefill: optional Boolean on Model and on the user-facing config schema. No consumer wiring yet.
  3. Consumer + runtime probeProviderTransform.canAcceptTrailingAssistant(model) with three-layer precedence (explicit / auto-inference / default true). session/prompt.ts MAX_STEPS routes between role:assistant and role:user based on it. Runtime probe of <baseURL>/props (llama.cpp) detects enable_thinking-branching templates automatically — no user config needed for the common case.

Thinking stays enabled in the request body throughout — only the role of the synthetic MAX_STEPS message changes from assistant to user. The model thinks and writes its wrap-up normally.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions