Symptom
Local OpenAI-compatible servers running thinking-on-by-default chat templates (llama.cpp --reasoning on, vLLM with reasoning, TGI with thinking, mistral.rs, etc.) reject any opencode request whose last message is role:"assistant", with:
HTTP 400 {"error":{"message":"Assistant response prefill is incompatible with enable_thinking."}}
opencode emits a trailing-assistant message in two situations, both of which trip this error:
- Empty trailing assistant —
message-v2.toModelMessagesEffect sometimes builds an assistant UIMessage whose only parts are [step-start, reasoning("")]. convertToModelMessages collapses that to content:"", which is sent as a trailing assistant turn.
- Non-empty trailing assistant —
session/prompt.ts deliberately injects a MAX_STEPS wrap-up instruction as role:"assistant" (response continuation / prefill).
Reproduction
- Run a llama-server with a thinking template, e.g.:
llama-server --model Qwen3.5-9B-...gguf --reasoning on --jinja --port 8080
- Point opencode at it via an
@ai-sdk/openai-compatible provider in opencode.json.
- Run any agent with a
steps limit small enough to trigger MAX_STEPS, or any flow that emits an empty-reasoning assistant turn.
- Observe HTTP 400s in the llama-server log.
Affected model families
Every 2025-2026 open-weight thinking family using enable_thinking-branching templates:
- Qwen3 hybrid (all sizes), Qwen3-Thinking-2507, Qwen3-VL, Qwen3.5, Qwen3.6, QwQ-32B
- DeepSeek-R1, R1-0528, V4 (when thinking on)
- GLM-4.6, GLM-4.7 thinking
- Kimi-K2-Thinking
- MiniMax-M2
Not affected: Qwen2.5, Qwen3-Coder, Qwen3-Instruct-2507, all Anthropic/OpenAI/Google/Bedrock-Anthropic models (these either don't use enable_thinking branching or accept prefill natively).
Upstream / cross-framework references
Why fix it in opencode
A llama.cpp-side fix is unlikely soon and would only cover llama.cpp. opencode is the boundary where the per-provider request shape is decided, and where capability data already lives. Fixing it here also covers vLLM/TGI/mistral.rs, which have analogous behaviour but no shared upstream change to wait for.
Proposed fix (three PRs)
- Empty-trailing case — extend the existing
transform.ts empty-content filter (currently anthropic + bedrock only) to @ai-sdk/openai-compatible. Refactors two near-identical map+filter chains into one helper.
Model.prefill capability — add prefill: optional Boolean on Model and on the user-facing config schema. No consumer wiring yet.
- Consumer + runtime probe —
ProviderTransform.canAcceptTrailingAssistant(model) with three-layer precedence (explicit / auto-inference / default true). session/prompt.ts MAX_STEPS routes between role:assistant and role:user based on it. Runtime probe of <baseURL>/props (llama.cpp) detects enable_thinking-branching templates automatically — no user config needed for the common case.
Thinking stays enabled in the request body throughout — only the role of the synthetic MAX_STEPS message changes from assistant to user. The model thinks and writes its wrap-up normally.
Symptom
Local OpenAI-compatible servers running thinking-on-by-default chat templates (llama.cpp
--reasoning on, vLLM with reasoning, TGI with thinking, mistral.rs, etc.) reject any opencode request whose last message isrole:"assistant", with:opencode emits a trailing-assistant message in two situations, both of which trip this error:
message-v2.toModelMessagesEffectsometimes builds an assistantUIMessagewhose only parts are[step-start, reasoning("")].convertToModelMessagescollapses that tocontent:"", which is sent as a trailing assistant turn.session/prompt.tsdeliberately injects aMAX_STEPSwrap-up instruction asrole:"assistant"(response continuation / prefill).Reproduction
@ai-sdk/openai-compatibleprovider inopencode.json.stepslimit small enough to trigger MAX_STEPS, or any flow that emits an empty-reasoning assistant turn.Affected model families
Every 2025-2026 open-weight thinking family using
enable_thinking-branching templates:Not affected: Qwen2.5, Qwen3-Coder, Qwen3-Instruct-2507, all Anthropic/OpenAI/Google/Bedrock-Anthropic models (these either don't use
enable_thinkingbranching or accept prefill natively).Upstream / cross-framework references
Why fix it in opencode
A llama.cpp-side fix is unlikely soon and would only cover llama.cpp. opencode is the boundary where the per-provider request shape is decided, and where capability data already lives. Fixing it here also covers vLLM/TGI/mistral.rs, which have analogous behaviour but no shared upstream change to wait for.
Proposed fix (three PRs)
transform.tsempty-content filter (currently anthropic + bedrock only) to@ai-sdk/openai-compatible. Refactors two near-identical map+filter chains into one helper.Model.prefillcapability — addprefill: optional BooleanonModeland on the user-facing config schema. No consumer wiring yet.ProviderTransform.canAcceptTrailingAssistant(model)with three-layer precedence (explicit / auto-inference / default true).session/prompt.tsMAX_STEPS routes betweenrole:assistantandrole:userbased on it. Runtime probe of<baseURL>/props(llama.cpp) detectsenable_thinking-branching templates automatically — no user config needed for the common case.Thinking stays enabled in the request body throughout — only the role of the synthetic MAX_STEPS message changes from
assistanttouser. The model thinks and writes its wrap-up normally.