Trailing-assistant 400 on llama.cpp/vLLM with thinking-on templates (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2)

### Symptom

Local OpenAI-compatible servers running thinking-on-by-default chat templates (llama.cpp `--reasoning on`, vLLM with reasoning, TGI with thinking, mistral.rs, etc.) reject any opencode request whose **last** message is `role:"assistant"`, with:

```
HTTP 400 {"error":{"message":"Assistant response prefill is incompatible with enable_thinking."}}
```

opencode emits a trailing-assistant message in two situations, both of which trip this error:

1. **Empty trailing assistant** — `message-v2.toModelMessagesEffect` sometimes builds an assistant `UIMessage` whose only parts are `[step-start, reasoning("")]`. `convertToModelMessages` collapses that to `content:""`, which is sent as a trailing assistant turn.
2. **Non-empty trailing assistant** — `session/prompt.ts` deliberately injects a `MAX_STEPS` wrap-up instruction as `role:"assistant"` (response continuation / prefill).

### Reproduction

1. Run a llama-server with a thinking template, e.g.:
   ```
   llama-server --model Qwen3.5-9B-...gguf --reasoning on --jinja --port 8080
   ```
2. Point opencode at it via an `@ai-sdk/openai-compatible` provider in `opencode.json`.
3. Run any agent with a `steps` limit small enough to trigger MAX_STEPS, or any flow that emits an empty-reasoning assistant turn.
4. Observe HTTP 400s in the llama-server log.

### Affected model families

Every 2025-2026 open-weight thinking family using `enable_thinking`-branching templates:

- Qwen3 hybrid (all sizes), Qwen3-Thinking-2507, Qwen3-VL, Qwen3.5, Qwen3.6, QwQ-32B
- DeepSeek-R1, R1-0528, V4 (when thinking on)
- GLM-4.6, GLM-4.7 thinking
- Kimi-K2-Thinking
- MiniMax-M2

Not affected: Qwen2.5, Qwen3-Coder, Qwen3-Instruct-2507, all Anthropic/OpenAI/Google/Bedrock-Anthropic models (these either don't use `enable_thinking` branching or accept prefill natively).

### Upstream / cross-framework references

- ggml-org/llama.cpp#20861 — exact symptom (opencode + Qwen3.5), still open
- ggml-org/llama.cpp#21889 — proposal to relax the check (open, no PR)
- mastra-ai/mastra#15234 — same symptom in a different framework

### Why fix it in opencode

A llama.cpp-side fix is unlikely soon and would only cover llama.cpp. opencode is the boundary where the per-provider request shape is decided, and where capability data already lives. Fixing it here also covers vLLM/TGI/mistral.rs, which have analogous behaviour but no shared upstream change to wait for.

### Proposed fix (three PRs)

1. **Empty-trailing case** — extend the existing `transform.ts` empty-content filter (currently anthropic + bedrock only) to `@ai-sdk/openai-compatible`. Refactors two near-identical map+filter chains into one helper.
2. **`Model.prefill` capability** — add `prefill: optional Boolean` on `Model` and on the user-facing config schema. No consumer wiring yet.
3. **Consumer + runtime probe** — `ProviderTransform.canAcceptTrailingAssistant(model)` with three-layer precedence (explicit / auto-inference / default true). `session/prompt.ts` MAX_STEPS routes between `role:assistant` and `role:user` based on it. Runtime probe of `<baseURL>/props` (llama.cpp) detects `enable_thinking`-branching templates automatically — no user config needed for the common case.

Thinking stays enabled in the request body throughout — only the role of the synthetic MAX_STEPS message changes from `assistant` to `user`. The model thinks and writes its wrap-up normally.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trailing-assistant 400 on llama.cpp/vLLM with thinking-on templates (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2) #27920

Symptom

Reproduction

Affected model families

Upstream / cross-framework references

Why fix it in opencode

Proposed fix (three PRs)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Trailing-assistant 400 on llama.cpp/vLLM with thinking-on templates (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2) #27920

Description

Symptom

Reproduction

Affected model families

Upstream / cross-framework references

Why fix it in opencode

Proposed fix (three PRs)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions