Problem
When the model emits DSML tool calls inside a thinking block (before </think> closes), the raw
<|DSML|tool_calls> XML leaks into reasoning_content/thinking in the client session. The structured toolCall is extracted correctly, but the raw XML also appears as thinking content, causing duplication.
Example from a live session json from Pi.dev agent harness:
Details
{
"type": "message",
"id": "b2f4e976",
"parentId": "cc7fdf27",
"timestamp": "2026-05-16T05:55:54.300Z",
"message":
{
"role": "assistant",
"content": [
{
"type": "thinking",
"thinking": "\n\n<|DSML|tool_calls>\n<|DSML|invoke name=\"bash\">\n<|DSML|parameter name=\"command\" string=\"true\">sed -n '7,9p' main.go</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls>",
"thinkingSignature": "reasoning_content"
},
{
"type": "toolCall",
"id": "call_b4b3cbd94d4a0b69c20583ce56c032bc",
"name": "bash",
"arguments":
{
"command": "sed -n '7,9p' main.go"
}
}],
"api": "openai-completions",
"provider": "ds4",
"model": "deepseek-v4-flash",
"usage":
{
"input": 27,
"output": 70,
"cacheRead": 61706,
"cacheWrite": 27,
"totalTokens": 61830
},
"stopReason": "toolUse",
"timestamp": 1778910950189,
"responseId": "chatcmpl-49"
}
}
The tool call itself is correctly parsed and emitted separately as a toolCall. The issue is the raw XML also appearing in the thinking field.
Reproduction
Not reliably reproducible with a fixed prompt. Occurs during normal Pi.dev agent sessions (thinking → tool calls → tool results → regen loop). More common with longer conversations and tool-heavy workflows (e.g., file editing, shell commands).
Impact
- Client session stores raw DSML XML as thinking content (duplication)
- Context is reprocessed from the start in the next turn due to KV cache miss:
ds4-server: live kv cache miss live=61831 prompt=61927 common=61830
Environment
- Pi.dev harness
- ds4-server, Metal backend
- DeepSeek-V4-Flash Q2 imatrix
- OpenAI chat completions streaming
All interaction is through Pi.dev's agent loop (thinking → tool calls → tool results → regen). The issue was observed during normal agent sessions where the model decides to call a tool while still inside a thinking block.
Suspected cause (unconfirmed)
Two hypotheses, not yet ruled out:
-
Server streaming path: The THINKING streaming state in ds4_server.c emits reasoning_content deltas without checking for DSML tool markers inside the thinking block. The parse_generated_message function correctly strips the XML for the final response, but the SSE streaming path may leak partial/full tags before the mode switches to TOOL.
-
Model quality: The model may be emitting DSML tool calls before closing the thinking block, which could be a model behavior or 2-bit quantization issue rather than a server bug. I have seen thinking tags appear randomly in responses in simple chat sessions (not agent).
Potential fix
Adds DSML tool-start detection in the THINKING streaming state for all three APIs (OpenAI, Responses, Anthropic). When a <|DSML|tool_calls> marker is found before </think>, reasoning is truncated at the tool boundary and the stream switches to TOOL mode so the raw XML is never emitted as reasoning content.
- Search for tool markers from buffer start (not just from current emit position) to catch partially emitted tags
- Retroactive detection when
emit_pos advances past the marker (hold-back race condition)
- Increase hold-back from 11 bytes to 22 bytes (the
| character is 3 bytes each in UTF-8, making
<|DSML|tool_calls> 22 bytes total) for tool-enabled requests only
I'm continuing to test with Pi.dev agent sessions to confirm it resolves the issue end-to-end before submitting this as a PR.
Problem
When the model emits DSML tool calls inside a thinking block (before
</think>closes), the raw<|DSML|tool_calls>XML leaks intoreasoning_content/thinkingin the client session. The structuredtoolCallis extracted correctly, but the raw XML also appears as thinking content, causing duplication.Example from a live session json from Pi.dev agent harness:
Details
{ "type": "message", "id": "b2f4e976", "parentId": "cc7fdf27", "timestamp": "2026-05-16T05:55:54.300Z", "message": { "role": "assistant", "content": [ { "type": "thinking", "thinking": "\n\n<|DSML|tool_calls>\n<|DSML|invoke name=\"bash\">\n<|DSML|parameter name=\"command\" string=\"true\">sed -n '7,9p' main.go</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls>", "thinkingSignature": "reasoning_content" }, { "type": "toolCall", "id": "call_b4b3cbd94d4a0b69c20583ce56c032bc", "name": "bash", "arguments": { "command": "sed -n '7,9p' main.go" } }], "api": "openai-completions", "provider": "ds4", "model": "deepseek-v4-flash", "usage": { "input": 27, "output": 70, "cacheRead": 61706, "cacheWrite": 27, "totalTokens": 61830 }, "stopReason": "toolUse", "timestamp": 1778910950189, "responseId": "chatcmpl-49" } }The tool call itself is correctly parsed and emitted separately as a toolCall. The issue is the raw XML also appearing in the thinking field.
Reproduction
Not reliably reproducible with a fixed prompt. Occurs during normal Pi.dev agent sessions (thinking → tool calls → tool results → regen loop). More common with longer conversations and tool-heavy workflows (e.g., file editing, shell commands).
Impact
ds4-server: live kv cache miss live=61831 prompt=61927 common=61830Environment
All interaction is through Pi.dev's agent loop (thinking → tool calls → tool results → regen). The issue was observed during normal agent sessions where the model decides to call a tool while still inside a thinking block.
Suspected cause (unconfirmed)
Two hypotheses, not yet ruled out:
Server streaming path: The THINKING streaming state in
ds4_server.cemitsreasoning_contentdeltas without checking for DSML tool markers inside the thinking block. Theparse_generated_messagefunction correctly strips the XML for the final response, but the SSE streaming path may leak partial/full tags before the mode switches to TOOL.Model quality: The model may be emitting DSML tool calls before closing the thinking block, which could be a model behavior or 2-bit quantization issue rather than a server bug. I have seen thinking tags appear randomly in responses in simple chat sessions (not agent).
Potential fix
Adds DSML tool-start detection in the THINKING streaming state for all three APIs (OpenAI, Responses, Anthropic). When a
<|DSML|tool_calls>marker is found before</think>, reasoning is truncated at the tool boundary and the stream switches to TOOL mode so the raw XML is never emitted as reasoning content.emit_posadvances past the marker (hold-back race condition)|character is 3 bytes each in UTF-8, making<|DSML|tool_calls>22 bytes total) for tool-enabled requests onlyI'm continuing to test with Pi.dev agent sessions to confirm it resolves the issue end-to-end before submitting this as a PR.