Skip to content

DSML tool_calls XML leaking into reasoning_content/thinking in streaming responses #167

@wanderingmeow

Description

@wanderingmeow

Problem

When the model emits DSML tool calls inside a thinking block (before </think> closes), the raw
<|DSML|tool_calls> XML leaks into reasoning_content/thinking in the client session. The structured toolCall is extracted correctly, but the raw XML also appears as thinking content, causing duplication.

Example from a live session json from Pi.dev agent harness:

Details
{
    "type": "message",
    "id": "b2f4e976",
    "parentId": "cc7fdf27",
    "timestamp": "2026-05-16T05:55:54.300Z",
    "message":
    {
        "role": "assistant",
        "content": [
        {
            "type": "thinking",
            "thinking": "\n\n<|DSML|tool_calls>\n<|DSML|invoke name=\"bash\">\n<|DSML|parameter name=\"command\" string=\"true\">sed -n '7,9p' main.go</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls>",
            "thinkingSignature": "reasoning_content"
        },
        {
            "type": "toolCall",
            "id": "call_b4b3cbd94d4a0b69c20583ce56c032bc",
            "name": "bash",
            "arguments":
            {
                "command": "sed -n '7,9p' main.go"
            }
        }],
        "api": "openai-completions",
        "provider": "ds4",
        "model": "deepseek-v4-flash",
        "usage":
        {
            "input": 27,
            "output": 70,
            "cacheRead": 61706,
            "cacheWrite": 27,
            "totalTokens": 61830
        },
        "stopReason": "toolUse",
        "timestamp": 1778910950189,
        "responseId": "chatcmpl-49"
    }
}

The tool call itself is correctly parsed and emitted separately as a toolCall. The issue is the raw XML also appearing in the thinking field.

Reproduction

Not reliably reproducible with a fixed prompt. Occurs during normal Pi.dev agent sessions (thinking → tool calls → tool results → regen loop). More common with longer conversations and tool-heavy workflows (e.g., file editing, shell commands).

Impact

  • Client session stores raw DSML XML as thinking content (duplication)
  • Context is reprocessed from the start in the next turn due to KV cache miss: ds4-server: live kv cache miss live=61831 prompt=61927 common=61830

Environment

  • Pi.dev harness
  • ds4-server, Metal backend
  • DeepSeek-V4-Flash Q2 imatrix
  • OpenAI chat completions streaming

All interaction is through Pi.dev's agent loop (thinking → tool calls → tool results → regen). The issue was observed during normal agent sessions where the model decides to call a tool while still inside a thinking block.

Suspected cause (unconfirmed)

Two hypotheses, not yet ruled out:

  1. Server streaming path: The THINKING streaming state in ds4_server.c emits reasoning_content deltas without checking for DSML tool markers inside the thinking block. The parse_generated_message function correctly strips the XML for the final response, but the SSE streaming path may leak partial/full tags before the mode switches to TOOL.

  2. Model quality: The model may be emitting DSML tool calls before closing the thinking block, which could be a model behavior or 2-bit quantization issue rather than a server bug. I have seen thinking tags appear randomly in responses in simple chat sessions (not agent).

Potential fix

Adds DSML tool-start detection in the THINKING streaming state for all three APIs (OpenAI, Responses, Anthropic). When a <|DSML|tool_calls> marker is found before </think>, reasoning is truncated at the tool boundary and the stream switches to TOOL mode so the raw XML is never emitted as reasoning content.

  1. Search for tool markers from buffer start (not just from current emit position) to catch partially emitted tags
  2. Retroactive detection when emit_pos advances past the marker (hold-back race condition)
  3. Increase hold-back from 11 bytes to 22 bytes (the character is 3 bytes each in UTF-8, making
    <|DSML|tool_calls> 22 bytes total) for tool-enabled requests only

I'm continuing to test with Pi.dev agent sessions to confirm it resolves the issue end-to-end before submitting this as a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions