DSML tool_calls XML leaking into reasoning_content/thinking in streaming responses

## Problem

When the model emits DSML tool calls inside a thinking block (before `</think>` closes), the raw
 `<｜DSML｜tool_calls>` XML leaks into `reasoning_content`/`thinking` in the client session. The structured `toolCall` is extracted correctly, but the raw XML also appears as thinking content, causing duplication.

Example from a live session json from Pi.dev agent harness:

<details>

```json
{
    "type": "message",
    "id": "b2f4e976",
    "parentId": "cc7fdf27",
    "timestamp": "2026-05-16T05:55:54.300Z",
    "message":
    {
        "role": "assistant",
        "content": [
        {
            "type": "thinking",
            "thinking": "\n\n<｜DSML｜tool_calls>\n<｜DSML｜invoke name=\"bash\">\n<｜DSML｜parameter name=\"command\" string=\"true\">sed -n '7,9p' main.go</｜DSML｜parameter>\n</｜DSML｜invoke>\n</｜DSML｜tool_calls>",
            "thinkingSignature": "reasoning_content"
        },
        {
            "type": "toolCall",
            "id": "call_b4b3cbd94d4a0b69c20583ce56c032bc",
            "name": "bash",
            "arguments":
            {
                "command": "sed -n '7,9p' main.go"
            }
        }],
        "api": "openai-completions",
        "provider": "ds4",
        "model": "deepseek-v4-flash",
        "usage":
        {
            "input": 27,
            "output": 70,
            "cacheRead": 61706,
            "cacheWrite": 27,
            "totalTokens": 61830
        },
        "stopReason": "toolUse",
        "timestamp": 1778910950189,
        "responseId": "chatcmpl-49"
    }
}
```

</details>

The tool call itself is correctly parsed and emitted separately as a toolCall. The issue is the raw XML also appearing in the thinking field.

## Reproduction

Not reliably reproducible with a fixed prompt. Occurs during normal Pi.dev agent sessions (thinking → tool calls → tool results → regen loop). More common with longer conversations and tool-heavy workflows (e.g., file editing, shell commands).

## Impact

- Client session stores raw DSML XML as thinking content (duplication)
- Context is reprocessed from the start in the next turn due to KV cache miss: `ds4-server: live kv cache miss live=61831 prompt=61927 common=61830`

## Environment

- [Pi.dev](https://github.com/earendil-works/pi-coding-agent) harness
- ds4-server, Metal backend
- DeepSeek-V4-Flash Q2 imatrix
- OpenAI chat completions streaming

All interaction is through Pi.dev's agent loop (thinking → tool calls → tool results → regen). The issue was observed during normal agent sessions where the model decides to call a tool while still inside a thinking block.

## Suspected cause (unconfirmed)

Two hypotheses, not yet ruled out:

1. Server streaming path: The THINKING streaming state in `ds4_server.c` emits `reasoning_content` deltas without checking for DSML tool markers inside the thinking block. The `parse_generated_message` function correctly strips the XML for the final response, but the SSE streaming path may leak partial/full tags before the mode switches to TOOL.

2. Model quality: The model may be emitting DSML tool calls before closing the thinking block, which could be a model behavior or 2-bit quantization issue rather than a server bug. I have seen thinking tags appear randomly in responses in simple chat sessions (not agent).

## Potential fix

Adds DSML tool-start detection in the THINKING streaming state for all three APIs (OpenAI, Responses, Anthropic). When a `<｜DSML｜tool_calls>` marker is found before `</think>`, reasoning is truncated at the tool boundary and the stream switches to TOOL mode so the raw XML is never emitted as reasoning content.

1. Search for tool markers from buffer start (not just from current emit position) to catch partially emitted tags
2. Retroactive detection when `emit_pos` advances past the marker (hold-back race condition)
3. Increase hold-back from 11 bytes to 22 bytes (the `｜` character is 3 bytes each in UTF-8, making
 `<｜DSML｜tool_calls>` 22 bytes total) for tool-enabled requests only

I'm continuing to test with Pi.dev agent sessions to confirm it resolves the issue end-to-end before submitting this as a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSML tool_calls XML leaking into reasoning_content/thinking in streaming responses #167

Problem

Reproduction

Impact

Environment

Suspected cause (unconfirmed)

Potential fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DSML tool_calls XML leaking into reasoning_content/thinking in streaming responses #167

Description

Problem

Reproduction

Impact

Environment

Suspected cause (unconfirmed)

Potential fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions