refactor: centralize CoT parsing in backend for streaming mode #16394

ServeurpersoCom · 2025-10-02T21:17:30Z

Refactored try_parse_reasoning() to handle incremental parsing during streaming:
Parser improvements:

Tracks partial tag detection (e.g., when stream cuts mid-tag like <thi...)
Handles multiple consecutive reasoning segments within a single response
Preserves leading whitespace while detecting reasoning blocks
Continues processing content after closes instead of stopping early
Works correctly with thinking_forced_open flag for grammar-constrained generation

Integration changes:

Modified common_chat_parse_content_only() and common_chat_parse_llama_3_1() to invoke reasoning parsing before tool call handling
Changed default reasoning_format from auto to deepseek for consistent behavior
Added deepseek-legacy option for backwards compatibility (inlines tags in content)

Benefits

Clients no longer need custom CoT parsing logic for streaming mode
Consistent API behavior: reasoning_content and content properly separated in both streaming and non-streaming modes
Simplifies webui and SDK implementations
Universal: works across all reasoning formats, not just DeepSeek

When generation is launched from a template that ends the system prompt with and thinking is enabled, the template sets thinking_forced_open = true; the parser then consumes all text preceding as reasoning, even if the opening tag is never emitted by the model, as validated by the REASONINGCONTENT test.

Without --jinja, the server prohibits the use of tools/tool_choice, sets inputs.use_jinja = false, and computes enable_thinking = false. Templates therefore immediately close the tag and no longer force the opening, which disables the reasoning/content separation required by our new scenarios (the parser is still called on the stream but will only see regular content unless the model itself emits complete tags).

Testing
Added parser tests covering:

Inline ... segments in CONTENT_ONLY format
Inline reasoning in LLAMA_3_X format
Multiple reasoning blocks in single response
Partial tag detection during streaming

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

tools/server/webui/src/lib/stores/chat.svelte.ts

tools/server/webui/src/lib/services/chat.ts

tools/server/webui/src/stories/ChatMessage.stories.svelte

ngxson · 2025-10-03T13:13:48Z

common/arg.cpp

+        "- deepseek: puts thoughts in `message.reasoning_content`\n"
+        "- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n"
+        "(default: deepseek)",


could you explain why this is also changed?

There is no functional difference between auto and deepseek. Both map to the same code path: reasoning is extracted into message.reasoning_content without reinserting tags, since reasoning_in_content stays false. Only the legacy (and none) mode behaves differently. I'll remove this change since it has no effect.

ServeurpersoCom · 2025-10-04T05:45:11Z

Sure, you can squint at curl -N chunk dumps, but this integrated UI turns that pain into a proper workflow: showing the raw wire stream (no backend parsing / reasoning_format=none, no frontend Markdown / HTML <pre>) so you can actually inspect behavior across models in real time, all in one click, with a tiny commit.

Spying GPT-OSS :

ServeurpersoCom · 2025-10-04T11:45:23Z

I have tested all the reasoning (and non-reasoning) models in my collection multiple times, including some tool call testing, without encountering any parsing bugs. I’d like to test more models. :)

unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/Mistral-Small-3.2-24B-Instruct-2506-Q6_K.gguf
unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q6_K.gguf
bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
mradermacher/BlackSheep-24B-i1-GGUF/BlackSheep-24B.Q8_0.gguf
mradermacher/XortronCriminalComputingConfig-i1-GGUF/XortronCriminalComputingConfig.Q8_0.gguf
bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/TheDrummer_Cydonia-24B-v4.1-Q8_0.gguf
unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-Q6_K.gguf
mradermacher/Codestral-22B-v0.1-i1-GGUF/Codestral-22B-v0.1.Q8_0.gguf
unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q6_K.gguf
bartowski/TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF/TheDrummer_Big-Tiger-Gemma-27B-v3-Q6_K.gguf
unsloth/Seed-OSS-36B-Instruct-GGUF/Seed-OSS-36B-Instruct-Q5_K_M.gguf
mradermacher/deepseek-coder-33b-instruct-i1-GGUF/deepseek-coder-33b-instruct.i1-Q6_K.gguf
unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
mradermacher/aya-expanse-32b-i1-GGUF/aya-expanse-32b.i1-Q6_K.gguf
unsloth/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf
unsloth/GLM-Z1-32B-0414-GGUF/GLM-Z1-32B-0414-Q6_K.gguf Need #16426 OK
unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf
bartowski/TheDrummer_GLM-Steam-106B-A12B-v1-GGUF/TheDrummer_GLM-Steam-106B-A12B-v1-Q4_K_M-00001-of-00002.gguf
mradermacher/EXAONE-4.0.1-32B-i1-GGUF/EXAONE-4.0.1-32B.i1-Q6_K.gguf
unsloth/QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf
mradermacher/Qwen3-32B-i1-GGUF/Qwen3-32B.i1-Q6_K.gguf
unsloth/Qwen2.5-VL-32B-Instruct-GGUF/Qwen2.5-VL-32B-Instruct-Q5_K_M.gguf
mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf
mradermacher/Qwen3-30B-A3B-Thinking-2507-i1-GGUF/Qwen3-30B-A3B-Thinking-2507.i1-Q6_K.gguf
lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf
unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf
unsloth/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF/Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
unsloth/OpenReasoning-Nemotron-32B-GGUF/OpenReasoning-Nemotron-32B-Q6_K.gguf
bartowski/TheDrummer_Valkyrie-49B-v2-GGUF/TheDrummer_Valkyrie-49B-v2-IQ4_NL.gguf
mradermacher/K2-Think-i1-GGUF/K2-Think.i1-Q6_K.gguf

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

tools/server/webui/src/stories/ChatMessage.stories.svelte

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

…p frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty.

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…ll tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

…atMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…ormat toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

allozaur · 2025-10-06T14:10:55Z

@ServeurpersoCom unfortunately applying GH code suggestions bypasses the pre-commit hooks that format the code, so please run npm run format and push formatted code. Thank you! 🙏

ggerganov

I did some testing and haven't spotted any issues. Let's wait for @ngxson approval and merge.

ngxson

Looks good over all, just some small comments

common/chat-parser.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

ServeurpersoCom requested review from allozaur, ngxson and ggerganov as code owners October 2, 2025 21:17

github-actions bot added testing Everything test related examples server labels Oct 2, 2025

ServeurpersoCom marked this pull request as draft October 2, 2025 21:17

ServeurpersoCom mentioned this pull request Oct 2, 2025

Add support to ◁think▷...◁/think▷ format and DRY the thinking processing logic #16364

Closed

ServeurpersoCom marked this pull request as ready for review October 3, 2025 02:03

allozaur assigned ServeurpersoCom Oct 3, 2025

allozaur requested changes Oct 3, 2025

View reviewed changes

ngxson reviewed Oct 3, 2025

View reviewed changes

ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch 2 times, most recently from 84c5532 to 928af2e Compare October 3, 2025 14:31

ServeurpersoCom mentioned this pull request Oct 4, 2025

fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

Open

tommarques56 reviewed Oct 4, 2025

View reviewed changes

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Show resolved Hide resolved

allozaur requested changes Oct 6, 2025

View reviewed changes

tools/server/webui/src/stories/ChatMessage.stories.svelte Show resolved Hide resolved

tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Show resolved Hide resolved

ServeurpersoCom and others added 7 commits October 6, 2025 13:34

refactor: address review feedback from ngxson

a2fdf42

debug: say goodbye to curl -N, hello one-click raw stream

3557493

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

Update tools/server/webui/src/lib/components/app/chat/ChatMessages/Ch…

1e6beb6

…atMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch from 32212ff to 1e6beb6 Compare October 6, 2025 11:35

allozaur approved these changes Oct 6, 2025

View reviewed changes

npm run format

73ae73b

allozaur approved these changes Oct 6, 2025

View reviewed changes

ggerganov approved these changes Oct 6, 2025

View reviewed changes

ngxson reviewed Oct 6, 2025

View reviewed changes

common/chat-parser.cpp Outdated Show resolved Hide resolved

common/chat-parser.cpp Show resolved Hide resolved

chat-parser: address review feedback from ngxson

9074d04

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

ngxson approved these changes Oct 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: centralize CoT parsing in backend for streaming mode #16394

refactor: centralize CoT parsing in backend for streaming mode #16394

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson Oct 3, 2025

Uh oh!

ServeurpersoCom Oct 3, 2025 •

edited

Loading

Uh oh!

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading

Uh oh!

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allozaur commented Oct 6, 2025

Uh oh!

ggerganov left a comment

Uh oh!

ngxson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

refactor: centralize CoT parsing in backend for streaming mode #16394

Are you sure you want to change the base?

refactor: centralize CoT parsing in backend for streaming mode #16394

Conversation

ServeurpersoCom commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allozaur commented Oct 6, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading

ServeurpersoCom Oct 3, 2025 •

edited

Loading

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading