Skip to content

Conversation

ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Oct 2, 2025

Refactored try_parse_reasoning() to handle incremental parsing during streaming:
Parser improvements:

Tracks partial tag detection (e.g., when stream cuts mid-tag like <thi...)
Handles multiple consecutive reasoning segments within a single response
Preserves leading whitespace while detecting reasoning blocks
Continues processing content after closes instead of stopping early
Works correctly with thinking_forced_open flag for grammar-constrained generation

Integration changes:

Modified common_chat_parse_content_only() and common_chat_parse_llama_3_1() to invoke reasoning parsing before tool call handling
Changed default reasoning_format from auto to deepseek for consistent behavior
Added deepseek-legacy option for backwards compatibility (inlines tags in content)

Benefits

Clients no longer need custom CoT parsing logic for streaming mode
Consistent API behavior: reasoning_content and content properly separated in both streaming and non-streaming modes
Simplifies webui and SDK implementations
Universal: works across all reasoning formats, not just DeepSeek

When generation is launched from a template that ends the system prompt with and thinking is enabled, the template sets thinking_forced_open = true; the parser then consumes all text preceding as reasoning, even if the opening tag is never emitted by the model, as validated by the REASONINGCONTENT test.

Without --jinja, the server prohibits the use of tools/tool_choice, sets inputs.use_jinja = false, and computes enable_thinking = false. Templates therefore immediately close the tag and no longer force the opening, which disables the reasoning/content separation required by our new scenarios (the parser is still called on the stream but will only see regular content unless the model itself emits complete tags).

Testing
Added parser tests covering:

Inline ... segments in CONTENT_ONLY format
Inline reasoning in LLAMA_3_X format
Multiple reasoning blocks in single response
Partial tag detection during streaming

common/arg.cpp Outdated
Comment on lines 3432 to 3434
"- deepseek: puts thoughts in `message.reasoning_content`\n"
"- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n"
"(default: deepseek)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why this is also changed?

Copy link
Collaborator Author

@ServeurpersoCom ServeurpersoCom Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no functional difference between auto and deepseek. Both map to the same code path: reasoning is extracted into message.reasoning_content without reinserting tags, since reasoning_in_content stays false. Only the legacy (and none) mode behaves differently. I'll remove this change since it has no effect.

@ServeurpersoCom ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch 2 times, most recently from 84c5532 to 928af2e Compare October 3, 2025 14:31
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Oct 4, 2025

Sure, you can squint at curl -N chunk dumps, but this integrated UI turns that pain into a proper workflow: showing the raw wire stream (no backend parsing / reasoning_format=none, no frontend Markdown / HTML <pre>) so you can actually inspect behavior across models in real time, all in one click, with a tiny commit.

Sans titre Spying GPT-OSS : oss

@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Oct 4, 2025

I have tested all the reasoning (and non-reasoning) models in my collection multiple times, including some tool call testing, without encountering any parsing bugs. I’d like to test more models. :)

unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/Mistral-Small-3.2-24B-Instruct-2506-Q6_K.gguf
unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q6_K.gguf
bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf
mradermacher/BlackSheep-24B-i1-GGUF/BlackSheep-24B.Q8_0.gguf
mradermacher/XortronCriminalComputingConfig-i1-GGUF/XortronCriminalComputingConfig.Q8_0.gguf
bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/TheDrummer_Cydonia-24B-v4.1-Q8_0.gguf
unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-Q6_K.gguf
mradermacher/Codestral-22B-v0.1-i1-GGUF/Codestral-22B-v0.1.Q8_0.gguf
unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q6_K.gguf
bartowski/TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF/TheDrummer_Big-Tiger-Gemma-27B-v3-Q6_K.gguf
unsloth/Seed-OSS-36B-Instruct-GGUF/Seed-OSS-36B-Instruct-Q5_K_M.gguf
mradermacher/deepseek-coder-33b-instruct-i1-GGUF/deepseek-coder-33b-instruct.i1-Q6_K.gguf
unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
mradermacher/aya-expanse-32b-i1-GGUF/aya-expanse-32b.i1-Q6_K.gguf
unsloth/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf
unsloth/GLM-Z1-32B-0414-GGUF/GLM-Z1-32B-0414-Q6_K.gguf Need #16426 OK
unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf
bartowski/TheDrummer_GLM-Steam-106B-A12B-v1-GGUF/TheDrummer_GLM-Steam-106B-A12B-v1-Q4_K_M-00001-of-00002.gguf
mradermacher/EXAONE-4.0.1-32B-i1-GGUF/EXAONE-4.0.1-32B.i1-Q6_K.gguf
unsloth/QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf
mradermacher/Qwen3-32B-i1-GGUF/Qwen3-32B.i1-Q6_K.gguf
unsloth/Qwen2.5-VL-32B-Instruct-GGUF/Qwen2.5-VL-32B-Instruct-Q5_K_M.gguf
mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf
mradermacher/Qwen3-30B-A3B-Thinking-2507-i1-GGUF/Qwen3-30B-A3B-Thinking-2507.i1-Q6_K.gguf
lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf
unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf
unsloth/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF/Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
unsloth/OpenReasoning-Nemotron-32B-GGUF/OpenReasoning-Nemotron-32B-Q6_K.gguf
bartowski/TheDrummer_Valkyrie-49B-v2-GGUF/TheDrummer_Valkyrie-49B-v2-IQ4_NL.gguf
mradermacher/K2-Think-i1-GGUF/K2-Think.i1-Q6_K.gguf

ServeurpersoCom and others added 7 commits October 6, 2025 13:34
…p frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
…ll tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
…atMessage.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
@ServeurpersoCom ServeurpersoCom force-pushed the streaming-aware-cpp-parser branch from 32212ff to 1e6beb6 Compare October 6, 2025 11:35
…ormat toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
@allozaur
Copy link
Collaborator

allozaur commented Oct 6, 2025

@ServeurpersoCom unfortunately applying GH code suggestions bypasses the pre-commit hooks that format the code, so please run npm run format and push formatted code. Thank you! 🙏

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing and haven't spotted any issues. Let's wait for @ngxson approval and merge.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good over all, just some small comments

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants