-
Notifications
You must be signed in to change notification settings - Fork 13.2k
refactor: centralize CoT parsing in backend for streaming mode #16394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
refactor: centralize CoT parsing in backend for streaming mode #16394
Conversation
tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Outdated
Show resolved
Hide resolved
tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Outdated
Show resolved
Hide resolved
common/arg.cpp
Outdated
"- deepseek: puts thoughts in `message.reasoning_content`\n" | ||
"- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`\n" | ||
"(default: deepseek)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why this is also changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no functional difference between auto and deepseek. Both map to the same code path: reasoning is extracted into message.reasoning_content without reinserting tags, since reasoning_in_content stays false. Only the legacy (and none) mode behaves differently. I'll remove this change since it has no effect.
84c5532
to
928af2e
Compare
I have tested all the reasoning (and non-reasoning) models in my collection multiple times, including some tool call testing, without encountering any parsing bugs. I’d like to test more models. :) unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf |
tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Show resolved
Hide resolved
tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Show resolved
Hide resolved
…p frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty.
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
…ll tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
…atMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
32212ff
to
1e6beb6
Compare
…ormat toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
@ServeurpersoCom unfortunately applying GH code suggestions bypasses the pre-commit hooks that format the code, so please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some testing and haven't spotted any issues. Let's wait for @ngxson approval and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good over all, just some small comments
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Refactored try_parse_reasoning() to handle incremental parsing during streaming:
Parser improvements:
Tracks partial tag detection (e.g., when stream cuts mid-tag like <thi...)
Handles multiple consecutive reasoning segments within a single response
Preserves leading whitespace while detecting reasoning blocks
Continues processing content after closes instead of stopping early
Works correctly with thinking_forced_open flag for grammar-constrained generation
Integration changes:
Modified common_chat_parse_content_only() and common_chat_parse_llama_3_1() to invoke reasoning parsing before tool call handling
Changed default reasoning_format from auto to deepseek for consistent behavior
Added deepseek-legacy option for backwards compatibility (inlines tags in content)
Benefits
Clients no longer need custom CoT parsing logic for streaming mode
Consistent API behavior: reasoning_content and content properly separated in both streaming and non-streaming modes
Simplifies webui and SDK implementations
Universal: works across all reasoning formats, not just DeepSeek
When generation is launched from a template that ends the system prompt with and thinking is enabled, the template sets thinking_forced_open = true; the parser then consumes all text preceding as reasoning, even if the opening tag is never emitted by the model, as validated by the REASONINGCONTENT test.
Without --jinja, the server prohibits the use of tools/tool_choice, sets inputs.use_jinja = false, and computes enable_thinking = false. Templates therefore immediately close the tag and no longer force the opening, which disables the reasoning/content separation required by our new scenarios (the parser is still called on the stream but will only see regular content unless the model itself emits complete tags).
Testing
Added parser tests covering:
Inline ... segments in CONTENT_ONLY format
Inline reasoning in LLAMA_3_X format
Multiple reasoning blocks in single response
Partial tag detection during streaming