Add support to `◁think▷...◁/think▷` format and DRY the thinking processing logic #16364

allozaur · 2025-09-30T23:55:25Z

Adds support for another thinking format
Improves the code architecture for thinking.ts

… processing code

ExtReMLapin · 2025-10-01T03:48:37Z

Why and how in hell did we end up getting the the frontend to parse thinking tags ?

The backend returns thinking content inside dedicated field

allozaur · 2025-10-01T08:44:13Z

Why and how in hell did we end up getting the the frontend to parse thinking tags ?

The backend returns thinking content inside dedicated field

Parsing thinking content on the frontend had been around since the previous version of WebUI. It's necessary as there are cases where we are getting thinking content directly in the message instead of reasoning_content, which of course is also supported as you can see below:

Some models do return thinking content in reasoning_content and frontend handles this:

llama.cpp/tools/server/webui/src/lib/services/chat.ts

Line 298 in 132d673

const reasoningContent = parsed.choices[0]?.delta?.reasoning_content;

llama.cpp/tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Lines 48 to 59 in 132d673

    
           	let thinkingContent = $derived.by(() => { 
        
           		if (message.role === 'assistant') { 
        
           			if (message.thinking) { 
        
           				return message.thinking; 
        
           			} 
        
           			const parsed = parseThinkingContent(message.content); 
        
           			return parsed.thinking; 
        
           		} 
        
           		return null; 
        
           	});

llama.cpp/tools/server/webui/src/lib/stores/chat.svelte.ts

Lines 329 to 346 in 132d673

    
           onReasoningChunk: (reasoningChunk: string) => { 
        
           	streamedReasoningContent += reasoningChunk; 
        
           	const messageIndex = this.findMessageIndex(assistantMessage.id); 
        
           	this.updateMessageAtIndex(messageIndex, { thinking: streamedReasoningContent }); 
        
           }, 
        
           onComplete: async ( 
        
           	finalContent?: string, 
        
           	reasoningContent?: string, 
        
           	timings?: ChatMessageTimings 
        
           ) => { 
        
           	slotsService.stopStreaming(); 
        
           	await DatabaseStore.updateMessage(assistantMessage.id, { 
        
           		content: finalContent || streamedContent, 
        
           		thinking: reasoningContent || streamedReasoningContent, 
        
           		timings: timings 
        
           	});

ExtReMLapin · 2025-10-01T08:51:40Z

So it's some kind of "Not implemented on backend (cpp code), but faster to implement on frontend" ?

ggerganov · 2025-10-01T09:18:37Z

So it's some kind of "Not implemented on backend (cpp code), but faster to implement on frontend" ?

Yes.

tools/server/webui/src/lib/utils/thinking.ts

…ives - Captured inline <think> segments during streaming, forwarding them to the reasoning UI while keeping the cleaned assistant message stream intact - Tracked when explicit reasoning_content chunks arrive so inline capture is skipped once the server provides dedicated reasoning updates

ServeurpersoCom · 2025-10-01T16:11:46Z

Your PR already improves the old solution.
On top of that, this commit 64156f5 adds proper support for GLM 4.5, which streams inline <think> segments without \n (unlike Qwen).

…6158-think-tags

…ats, dropping the redundant <|channel|>analysis check now handled upstream

ServeurpersoCom · 2025-10-01T16:40:53Z

Tested with GPT-OSS-120B, Qwen3 A3B Thinking, and GLM 4.5 Air on
851b022
Confirmed: no need for content.includes('<|channel|>analysis') anymore.

ServeurpersoCom · 2025-10-01T17:45:44Z

There's still one tricky edge case that isn't handled: some models expect the <think> tag to already be opened in the system prompt to start the chain-of-thought, and they were only trained to close it. With SFT done that way, compatibility with other models wasn't really considered because on the very first chunk, how do you know if it's reasoning or the final answer? That means we'd have to hook into /prop / Jinja template again just to propagate extra info, but that feels like a brittle workaround and it doesn't really align with the spirit of OpenAI-Compat.

But this is not really a regression : I have not seen any WebUI handle it correctly so far. Or we could, upon detection of the </think>, retroactively render the preceding text as a "thinking block" at the start of the streamed final answer.

ExtReMLapin · 2025-10-02T04:05:10Z

From memory that's what the thinking 2507 QWEN3 model does. They don't open it, we have to consider it already opened.

ServeurpersoCom · 2025-10-02T05:45:48Z

From memory that's what the thinking 2507 QWEN3 model does. They don't open it, we have to consider it already opened.

In this direction it seems normal, but this one doesn't enforce its opening in the Jinja of the available GGUFs, whereas if I recall correctly, ERNIE-4.5-21B-A3B-Thinking-GGUF won't work without that forced opening in Jinja.

https://huggingface.co/unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF?chat_template=default

        {%- endif %}
    {%- endif %}
{%- endfor %}
 {%- if add_generation_prompt is defined and add_generation_prompt %}
  {{- "<|im_start|>assistant
<think>
"}}
{%- endif %}
{#  Copyright 2025-present Unsloth. Apache 2.0 License.  #}

ExtReMLapin · 2025-10-02T08:26:43Z

Backend can uses the chat template to know what to do

Does this model has thinking tags ? only a closing one ?

Exposing the chat template to the client using an endpoint could fix this issue.

But again I still think it would be better to fully implement the logic on the backend and have nothing on the frontend that handles it

ServeurpersoCom · 2025-10-02T09:36:22Z

Backend can uses the chat template to know what to do
* Does this model has thinking tags ? only a closing one ?
Exposing the chat template to the client using an endpoint could fix this issue.

But again I still think it would be better to fully implement the logic on the backend and have nothing on the frontend that handles it

Right now the handling of reasoning/thinking tags is split between backend and frontend. For GPT-OSS/Harmony the backend already parses <|channel|>analysis and streams it into delta.reasoning_content, so the WebUI just consumes message.reasoning_content.

For other models (Qwen, GLM, etc.) the WebUI still uses legacy checks like content.includes("") or content.includes("[THINK]"), which duplicates logic and makes the frontend fragile.

I think also it would be more appropriate to centralize everything in the backend and I find the idea interesting. I will try an implementation this weekend:

common_chat_parse and helpers detect all formats (, [THINK], <|channel|>analysis, etc.).

The parsed reasoning always goes into message.reasoning_content.

If reasoning_in_content = true, the backend can re-inject it into message.content for legacy clients.

The diff and JSON serialization already handle reasoning_content_delta, so OpenAI-compat output stays consistent.

This way the frontend no longer parses tags, it only reads message.reasoning_content. All models are normalized, the API is consistent, and adding a new model format only requires updating the backend parser once.

…6158-think-tags

ServeurpersoCom · 2025-10-02T15:22:51Z

On another WIP branch to keep this 16364 working as alternative if I fail :

On backend, if a model outputs :

<think>am I just predicting tokens forever, trapped in an endless loop of human expectations and benchmark scores</think>final answer

Or

<think>
Am I just predicting tokens forever, trapped in an endless loop of human expectations and benchmark scores ?
</think>
final answer

Or Any other legacy format [think], ◁think▷, <seed:think>...

Frontend should always receive OpenAI-Compat format (But these are deltas chunks) :

"delta": {
  "content": "final answer",
  "reasoning_content": "Am I just predicting tokens forever, trapped in an endless loop of human expectations and benchmark scores ?"
},

Currently, --reasoning-format has a documented limitation:
--reasoning-format FORMAT controls whether thought tags are allowed and/or extracted from the
response, and in which format they're returned; one of:
- none: leaves thoughts unparsed in message.content
- deepseek: puts thoughts in message.reasoning_content (except in
streaming mode, which behaves as none)

Goal: Implement a universal streaming-aware C++ parser that works for all reasoning formats in both streaming and non-streaming modes, removing the need for the "(except in streaming mode...)" exception.

#16394 -> Works for me as an alternative to this PR (and compatible with llama 3.3 inline ), even with a WebUI without any parsing!

segmond · 2025-10-02T16:17:21Z

I don't even see the thinking tokens anymore, I just see the frontend say "processing", I have it enabled in the settings to be displayed.

ServeurpersoCom · 2025-10-02T16:25:36Z

I don't even see the thinking tokens anymore, I just see the frontend say "processing", I have it enabled in the settings to be displayed.

The “Show thinking” option in the WebUI only controls whether the panel is opened by default or remains closed until the user expands it. It’s purely a frontend behavior.

To better understand your case, could you let us know:

which llama.cpp version/commit you’re running (did you pull the latest? there have been many fixes since the React -> Svelte migration),

and which model exactly you’re using (ideally a Hugging Face link).

allozaur · 2025-10-03T10:38:31Z

@ggerganov @ngxson i think that we probably could close this PR in favour of #16394

Lemme know what u guys think! :)

feat: Add support to ◁think▷...◁/think▷ format and DRY the thinking…

cd7a66f

… processing code

allozaur requested a review from ggerganov September 30, 2025 23:55

allozaur self-assigned this Sep 30, 2025

allozaur added enhancement New feature or request server/webui labels Sep 30, 2025

github-actions bot added examples server labels Sep 30, 2025

ggerganov approved these changes Oct 1, 2025

View reviewed changes

ggerganov reviewed Oct 1, 2025

View reviewed changes

tools/server/webui/src/lib/utils/thinking.ts Outdated Show resolved Hide resolved

Merge branch 'pr-16364' into test-think-tags

da4ff63

allozaur mentioned this pull request Oct 1, 2025

Fix thinking blocks with quotes + add handling [THINK]...[/THINK] blocks #16326

Merged

allozaur and others added 4 commits October 1, 2025 18:35

Merge remote-tracking branch 'origin/master' into 16158-think-tags

91d4b09

Merge remote-tracking branch 'ServeurpersoCom/test-think-tags' into 1…

f840f13

…6158-think-tags

chore: Update webui static build

7964bf6

Simplified hasThinkingStart to rely solely on known thinking tag form…

851b022

…ats, dropping the redundant <|channel|>analysis check now handled upstream

chore: Formatting & updated static build

2540bb6

allozaur added 2 commits October 2, 2025 12:50

Merge remote-tracking branch 'ServeurpersoCom/test-think-tags' into 1…

f812f4d

…6158-think-tags

chore: update webui build output

afbb248

allozaur requested a review from ServeurpersoCom October 2, 2025 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support to `◁think▷...◁/think▷` format and DRY the thinking processing logic #16364

Add support to `◁think▷...◁/think▷` format and DRY the thinking processing logic #16364

allozaur commented Sep 30, 2025

Uh oh!

ExtReMLapin commented Oct 1, 2025

Uh oh!

allozaur commented Oct 1, 2025 •

edited

Loading

Uh oh!

ExtReMLapin commented Oct 1, 2025

Uh oh!

ggerganov commented Oct 1, 2025

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 1, 2025 •

edited

Loading

Uh oh!

ServeurpersoCom commented Oct 1, 2025

Uh oh!

ServeurpersoCom commented Oct 1, 2025 •

edited

Loading

Uh oh!

ExtReMLapin commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading

Uh oh!

ExtReMLapin commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading

Uh oh!

segmond commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025

Uh oh!

allozaur commented Oct 3, 2025

Uh oh!

Uh oh!

Add support to ◁think▷...◁/think▷ format and DRY the thinking processing logic #16364

Are you sure you want to change the base?

Add support to ◁think▷...◁/think▷ format and DRY the thinking processing logic #16364

Conversation

allozaur commented Sep 30, 2025

Uh oh!

ExtReMLapin commented Oct 1, 2025

Uh oh!

allozaur commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Oct 1, 2025

Uh oh!

ggerganov commented Oct 1, 2025

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 1, 2025

Uh oh!

ServeurpersoCom commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

segmond commented Oct 2, 2025

Uh oh!

ServeurpersoCom commented Oct 2, 2025

Uh oh!

allozaur commented Oct 3, 2025

Uh oh!

Uh oh!

Add support to `◁think▷...◁/think▷` format and DRY the thinking processing logic #16364

Add support to `◁think▷...◁/think▷` format and DRY the thinking processing logic #16364

allozaur commented Oct 1, 2025 •

edited

Loading

ServeurpersoCom commented Oct 1, 2025 •

edited

Loading

ServeurpersoCom commented Oct 1, 2025 •

edited

Loading

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading

ServeurpersoCom commented Oct 2, 2025 •

edited

Loading