fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

ServeurpersoCom · 2025-10-04T19:08:55Z

Add generic fallback to detect trailing tags in Jinja templates and handle forced-open reasoning blocks :

Detect trailing tags in generic chat templates, trim whitespace, and either append the closing tag or mark the reasoning block as forced-open based on enable_thinking
Added a regression test covering a fallback template that opens the reasoning block in the prompt and verifies prompt differences, forced-open behaviour, and reasoning parsing
Now compatible with models using the default Jinja chat template, such as https://huggingface.co/unsloth/GLM-Z1-32B-0414-GGUF

Make sure to read the contributing guidelines before submitting a PR

ServeurpersoCom · 2025-10-04T19:16:57Z

(llama-server --jinja)

Before fix "reasoning_format":"auto" ; "stream":false:

(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user said \"Salut\", which is French for \"Hello\". I need to respond in a friendly manner.\n\nSince they used French, maybe they want the response in French too. Let me check the history. The initial prompt was in French, but the user's previous message was \"Salut\". Wait, the user message starts with \"Tu es un assistant utile.\" which is French. Then they wrote \"Salut\". So perhaps they expect a French response.\n\nBut looking at the assistant's response, it's in English. The user set the initial instruction in French, so maybe they prefer French. However, sometimes people switch languages. Let me confirm. The user's messages are in French, so I should reply in French to be consistent.\n\nSo, the correct response would be in French: \"Salut ! Comment puis-je t'aider aujourd'hui ?\" That's friendly and helpful. Make sure the greeting matches and the offer to assist is clear.\n</think>\nSalut ! comment puis-je t'aider aujourd'hui ? 😊"}}],"created":1759601227,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":216,"prompt_tokens":19,"total_tokens":235},"id":"chatcmpl-8IbqU9EHbs3gq1r4tysaPBq4SIoJe8Az","timings":{"cache_n":14,"prompt_n":5,"prompt_ms":41.655,"prompt_per_token_ms":8.331,"prompt_per_second":120.03360941063498,"predicted_n":216,"predicted_ms":4130.0,"predicted_per_token_ms":19.12037037037037,"predicted_per_second":52.300242130750604}}

Before fix "reasoning_format":"none" "stream":false :

(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"none","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user greeted me with \"Salut\" which is French for \"Hello\". I should respond in French to keep the conversation natural. I'll say \"Salut ! Comment puis-je t'aider aujourd'hui ?\" which means \"Hello! How can I assist you today?\" That's friendly and open for them to ask for help. Let me double-check the spelling to make sure there are no mistakes. Yep, looks good. I'll go with that.\n</think>\nSalut ! Comment puis-je t'aider aujourd'hui ?"}}],"created":1759601236,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":113,"prompt_tokens":19,"total_tokens":132},"id":"chatcmpl-gvxkBcwrkMukqSyBsVwCBAkJVgQTEzDC","timings":{"cache_n":18,"prompt_n":1,"prompt_ms":19.573,"prompt_per_token_ms":19.573,"prompt_per_second":51.090788330863944,"predicted_n":113,"predicted_ms":2150.993,"predicted_per_token_ms":19.035336283185842,"predicted_per_second":52.53387621438099}}

After fix "reasoning_format":"auto" "stream":false:

(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user said \"Salut\", which is French for \"Hi\". I should respond in a friendly manner. Let me greet them back and ask how I can assist them. Keep it simple and welcoming. Maybe \"Salut ! Comment puis-je vous aider aujourd'hui ?\" That's \"Hi! How can I assist you today?\" in French. They might expect a French response since they used French. Let me make sure the translation is correct. Yes, that works. Keep the tone polite and open-ended.","content":"Salut ! Comment puis-je vous aider aujourd'hui ?"}}],"created":1759605045,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":122,"prompt_tokens":19,"total_tokens":141},"id":"chatcmpl-CyUbsLJowfBgwu2oJsTmR8uYEVzTAI41","timings":{"cache_n":14,"prompt_n":5,"prompt_ms":42.02,"prompt_per_token_ms":8.404,"prompt_per_second":118.99095668729176,"predicted_n":122,"predicted_ms":2321.744,"predicted_per_token_ms":19.030688524590165,"predicted_per_second":52.54670626908048}}

After fix "reasoning_format":"none" "stream":false :

(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"none","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user said \"Salut\" which is French for \"Hi.\" I should respond in French to keep the conversation consistent.\n\nLet me make sure to use a friendly and welcoming tone. Maybe say \"Salut ! Comment puis-je t'aider aujourd'hui ?\" which means \"Hi! How can I assist you today?\"\n\nI should double-check the spelling and grammar to ensure it's correct. Also, keeping it simple and straightforward would be best.\n</think>\nSalut ! Comment puis-je t'aider aujourd'hui ?"}}],"created":1759605067,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":110,"prompt_tokens":19,"total_tokens":129},"id":"chatcmpl-LBffKWe3rnTJlz0u5ytmlYMXesZX9UKu","timings":{"cache_n":18,"prompt_n":1,"prompt_ms":27.242,"prompt_per_token_ms":27.242,"prompt_per_second":36.708024374128186,"predicted_n":110,"predicted_ms":2091.538,"predicted_per_token_ms":19.01398181818182,"predicted_per_second":52.592876629542474}}

Works also with "stream": true together with streaming-aware parser #16394

ggerganov · 2025-10-06T08:59:34Z

The CI seems to fail?

ServeurpersoCom · 2025-10-06T09:22:37Z

Reproduced it's me, I check...

=== llama_chat_format_single (user message) ===

fmt_single(chatml) :
<|im_start|>user
How are you<|im_end|>
<|im_start|>assistant

-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v1) :  [INST] How are you [/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v3) : [INST] How are you[/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v3-tekken) : [INST]How are you[/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v7) : [INST] How are you[/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(llama2) : [INST] How are you [/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral) : [INST] How are you [/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(gemma) :
<start_of_turn>user
How are you<end_of_turn>
<start_of_turn>model

-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(llama3) : <|start_header_id|>user<|end_header_id|>

How are you<|eot_id|><|start_header_id|>assistant<|end_header_id|>


-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(gigachat) : user<|role_sep|>How are you<|message_sep|>available functions<|role_sep|>[]<|message_sep|>assistant<|role_sep|>
-------------------------
(root|~/llama.cpp.pascal)

ServeurpersoCom · 2025-10-06T11:22:17Z

I'm getting "Errors while running CTest" on the CI, and I need to check if there's a regression somewhere else.

test-chat / test-chat-template are OK vs. master :

(root|~/llama.cpp.pascal) ./build/bin/test-chat > /root/dev-chat.log 2>&1
(root|~/llama.cpp.pascal) ./build/bin/test-chat-template > /root/dev-chat-template.log 2>&1
(root|~/llama.cpp.pascal) cd ../llama.cpp
(root|~/llama.cpp) ./build/bin/test-chat > /root/master-chat.log 2>&1
(root|~/llama.cpp) ./build/bin/test-chat-template > /root/master-chat-template.log 2>&1
(root|~/llama.cpp) diff /root/dev-chat-template.log /root/master-chat-template.log
(root|~/llama.cpp) diff /root/dev-chat.log /root/master-chat.log
504,505d503
< Parsing input with format Content-only: Reasoning trace</think>Final answer
< Parsed message: {"role":"assistant","content":"Final answer","reasoning_content":"Reasoning trace","thinking":"Reasoning trace"}
589,590c587,588
< Parsed partial JSON: [{"name":"special_function","660260756":1}] (json_healing_marker: ,"660260756)
< Cleaned up JSON [{"name":"special_function","660260756":1}] to [{"name":"special_function"}] (json_healing_marker : ',"660260756')
---
> Parsed partial JSON: [{"name":"special_function","1687926652":1}] (json_healing_marker: ,"1687926652)
> Cleaned up JSON [{"name":"special_function","1687926652":1}] to [{"name":"special_function"}] (json_healing_marker : ',"1687926652')
596,597c594,595
< Parsed partial JSON: [{"name":"special_function","arguments":{"arg959997301":1}}] (json_healing_marker: 959997301)
< Cleaned up JSON [{"name":"special_function","arguments":{"arg959997301":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '959997301')
---
> Parsed partial JSON: [{"name":"special_function","arguments":{"arg660260756":1}}] (json_healing_marker: 660260756)
> Cleaned up JSON [{"name":"special_function","arguments":{"arg660260756":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '660260756')
603,604c601,602
< Parsed partial JSON: [{"name":"special_function","arguments":{"arg485560280":1}}] (json_healing_marker: 485560280)
< Cleaned up JSON [{"name":"special_function","arguments":{"arg485560280":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '485560280')
---
> Parsed partial JSON: [{"name":"special_function","arguments":{"arg959997301":1}}] (json_healing_marker: 959997301)
> Cleaned up JSON [{"name":"special_function","arguments":{"arg959997301":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '959997301')
611,612c609,610
< Parsed partial JSON: [{"name":"special_function","arguments":{"arg1":1},"402724286":1}] (json_healing_marker: "402724286)
< Cleaned up JSON [{"name":"special_function","arguments":{"arg1":1},"402724286":1}] to [{"name":"special_function","arguments":"{\"arg1\":1}"}] (json_healing_marker : '"402724286')
---
> Parsed partial JSON: [{"name":"special_function","arguments":{"arg1":1},"485560280":1}] (json_healing_marker: "485560280)
> Cleaned up JSON [{"name":"special_function","arguments":{"arg1":1},"485560280":1}] to [{"name":"special_function","arguments":"{\"arg1\":1}"}] (json_healing_marker : '"485560280')
634,635c632,633
< Parsed partial JSON: {"arg1364228444":1} (json_healing_marker: 364228444)
< Cleaned up JSON {"arg1364228444":1} to "{\"arg1" (json_healing_marker : '364228444')
---
> Parsed partial JSON: {"arg1894429689":1} (json_healing_marker: 894429689)
> Cleaned up JSON {"arg1894429689":1} to "{\"arg1" (json_healing_marker : '894429689')
642,643c640,641
< Parsed partial JSON: {"arg11947346619":1} (json_healing_marker: 1947346619)
< Cleaned up JSON {"arg11947346619":1} to "{\"arg1" (json_healing_marker : '1947346619')
---
> Parsed partial JSON: {"arg1364228444":1} (json_healing_marker: 364228444)
> Cleaned up JSON {"arg1364228444":1} to "{\"arg1" (json_healing_marker : '364228444')
665,666c663,664
< Parsed partial JSON: {"arg12007905771":1} (json_healing_marker: 2007905771)
< Cleaned up JSON {"arg12007905771":1} to "{\"arg1" (json_healing_marker : '2007905771')
---
> Parsed partial JSON: {"arg12114738097":1} (json_healing_marker: 2114738097)
> Cleaned up JSON {"arg12114738097":1} to "{\"arg1" (json_healing_marker : '2114738097')
(root|~/llama.cpp)

tommarques56 · 2025-10-06T12:45:49Z

Hi @ServeurpersoCom, I can run an automated high-severity-only LLM review on this PR and post a single focused inline comment. Reply with "approve" or add a comment saying "@tommarques56 approve" to proceed.

ServeurpersoCom · 2025-10-06T12:51:03Z

Hi @ServeurpersoCom, I can run an automated high-severity-only LLM review on this PR and post a single focused inline comment. Reply with "approve" or add a comment saying "@tommarques56 approve" to proceed.

@tommarques56 approve

Hey tommarques56, I really like what you’re doing with these automated LLM reviews! That’s a great idea!
Go ahead and run it, and if it finds anything, I’ll definitely take it into account and fix it accordingly.

ggerganov · 2025-10-06T12:59:36Z

@ServeurpersoCom I just blocked this user for spamming. This is not a good way to run such experiments because it introduces a lot of noise into the discussions.

ServeurpersoCom · 2025-10-06T13:04:36Z

@ServeurpersoCom I just blocked this user for spamming. This is not a good way to run such experiments because it introduces a lot of noise into the discussions.

Ah, that explains the false XSS detection from the bot earlier! perfect, thanks for clarifying!

…mplates and handle forced-open reasoning blocks - Detect trailing <think> tags in generic chat templates, trim whitespace, and either append the closing tag or mark the reasoning block as forced-open based on enable_thinking - Added a regression test covering a fallback template that opens the reasoning block in the prompt and verifies prompt differences, forced-open behaviour, and reasoning parsing - Now compatible with models using the default Jinja chat template, such as https://huggingface.co/unsloth/GLM-Z1-32B-0414-GGUF

…t through common_chat_params for consistent <think> handling - Added a supports_enable_thinking field to common_chat_params, populate it during template rendering, and reuse it when deciding whether the generic <think> fallback should run - Updated common_chat_templates_support_enable_thinking to consult the tracked capability and expanded the chat template tests to assert the flag for templates that do and do not react to enable_thinking - Updated chat template tests to assert the guarded fallback behaviour and to cover templates that conditionally open <think> blocks.

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

ServeurpersoCom requested a review from ggerganov as a code owner October 4, 2025 19:08

github-actions bot added the testing Everything test related label Oct 4, 2025

ServeurpersoCom mentioned this pull request Oct 4, 2025

refactor: centralize CoT parsing in backend for streaming mode #16394

Open

ServeurpersoCom force-pushed the generic-jinja-think-fallback branch from 0f63b5b to 0869085 Compare October 6, 2025 09:15

ServeurpersoCom force-pushed the generic-jinja-think-fallback branch from 89ae131 to e1f526c Compare October 6, 2025 10:23

ServeurpersoCom and others added 3 commits October 6, 2025 20:15

chat-parser: address pr-16394 review feedback from ngxson

6041e25

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

ServeurpersoCom force-pushed the generic-jinja-think-fallback branch from 756e6ec to 6041e25 Compare October 6, 2025 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

ServeurpersoCom commented Oct 4, 2025

Uh oh!

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading

Uh oh!

ggerganov commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025 •

edited

Loading

Uh oh!

tommarques56 commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025

Uh oh!

ggerganov commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025

Uh oh!

Uh oh!

fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

Are you sure you want to change the base?

fix: add generic fallback to detect trailing <think> tags in Jinja templates and handle forced-open reasoning blocks #16426

Conversation

ServeurpersoCom commented Oct 4, 2025

Uh oh!

ServeurpersoCom commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommarques56 commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025

Uh oh!

ggerganov commented Oct 6, 2025

Uh oh!

ServeurpersoCom commented Oct 6, 2025

Uh oh!

Uh oh!

ServeurpersoCom commented Oct 4, 2025 •

edited

Loading

ServeurpersoCom commented Oct 6, 2025 •

edited

Loading