Skip to content

Conversation

ServeurpersoCom
Copy link
Collaborator

Add generic fallback to detect trailing tags in Jinja templates and handle forced-open reasoning blocks :

  • Detect trailing tags in generic chat templates, trim whitespace, and either append the closing tag or mark the reasoning block as forced-open based on enable_thinking
  • Added a regression test covering a fallback template that opens the reasoning block in the prompt and verifies prompt differences, forced-open behaviour, and reasoning parsing
  • Now compatible with models using the default Jinja chat template, such as https://huggingface.co/unsloth/GLM-Z1-32B-0414-GGUF

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added the testing Everything test related label Oct 4, 2025
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Oct 4, 2025

(llama-server --jinja)

  • Before fix "reasoning_format":"auto" ; "stream":false:
(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user said \"Salut\", which is French for \"Hello\". I need to respond in a friendly manner.\n\nSince they used French, maybe they want the response in French too. Let me check the history. The initial prompt was in French, but the user's previous message was \"Salut\". Wait, the user message starts with \"Tu es un assistant utile.\" which is French. Then they wrote \"Salut\". So perhaps they expect a French response.\n\nBut looking at the assistant's response, it's in English. The user set the initial instruction in French, so maybe they prefer French. However, sometimes people switch languages. Let me confirm. The user's messages are in French, so I should reply in French to be consistent.\n\nSo, the correct response would be in French: \"Salut ! Comment puis-je t'aider aujourd'hui ?\" That's friendly and helpful. Make sure the greeting matches and the offer to assist is clear.\n</think>\nSalut ! comment puis-je t'aider aujourd'hui ? 😊"}}],"created":1759601227,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":216,"prompt_tokens":19,"total_tokens":235},"id":"chatcmpl-8IbqU9EHbs3gq1r4tysaPBq4SIoJe8Az","timings":{"cache_n":14,"prompt_n":5,"prompt_ms":41.655,"prompt_per_token_ms":8.331,"prompt_per_second":120.03360941063498,"predicted_n":216,"predicted_ms":4130.0,"predicted_per_token_ms":19.12037037037037,"predicted_per_second":52.300242130750604}}
  • Before fix "reasoning_format":"none" "stream":false :
(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"none","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user greeted me with \"Salut\" which is French for \"Hello\". I should respond in French to keep the conversation natural. I'll say \"Salut ! Comment puis-je t'aider aujourd'hui ?\" which means \"Hello! How can I assist you today?\" That's friendly and open for them to ask for help. Let me double-check the spelling to make sure there are no mistakes. Yep, looks good. I'll go with that.\n</think>\nSalut ! Comment puis-je t'aider aujourd'hui ?"}}],"created":1759601236,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":113,"prompt_tokens":19,"total_tokens":132},"id":"chatcmpl-gvxkBcwrkMukqSyBsVwCBAkJVgQTEzDC","timings":{"cache_n":18,"prompt_n":1,"prompt_ms":19.573,"prompt_per_token_ms":19.573,"prompt_per_second":51.090788330863944,"predicted_n":113,"predicted_ms":2150.993,"predicted_per_token_ms":19.035336283185842,"predicted_per_second":52.53387621438099}}
  • After fix "reasoning_format":"auto" "stream":false:
(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user said \"Salut\", which is French for \"Hi\". I should respond in a friendly manner. Let me greet them back and ask how I can assist them. Keep it simple and welcoming. Maybe \"Salut ! Comment puis-je vous aider aujourd'hui ?\" That's \"Hi! How can I assist you today?\" in French. They might expect a French response since they used French. Let me make sure the translation is correct. Yes, that works. Keep the tone polite and open-ended.","content":"Salut ! Comment puis-je vous aider aujourd'hui ?"}}],"created":1759605045,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":122,"prompt_tokens":19,"total_tokens":141},"id":"chatcmpl-CyUbsLJowfBgwu2oJsTmR8uYEVzTAI41","timings":{"cache_n":14,"prompt_n":5,"prompt_ms":42.02,"prompt_per_token_ms":8.404,"prompt_per_second":118.99095668729176,"predicted_n":122,"predicted_ms":2321.744,"predicted_per_token_ms":19.030688524590165,"predicted_per_second":52.54670626908048}}
  • After fix "reasoning_format":"none" "stream":false :
(root|~/llama.cpp.pascal) curl -s -N https://www.serveurperso.com/ia/v1/chat/completions -H "Content-Type: application/json" -d '{"messages":[{"role":"system","content":"Tu es un assistant utile."},{"role":"user","content":"Salut"}],"stream":false,"model":"GLM-Z1-32B-0414","reasoning_format":"none","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["edkypmxt"],"timings_per_token":true}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user said \"Salut\" which is French for \"Hi.\" I should respond in French to keep the conversation consistent.\n\nLet me make sure to use a friendly and welcoming tone. Maybe say \"Salut ! Comment puis-je t'aider aujourd'hui ?\" which means \"Hi! How can I assist you today?\"\n\nI should double-check the spelling and grammar to ensure it's correct. Also, keeping it simple and straightforward would be best.\n</think>\nSalut ! Comment puis-je t'aider aujourd'hui ?"}}],"created":1759605067,"model":"GLM-Z1-32B-0414","system_fingerprint":"b6699-3fd608ce","object":"chat.completion","usage":{"completion_tokens":110,"prompt_tokens":19,"total_tokens":129},"id":"chatcmpl-LBffKWe3rnTJlz0u5ytmlYMXesZX9UKu","timings":{"cache_n":18,"prompt_n":1,"prompt_ms":27.242,"prompt_per_token_ms":27.242,"prompt_per_second":36.708024374128186,"predicted_n":110,"predicted_ms":2091.538,"predicted_per_token_ms":19.01398181818182,"predicted_per_second":52.592876629542474}}

Works also with "stream": true together with streaming-aware parser #16394

@ggerganov
Copy link
Member

The CI seems to fail?

@ServeurpersoCom ServeurpersoCom force-pushed the generic-jinja-think-fallback branch from 0f63b5b to 0869085 Compare October 6, 2025 09:15
@ServeurpersoCom
Copy link
Collaborator Author

Reproduced it's me, I check...

=== llama_chat_format_single (user message) ===

fmt_single(chatml) :
<|im_start|>user
How are you<|im_end|>
<|im_start|>assistant

-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v1) :  [INST] How are you [/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v3) : [INST] How are you[/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v3-tekken) : [INST]How are you[/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral-v7) : [INST] How are you[/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(llama2) : [INST] How are you [/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(mistral) : [INST] How are you [/INST]
-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(gemma) :
<start_of_turn>user
How are you<end_of_turn>
<start_of_turn>model

-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(llama3) : <|start_header_id|>user<|end_header_id|>

How are you<|eot_id|><|start_header_id|>assistant<|end_header_id|>


-------------------------
Failed to infer a tool call example (possible template bug)
fmt_single(gigachat) : user<|role_sep|>How are you<|message_sep|>available functions<|role_sep|>[]<|message_sep|>assistant<|role_sep|>
-------------------------
(root|~/llama.cpp.pascal)

@ServeurpersoCom ServeurpersoCom force-pushed the generic-jinja-think-fallback branch from 89ae131 to e1f526c Compare October 6, 2025 10:23
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Oct 6, 2025

I'm getting "Errors while running CTest" on the CI, and I need to check if there's a regression somewhere else.

test-chat / test-chat-template are OK vs. master :

(root|~/llama.cpp.pascal) ./build/bin/test-chat > /root/dev-chat.log 2>&1
(root|~/llama.cpp.pascal) ./build/bin/test-chat-template > /root/dev-chat-template.log 2>&1
(root|~/llama.cpp.pascal) cd ../llama.cpp
(root|~/llama.cpp) ./build/bin/test-chat > /root/master-chat.log 2>&1
(root|~/llama.cpp) ./build/bin/test-chat-template > /root/master-chat-template.log 2>&1
(root|~/llama.cpp) diff /root/dev-chat-template.log /root/master-chat-template.log
(root|~/llama.cpp) diff /root/dev-chat.log /root/master-chat.log
504,505d503
< Parsing input with format Content-only: Reasoning trace</think>Final answer
< Parsed message: {"role":"assistant","content":"Final answer","reasoning_content":"Reasoning trace","thinking":"Reasoning trace"}
589,590c587,588
< Parsed partial JSON: [{"name":"special_function","660260756":1}] (json_healing_marker: ,"660260756)
< Cleaned up JSON [{"name":"special_function","660260756":1}] to [{"name":"special_function"}] (json_healing_marker : ',"660260756')
---
> Parsed partial JSON: [{"name":"special_function","1687926652":1}] (json_healing_marker: ,"1687926652)
> Cleaned up JSON [{"name":"special_function","1687926652":1}] to [{"name":"special_function"}] (json_healing_marker : ',"1687926652')
596,597c594,595
< Parsed partial JSON: [{"name":"special_function","arguments":{"arg959997301":1}}] (json_healing_marker: 959997301)
< Cleaned up JSON [{"name":"special_function","arguments":{"arg959997301":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '959997301')
---
> Parsed partial JSON: [{"name":"special_function","arguments":{"arg660260756":1}}] (json_healing_marker: 660260756)
> Cleaned up JSON [{"name":"special_function","arguments":{"arg660260756":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '660260756')
603,604c601,602
< Parsed partial JSON: [{"name":"special_function","arguments":{"arg485560280":1}}] (json_healing_marker: 485560280)
< Cleaned up JSON [{"name":"special_function","arguments":{"arg485560280":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '485560280')
---
> Parsed partial JSON: [{"name":"special_function","arguments":{"arg959997301":1}}] (json_healing_marker: 959997301)
> Cleaned up JSON [{"name":"special_function","arguments":{"arg959997301":1}}] to [{"name":"special_function","arguments":"{\"arg"}] (json_healing_marker : '959997301')
611,612c609,610
< Parsed partial JSON: [{"name":"special_function","arguments":{"arg1":1},"402724286":1}] (json_healing_marker: "402724286)
< Cleaned up JSON [{"name":"special_function","arguments":{"arg1":1},"402724286":1}] to [{"name":"special_function","arguments":"{\"arg1\":1}"}] (json_healing_marker : '"402724286')
---
> Parsed partial JSON: [{"name":"special_function","arguments":{"arg1":1},"485560280":1}] (json_healing_marker: "485560280)
> Cleaned up JSON [{"name":"special_function","arguments":{"arg1":1},"485560280":1}] to [{"name":"special_function","arguments":"{\"arg1\":1}"}] (json_healing_marker : '"485560280')
634,635c632,633
< Parsed partial JSON: {"arg1364228444":1} (json_healing_marker: 364228444)
< Cleaned up JSON {"arg1364228444":1} to "{\"arg1" (json_healing_marker : '364228444')
---
> Parsed partial JSON: {"arg1894429689":1} (json_healing_marker: 894429689)
> Cleaned up JSON {"arg1894429689":1} to "{\"arg1" (json_healing_marker : '894429689')
642,643c640,641
< Parsed partial JSON: {"arg11947346619":1} (json_healing_marker: 1947346619)
< Cleaned up JSON {"arg11947346619":1} to "{\"arg1" (json_healing_marker : '1947346619')
---
> Parsed partial JSON: {"arg1364228444":1} (json_healing_marker: 364228444)
> Cleaned up JSON {"arg1364228444":1} to "{\"arg1" (json_healing_marker : '364228444')
665,666c663,664
< Parsed partial JSON: {"arg12007905771":1} (json_healing_marker: 2007905771)
< Cleaned up JSON {"arg12007905771":1} to "{\"arg1" (json_healing_marker : '2007905771')
---
> Parsed partial JSON: {"arg12114738097":1} (json_healing_marker: 2114738097)
> Cleaned up JSON {"arg12114738097":1} to "{\"arg1" (json_healing_marker : '2114738097')
(root|~/llama.cpp)

@tommarques56
Copy link

Hi @ServeurpersoCom, I can run an automated high-severity-only LLM review on this PR and post a single focused inline comment. Reply with "approve" or add a comment saying "@tommarques56 approve" to proceed.

@ServeurpersoCom
Copy link
Collaborator Author

Hi @ServeurpersoCom, I can run an automated high-severity-only LLM review on this PR and post a single focused inline comment. Reply with "approve" or add a comment saying "@tommarques56 approve" to proceed.

@tommarques56 approve

Hey tommarques56, I really like what you’re doing with these automated LLM reviews! That’s a great idea!
Go ahead and run it, and if it finds anything, I’ll definitely take it into account and fix it accordingly.

@ggerganov
Copy link
Member

@ServeurpersoCom I just blocked this user for spamming. This is not a good way to run such experiments because it introduces a lot of noise into the discussions.

@ServeurpersoCom
Copy link
Collaborator Author

@ServeurpersoCom I just blocked this user for spamming. This is not a good way to run such experiments because it introduces a lot of noise into the discussions.

Ah, that explains the false XSS detection from the bot earlier! perfect, thanks for clarifying!

ServeurpersoCom and others added 3 commits October 6, 2025 20:15
…mplates and handle forced-open reasoning blocks

- Detect trailing <think> tags in generic chat templates, trim whitespace, and either append
  the closing tag or mark the reasoning block as forced-open based on enable_thinking
- Added a regression test covering a fallback template that opens the reasoning block in the
  prompt and verifies prompt differences, forced-open behaviour, and reasoning parsing
- Now compatible with models using the default Jinja chat template, such as
  https://huggingface.co/unsloth/GLM-Z1-32B-0414-GGUF
…t through common_chat_params for consistent <think> handling

- Added a supports_enable_thinking field to common_chat_params, populate it during template rendering,
  and reuse it when deciding whether the generic <think> fallback should run
- Updated common_chat_templates_support_enable_thinking to consult the tracked capability and expanded
  the chat template tests to assert the flag for templates that do and do not react to enable_thinking
- Updated chat template tests to assert the guarded fallback behaviour and to cover templates that
  conditionally open <think> blocks.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
@ServeurpersoCom ServeurpersoCom force-pushed the generic-jinja-think-fallback branch from 756e6ec to 6041e25 Compare October 6, 2025 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants