llama-cli: add support for reasoning #16603

bandoti · 2025-10-16T01:32:10Z

This change adds a "partial formatter" that processes partially collected messages (like the server streaming logic) in order to render reasoning logic prior to EOG token arrival.

In addition, the chat_add_and_format lambda has been moved to a functor, and this now calls common_chat_templates_apply directly to allow more robust template-application options.

Logic has been put in place to suppress the system/prompt tags to clean up output.

Example output :

./build/bin/llama-cli.exe -m ./models/gpt-oss-20b-mxfp4.gguf -c 2048 -sys "you are a wizard" -p "please recite me a haiku about llamas" --jinja

bandoti · 2025-10-16T13:47:54Z

I just updated to clean up the system/prompt tags (see description changes), but I will await feedback before changing anything else! 😊

One thing I was contemplating was splitting the display block into a separate abstraction. The display could be a new type because more state was added here, it might be a good time to do refactors like this to encapsulate functionality incrementally.

bandoti · 2025-10-17T14:21:20Z

Ack, I found an issue with the logic here. When part of the template strings match the "content" part it falsely matches. For example, "You are a wizard" when the template applied (below) will match against "You are ChatGPT". So I think it has to match the surrounding tokens exactly first.

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a wizard

<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant

bandoti · 2025-10-19T02:13:43Z

After testing a bit I found it was never reliable to get the system prompt tokens exactly back in all cases and opted to simply print the system prompt and user prompt content prior to jumping into the loop.

CISC

LGTM, but could be improved.

CISC · 2025-10-19T09:58:16Z

tools/main/main.cpp

+    }
+
+private:
+    common_chat_syntax syntax_;


This notation of postfixed _ is unfamiliar and also not used anywhere else in this repo.

Sure no problem. I am going to go through and just double check things fit the code standards again as it's been a bit since my last contribution.

CISC · 2025-10-19T10:02:04Z

tools/main/main.cpp

+            if (!diff.reasoning_content_delta.empty()) {
+                result.push_back({diff.reasoning_content_delta, REASONING});
+                had_reasoning_ = true;
+            }
+            if (!diff.content_delta.empty()) {
+                if (had_reasoning_) {
+                    result.push_back({"\n", REASONING});
+                    had_reasoning_ = false;
+                }
+                result.push_back({diff.content_delta, CONTENT});
+            }


Since the thinking tags are eaten it makes it really hard to separate thinking from the rest.

Would it be an idea to highlight thinking in another color? Would require some additional logging API to check status of color and/or logging with g_col.

Okay sounds good! What do you think about adding something like "Thinking..." when the reasoning starts as well?

MaggotHATE · 2025-10-19T12:34:56Z

llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use. It is better to keep all special tags visible for testing/debugging purposes. In case of reasoning, it should be visibly separated from the rest of the answer, as @CISC has suggested - it's hard to understand where the reasoning is in the example screenshot you've posted.

CISC · 2025-10-19T12:43:32Z

It is better to keep all special tags visible for testing/debugging purposes.

Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.

MaggotHATE · 2025-10-19T13:11:41Z

Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.

If that's intended with jinja, then it's fine, but I would still suggest improving it in future. So long as LLMs can still hallucinate and have mismatched templates, it's always better to double-check.

bandoti · 2025-10-19T14:51:51Z

llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use.

@MaggotHATE Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.

Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.

What "take two" will have is: (1) only a single toolcall.cpp/h inside the llama-cli project; (2) only support toolcalls via the stdio transport (because there are nice local nodejs proxies and so-forth).

This will add nice testability to the toolcalls.

MaggotHATE · 2025-10-19T15:47:56Z

Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.

Any long, continuous dialog with a model would provide and good understanding if is works correctly and generates all required special tokens; this is especially important with different sampling combinations and settings. For example, old Magistral used to have problems with its thinking tags, which should be fixed in 2509 (I have tested it briefly only, as the model works better without reasoning). Moreover, the idea of "hybrid" reasoning is still in the air, which makes differentiating and outlining reasoning portions of generated text even more important.

I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).

Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.

If I understood you correctly, I would advice against introducing any network-related features into llama-cli and for making a separate tool instead. As of right now, it is fully private, with no way to connect to a network, which is a guarantee. Changing that would make llama-cli potentially less secure/private. Ah yes, that was changed with the remote downloading of models. Alas.

Add partial formatter

d230722

bandoti requested a review from ggerganov as a code owner October 16, 2025 01:32

github-actions bot added the examples label Oct 16, 2025

bandoti requested review from CISC and ggerganov and removed request for ggerganov October 16, 2025 01:32

bandoti added 2 commits October 16, 2025 09:13

Remove extra call to common_chat_templates_apply

3d94112

Suppress template markup in system & prompt display

a7771c1

bandoti added 2 commits October 17, 2025 12:01

Track system/user prompt position

8694fa3

Remove complexity

e403844

CISC approved these changes Oct 19, 2025

View reviewed changes

llama-cli: add support for reasoning #16603

Are you sure you want to change the base?

llama-cli: add support for reasoning #16603

Conversation

bandoti commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bandoti commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bandoti commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bandoti commented Oct 19, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

CISC Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

bandoti Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

bandoti Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

MaggotHATE commented Oct 19, 2025

Uh oh!

CISC commented Oct 19, 2025

Uh oh!

MaggotHATE commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bandoti commented Oct 19, 2025

Uh oh!

MaggotHATE commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bandoti commented Oct 16, 2025 •

edited

Loading

bandoti commented Oct 16, 2025 •

edited

Loading

bandoti commented Oct 17, 2025 •

edited

Loading

MaggotHATE commented Oct 19, 2025 •

edited

Loading