Skip to content

Conversation

@aviallon
Copy link
Contributor

@aviallon aviallon commented Dec 4, 2025

This enables support for limiting the maximum reasoning by counting how much tokens we output since entered reasoning mode, and then append a closing sentence + reasoning end token.

The closing sentence can be set in the GGUF metadata, on the command line, or overridden per-request.

Make sure to read the contributing guidelines before submitting a PR

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can implement this as some kind of custom sampler instead (via llama_sampler_i). We can simply force the token at position reasoning_budget to always be the closing tag.

Edit: the down side is that this approach won't go well with backend sampling (which will come in near future)

Comment on lines 1175 to 1176
size_t start_pos = slot.generated_text.rfind(start_tag);
size_t end_pos = slot.generated_text.rfind(end_tag);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this can decrease the performance quite a lot, especially on long generated_text. Wondering if we should only support this feature for models with reasoning tag as one single token.

Copy link
Contributor Author

@aviallon aviallon Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
In my opinion, we should embed somewhere in the GGUF informations about the thinking tags.
This would remove the need to do a dirty hardcoding like I did.
Also, the rfind could be limited to the closing token's length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
In my opinion, we should embed somewhere in the GGUF informations about the thinking tags.
This would remove the need to do a dirty hardcoding like I did.
Also, the rfind could be limited to the last few tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Could we guess the think tag by analyzing the chat template perhaps?
Like, trying to render a message with reasoning so that we can extract what ought to be the thinking tags?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reliable way. Each model handles thinking content slightly different than the other, so it's impossible to tell.

The better solution is to simply hard-code the tag as we already done in chat-parser.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think allowing GGUF to hold metadata about that would be a good thing. It could even be reported.

std::string oaicompat_model;
std::string oaicompat_cmpl_id;
common_chat_syntax oaicompat_chat_syntax;
std::optional<int32_t> reasoning_budget_override;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should simply save the reasoning_budget here, and apply the default in server-task.cpp.

// Track reasoning tokens when we're inside thinking blocks (<think>...</think> or similar)
// When the budget is exceeded we enqueue the closing tag tokens so they get sent to the client
// and fed back into the model before continuing normal generation
if (slot.has_next_token && reasoning_budget > 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be improved by using a state machine I believe, with states:

stateDiagram-v2
    [*] --> INIT
    INIT --> THINKING: reasoning forced open
    INIT --> THINKING: open tag found, tag to be guessed
    THINKING --> REGULAR_CONTENT: close tag found, we know the matching tag
Loading

Which should make things quite faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to mention that the grammar sampling also have this kind of state machine under the hood

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Does it have any internal API I could hook into?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aviallon Think best place to look back from is llama-grammar.cpp's llama_grammar_accept_impl (that's the accept method for the sampler that accepts the tokens based on matching the grammar and checks the grammar's internal state).

@pwilkin
Copy link
Collaborator

pwilkin commented Dec 4, 2025

I've been thinking about this and I think the best way would be to delegate to the parser to inform the server when thinking has been opened. The parser already has a similar mechanic for lazy grammar triggers.

Also, note that there is a model that natively supports a thinking budget: Seed OSS (you can look at its template here in the code to get an idea how that works). It might not be a bad idea to actually implement this in a similar way, as just abruptly ending the reasoning process could greatly reduce the model's performance. What I mean is: insert into the model's thinking stream a string like '*** Note: I have to finish thinking in X tokens, I've already used up Y tokens. ***' - not sure how well that would work though, just an experimental idea.

@aviallon
Copy link
Contributor Author

aviallon commented Dec 4, 2025

What I mean is: insert into the model's thinking stream a string like '*** Note: I have to finish thinking in X tokens, I've already used up Y tokens. ***' - not sure how well that would work though, just an experimental idea.

I've actually thought about that.
@ngxson is it feasible to have "ghost" tokens be present in the model's context, but without actually permanently saving them in the KV-cache?

@ngxson
Copy link
Collaborator

ngxson commented Dec 4, 2025

I don't get it, why would you want to do that?

@pwilkin
Copy link
Collaborator

pwilkin commented Dec 4, 2025

I don't get it, why would you want to do that?

Look at the Seed-OSS template - basically the idea is that the model won't really respond coherently if its reasoning is abruptly cut off, so you want the model to "soft-limit" itself earlier. Seed-OSS does that natively since it's trained to do that - if it has a reasoning budget set, it will actually stop itself every once in a while to evaluate its reasoning budget.

@ngxson
Copy link
Collaborator

ngxson commented Dec 4, 2025

I'm not sure if I understand it correctly, is that something like Qwen3, where a prompt must be added before close the thinking tag?

image

In any cases, I think what we're discussing about is pretty close to the grammar sampling system. If we need to insert such prompt at the end of the thinking, we just need to force-sampling these exact tokens.

@aviallon
Copy link
Contributor Author

aviallon commented Dec 4, 2025

I'm not sure if I understand it correctly, is that something like Qwen3, where a prompt must be added before close the thinking tag?
image

In any cases, I think what we're discussing about is pretty close to the grammar sampling system. If we need to insert such prompt at the end of the thinking, we just need to force-sampling these exact tokens.

In any case, it looks like we need to be able to configure the stop phrase per-model, with a generic one as a fallback.

@aldehir
Copy link
Collaborator

aldehir commented Dec 4, 2025

Just throwing ideas out. With the introduction of the PEG parser, I can imagine a limit_tokens(parser, inputs.reasoning_budget, "I need to wrap it up") parser that will consume and accumulate a token count. Once hit, it would force the message. This won't translate cleanly to a GBNF grammar, but it could be turned into a sampler that force samples when the condition is met. It also solves the problem of knowing when we're reasoning. Lot of handwaving there but it seems doable.

@pwilkin
Copy link
Collaborator

pwilkin commented Dec 4, 2025

I'm not sure if I understand it correctly, is that something like Qwen3, where a prompt must be added before close the thinking tag?

Yup, but also earlier.

Seed does something like this:

<seed:think>I am thinking now, thinking, thinking for a long time...
<seed:cot_budget_reflect>I've been thinking for 630 tokens, with 2370 thinking tokens left</seed:cot_budget_reflect>
...

Inserting stuff like that during the thinking process would possibly let the model finish the thinking while keeping budget constraints.

If we have the mechanism of "force sampling after X thinking tokens" then there shouldn't be any overhead of adding such an option and we could even do a benchmark on some IFEval or something to check which option (abrupt end of thinking, Qwen-like single thinking end-phrase, Seed-like mid-thinking budget reflections) is the best for performance.

@ngxson
Copy link
Collaborator

ngxson commented Dec 4, 2025

<seed:think>I am thinking now, thinking, thinking for a long time...
<seed:cot_budget_reflect>I've been thinking for 630 tokens, with 2370 thinking tokens left</seed:cot_budget_reflect>
...

I'm just quite curious if these "guiding" tokens are counted toward the final budget (I hope not, otherwise it will be quite complicated) but if yes, I would be interested to see how they do it on their original implementation (I assume written in python)

Generic token counting should be trivial to add into a grammar sampling though, probably by simply extending this syntax: .{1000} (which counts 1000 characters) into (?<tokens>){1000} (counts 1000 tokens) - The same is possible for PEG I think

@aldehir
Copy link
Collaborator

aldehir commented Dec 4, 2025

Generic token counting should be trivial to add into a grammar sampling though, probably by simply extending this syntax: .{1000} (which counts 1000 characters) into (?){1000} (counts 1000 tokens) - The same is possible for PEG I think

Yes, that is pretty simple! Making the grammar token aware is something we can benefit from as well. Right now it operates at the character level (as far as I know with my high level understanding).

@aviallon
Copy link
Contributor Author

aviallon commented Dec 4, 2025

If you want me to implement something better (like the ideas around introducing changes to the grammar), I'd gladly do it, but I'd probably need some guiding to get there.

In any case, I updated the PR to implement the state machine idea and simplify the reasoning_budget handling.

@pwilkin
Copy link
Collaborator

pwilkin commented Dec 5, 2025

@ngxson nah, their model is trained on this. You pass it a training budget, it outputs a neat little table of when the model should "reflect" and the model just outputs those "reflection" tokens around these parts. It's not fully deterministic, the model isn't always fully correct about the number of tokens used. But I thought it would be an interesting inspiration of how to handle "limited reasoning budget" in a forceful way.

@aviallon aviallon marked this pull request as ready for review December 5, 2025 10:36
@aviallon aviallon requested a review from ggerganov as a code owner December 5, 2025 10:36
@aviallon
Copy link
Contributor Author

aviallon commented Dec 5, 2025

@ngxson it should be much more sane now

@github-actions github-actions bot added the testing Everything test related label Dec 5, 2025
@github-actions github-actions bot added the python python script changes label Dec 5, 2025
@ngxson
Copy link
Collaborator

ngxson commented Dec 5, 2025

Honestly I feel like the idea of having a grammar-based reasoning control will be much more useful and powerful. Its benefits over this current PR are:

  • Cleaner approach - least server-specific code
  • More flexible - can inject mid-thinking "marks" for models like Seed-OSS explained earlier
  • More future proof - if we somehow implement multi-thread sampling in the future, the performance will be improved with zero changes to the sampling implementation

So I still prefer to wait for the implementation of @aldehir's idea

@aviallon
Copy link
Contributor Author

aviallon commented Dec 5, 2025

@aldehir could you help me implement your idea? I'm not sure I know how I should start.

@aldehir
Copy link
Collaborator

aldehir commented Dec 5, 2025

@aviallon I have a working grammar example by approximating token count. I'll share it soon, need to get some rest first.

Also spent some time adding token support to llama-grammar. Need to iron out a few kinks but I'm aiming to submit a PR this weekend. This should allow for accurate token counting.

@ggerganov
Copy link
Member

ggerganov commented Dec 6, 2025

Edit: the down side is that this approach won't go well with backend sampling (which will come in near future)

One thing to keep in mind is that currently to evaluate the grammar, we do a "shortcut" during sampling:

  • First sample the token without using a grammar
  • Check if the token fits the grammar (i.e. it's not disallowed)
  • If it does not fit, then we resample starting with the grammar (slow) and sampling only from the tokens that fit

This shortcut is done to make the sampling fast even though it biases the actual grammar constraints to some extend.

With #17004 I think we would need to remove this shortcut because it does not play well with the new backend sampling support. Already working on this and should push it soon.

The end result will be that whenever we have a grammar involved:

  • it will be evaluated first before any other samplers
  • force all following samplers to run on the CPU (unless someday we figure out how to implement the grammar to run in a ggml graph - unlikely)
  • as a result, there could be a non-negligible performance impact - not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants