server: handle limiting maximum reasoning budget #17750

aviallon · 2025-12-04T01:58:29Z

This enables support for limiting the maximum reasoning by counting how much tokens we output since entered reasoning mode, and then append a closing sentence + reasoning end token.

The closing sentence can be set in the GGUF metadata, on the command line, or overridden per-request.

Make sure to read the contributing guidelines before submitting a PR

This overrides the reasoning_budget per-request.

ngxson

I'm wondering if we can implement this as some kind of custom sampler instead (via llama_sampler_i). We can simply force the token at position reasoning_budget to always be the closing tag.

Edit: the down side is that this approach won't go well with backend sampling (which will come in near future)

ngxson · 2025-12-04T15:24:45Z

tools/server/server-context.cpp

+                size_t start_pos = slot.generated_text.rfind(start_tag);
+                size_t end_pos   = slot.generated_text.rfind(end_tag);


IMO this can decrease the performance quite a lot, especially on long generated_text. Wondering if we should only support this feature for models with reasoning tag as one single token.

Indeed.
In my opinion, we should embed somewhere in the GGUF informations about the thinking tags.
This would remove the need to do a dirty hardcoding like I did.
Also, the rfind could be limited to the closing token's length.

Indeed.
In my opinion, we should embed somewhere in the GGUF informations about the thinking tags.
This would remove the need to do a dirty hardcoding like I did.
Also, the rfind could be limited to the last few tokens.

@ngxson Could we guess the think tag by analyzing the chat template perhaps?
Like, trying to render a message with reasoning so that we can extract what ought to be the thinking tags?

There is no reliable way. Each model handles thinking content slightly different than the other, so it's impossible to tell.

The better solution is to simply hard-code the tag as we already done in chat-parser.cpp

Then I think allowing GGUF to hold metadata about that would be a good thing. It could even be reported.

aviallon · 2025-12-04T16:09:29Z

tools/server/server-task.h

    std::string                  oaicompat_model;
    std::string                  oaicompat_cmpl_id;
    common_chat_syntax           oaicompat_chat_syntax;
+    std::optional<int32_t>       reasoning_budget_override;


I think I should simply save the reasoning_budget here, and apply the default in server-task.cpp.

aviallon · 2025-12-04T16:14:48Z

tools/server/server-context.cpp

+        // Track reasoning tokens when we're inside thinking blocks (<think>...</think> or similar)
+        // When the budget is exceeded we enqueue the closing tag tokens so they get sent to the client
+        // and fed back into the model before continuing normal generation
+        if (slot.has_next_token && reasoning_budget > 0) {


This could be improved by using a state machine I believe, with states:

stateDiagram-v2 [*] --> INIT INIT --> THINKING: reasoning forced open INIT --> THINKING: open tag found, tag to be guessed THINKING --> REGULAR_CONTENT: close tag found, we know the matching tag

Loading

Which should make things quite faster.

just to mention that the grammar sampling also have this kind of state machine under the hood

Oh. Does it have any internal API I could hook into?

@aviallon Think best place to look back from is llama-grammar.cpp's llama_grammar_accept_impl (that's the accept method for the sampler that accepts the tokens based on matching the grammar and checks the grammar's internal state).

pwilkin · 2025-12-04T17:03:59Z

I've been thinking about this and I think the best way would be to delegate to the parser to inform the server when thinking has been opened. The parser already has a similar mechanic for lazy grammar triggers.

Also, note that there is a model that natively supports a thinking budget: Seed OSS (you can look at its template here in the code to get an idea how that works). It might not be a bad idea to actually implement this in a similar way, as just abruptly ending the reasoning process could greatly reduce the model's performance. What I mean is: insert into the model's thinking stream a string like '*** Note: I have to finish thinking in X tokens, I've already used up Y tokens. ***' - not sure how well that would work though, just an experimental idea.

aviallon · 2025-12-04T18:15:29Z

What I mean is: insert into the model's thinking stream a string like '*** Note: I have to finish thinking in X tokens, I've already used up Y tokens. ***' - not sure how well that would work though, just an experimental idea.

I've actually thought about that.
@ngxson is it feasible to have "ghost" tokens be present in the model's context, but without actually permanently saving them in the KV-cache?

ngxson · 2025-12-04T19:48:23Z

I don't get it, why would you want to do that?

pwilkin · 2025-12-04T19:53:54Z

I don't get it, why would you want to do that?

Look at the Seed-OSS template - basically the idea is that the model won't really respond coherently if its reasoning is abruptly cut off, so you want the model to "soft-limit" itself earlier. Seed-OSS does that natively since it's trained to do that - if it has a reasoning budget set, it will actually stop itself every once in a while to evaluate its reasoning budget.

ngxson · 2025-12-04T20:16:42Z

I'm not sure if I understand it correctly, is that something like Qwen3, where a prompt must be added before close the thinking tag?

In any cases, I think what we're discussing about is pretty close to the grammar sampling system. If we need to insert such prompt at the end of the thinking, we just need to force-sampling these exact tokens.

aviallon · 2025-12-04T20:48:46Z

I'm not sure if I understand it correctly, is that something like Qwen3, where a prompt must be added before close the thinking tag?

In any cases, I think what we're discussing about is pretty close to the grammar sampling system. If we need to insert such prompt at the end of the thinking, we just need to force-sampling these exact tokens.

In any case, it looks like we need to be able to configure the stop phrase per-model, with a generic one as a fallback.

aldehir · 2025-12-04T20:56:20Z

Just throwing ideas out. With the introduction of the PEG parser, I can imagine a limit_tokens(parser, inputs.reasoning_budget, "I need to wrap it up") parser that will consume and accumulate a token count. Once hit, it would force the message. This won't translate cleanly to a GBNF grammar, but it could be turned into a sampler that force samples when the condition is met. It also solves the problem of knowing when we're reasoning. Lot of handwaving there but it seems doable.

pwilkin · 2025-12-04T21:14:18Z

I'm not sure if I understand it correctly, is that something like Qwen3, where a prompt must be added before close the thinking tag?

Yup, but also earlier.

Seed does something like this:

<seed:think>I am thinking now, thinking, thinking for a long time...
<seed:cot_budget_reflect>I've been thinking for 630 tokens, with 2370 thinking tokens left</seed:cot_budget_reflect>
...

Inserting stuff like that during the thinking process would possibly let the model finish the thinking while keeping budget constraints.

If we have the mechanism of "force sampling after X thinking tokens" then there shouldn't be any overhead of adding such an option and we could even do a benchmark on some IFEval or something to check which option (abrupt end of thinking, Qwen-like single thinking end-phrase, Seed-like mid-thinking budget reflections) is the best for performance.

ngxson · 2025-12-04T21:22:38Z

<seed:think>I am thinking now, thinking, thinking for a long time...
<seed:cot_budget_reflect>I've been thinking for 630 tokens, with 2370 thinking tokens left</seed:cot_budget_reflect>
...

I'm just quite curious if these "guiding" tokens are counted toward the final budget (I hope not, otherwise it will be quite complicated) but if yes, I would be interested to see how they do it on their original implementation (I assume written in python)

Generic token counting should be trivial to add into a grammar sampling though, probably by simply extending this syntax: .{1000} (which counts 1000 characters) into (?<tokens>){1000} (counts 1000 tokens) - The same is possible for PEG I think

aldehir · 2025-12-04T21:27:08Z

Generic token counting should be trivial to add into a grammar sampling though, probably by simply extending this syntax: .{1000} (which counts 1000 characters) into (?){1000} (counts 1000 tokens) - The same is possible for PEG I think

Yes, that is pretty simple! Making the grammar token aware is something we can benefit from as well. Right now it operates at the character level (as far as I know with my high level understanding).

…t handling

aviallon · 2025-12-04T23:05:29Z

If you want me to implement something better (like the ideas around introducing changes to the grammar), I'd gladly do it, but I'd probably need some guiding to get there.

In any case, I updated the PR to implement the state machine idea and simplify the reasoning_budget handling.

Can be set in: - GGUF metadata, - CLI, - Request params

pwilkin · 2025-12-05T00:04:51Z

@ngxson nah, their model is trained on this. You pass it a training budget, it outputs a neat little table of when the model should "reflect" and the model just outputs those "reflection" tokens around these parts. It's not fully deterministic, the model isn't always fully correct about the number of tokens used. But I thought it would be an interesting inspiration of how to handle "limited reasoning budget" in a forceful way.

aviallon · 2025-12-05T10:37:28Z

@ngxson it should be much more sane now

ngxson · 2025-12-05T10:57:44Z

Honestly I feel like the idea of having a grammar-based reasoning control will be much more useful and powerful. Its benefits over this current PR are:

Cleaner approach - least server-specific code
More flexible - can inject mid-thinking "marks" for models like Seed-OSS explained earlier
More future proof - if we somehow implement multi-thread sampling in the future, the performance will be improved with zero changes to the sampling implementation

So I still prefer to wait for the implementation of @aldehir's idea

aviallon · 2025-12-05T11:10:35Z

@aldehir could you help me implement your idea? I'm not sure I know how I should start.

aldehir · 2025-12-05T11:16:03Z

@aviallon I have a working grammar example by approximating token count. I'll share it soon, need to get some rest first.

Also spent some time adding token support to llama-grammar. Need to iron out a few kinks but I'm aiming to submit a PR this weekend. This should allow for accurate token counting.

ggerganov · 2025-12-06T08:22:50Z

Edit: the down side is that this approach won't go well with backend sampling (which will come in near future)

One thing to keep in mind is that currently to evaluate the grammar, we do a "shortcut" during sampling:

First sample the token without using a grammar
Check if the token fits the grammar (i.e. it's not disallowed)
If it does not fit, then we resample starting with the grammar (slow) and sampling only from the tokens that fit

This shortcut is done to make the sampling fast ~~even though it biases the actual grammar constraints to some extend~~.

With #17004 I think we would need to remove this shortcut because it does not play well with the new backend sampling support. Already working on this and should push it soon.

The end result will be that whenever we have a grammar involved:

it will be evaluated first before any other samplers
force all following samplers to run on the CPU (unless someday we figure out how to implement the grammar to run in a ggml graph - unlikely)
as a result, there could be a non-negligible performance impact - not sure.

aviallon added 2 commits December 4, 2025 02:56

server: handle limiting maximum reasoning budget

1ff71f4

server: support OAI's thinking_budget_tokens

1b6c51e

This overrides the reasoning_budget per-request.

loci-dev mentioned this pull request Dec 4, 2025

UPSTREAM PR #17750: server: handle limiting maximum reasoning budget auroralabs-loci/llama.cpp#425

Open

fixup! server: handle limiting maximum reasoning budget

687c4b0

github-actions bot added examples server labels Dec 4, 2025

ngxson reviewed Dec 4, 2025

View reviewed changes

aviallon commented Dec 4, 2025

View reviewed changes

track reasoning state using a state machine, simplify reasoning_budge…

256e637

…t handling

aviallon added 2 commits December 5, 2025 00:14

remove leftovers, fix SLT_DBG format string parameters

d0e8da1

make reasoning force close message customizable

2475a17

Can be set in: - GGUF metadata, - CLI, - Request params

aviallon added 2 commits December 5, 2025 11:32

use reasoning state from chat-parser, thus removing duplicated logic

05e5139

add tests for reasoning budget enforcement

4292831

aviallon marked this pull request as ready for review December 5, 2025 10:36

aviallon requested a review from ggerganov as a code owner December 5, 2025 10:36

github-actions bot added the testing Everything test related label Dec 5, 2025

github-actions bot added the python python script changes label Dec 5, 2025

aldehir mentioned this pull request Dec 6, 2025

llama : add token matching support to llama-grammar #17816

Open

3 tasks

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17816: llama : add token matching support to llama-grammar auroralabs-loci/llama.cpp#468

Open

3 tasks

		size_t start_pos = slot.generated_text.rfind(start_tag);
		size_t end_pos = slot.generated_text.rfind(end_tag);

server: handle limiting maximum reasoning budget #17750

Are you sure you want to change the base?

server: handle limiting maximum reasoning budget #17750

Conversation

aviallon commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aviallon Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Dec 4, 2025

Uh oh!

aviallon commented Dec 4, 2025

Uh oh!

ngxson commented Dec 4, 2025

Uh oh!

pwilkin commented Dec 4, 2025

Uh oh!

ngxson commented Dec 4, 2025

Uh oh!

aviallon commented Dec 4, 2025

Uh oh!

aldehir commented Dec 4, 2025

Uh oh!

pwilkin commented Dec 4, 2025

Uh oh!

ngxson commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Dec 4, 2025

Uh oh!

aviallon commented Dec 4, 2025

Uh oh!

pwilkin commented Dec 5, 2025

Uh oh!

aviallon commented Dec 5, 2025

Uh oh!

ngxson commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aviallon commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Dec 5, 2025

Uh oh!

ggerganov commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

aviallon commented Dec 4, 2025 •

edited

Loading

ngxson left a comment •

edited

Loading

aviallon Dec 4, 2025 •

edited

Loading

ngxson commented Dec 4, 2025 •

edited

Loading

ngxson commented Dec 5, 2025 •

edited

Loading

aviallon commented Dec 5, 2025 •

edited

Loading

ggerganov commented Dec 6, 2025 •

edited

Loading