-
Notifications
You must be signed in to change notification settings - Fork 14k
server: handle limiting maximum reasoning budget #17750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
server: handle limiting maximum reasoning budget #17750
Conversation
This overrides the reasoning_budget per-request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can implement this as some kind of custom sampler instead (via llama_sampler_i). We can simply force the token at position reasoning_budget to always be the closing tag.
Edit: the down side is that this approach won't go well with backend sampling (which will come in near future)
tools/server/server-context.cpp
Outdated
| size_t start_pos = slot.generated_text.rfind(start_tag); | ||
| size_t end_pos = slot.generated_text.rfind(end_tag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this can decrease the performance quite a lot, especially on long generated_text. Wondering if we should only support this feature for models with reasoning tag as one single token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
In my opinion, we should embed somewhere in the GGUF informations about the thinking tags.
This would remove the need to do a dirty hardcoding like I did.
Also, the rfind could be limited to the closing token's length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
In my opinion, we should embed somewhere in the GGUF informations about the thinking tags.
This would remove the need to do a dirty hardcoding like I did.
Also, the rfind could be limited to the last few tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Could we guess the think tag by analyzing the chat template perhaps?
Like, trying to render a message with reasoning so that we can extract what ought to be the thinking tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reliable way. Each model handles thinking content slightly different than the other, so it's impossible to tell.
The better solution is to simply hard-code the tag as we already done in chat-parser.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think allowing GGUF to hold metadata about that would be a good thing. It could even be reported.
tools/server/server-task.h
Outdated
| std::string oaicompat_model; | ||
| std::string oaicompat_cmpl_id; | ||
| common_chat_syntax oaicompat_chat_syntax; | ||
| std::optional<int32_t> reasoning_budget_override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should simply save the reasoning_budget here, and apply the default in server-task.cpp.
tools/server/server-context.cpp
Outdated
| // Track reasoning tokens when we're inside thinking blocks (<think>...</think> or similar) | ||
| // When the budget is exceeded we enqueue the closing tag tokens so they get sent to the client | ||
| // and fed back into the model before continuing normal generation | ||
| if (slot.has_next_token && reasoning_budget > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be improved by using a state machine I believe, with states:
stateDiagram-v2
[*] --> INIT
INIT --> THINKING: reasoning forced open
INIT --> THINKING: open tag found, tag to be guessed
THINKING --> REGULAR_CONTENT: close tag found, we know the matching tag
Which should make things quite faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to mention that the grammar sampling also have this kind of state machine under the hood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Does it have any internal API I could hook into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aviallon Think best place to look back from is llama-grammar.cpp's llama_grammar_accept_impl (that's the accept method for the sampler that accepts the tokens based on matching the grammar and checks the grammar's internal state).
|
I've been thinking about this and I think the best way would be to delegate to the parser to inform the server when thinking has been opened. The parser already has a similar mechanic for lazy grammar triggers. Also, note that there is a model that natively supports a thinking budget: Seed OSS (you can look at its template here in the code to get an idea how that works). It might not be a bad idea to actually implement this in a similar way, as just abruptly ending the reasoning process could greatly reduce the model's performance. What I mean is: insert into the model's thinking stream a string like '*** Note: I have to finish thinking in X tokens, I've already used up Y tokens. ***' - not sure how well that would work though, just an experimental idea. |
I've actually thought about that. |
|
I don't get it, why would you want to do that? |
Look at the Seed-OSS template - basically the idea is that the model won't really respond coherently if its reasoning is abruptly cut off, so you want the model to "soft-limit" itself earlier. Seed-OSS does that natively since it's trained to do that - if it has a reasoning budget set, it will actually stop itself every once in a while to evaluate its reasoning budget. |
|
Just throwing ideas out. With the introduction of the PEG parser, I can imagine a |
Yup, but also earlier. Seed does something like this: Inserting stuff like that during the thinking process would possibly let the model finish the thinking while keeping budget constraints. If we have the mechanism of "force sampling after X thinking tokens" then there shouldn't be any overhead of adding such an option and we could even do a benchmark on some IFEval or something to check which option (abrupt end of thinking, Qwen-like single thinking end-phrase, Seed-like mid-thinking budget reflections) is the best for performance. |
I'm just quite curious if these "guiding" tokens are counted toward the final budget (I hope not, otherwise it will be quite complicated) but if yes, I would be interested to see how they do it on their original implementation (I assume written in python) Generic token counting should be trivial to add into a grammar sampling though, probably by simply extending this syntax: |
Yes, that is pretty simple! Making the grammar token aware is something we can benefit from as well. Right now it operates at the character level (as far as I know with my high level understanding). |
|
If you want me to implement something better (like the ideas around introducing changes to the grammar), I'd gladly do it, but I'd probably need some guiding to get there. In any case, I updated the PR to implement the state machine idea and simplify the |
Can be set in: - GGUF metadata, - CLI, - Request params
|
@ngxson nah, their model is trained on this. You pass it a training budget, it outputs a neat little table of when the model should "reflect" and the model just outputs those "reflection" tokens around these parts. It's not fully deterministic, the model isn't always fully correct about the number of tokens used. But I thought it would be an interesting inspiration of how to handle "limited reasoning budget" in a forceful way. |
|
@ngxson it should be much more sane now |
|
Honestly I feel like the idea of having a grammar-based reasoning control will be much more useful and powerful. Its benefits over this current PR are:
So I still prefer to wait for the implementation of @aldehir's idea |
|
@aldehir could you help me implement your idea? I'm not sure I know how I should start. |
|
@aviallon I have a working grammar example by approximating token count. I'll share it soon, need to get some rest first. Also spent some time adding token support to llama-grammar. Need to iron out a few kinks but I'm aiming to submit a PR this weekend. This should allow for accurate token counting. |
One thing to keep in mind is that currently to evaluate the grammar, we do a "shortcut" during sampling:
This shortcut is done to make the sampling fast With #17004 I think we would need to remove this shortcut because it does not play well with the new backend sampling support. Already working on this and should push it soon. The end result will be that whenever we have a grammar involved:
|


This enables support for limiting the maximum reasoning by counting how much tokens we output since entered reasoning mode, and then append a closing sentence + reasoning end token.
The closing sentence can be set in the GGUF metadata, on the command line, or overridden per-request.
Make sure to read the contributing guidelines before submitting a PR