-
Notifications
You must be signed in to change notification settings - Fork 13.3k
server : dynamic token limit for prompt cache #16560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : dynamic token limit for prompt cache #16560
Conversation
a6be3f7
to
07d6954
Compare
Not quite related to this, do you think we should make a tiny recurrent model for testing the features related to checkpoint? |
Yes, I was looking for a small recurrent model recently, but the smallest one that I could find with quantization was ~200MB which is too much IMO. If we make a tiny model, we can add some checkpoint tests. |
The smallest recurrent model I know is https://huggingface.co/delphi-suite/v0-mamba-100k but it may be too small (its embedding size is 52). There's also slightly bigger ones like https://huggingface.co/delphi-suite/v0-mamba-12.8m, but it has a weird embedding size too (326). |
@compilade I'm wondering if the mentioned models are recurrence or hybrid. No matter the vocab size, if its output it deterministic cross backends, I think it's fine to be used on CI. IMO a hybrid one will be a nice-to-have. I'll see if I can run a fine-tuning on a stripped down version of LFM2 (they have ready-to-use script for fine-tuning) |
@ngxson The
Agreed. |
Testing this out now, it looks like it's starting at a lower limit but from my read of the PR it should increase when the cap is hit?
I have a long test running now that should fill it up probably in an hour or two, so I'll report back later. |
It looks like the token counter for the cache state might not be updating(still shows
|
Yes, after the commit I just pushed, the log should show both the configured token limit and the dynamically estimated one based on the memory limit. |
Yep, looks good to me now! I bumped the
|
ref #16391 (comment)
Increase the token limit of the prompt cache if it can fit in the specified memory limit.