server : dynamic token limit for prompt cache #16560

ggerganov · 2025-10-13T10:08:27Z

Increase the token limit of the prompt cache if it can fit in the specified memory limit.

ngxson · 2025-10-13T10:45:27Z

Not quite related to this, do you think we should make a tiny recurrent model for testing the features related to checkpoint?

ggerganov · 2025-10-13T10:55:54Z

Yes, I was looking for a small recurrent model recently, but the smallest one that I could find with quantization was ~200MB which is too much IMO. If we make a tiny model, we can add some checkpoint tests.

compilade · 2025-10-13T14:15:08Z

I was looking for a small recurrent model recently,

The smallest recurrent model I know is https://huggingface.co/delphi-suite/v0-mamba-100k but it may be too small (its embedding size is 52). There's also slightly bigger ones like https://huggingface.co/delphi-suite/v0-mamba-12.8m, but it has a weird embedding size too (326).

ngxson · 2025-10-13T14:26:49Z

@compilade I'm wondering if the mentioned models are recurrence or hybrid. No matter the vocab size, if its output it deterministic cross backends, I think it's fine to be used on CI.

IMO a hybrid one will be a nice-to-have. I'll see if I can run a fine-tuning on a stripped down version of LFM2 (they have ready-to-use script for fine-tuning)

compilade · 2025-10-13T14:29:14Z

@compilade I'm wondering if the mentioned models are recurrent or hybrid.

@ngxson The delphi-suite/v0-mamba-* models are purely Mamba1 models, and so they are not hybrid.

IMO a hybrid one will be a nice-to-have.

Agreed.

AesSedai · 2025-10-13T17:34:35Z

Testing this out now, it looks like it's starting at a lower limit but from my read of the PR it should increase when the cap is hit?

update:  - cache state: 1 prompts, 233.244 MiB (limits: 65536.000 MiB, 66048 tokens)

I have a long test running now that should fill it up probably in an hour or two, so I'll report back later.

AesSedai · 2025-10-13T18:57:50Z

It looks like the token counter for the cache state might not be updating(still shows 66048?) but it's definitely holding more prompts now:

158.17.066.643 W srv  get_availabl: updating prompt cache
158.17.066.812 W srv   prompt_save:  - saving prompt with length 1014, total state size = 364.420 MiB
158.17.370.267 W srv          load:  - looking for better prompt, base f_keep = 0.004, sim = 0.001
158.17.370.314 W srv          load:  - found better prompt with f_keep = 0.876, sim = 1.000
158.17.510.421 W srv        update:  - cache state: 21 prompts, 59006.627 MiB (limits: 65536.000 MiB, 66048 tokens)
158.17.510.437 W srv        update:    - prompt 0x3f469120:   14357 tokens, checkpoints:  0,  5159.713 MiB
158.17.510.438 W srv        update:    - prompt 0x3f461400:    9292 tokens, checkpoints:  0,  3339.421 MiB
158.17.510.440 W srv        update:    - prompt 0x3f43d070:    4502 tokens, checkpoints:  0,  1617.960 MiB
158.17.510.441 W srv        update:    - prompt 0x48d34800:   14922 tokens, checkpoints:  0,  5362.767 MiB
158.17.510.441 W srv        update:    - prompt 0x49030e80:   14467 tokens, checkpoints:  0,  5199.246 MiB
158.17.510.442 W srv        update:    - prompt 0x3f464330:    4319 tokens, checkpoints:  0,  1552.192 MiB
158.17.510.444 W srv        update:    - prompt 0x3f436ce0:   10190 tokens, checkpoints:  0,  3662.150 MiB
158.17.510.445 W srv        update:    - prompt 0x48c96d50:    9396 tokens, checkpoints:  0,  3376.797 MiB
158.17.510.445 W srv        update:    - prompt 0x3f45daf0:    1790 tokens, checkpoints:  0,   643.304 MiB
158.17.510.447 W srv        update:    - prompt 0x34822b40:    6443 tokens, checkpoints:  0,  2315.529 MiB
158.17.510.448 W srv        update:    - prompt 0x3f47c8a0:    3607 tokens, checkpoints:  0,  1296.309 MiB
158.17.510.448 W srv        update:    - prompt 0x44834220:   10307 tokens, checkpoints:  0,  3704.198 MiB
158.17.510.450 W srv        update:    - prompt 0x3f445b80:    6917 tokens, checkpoints:  0,  2485.878 MiB
158.17.510.451 W srv        update:    - prompt 0x48c3c9d0:    4939 tokens, checkpoints:  0,  1775.012 MiB
158.17.510.451 W srv        update:    - prompt 0x3f449030:   14756 tokens, checkpoints:  0,  5303.108 MiB
158.17.510.452 W srv        update:    - prompt 0x3f439920:    1443 tokens, checkpoints:  0,   518.597 MiB
158.17.510.455 W srv        update:    - prompt 0x3f4323b0:   21494 tokens, checkpoints:  0,  7724.654 MiB
158.17.510.455 W srv        update:    - prompt 0x3f4a7910:    1197 tokens, checkpoints:  0,   430.188 MiB
158.17.510.456 W srv        update:    - prompt 0x4473a470:    7833 tokens, checkpoints:  0,  2815.076 MiB
158.17.510.456 W srv        update:    - prompt 0x468af840:    1002 tokens, checkpoints:  0,   360.107 MiB
158.17.510.456 W srv        update:    - prompt 0x448340c0:    1014 tokens, checkpoints:  0,   364.420 MiB
158.17.510.460 W srv  get_availabl: prompt cache update took 443.82 ms

ggerganov · 2025-10-13T19:30:12Z

Yes, after the commit I just pushed, the log should show both the configured token limit and the dynamically estimated one based on the memory limit.

AesSedai · 2025-10-13T22:10:38Z

Yep, looks good to me now! I bumped the --cache-ram up further since I have the space for it (and I'm doing some DSPy testing atm, so it's running a lot of validation prompts with only small context changes) and I see the new estimate:

15.15.917.828 W srv  get_availabl: updating prompt cache
15.15.918.234 W srv   prompt_save:  - saving prompt with length 2185, total state size = 785.262 MiB
15.16.592.843 W srv          load:  - looking for better prompt, base f_keep = 0.371, sim = 0.054
15.16.592.865 W srv        update:  - cache state: 8 prompts, 16553.716 MiB (limits: 262144.000 MiB, 70144 tokens, 729420 est)
15.16.592.867 W srv        update:    - prompt 0x293dcff0:   18544 tokens, checkpoints:  0,  6664.464 MiB
15.16.592.868 W srv        update:    - prompt 0x293e4560:    1735 tokens, checkpoints:  0,   623.538 MiB
15.16.592.868 W srv        update:    - prompt 0x293e56f0:    7558 tokens, checkpoints:  0,  2716.245 MiB
15.16.592.869 W srv        update:    - prompt 0x293f07c0:    4950 tokens, checkpoints:  0,  1778.965 MiB
15.16.592.870 W srv        update:    - prompt 0x293f3c70:    1438 tokens, checkpoints:  0,   516.800 MiB
15.16.592.871 W srv        update:    - prompt 0x293e7cb0:    8446 tokens, checkpoints:  0,  3035.380 MiB
15.16.592.871 W srv        update:    - prompt 0x293e1920:    1205 tokens, checkpoints:  0,   433.063 MiB
15.16.592.872 W srv        update:    - prompt 0x294325a0:    2185 tokens, checkpoints:  0,   785.262 MiB
15.16.592.873 W srv  get_availabl: prompt cache update took 675.04 ms

ggerganov requested a review from ngxson as a code owner October 13, 2025 10:08

github-actions bot added examples server labels Oct 13, 2025

ggerganov mentioned this pull request Oct 13, 2025

server : host-memory prompt caching #16391

Merged

5 tasks

server : dynamic token limit for prompt cache

07d6954

ggerganov force-pushed the gg/server-dynamic-token-limit-for-prompt-cache branch from a6be3f7 to 07d6954 Compare October 13, 2025 10:10

cont : print estimated token limit

63924a9

ggerganov merged commit bc07349 into master Oct 14, 2025
68 of 70 checks passed

ggerganov deleted the gg/server-dynamic-token-limit-for-prompt-cache branch October 14, 2025 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : dynamic token limit for prompt cache #16560

server : dynamic token limit for prompt cache #16560

ggerganov commented Oct 13, 2025

Uh oh!

ngxson commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

compilade commented Oct 13, 2025 •

edited

Loading

Uh oh!

ngxson commented Oct 13, 2025

Uh oh!

compilade commented Oct 13, 2025 •

edited

Loading

Uh oh!

AesSedai commented Oct 13, 2025

Uh oh!

AesSedai commented Oct 13, 2025 •

edited

Loading

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

AesSedai commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

server : dynamic token limit for prompt cache #16560

server : dynamic token limit for prompt cache #16560

Conversation

ggerganov commented Oct 13, 2025

Uh oh!

ngxson commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

compilade commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 13, 2025

Uh oh!

compilade commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AesSedai commented Oct 13, 2025

Uh oh!

AesSedai commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

AesSedai commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

compilade commented Oct 13, 2025 •

edited

Loading

compilade commented Oct 13, 2025 •

edited

Loading

AesSedai commented Oct 13, 2025 •

edited

Loading