Skip to content

Conversation

ggerganov
Copy link
Member

ref #16391 (comment)

Increase the token limit of the prompt cache if it can fit in the specified memory limit.

@ggerganov ggerganov force-pushed the gg/server-dynamic-token-limit-for-prompt-cache branch from a6be3f7 to 07d6954 Compare October 13, 2025 10:10
@ngxson
Copy link
Collaborator

ngxson commented Oct 13, 2025

Not quite related to this, do you think we should make a tiny recurrent model for testing the features related to checkpoint?

@ggerganov
Copy link
Member Author

Yes, I was looking for a small recurrent model recently, but the smallest one that I could find with quantization was ~200MB which is too much IMO. If we make a tiny model, we can add some checkpoint tests.

@compilade
Copy link
Collaborator

compilade commented Oct 13, 2025

I was looking for a small recurrent model recently,

The smallest recurrent model I know is https://huggingface.co/delphi-suite/v0-mamba-100k but it may be too small (its embedding size is 52). There's also slightly bigger ones like https://huggingface.co/delphi-suite/v0-mamba-12.8m, but it has a weird embedding size too (326).

@ngxson
Copy link
Collaborator

ngxson commented Oct 13, 2025

@compilade I'm wondering if the mentioned models are recurrence or hybrid. No matter the vocab size, if its output it deterministic cross backends, I think it's fine to be used on CI.

IMO a hybrid one will be a nice-to-have. I'll see if I can run a fine-tuning on a stripped down version of LFM2 (they have ready-to-use script for fine-tuning)

@compilade
Copy link
Collaborator

compilade commented Oct 13, 2025

@compilade I'm wondering if the mentioned models are recurrent or hybrid.

@ngxson The delphi-suite/v0-mamba-* models are purely Mamba1 models, and so they are not hybrid.

IMO a hybrid one will be a nice-to-have.

Agreed.

@AesSedai
Copy link

Testing this out now, it looks like it's starting at a lower limit but from my read of the PR it should increase when the cap is hit?

update:  - cache state: 1 prompts, 233.244 MiB (limits: 65536.000 MiB, 66048 tokens)

I have a long test running now that should fill it up probably in an hour or two, so I'll report back later.

@AesSedai
Copy link

AesSedai commented Oct 13, 2025

It looks like the token counter for the cache state might not be updating(still shows 66048?) but it's definitely holding more prompts now:

158.17.066.643 W srv  get_availabl: updating prompt cache
158.17.066.812 W srv   prompt_save:  - saving prompt with length 1014, total state size = 364.420 MiB
158.17.370.267 W srv          load:  - looking for better prompt, base f_keep = 0.004, sim = 0.001
158.17.370.314 W srv          load:  - found better prompt with f_keep = 0.876, sim = 1.000
158.17.510.421 W srv        update:  - cache state: 21 prompts, 59006.627 MiB (limits: 65536.000 MiB, 66048 tokens)
158.17.510.437 W srv        update:    - prompt 0x3f469120:   14357 tokens, checkpoints:  0,  5159.713 MiB
158.17.510.438 W srv        update:    - prompt 0x3f461400:    9292 tokens, checkpoints:  0,  3339.421 MiB
158.17.510.440 W srv        update:    - prompt 0x3f43d070:    4502 tokens, checkpoints:  0,  1617.960 MiB
158.17.510.441 W srv        update:    - prompt 0x48d34800:   14922 tokens, checkpoints:  0,  5362.767 MiB
158.17.510.441 W srv        update:    - prompt 0x49030e80:   14467 tokens, checkpoints:  0,  5199.246 MiB
158.17.510.442 W srv        update:    - prompt 0x3f464330:    4319 tokens, checkpoints:  0,  1552.192 MiB
158.17.510.444 W srv        update:    - prompt 0x3f436ce0:   10190 tokens, checkpoints:  0,  3662.150 MiB
158.17.510.445 W srv        update:    - prompt 0x48c96d50:    9396 tokens, checkpoints:  0,  3376.797 MiB
158.17.510.445 W srv        update:    - prompt 0x3f45daf0:    1790 tokens, checkpoints:  0,   643.304 MiB
158.17.510.447 W srv        update:    - prompt 0x34822b40:    6443 tokens, checkpoints:  0,  2315.529 MiB
158.17.510.448 W srv        update:    - prompt 0x3f47c8a0:    3607 tokens, checkpoints:  0,  1296.309 MiB
158.17.510.448 W srv        update:    - prompt 0x44834220:   10307 tokens, checkpoints:  0,  3704.198 MiB
158.17.510.450 W srv        update:    - prompt 0x3f445b80:    6917 tokens, checkpoints:  0,  2485.878 MiB
158.17.510.451 W srv        update:    - prompt 0x48c3c9d0:    4939 tokens, checkpoints:  0,  1775.012 MiB
158.17.510.451 W srv        update:    - prompt 0x3f449030:   14756 tokens, checkpoints:  0,  5303.108 MiB
158.17.510.452 W srv        update:    - prompt 0x3f439920:    1443 tokens, checkpoints:  0,   518.597 MiB
158.17.510.455 W srv        update:    - prompt 0x3f4323b0:   21494 tokens, checkpoints:  0,  7724.654 MiB
158.17.510.455 W srv        update:    - prompt 0x3f4a7910:    1197 tokens, checkpoints:  0,   430.188 MiB
158.17.510.456 W srv        update:    - prompt 0x4473a470:    7833 tokens, checkpoints:  0,  2815.076 MiB
158.17.510.456 W srv        update:    - prompt 0x468af840:    1002 tokens, checkpoints:  0,   360.107 MiB
158.17.510.456 W srv        update:    - prompt 0x448340c0:    1014 tokens, checkpoints:  0,   364.420 MiB
158.17.510.460 W srv  get_availabl: prompt cache update took 443.82 ms

@ggerganov
Copy link
Member Author

Yes, after the commit I just pushed, the log should show both the configured token limit and the dynamically estimated one based on the memory limit.

@AesSedai
Copy link

Yep, looks good to me now! I bumped the --cache-ram up further since I have the space for it (and I'm doing some DSPy testing atm, so it's running a lot of validation prompts with only small context changes) and I see the new estimate:

15.15.917.828 W srv  get_availabl: updating prompt cache
15.15.918.234 W srv   prompt_save:  - saving prompt with length 2185, total state size = 785.262 MiB
15.16.592.843 W srv          load:  - looking for better prompt, base f_keep = 0.371, sim = 0.054
15.16.592.865 W srv        update:  - cache state: 8 prompts, 16553.716 MiB (limits: 262144.000 MiB, 70144 tokens, 729420 est)
15.16.592.867 W srv        update:    - prompt 0x293dcff0:   18544 tokens, checkpoints:  0,  6664.464 MiB
15.16.592.868 W srv        update:    - prompt 0x293e4560:    1735 tokens, checkpoints:  0,   623.538 MiB
15.16.592.868 W srv        update:    - prompt 0x293e56f0:    7558 tokens, checkpoints:  0,  2716.245 MiB
15.16.592.869 W srv        update:    - prompt 0x293f07c0:    4950 tokens, checkpoints:  0,  1778.965 MiB
15.16.592.870 W srv        update:    - prompt 0x293f3c70:    1438 tokens, checkpoints:  0,   516.800 MiB
15.16.592.871 W srv        update:    - prompt 0x293e7cb0:    8446 tokens, checkpoints:  0,  3035.380 MiB
15.16.592.871 W srv        update:    - prompt 0x293e1920:    1205 tokens, checkpoints:  0,   433.063 MiB
15.16.592.872 W srv        update:    - prompt 0x294325a0:    2185 tokens, checkpoints:  0,   785.262 MiB
15.16.592.873 W srv  get_availabl: prompt cache update took 675.04 ms

@ggerganov ggerganov merged commit bc07349 into master Oct 14, 2025
68 of 70 checks passed
@ggerganov ggerganov deleted the gg/server-dynamic-token-limit-for-prompt-cache branch October 14, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants