server: fix --cache-ram not preventing RAM OOM by zzhenyao · Pull Request #23561 · ggml-org/llama.cpp

zzhenyao · 2026-05-23T04:18:08Z

Overview

Fix RAM OOM crash in server_prompt_cache when saving prompt cache with --cache-ram limit.

A previous fix added pre-allocation checks, but checkpoint data was not included in the cache size calculation, so the real cache size could still exceed --cache-ram.

Also, update() did not account for the pending allocation, and the states.size() > 1 guard skipped eviction when only one entry existed.

Fixed:

include checkpoint data in the cache size calculation
pass pending allocation size and token count to update() so it evicts until the new entry fits
remove states.size() > 1 guard
remove catch(bad_alloc) recovery

Additional information

alloc() checked the limit using only KV cache and draft state sizes, and the try block only allocated those. But the cache entry also included checkpoint data, so actual RAM usage could still exceed the configured limit and trigger OOM.

Before fix:

saving prompt with length 146764, total state size = 5332.813 MiB (draft: 307.363 MiB)
oom-kill: ... task=llama-server,pid=92787
Out of memory: Killed process 92787 (llama-server) ...

After fix:

saving prompt with length 72820, total state size = 2721.372 MiB (draft: 152.505 MiB)
single state (2721.372 MiB) exceeds cache limit (360.000 MiB), skipping cache

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, llama.cpp + claude code, used for checking code style

…allocation

zzhenyao · 2026-05-24T07:19:45Z

@aldehir @ggerganov could you please take a look when convenient?
Possibly related: #22925, #22629, #21690

Follow-up: if --cache-ram is supposed to include checkpoint too, then this fix should stop prompt cache from exceeding the limit and causing RAM OOM. But that also means the default value may now be too small. Should it be adjusted?

server : fix OOM crash in prompt cache by checking size limit before …

ec67a3d

…allocation

zzhenyao requested a review from a team as a code owner May 23, 2026 04:18

github-actions Bot added examples server labels May 23, 2026

server: fix prompt cache eviction and include checkpoint in cache-ram

3b0d2b6

zzhenyao changed the title ~~server : fix OOM crash in prompt cache by checking size limit before allocation~~ server: fix --cache-ram not preventing RAM OOM May 24, 2026

zzhenyao mentioned this pull request May 24, 2026

Eval bug: Checkpoints and MMProj on Gemma 4 consume abnormal amounts of RAM, leading to llama-server going OOM #21690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix --cache-ram not preventing RAM OOM#23561

server: fix --cache-ram not preventing RAM OOM#23561
zzhenyao wants to merge 2 commits into
ggml-org:masterfrom
zzhenyao:fix/server-cache-ram-oom

zzhenyao commented May 23, 2026 •

edited

Loading

Uh oh!

zzhenyao commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zzhenyao commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

zzhenyao commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zzhenyao commented May 23, 2026 •

edited

Loading