RFC: MoE expert cache, VRAM caching of hot CPU-resident experts with hybrid hit/miss execution #24528

leloch · 2026-06-12T16:10:15Z

leloch
Jun 12, 2026

Follow-up to #24524, which was closed as too large to review — @am17an suggested
an RFC laying out the benefits vs the maintenance burden, and explicitly suggested
having the AI draft it. So, full disclosure: this document and the code it
describes were predominantly generated by an AI (Anthropic's Claude Fable 5),
working under my direction on my hardware. Code: leloch:moe-cache-pr.

Problem

When a MoE model spills experts to system RAM (--cpu-moe / --n-cpu-moe /
auto-fit), decode is dominated by the CPU reading expert weights at RAM
bandwidth while the GPUs idle. Prior attempts (#20757, #21609, #21614, #21620,
#23170) all moved MUL_MAT_ID to the GPU and tried to make the weight copies
cheaper. That puts every cache miss on the critical path as a synchronous PCIe
transfer: @batot1 measured ~3× decode regression from forced decode offload
without residency (#20757), and the Metal slot-pool experiment in the same
thread was 2× slower than vanilla even at 97–99% hit rate, purely from
per-layer sync points. #23170's post-mortem concluded the staging-buffer
approach is a no-op when made correct, and that the fix is a separate
persistent expert-cache buffer with explicit expert→slot bookkeeping.

On the discussions side, #22584 proposes the static version of this idea
(offline expert profiling + strategic placement), #19030 and PR #21067 the
prefetch version (latency-hiding without residency), and #22183 asks the
placement question a runtime cache answers automatically ("use the idle fast
device for experts") — but no runtime expert cache has been proposed as an
RFC before.

Proposed design

That persistent buffer, plus an inverted execution model: MUL_MAT_ID stays on
the CPU. Inside the CPU kernel, thread 0 dispatches one batched matvec over
the cached (hit) rows on the GPU while the other threads compute the miss rows
as they normally would. Misses cost nothing extra, hit rate is pure upside,
worst case degrades to the vanilla CPU path. Fill is decode-only (prompt
routing is far flatter and thrashes the cache); decode routing is skewed
enough to make this work — measured on Qwen3.5-122B, the top 10% of experts
take ~80% of hits (independent corroboration in #20757: Gini ≈ 0.76, ~99%
simulated hit rate at 69% expert budget). Complementary to #21067: prefetch
hides cold transfers at large ubatch (prompt); this removes hot
re-transfers at ubatch 1 (decode).

Measured benefit

4× RTX 3090, EPYC 7R13 48c, 8-ch DDR4-3200 256GB; llama-bench tg300, identical command
lines both arms:

Model	Regime	cache	vanilla	gain
GLM-5.1 754B IQ2_M (220 GB)	zero-config	17.49	13.96	+25%
Qwen3.5 397B Q3_K_XL (167 GB)	zero-config	30.25	28.18	+7%
Qwen3.5 122B (fits in VRAM)	zero-config	74.98	74.98	parity (cache dormant)
13 models / 9 architectures	forced `-ngl 99 -ncmoe 99`			+10%…+57%, 16/16 ≥ parity

Quality: decode-path perplexity statistically identical to pure CPU;
prompt/batch path bit-untouched; test-backend-ops MUL_MAT_ID 789/789 both
modes. (GPU rounding can flip a near-tie token under greedy decoding, same
class as any -ngl change.)

Maintenance burden:

Where the code lives. ~1,700 lines in one new CUDA file pair
(moe-cache.cu/.cuh); a backend-agnostic function-pointer API table
(ggml-backend-moe-cache.h); small null-checked hooks in ggml-cpu.c/ops.cpp
(mul_mat_id + GLU) and ggml-backend.cpp (scheduler offer/redirect); a fit
rule + --moe-cache flag in common. Non-CUDA builds see a zero-initialized
table and no-op hooks; --moe-cache 0 restores stock behavior exactly.

The real costs

It hooks the CPU mul_mat_id hot path. The hooks are null-checked
pointers, but anyone refactoring the CPU MoE kernels or the scheduler
split logic now has a second consumer to think about. This is the largest
structural cost.
Heuristics that maintainers didn't choose. Shape-census stability
window, baseline-sampled bail-out (EWMA, 4 strikes at +5%), VRAM reserve
default, decode-only fill. They're measured, but they're policy, and
policy invites tuning requests.
Hard to test in CI. The cache only engages on spilling MoE models
during decode; CI can verify it stays dormant and that the dispatch path
is numerically correct (a built-in selftest does this model-free), but the
perf claims need big-RAM multi-GPU hardware. Realistically, regressions
would be caught by users, not CI.
CUDA-only. The API table is backend-agnostic, but today only CUDA
registers it; Metal/Vulkan users will ask.
State. A persistent hot-set file (~/.cache/llama.cpp/) and an
admission/eviction state machine — more surface than a stateless
optimization.

What would reduce the burden, if there's interest:

A minimal core (plain per-node dispatch only: no fused gate+up+SwiGLU,
no GPU-resident handoff, no hot-set persistence) is roughly half the code
and keeps most of the gain; I can measure each layer's exact contribution
on this hardware.
Landing the API table + CPU hooks alone first (~small, inert without a
backend registration) would let any backend implement caching out-of-tree
before committing to the CUDA implementation.
Or: nobody merges anything, and this stays a documented design +
benchmark set that informs ggml: allow prefetching tensor overrides #21067 or a future maintainer-built version.
That's a fine outcome too.

Offer

The 4×3090 / 48-core / 8-channel 256GB DDR4-3200 box is available for any benchmark,
ablation, or A/B against #21067 that would help evaluate this. All results
above are reproducible with the commands in the branch's commit messages.

Lidenburg · 2026-06-14T14:08:36Z

Lidenburg
Jun 14, 2026

I've been spending some time implementing MoE expert caching as well, except offloading all matmuls to the GPU, mainly for fun to see how high I could make the t/s number go, I have no intention of submitting a PR with production code or to own the code or maintain it.

I have uploaded the code to a new fork at https://github.com/Lidenburg/llama.cpp incase anyone wants to use it as a reference point. I tried to keep my changes to a single file (ggml-backend.cpp) to make porting it easier and to make it easier to understand. The code is partially LLM generated and partially hand-written, but the theories and strategies tested were all manually guided. That said I wouldn't recommend manually reading the code.

In general I saw around ~60% performance gain using expert caching with the same VRAM usage as when using the --n-cpu-moe option. There's some comparisons in the README in the fork.

My goal was to try to get larger models (100b+) running at reasonable speeds on my 32gb + 16gb system, which i still think is probably achievable, but not using the spare SATA SSD that I had lying around. I would be curious to see what others with faster SSDs can get out of it in terms of t/s.

0 replies

batot1 · 2026-06-14T22:12:22Z

batot1
Jun 14, 2026

MoE cache regression test on GTX 1080 Ti

Comment intended for ggml-org/llama.cpp Discussion #24528
Branch tested: leloch/llama.cpp, moe-cache-pr

Summary

I tested the moe-cache-pr branch on a much smaller and older setup than the 4× RTX 3090 system described in the RFC.

This may simply be outside the hardware regime where the cache is expected to help, but on my GTX 1080 Ti single-GPU setup I consistently observed a regression versus a hard-disabled MoE cache path.

The useful result here is the single-GPU GTX 1080 Ti sweep. A dual-GPU attempt was also made, but that run is not a valid MoE-cache benchmark because the binary was built only for sm_61, while the first visible CUDA device was sm_75.

Tested branch

Item	Value
Repository	`leloch/llama.cpp`
Branch	`moe-cache-pr`
HEAD	`8853f0535`
Commit title	`common: MoE-cache-aware fit placement and --moe-cache flag`

Model

/home/bartosz/models/qwen36/Qwen3.6-35B-A3B-UDT-Q8_K_XL_MTP.gguf

Single-GPU hardware

Item	Value
GPU	NVIDIA GeForce GTX 1080 Ti
VRAM	11 GB
CUDA compute capability	6.1
Runtime device selection	`CUDA_VISIBLE_DEVICES=1`
Build CUDA arch	`CMAKE_CUDA_ARCHITECTURES=61-real`

Common runtime parameters

-c 8192
-n 512
-t 6
-b 512
-ub 512
-ngl all
-sm none
-fa on
-ctk q8_0
-ctv q8_0
--cpu-moe
--cache-ram 0
GGML_CUDA_MOE_CACHE_HOTSET=0

Decode throughput results

Mode	MoE cache budget	TG tok/s	Relative to hard OFF	Result
hard OFF	0 MB	19.32	100.0%	baseline
ON	4096 MB	13.25	68.6%	regression
ON	1024 MB	14.38	74.4%	regression
ON	512 MB	17.26	89.3%	regression
ON	256 MB	17.19	89.0%	regression
ON	128 MB	18.64	96.5%	regression
ON	64 MB	18.71	96.9%	regression
ON	32 MB	18.72	96.9%	regression

Throughput trend

hard OFF   -> 19.32 tok/s

4096 MB    -> 13.25 tok/s
1024 MB    -> 14.38 tok/s
512 MB     -> 17.26 tok/s
256 MB     -> 17.19 tok/s
128 MB     -> 18.64 tok/s
64 MB      -> 18.71 tok/s
32 MB      -> 18.72 tok/s

On this setup, larger cache budgets caused larger regressions. Smaller budgets reduced the regression, but none of the tested cache-enabled configurations outperformed the hard-disabled cache path.

CLI / control observations

`--moe-cache auto`

--moe-cache auto failed with:

error while handling argument "--moe-cache": stoi

The help text suggests that auto is a valid/default mode, but the CLI argument parser appears to parse the value as an integer.

`--moe-cache 0` versus hard OFF

In earlier runs, --moe-cache 0 did not appear to behave like a reliable hard-disabled cache path. The reliable hard-OFF configuration was:

GGML_CUDA_MOE_CACHE=0
GGML_CUDA_MOE_CACHE_HOTSET=0

The hard-OFF result was:

eval time = 26500.62 ms / 512 tokens
TG        = 19.32 tok/s

Bail-out / trim observations

Some cache-enabled runs logged bail-out / trim messages such as:

[moe-cache] bail-out: cache-engaged nodes average ... vs ... pure-CPU — disabling the cache and freeing its VRAM for this run
[moe-cache] dev=0 TRIMMED ... MB under VRAM pressure — cache off on this device

Examples:

Budget	Relevant log observation	TG tok/s
1024 MB	`cache-engaged nodes average 285us vs 266us pure-CPU` + `TRIMMED 1022 MB`	14.38
512 MB	`cache-engaged nodes average 290us vs 261us pure-CPU` + `TRIMMED 510 MB`	17.26
256 MB	`cache-engaged nodes average 310us vs 263us pure-CPU` + `TRIMMED 255 MB`	17.19

Even after bail-out / trim, throughput did not return to the hard-OFF baseline.

Dual-GPU attempt

I also attempted a dual-GPU run with both GPUs visible:

CUDA device	GPU	Compute capability
CUDA0	GTX 1660 Ti	7.5
CUDA1	GTX 1080 Ti	6.1

The binary was built only for sm_61:

CUDA : ARCHS = 610

The run failed during warmup with:

CUDA error: no kernel image is available for execution on the device
current device: 0
ggml_cuda_kernel_can_use_pdl

This dual-GPU attempt is not a valid MoE-cache benchmark. The likely reason is that CUDA device 0 was the GTX 1660 Ti (sm_75), while the tested binary only contained kernels for sm_61.

A proper dual-GPU test would require a binary built for both architectures, for example sm_61 and sm_75. I did not include such a result here to avoid mixing the MoE-cache behavior with a separate dual-architecture build variable.

Conclusion

On this GTX 1080 Ti single-GPU setup, MoE cache regresses decode throughput versus:

GGML_CUDA_MOE_CACHE=0
GGML_CUDA_MOE_CACHE_HOTSET=0

This does not disprove the reported 4× RTX 3090 results from the RFC. It may indicate that the current cache path is beneficial only in a specific hardware/model regime, while on smaller or older GPUs the overhead can dominate.

The main actionable observations are:

--moe-cache auto fails with stoi.
--moe-cache 0 did not appear equivalent to GGML_CUDA_MOE_CACHE=0 in my tests.
On GTX 1080 Ti, cache-enabled runs regress versus hard OFF.
Larger cache budgets cause larger regressions.
Bail-out / trim does not restore hard-OFF throughput.
A dual-GPU test with mixed sm_75 + sm_61 requires a binary built for both architectures; the sm_61-only binary cannot benchmark that setup correctly.

0 replies

leloch · 2026-06-15T10:58:09Z

leloch
Jun 15, 2026
Author

Thanks for testing, the improvement is indeed quite hardware dependent.

I imagine on a relatively slow GPU without tensor cores (like 1080), and not much spare VRAM after placing all dense layers on a GPU, for MoE it could be just faster to compute expert activation on CPU + RAM (depending on the exact CPU and RAM bandwidth available), as there is some overhead on the cache path.

Ideally you want enough VRAM spare for cache (after placing ALL dense layers on a GPU - otherwise there is no point), that you can fit experts working set there (so around 20%-30% of MoE expert weights).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: MoE expert cache, VRAM caching of hot CPU-resident experts with hybrid hit/miss execution #24528

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: MoE expert cache, VRAM caching of hot CPU-resident experts with hybrid hit/miss execution #24528

Uh oh!

leloch Jun 12, 2026

Problem

Proposed design

Measured benefit

Maintenance burden:

Offer

Replies: 3 comments

Uh oh!

Lidenburg Jun 14, 2026

Uh oh!

batot1 Jun 14, 2026

MoE cache regression test on GTX 1080 Ti

Summary

Tested branch

Model

Single-GPU hardware

Common runtime parameters

Decode throughput results

Throughput trend

CLI / control observations

--moe-cache auto

--moe-cache 0 versus hard OFF

Bail-out / trim observations

Dual-GPU attempt

Conclusion

Uh oh!

leloch Jun 15, 2026 Author

leloch
Jun 12, 2026

Lidenburg
Jun 14, 2026

batot1
Jun 14, 2026

`--moe-cache auto`

`--moe-cache 0` versus hard OFF

leloch
Jun 15, 2026
Author