Replies: 3 comments
-
|
I've been spending some time implementing MoE expert caching as well, except offloading all matmuls to the GPU, mainly for fun to see how high I could make the t/s number go, I have no intention of submitting a PR with production code or to own the code or maintain it. I have uploaded the code to a new fork at https://github.com/Lidenburg/llama.cpp incase anyone wants to use it as a reference point. I tried to keep my changes to a single file (ggml-backend.cpp) to make porting it easier and to make it easier to understand. The code is partially LLM generated and partially hand-written, but the theories and strategies tested were all manually guided. That said I wouldn't recommend manually reading the code. In general I saw around ~60% performance gain using expert caching with the same VRAM usage as when using the My goal was to try to get larger models (100b+) running at reasonable speeds on my 32gb + 16gb system, which i still think is probably achievable, but not using the spare SATA SSD that I had lying around. I would be curious to see what others with faster SSDs can get out of it in terms of t/s. |
Beta Was this translation helpful? Give feedback.
-
MoE cache regression test on GTX 1080 TiComment intended for SummaryI tested the This may simply be outside the hardware regime where the cache is expected to help, but on my GTX 1080 Ti single-GPU setup I consistently observed a regression versus a hard-disabled MoE cache path. The useful result here is the single-GPU GTX 1080 Ti sweep. A dual-GPU attempt was also made, but that run is not a valid MoE-cache benchmark because the binary was built only for Tested branch
ModelSingle-GPU hardware
Common runtime parametersDecode throughput results
Throughput trendOn this setup, larger cache budgets caused larger regressions. Smaller budgets reduced the regression, but none of the tested cache-enabled configurations outperformed the hard-disabled cache path. CLI / control observations
|
| Budget | Relevant log observation | TG tok/s |
|---|---|---|
| 1024 MB | cache-engaged nodes average 285us vs 266us pure-CPU + TRIMMED 1022 MB |
14.38 |
| 512 MB | cache-engaged nodes average 290us vs 261us pure-CPU + TRIMMED 510 MB |
17.26 |
| 256 MB | cache-engaged nodes average 310us vs 263us pure-CPU + TRIMMED 255 MB |
17.19 |
Even after bail-out / trim, throughput did not return to the hard-OFF baseline.
Dual-GPU attempt
I also attempted a dual-GPU run with both GPUs visible:
| CUDA device | GPU | Compute capability |
|---|---|---|
| CUDA0 | GTX 1660 Ti | 7.5 |
| CUDA1 | GTX 1080 Ti | 6.1 |
The binary was built only for sm_61:
CUDA : ARCHS = 610
The run failed during warmup with:
CUDA error: no kernel image is available for execution on the device
current device: 0
ggml_cuda_kernel_can_use_pdl
This dual-GPU attempt is not a valid MoE-cache benchmark. The likely reason is that CUDA device 0 was the GTX 1660 Ti (sm_75), while the tested binary only contained kernels for sm_61.
A proper dual-GPU test would require a binary built for both architectures, for example sm_61 and sm_75. I did not include such a result here to avoid mixing the MoE-cache behavior with a separate dual-architecture build variable.
Conclusion
On this GTX 1080 Ti single-GPU setup, MoE cache regresses decode throughput versus:
GGML_CUDA_MOE_CACHE=0
GGML_CUDA_MOE_CACHE_HOTSET=0
This does not disprove the reported 4× RTX 3090 results from the RFC. It may indicate that the current cache path is beneficial only in a specific hardware/model regime, while on smaller or older GPUs the overhead can dominate.
The main actionable observations are:
--moe-cache autofails withstoi.--moe-cache 0did not appear equivalent toGGML_CUDA_MOE_CACHE=0in my tests.- On GTX 1080 Ti, cache-enabled runs regress versus hard OFF.
- Larger cache budgets cause larger regressions.
- Bail-out / trim does not restore hard-OFF throughput.
- A dual-GPU test with mixed
sm_75+sm_61requires a binary built for both architectures; thesm_61-only binary cannot benchmark that setup correctly.
Beta Was this translation helpful? Give feedback.
-
|
Thanks for testing, the improvement is indeed quite hardware dependent. I imagine on a relatively slow GPU without tensor cores (like 1080), and not much spare VRAM after placing all dense layers on a GPU, for MoE it could be just faster to compute expert activation on CPU + RAM (depending on the exact CPU and RAM bandwidth available), as there is some overhead on the cache path. Ideally you want enough VRAM spare for cache (after placing ALL dense layers on a GPU - otherwise there is no point), that you can fit experts working set there (so around 20%-30% of MoE expert weights). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Follow-up to #24524, which was closed as too large to review — @am17an suggested
an RFC laying out the benefits vs the maintenance burden, and explicitly suggested
having the AI draft it. So, full disclosure: this document and the code it
describes were predominantly generated by an AI (Anthropic's Claude Fable 5),
working under my direction on my hardware. Code:
leloch:moe-cache-pr.Problem
When a MoE model spills experts to system RAM (
--cpu-moe/--n-cpu-moe/auto-fit), decode is dominated by the CPU reading expert weights at RAM
bandwidth while the GPUs idle. Prior attempts (#20757, #21609, #21614, #21620,
#23170) all moved MUL_MAT_ID to the GPU and tried to make the weight copies
cheaper. That puts every cache miss on the critical path as a synchronous PCIe
transfer: @batot1 measured ~3× decode regression from forced decode offload
without residency (#20757), and the Metal slot-pool experiment in the same
thread was 2× slower than vanilla even at 97–99% hit rate, purely from
per-layer sync points. #23170's post-mortem concluded the staging-buffer
approach is a no-op when made correct, and that the fix is a separate
persistent expert-cache buffer with explicit expert→slot bookkeeping.
On the discussions side, #22584 proposes the static version of this idea
(offline expert profiling + strategic placement), #19030 and PR #21067 the
prefetch version (latency-hiding without residency), and #22183 asks the
placement question a runtime cache answers automatically ("use the idle fast
device for experts") — but no runtime expert cache has been proposed as an
RFC before.
Proposed design
That persistent buffer, plus an inverted execution model: MUL_MAT_ID stays on
the CPU. Inside the CPU kernel, thread 0 dispatches one batched matvec over
the cached (hit) rows on the GPU while the other threads compute the miss rows
as they normally would. Misses cost nothing extra, hit rate is pure upside,
worst case degrades to the vanilla CPU path. Fill is decode-only (prompt
routing is far flatter and thrashes the cache); decode routing is skewed
enough to make this work — measured on Qwen3.5-122B, the top 10% of experts
take ~80% of hits (independent corroboration in #20757: Gini ≈ 0.76, ~99%
simulated hit rate at 69% expert budget). Complementary to #21067: prefetch
hides cold transfers at large ubatch (prompt); this removes hot
re-transfers at ubatch 1 (decode).
Measured benefit
4× RTX 3090, EPYC 7R13 48c, 8-ch DDR4-3200 256GB; llama-bench tg300, identical command
lines both arms:
-ngl 99 -ncmoe 99Quality: decode-path perplexity statistically identical to pure CPU;
prompt/batch path bit-untouched;
test-backend-opsMUL_MAT_ID 789/789 bothmodes. (GPU rounding can flip a near-tie token under greedy decoding, same
class as any
-nglchange.)Maintenance burden:
Where the code lives. ~1,700 lines in one new CUDA file pair
(
moe-cache.cu/.cuh); a backend-agnostic function-pointer API table(
ggml-backend-moe-cache.h); small null-checked hooks inggml-cpu.c/ops.cpp(mul_mat_id + GLU) and
ggml-backend.cpp(scheduler offer/redirect); a fitrule +
--moe-cacheflag in common. Non-CUDA builds see a zero-initializedtable and no-op hooks;
--moe-cache 0restores stock behavior exactly.The real costs
pointers, but anyone refactoring the CPU MoE kernels or the scheduler
split logic now has a second consumer to think about. This is the largest
structural cost.
window, baseline-sampled bail-out (EWMA, 4 strikes at +5%), VRAM reserve
default, decode-only fill. They're measured, but they're policy, and
policy invites tuning requests.
during decode; CI can verify it stays dormant and that the dispatch path
is numerically correct (a built-in selftest does this model-free), but the
perf claims need big-RAM multi-GPU hardware. Realistically, regressions
would be caught by users, not CI.
registers it; Metal/Vulkan users will ask.
~/.cache/llama.cpp/) and anadmission/eviction state machine — more surface than a stateless
optimization.
What would reduce the burden, if there's interest:
no GPU-resident handoff, no hot-set persistence) is roughly half the code
and keeps most of the gain; I can measure each layer's exact contribution
on this hardware.
backend registration) would let any backend implement caching out-of-tree
before committing to the CUDA implementation.
benchmark set that informs ggml: allow prefetching tensor overrides #21067 or a future maintainer-built version.
That's a fine outcome too.
Offer
The 4×3090 / 48-core / 8-channel 256GB DDR4-3200 box is available for any benchmark,
ablation, or A/B against #21067 that would help evaluate this. All results
above are reproducible with the commands in the branch's commit messages.
Beta Was this translation helpful? Give feedback.
All reactions