feat: add hybrid GPU/CPU offloading support for MoE models#281
Merged
Conversation
Add first-class CRD fields for hybrid GPU/CPU inference, enabling large Mixture-of-Experts models to run on VRAM-constrained hardware by offloading expert weights and KV cache to system RAM. New InferenceServiceSpec fields: - moeCPUOffload: offload all MoE expert layers to CPU (--cpu-moe) - moeCPULayers: offload N MoE layers to CPU (--n-cpu-moe) - noKvOffload: keep KV cache in system RAM (--no-kv-offload) New InferenceResourceRequirements field: - hostMemory: explicit system RAM request for hybrid pods, takes precedence over memory to ensure accurate K8s scheduling The reconciler emits a Warning event when offloading is enabled but neither memory nor hostMemory is set, preventing silent OOM kills. Ref #280 Signed-off-by: Christopher Maher <chris@mahercode.io>
Replace deprecated record.EventRecorder (GetEventRecorderFor) with events.EventRecorder (GetEventRecorder) to satisfy staticcheck SA1019. Signed-off-by: Christopher Maher <chris@mahercode.io>
Defilan
added a commit
that referenced
this pull request
Apr 17, 2026
Phase 2 of hybrid GPU/CPU offloading support (ref #280). Adds three new InferenceServiceSpec fields for fine-grained inference tuning: - tensorOverrides: regex-based tensor placement (--override-tensor) - batchSize: prompt processing batch size (--batch-size) - uBatchSize: decoding micro-batch size (--ubatch-size) These controls are particularly useful for hybrid MoE workloads where batch size tuning amortizes PCIe overhead and tensor overrides enable expert-level placement decisions beyond the typed moeCPUOffload flag. This feature was implemented by Qwen3.6-35B-A3B running via llama.cpp with hybrid CPU/GPU offloading on dual RTX 5060 Ti GPUs, deployed by LLMKube itself with the Phase 1 offloading support from #281. Signed-off-by: Christopher Maher <chris@mahercode.io>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class CRD fields for hybrid GPU/CPU inference (Phase 1 of #280), enabling large MoE models to run on VRAM-constrained hardware by offloading expert weights and KV cache to system RAM.
moeCPUOffload— offloads all MoE expert layers to CPU (--cpu-moe)moeCPULayers— offloads N MoE layers to CPU (--n-cpu-moe)noKvOffload— keeps KV cache in system RAM (--no-kv-offload)hostMemory— explicit system RAM request for hybrid pods, takes precedence overmemoryfor accurate K8s schedulingWarningevent when offloading is enabled but memory is not declaredUse case
A Qwen3-30B-A3B (30B total, ~3B active per token) can run on dual RTX 5060 Ti 16GB with 90K context by keeping attention layers on GPU while expert weights sit in system RAM — previously only possible via
extraArgs.Test plan
needsOffloadMemoryWarninghelper (both flags × memory/hostMemory set/unset)hostMemoryresource precedencemake testpasses (controller coverage 82.6%)make generate && make manifestscleanRef #280 (Phase 1 only — Phases 2 & 3 tracked in the issue)