Skip to content

feat: add hybrid GPU/CPU offloading support for MoE models#281

Merged
Defilan merged 2 commits intomainfrom
feat/hybrid-gpu-cpu-offloading
Apr 17, 2026
Merged

feat: add hybrid GPU/CPU offloading support for MoE models#281
Defilan merged 2 commits intomainfrom
feat/hybrid-gpu-cpu-offloading

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Apr 17, 2026

Summary

Adds first-class CRD fields for hybrid GPU/CPU inference (Phase 1 of #280), enabling large MoE models to run on VRAM-constrained hardware by offloading expert weights and KV cache to system RAM.

  • moeCPUOffload — offloads all MoE expert layers to CPU (--cpu-moe)
  • moeCPULayers — offloads N MoE layers to CPU (--n-cpu-moe)
  • noKvOffload — keeps KV cache in system RAM (--no-kv-offload)
  • hostMemory — explicit system RAM request for hybrid pods, takes precedence over memory for accurate K8s scheduling
  • Emits a Warning event when offloading is enabled but memory is not declared

Use case

A Qwen3-30B-A3B (30B total, ~3B active per token) can run on dual RTX 5060 Ti 16GB with 90K context by keeping attention layers on GPU while expert weights sit in system RAM — previously only possible via extraArgs.

Test plan

  • 9 unit tests for flag generation (3 per flag: true/nil/false)
  • 6 tests for needsOffloadMemoryWarning helper (both flags × memory/hostMemory set/unset)
  • 3 tests for hostMemory resource precedence
  • Updated full-configuration integration test with all new fields
  • make test passes (controller coverage 82.6%)
  • make generate && make manifests clean
  • Helm chart CRD synced

Ref #280 (Phase 1 only — Phases 2 & 3 tracked in the issue)

Defilan added 2 commits April 16, 2026 20:37
Add first-class CRD fields for hybrid GPU/CPU inference, enabling large
Mixture-of-Experts models to run on VRAM-constrained hardware by
offloading expert weights and KV cache to system RAM.

New InferenceServiceSpec fields:
- moeCPUOffload: offload all MoE expert layers to CPU (--cpu-moe)
- moeCPULayers: offload N MoE layers to CPU (--n-cpu-moe)
- noKvOffload: keep KV cache in system RAM (--no-kv-offload)

New InferenceResourceRequirements field:
- hostMemory: explicit system RAM request for hybrid pods, takes
  precedence over memory to ensure accurate K8s scheduling

The reconciler emits a Warning event when offloading is enabled but
neither memory nor hostMemory is set, preventing silent OOM kills.

Ref #280

Signed-off-by: Christopher Maher <chris@mahercode.io>
Replace deprecated record.EventRecorder (GetEventRecorderFor) with
events.EventRecorder (GetEventRecorder) to satisfy staticcheck SA1019.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 2287f66 into main Apr 17, 2026
16 checks passed
@Defilan Defilan deleted the feat/hybrid-gpu-cpu-offloading branch April 17, 2026 04:08
@github-actions github-actions bot mentioned this pull request Apr 17, 2026
Defilan added a commit that referenced this pull request Apr 17, 2026
Phase 2 of hybrid GPU/CPU offloading support (ref #280). Adds three new
InferenceServiceSpec fields for fine-grained inference tuning:

- tensorOverrides: regex-based tensor placement (--override-tensor)
- batchSize: prompt processing batch size (--batch-size)
- uBatchSize: decoding micro-batch size (--ubatch-size)

These controls are particularly useful for hybrid MoE workloads where
batch size tuning amortizes PCIe overhead and tensor overrides enable
expert-level placement decisions beyond the typed moeCPUOffload flag.

This feature was implemented by Qwen3.6-35B-A3B running via llama.cpp
with hybrid CPU/GPU offloading on dual RTX 5060 Ti GPUs, deployed by
LLMKube itself with the Phase 1 offloading support from #281.

Signed-off-by: Christopher Maher <chris@mahercode.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant