feat: add hybrid GPU/CPU offloading support for MoE models by Defilan · Pull Request #281 · defilantech/LLMKube

Defilan · 2026-04-17T03:37:27Z

Summary

Adds first-class CRD fields for hybrid GPU/CPU inference (Phase 1 of #280), enabling large MoE models to run on VRAM-constrained hardware by offloading expert weights and KV cache to system RAM.

moeCPUOffload — offloads all MoE expert layers to CPU (--cpu-moe)
moeCPULayers — offloads N MoE layers to CPU (--n-cpu-moe)
noKvOffload — keeps KV cache in system RAM (--no-kv-offload)
hostMemory — explicit system RAM request for hybrid pods, takes precedence over memory for accurate K8s scheduling
Emits a Warning event when offloading is enabled but memory is not declared

Use case

A Qwen3-30B-A3B (30B total, ~3B active per token) can run on dual RTX 5060 Ti 16GB with 90K context by keeping attention layers on GPU while expert weights sit in system RAM — previously only possible via extraArgs.

Test plan

9 unit tests for flag generation (3 per flag: true/nil/false)
6 tests for needsOffloadMemoryWarning helper (both flags × memory/hostMemory set/unset)
3 tests for hostMemory resource precedence
Updated full-configuration integration test with all new fields
make test passes (controller coverage 82.6%)
make generate && make manifests clean
Helm chart CRD synced

Ref #280 (Phase 1 only — Phases 2 & 3 tracked in the issue)

Add first-class CRD fields for hybrid GPU/CPU inference, enabling large Mixture-of-Experts models to run on VRAM-constrained hardware by offloading expert weights and KV cache to system RAM. New InferenceServiceSpec fields: - moeCPUOffload: offload all MoE expert layers to CPU (--cpu-moe) - moeCPULayers: offload N MoE layers to CPU (--n-cpu-moe) - noKvOffload: keep KV cache in system RAM (--no-kv-offload) New InferenceResourceRequirements field: - hostMemory: explicit system RAM request for hybrid pods, takes precedence over memory to ensure accurate K8s scheduling The reconciler emits a Warning event when offloading is enabled but neither memory nor hostMemory is set, preventing silent OOM kills. Ref #280 Signed-off-by: Christopher Maher <chris@mahercode.io>

Replace deprecated record.EventRecorder (GetEventRecorderFor) with events.EventRecorder (GetEventRecorder) to satisfy staticcheck SA1019. Signed-off-by: Christopher Maher <chris@mahercode.io>

Phase 2 of hybrid GPU/CPU offloading support (ref #280). Adds three new InferenceServiceSpec fields for fine-grained inference tuning: - tensorOverrides: regex-based tensor placement (--override-tensor) - batchSize: prompt processing batch size (--batch-size) - uBatchSize: decoding micro-batch size (--ubatch-size) These controls are particularly useful for hybrid MoE workloads where batch size tuning amortizes PCIe overhead and tensor overrides enable expert-level placement decisions beyond the typed moeCPUOffload flag. This feature was implemented by Qwen3.6-35B-A3B running via llama.cpp with hybrid CPU/GPU offloading on dual RTX 5060 Ti GPUs, deployed by LLMKube itself with the Phase 1 offloading support from #281. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan added 2 commits April 16, 2026 20:37

fix: use non-deprecated events.EventRecorder API

92facff

Replace deprecated record.EventRecorder (GetEventRecorderFor) with events.EventRecorder (GetEventRecorder) to satisfy staticcheck SA1019. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan merged commit 2287f66 into main Apr 17, 2026
16 checks passed

Defilan deleted the feat/hybrid-gpu-cpu-offloading branch April 17, 2026 04:08

github-actions bot mentioned this pull request Apr 17, 2026

chore: release 0.6.1 #275

Open

Defilan mentioned this pull request Apr 17, 2026

feat: add tensor overrides and batch size controls for hybrid offloading #283

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add hybrid GPU/CPU offloading support for MoE models#281

feat: add hybrid GPU/CPU offloading support for MoE models#281
Defilan merged 2 commits intomainfrom
feat/hybrid-gpu-cpu-offloading

Defilan commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Apr 17, 2026

Summary

Use case

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant