Add a CPU profile and a kind-based reconciler CPU canary by EdHasNoLife · Pull Request #21 · cachebox-project/inference-cache

EdHasNoLife · 2026-05-27T21:36:44Z

Summary

Lets the CacheBackend reconciler stand up a healthy, serving backend without a GPU, so the substrate can be validated off-GPU (closing the one C2 DoD item — "apply CR → healthy pods" — that previously needed a GPU host).

CPU profile (backendConfig.profile: cpu): renders a GPU-free vLLM engine — no nvidia.com/gpu limit and no LMCache connector, but prefix caching + the KV-event publisher stay on, plus CPU flags (--dtype=bfloat16 --max-model-len=8192 --enforce-eager) and VLLM_CPU_KVCACHE_SPACE, defaulting to the vLLM CPU image + a tiny ungated model (Qwen/Qwen2.5-0.5B-Instruct). The default gpu profile is unchanged and still owns real LMCache offload (which requires a GPU). Selection is case-insensitive; unknown/empty falls back to gpu.
Kind-based reconciler CPU canary (docs/reference-stack/scripts/canary_c2_reconcile.sh + an on-demand workflow): brings up kind, runs the controller, applies a profile=cpu CacheBackend, and asserts the reconciler reports Ready with a published endpoint, an engine prefix-cache hit through the Service, and owner-ref GC on delete — the real-pod path envtest can't cover.
A profile=cpu sample and the backendConfig key reference in docs/design/cachebackend-api.md.

This complements PR #17 (the C1-chain CPU canary); once C1 lands on main, the two canaries can be unified.

Test plan

make ci / make pre-pr green; GPU rendering is byte-identical (existing builder + controller unit tests unchanged and passing).
New unit tests: CPU profile render (no GPU limit, no LMCache connector, CPU flags/env, optional HF_TOKEN), CPU overrides, and default-profile-is-GPU.
Canary script bash -n clean; focused diff review found no defects.
Full canary run (kind + multi-GB vLLM CPU image + ~10 GiB RAM) is on-demand, not a CI gate — run via the c2-reconciler-canary workflow dispatch or locally.

Lets the CacheBackend reconciler stand up a healthy, serving backend without a GPU so the substrate can be validated off-GPU. - backendConfig profile=cpu renders a GPU-free vLLM engine: no nvidia.com/gpu limit and no LMCache connector, but prefix caching and the KV-event publisher stay on, plus CPU flags (--dtype/--max-model-len/--enforce-eager) and VLLM_CPU_KVCACHE_SPACE, defaulting to the vLLM CPU image + a tiny ungated model. The default (gpu) profile is unchanged and still owns real LMCache offload, which requires a GPU. Selection is case-insensitive; unknown/empty falls back to gpu. Covered by unit tests; the GPU rendering is byte-identical. - docs/reference-stack/scripts/canary_c2_reconcile.sh + an on-demand workflow: bring up kind, run the controller, apply a profile=cpu CacheBackend, and assert the reconciler reports Ready with a published endpoint, an engine prefix-cache hit through the Service, and owner-ref GC on delete — the real-pod path envtest can't cover. - A profile=cpu sample and the backendConfig key reference in the design doc.

github-actions · 2026-05-27T21:39:00Z

Codex review

Blocking

pkg/adapters/backend/lmcache.go adds a CPU profile that changes GPU resources and /dev/shm sizing, but the controller update path only reconciles image/command/args/env. See internal/controller/cachebackend_controller.go and internal/controller/cachebackend_controller.go. If an existing backend is changed from default gpu to cpu, the live Deployment keeps the nvidia.com/gpu limit and 8Gi shm volume; switching back from CPU also will not add the GPU limit. That breaks the advertised profile=cpu behavior for normal spec updates. Reconcile the profile-owned container resources/volumes, or force a safe pod-template replacement, and add a controller test for GPU<->CPU profile changes.

Should-fix

pkg/adapters/backend/lmcache.go defaults CPU profile to vllm/vllm-openai-cpu:latest, while the project’s own version table documents only arch-tagged CPU images at docs/reference-stack/VERSIONS.md, and the canary has to select latest-arm64/latest-x86_64 explicitly at docs/reference-stack/scripts/canary_c2_reconcile.sh. A profile: cpu CR without an image override will render a default inconsistent with the validated stack. Either use a known multi-arch default, require/validate the image override for CPU, or document that no safe default exists.

Nit

None.

No vendor-neutral naming issues found in the PR diff, and there are no proto/contract changes to mirror.

Verdict: changes-requested.

Address review feedback on the CPU profile: - The update path now reconciles the container Resources (the GPU limit differs by profile) and the pod Volumes (the shm size differs), so switching an existing backend gpu<->cpu actually reaches the live Deployment instead of only taking effect on create. Both fields are builder-owned and not API-server-defaulted, so reconciling them stays churn-free. Adds a controller test for the profile switch. - profile=cpu now requires backendConfig.image: the upstream CPU image is arch- tagged (latest-arm64 / latest-x86_64) with no safe multi-arch default, so the builder errors rather than rendering a bogus :latest. Tests + design doc updated.

github-actions · 2026-05-27T21:45:21Z

Codex review

Blocking
None.

Should-Fix

config/samples/cachebackend-lmcache-cpu.yaml:25 hardcodes vllm/vllm-openai-cpu:latest-arm64 in a generic sample, while line 9 says x86 hosts must use latest-x86_64. Since config/samples is user-facing applyable config, this will fail on common x86_64 kind/GitHub runner environments unless edited first. The canary script handles this correctly with arch detection; the sample should either use a safer generic convention, split arch-specific samples, or make the required substitution impossible to miss before apply.

Nit
None.

I did not find vendor-neutral naming violations, proto/contract drift, or generated-code issues in this diff. I could not run the focused Go tests because this review sandbox is read-only and Go could not create /tmp/go-cache.

Verdict: changes-requested.

The CPU image is arch-tagged with no multi-arch tag, so a hardcoded latest-arm64 would fail on x86 hosts. Ship a deliberate :latest-<arch> placeholder (matching the reference stack's non-applyable-image convention) so applying as-is fails fast rather than silently running the wrong arch.

github-actions · 2026-05-27T21:50:36Z

Codex review

Blocking: None.

Should-fix: None.

Nit: None.

Verdict: approve.

The PR stays vendor-neutral in the changed surfaces, does not touch proto/gRPC contracts, keeps the CPU-specific behavior isolated in the backend adapter/controller rendering path, and adds focused unit coverage plus an on-demand real-cluster canary. I did not run the test suite in this read-only review environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CPU profile and a kind-based reconciler CPU canary#21

Add a CPU profile and a kind-based reconciler CPU canary#21
EdHasNoLife wants to merge 3 commits into
mainfrom
sunxuedward/c2-cpu-profile-canary

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EdHasNoLife commented May 27, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant