Skip to content

Add a CPU profile and a kind-based reconciler CPU canary#21

Open
EdHasNoLife wants to merge 3 commits into
mainfrom
sunxuedward/c2-cpu-profile-canary
Open

Add a CPU profile and a kind-based reconciler CPU canary#21
EdHasNoLife wants to merge 3 commits into
mainfrom
sunxuedward/c2-cpu-profile-canary

Conversation

@EdHasNoLife
Copy link
Copy Markdown
Collaborator

Summary

Lets the CacheBackend reconciler stand up a healthy, serving backend without a GPU, so the substrate can be validated off-GPU (closing the one C2 DoD item — "apply CR → healthy pods" — that previously needed a GPU host).

  • CPU profile (backendConfig.profile: cpu): renders a GPU-free vLLM engine — no nvidia.com/gpu limit and no LMCache connector, but prefix caching + the KV-event publisher stay on, plus CPU flags (--dtype=bfloat16 --max-model-len=8192 --enforce-eager) and VLLM_CPU_KVCACHE_SPACE, defaulting to the vLLM CPU image + a tiny ungated model (Qwen/Qwen2.5-0.5B-Instruct). The default gpu profile is unchanged and still owns real LMCache offload (which requires a GPU). Selection is case-insensitive; unknown/empty falls back to gpu.
  • Kind-based reconciler CPU canary (docs/reference-stack/scripts/canary_c2_reconcile.sh + an on-demand workflow): brings up kind, runs the controller, applies a profile=cpu CacheBackend, and asserts the reconciler reports Ready with a published endpoint, an engine prefix-cache hit through the Service, and owner-ref GC on delete — the real-pod path envtest can't cover.
  • A profile=cpu sample and the backendConfig key reference in docs/design/cachebackend-api.md.

This complements PR #17 (the C1-chain CPU canary); once C1 lands on main, the two canaries can be unified.

Test plan

  • make ci / make pre-pr green; GPU rendering is byte-identical (existing builder + controller unit tests unchanged and passing).
  • New unit tests: CPU profile render (no GPU limit, no LMCache connector, CPU flags/env, optional HF_TOKEN), CPU overrides, and default-profile-is-GPU.
  • Canary script bash -n clean; focused diff review found no defects.
  • Full canary run (kind + multi-GB vLLM CPU image + ~10 GiB RAM) is on-demand, not a CI gate — run via the c2-reconciler-canary workflow dispatch or locally.

Lets the CacheBackend reconciler stand up a healthy, serving backend without a
GPU so the substrate can be validated off-GPU.

- backendConfig profile=cpu renders a GPU-free vLLM engine: no nvidia.com/gpu
  limit and no LMCache connector, but prefix caching and the KV-event publisher
  stay on, plus CPU flags (--dtype/--max-model-len/--enforce-eager) and
  VLLM_CPU_KVCACHE_SPACE, defaulting to the vLLM CPU image + a tiny ungated model.
  The default (gpu) profile is unchanged and still owns real LMCache offload,
  which requires a GPU. Selection is case-insensitive; unknown/empty falls back
  to gpu. Covered by unit tests; the GPU rendering is byte-identical.
- docs/reference-stack/scripts/canary_c2_reconcile.sh + an on-demand workflow:
  bring up kind, run the controller, apply a profile=cpu CacheBackend, and assert
  the reconciler reports Ready with a published endpoint, an engine prefix-cache
  hit through the Service, and owner-ref GC on delete — the real-pod path envtest
  can't cover.
- A profile=cpu sample and the backendConfig key reference in the design doc.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking

  • pkg/adapters/backend/lmcache.go adds a CPU profile that changes GPU resources and /dev/shm sizing, but the controller update path only reconciles image/command/args/env. See internal/controller/cachebackend_controller.go and internal/controller/cachebackend_controller.go. If an existing backend is changed from default gpu to cpu, the live Deployment keeps the nvidia.com/gpu limit and 8Gi shm volume; switching back from CPU also will not add the GPU limit. That breaks the advertised profile=cpu behavior for normal spec updates. Reconcile the profile-owned container resources/volumes, or force a safe pod-template replacement, and add a controller test for GPU<->CPU profile changes.

Should-fix

Nit

  • None.

No vendor-neutral naming issues found in the PR diff, and there are no proto/contract changes to mirror.

Verdict: changes-requested.

Address review feedback on the CPU profile:
- The update path now reconciles the container Resources (the GPU limit differs by
  profile) and the pod Volumes (the shm size differs), so switching an existing
  backend gpu<->cpu actually reaches the live Deployment instead of only taking
  effect on create. Both fields are builder-owned and not API-server-defaulted, so
  reconciling them stays churn-free. Adds a controller test for the profile switch.
- profile=cpu now requires backendConfig.image: the upstream CPU image is arch-
  tagged (latest-arm64 / latest-x86_64) with no safe multi-arch default, so the
  builder errors rather than rendering a bogus :latest. Tests + design doc updated.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-Fix

  • config/samples/cachebackend-lmcache-cpu.yaml:25 hardcodes vllm/vllm-openai-cpu:latest-arm64 in a generic sample, while line 9 says x86 hosts must use latest-x86_64. Since config/samples is user-facing applyable config, this will fail on common x86_64 kind/GitHub runner environments unless edited first. The canary script handles this correctly with arch detection; the sample should either use a safer generic convention, split arch-specific samples, or make the required substitution impossible to miss before apply.

Nit
None.

I did not find vendor-neutral naming violations, proto/contract drift, or generated-code issues in this diff. I could not run the focused Go tests because this review sandbox is read-only and Go could not create /tmp/go-cache.

Verdict: changes-requested.

The CPU image is arch-tagged with no multi-arch tag, so a hardcoded latest-arm64
would fail on x86 hosts. Ship a deliberate :latest-<arch> placeholder (matching
the reference stack's non-applyable-image convention) so applying as-is fails
fast rather than silently running the wrong arch.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking: None.

Should-fix: None.

Nit: None.

Verdict: approve.

The PR stays vendor-neutral in the changed surfaces, does not touch proto/gRPC contracts, keeps the CPU-specific behavior isolated in the backend adapter/controller rendering path, and adds focused unit coverage plus an on-demand real-cluster canary. I did not run the test suite in this read-only review environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant