Skip to content

Add full-chain CPU e2e canary for the cache substrate#17

Draft
EdHasNoLife wants to merge 3 commits into
sunxuedward/cac-21-c1-vllm-kv-event-subscriberfrom
sunxuedward/c1-cpu-e2e-canary
Draft

Add full-chain CPU e2e canary for the cache substrate#17
EdHasNoLife wants to merge 3 commits into
sunxuedward/cac-21-c1-vllm-kv-event-subscriberfrom
sunxuedward/c1-cpu-e2e-canary

Conversation

@EdHasNoLife
Copy link
Copy Markdown
Collaborator

Summary

A one-shot, GPU-free canary that exercises the whole substrate end to end and asserts it works:

CPU vLLM engine ──ZMQ KV events──▶ kvevent-subscriber ──gRPC──▶ policy server ──▶ index

docs/reference-stack/scripts/canary_e2e.sh builds server + kvevent-subscriber, starts a CPU vLLM engine (with KV events), the policy server, and the subscriber, fires a repeated long prefix, then asserts:

  1. an engine prefix-cache hit (vllm:prefix_cache_hits_total increases), and
  2. the index populated end-to-end (inferencecache_index_entries{model} > 0).

It manages/cleans up the engine container, exits non-zero on failure, and is arch-aware (arm64/x86_64 image tag). Documented in the reference-stack README.

How it's run

On-demand (not a blocking CI gate): it needs Docker, pulls the vLLM CPU image (multi-GB), and a Docker VM with ~10+ GiB RAM. Run locally, or wire into a scheduled/dispatch job on a Docker-capable runner.

Verification

Ran locally end-to-end: engine healthy, requests 200/200, prefix_cache_hits 0→2560, index_entries{model=canary}=20, PASS.

Note

Stacked on #15 (C1 — the kvevent-subscriber the full chain needs). Base is the C1 branch; will retarget to main once C1 merges. make pre-pr green.

scripts/canary_e2e.sh brings up the whole stack on a CPU vLLM engine (no GPU) —
engine → ZMQ KV events → kvevent-subscriber → policy server → index — fires a
repeated long prefix, and asserts both an engine prefix-cache hit
(vllm:prefix_cache_hits increases) and that the server index populated
(inferencecache_index_entries{model} > 0). Builds the binaries, manages/cleans up
the engine container, and exits non-zero on failure. On-demand (needs Docker +
the vLLM CPU image + adequate VM RAM), not a blocking CI gate. Documented in the
reference-stack README.

Verified locally: prefix_cache_hits 0->2560, index_entries{model=canary}=20, PASS.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix
None.

Nit
None.

Verdict
Approve.

I reviewed the full PR diff and surrounding server/subscriber/index code. No vendor-neutral naming violations, proto/CRD contract changes, or fail-open semantic regressions were introduced. Static checks passed: bash -n, make verify-naming, and make verify-no-internal-refs. Shellcheck only reported informational A && B || true cleanup-style notes, not a correctness issue here. Full Docker/vLLM canary was not run in this review environment.

cpu-substrate-canary.yml runs docs/reference-stack/scripts/canary_e2e.sh on a
nightly cron and on manual dispatch (with a runner-label input to target a
self-hosted Docker host). Not a per-PR gate — it pulls a multi-GB image and needs
~10 GiB RAM. Uploads server/subscriber logs on failure. Depends on
cmd/kvevent-subscriber, so it only functions once C1 is on the default branch.
@EdHasNoLife EdHasNoLife marked this pull request as draft May 27, 2026 19:18
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix

  • docs/reference-stack/scripts/canary_e2e.sh: the canary logs the two completion HTTP status codes but never asserts they are 200. A run can still reach the later metric assertions even if one request failed, which weakens the “full-chain works” signal described in the PR body. Capture both codes and fail immediately unless both are 200.

Nit

  • docs/reference-stack/scripts/canary_e2e.sh: cleanup/log upload only preserves server and subscriber logs. Engine startup/request failures are likely to be the common failure mode, so dumping or saving docker logs "$CONTAINER" on failure would make the scheduled canary much easier to diagnose.

Vendor-neutral check passed for the PR diff; no proto/CRD/core API contract changes detected.

Verdict: changes-requested.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking

  • docs/reference-stack/scripts/canary_e2e.sh: go build -o bin/server and the next line assume bin/ already exists. It is ignored by .gitignore and not tracked, so a fresh GitHub Actions checkout will fail immediately with a missing parent directory. Add mkdir -p bin before building, or build through a target that guarantees the output directory exists. This also breaks the new workflow invocation at .github/workflows/cpu-substrate-canary.yml.

Should-fix

None.

Nit

None.

Verdict: changes-requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant