Add CacheIndex CRD + controller status surface (B6 status half)#16
Add CacheIndex CRD + controller status surface (B6 status half)#16EdHasNoLife wants to merge 16 commits into
Conversation
Expose the server's in-memory cache aggregate as a cluster-scoped,
status-only CacheIndex CR (`kubectl get cacheindex`) — the other half of
B6 (CAC-50).
- api/v1alpha1: cluster-scoped, status-only CacheIndex CRD. Status mirrors
the tech spec — replicas[]{id,cacheMemoryBytes,hitRate,pressure,
lastUpdate}, prefixes.summary{total,hot=0}, tenants[]{id,memoryUsed,
hitRate}. Rates are decimal strings (CRDs avoid floats for cross-language
portability). Generated manifests + deepcopy + RBAC committed.
- pkg/index: Snapshot() returns the cluster-wide aggregate (latest stats
per replica; tenant memory/hit-rate dedup replicas within a tenant).
- pkg/server: internal HTTP /snapshot endpoint serving the snapshot as
JSON (metadata only — replica/tenant stats + prefix counts, never KV or
prompt data).
- internal/controller: CacheIndexPoller, a leader-elected manager Runnable
that scrapes /snapshot on a timer, maintains the singleton cluster-default
CR, and writes status only when the meaningful aggregate changed
(timestamps ignored for change detection to avoid churn).
- cmd/controller: wire the poller with --server-snapshot-url /
--cacheindex-refresh-interval flags.
Design per PROJECT decision: server exposes the data, controller writes the
CR (reuses its k8s client/RBAC); single server instance assumed (Phase 1).
Codex reviewShould-Fix
Nit
Verdict: changes-requested. No vendor-neutral naming or proto-contract violations found. |
Address PR review: - refresh() now ensures the cluster-default CacheIndex exists BEFORE scraping /snapshot, so `kubectl get cacheindex` shows the CR even when the server is unreachable (status fills on the next successful tick). Regression test covers the server-down path. - grpc-contract.md: note the CacheIndex status surface + /snapshot are now implemented (no longer a future follow-up). - cmd/server flag help: list /snapshot alongside the other HTTP endpoints.
|
Addressed in fa77fe3.
|
Codex reviewBlocking Should-Fix Nit
Verdict I verified the diff for vendor-neutral naming, internal tracked references, and whitespace drift. |
Address PR review nit: the poller never updates/patches/deletes the CacheIndex resource itself — only its status subresource. Drop those verbs from the main-resource RBAC marker, keeping get;list;watch;create (list+watch are required because the manager's cached client backs Get with an informer). Regenerated role.yaml.
|
Addressed in 87effb8. Tightened the CacheIndex RBAC: the main resource is now |
Codex reviewBlocking
Should-fix
Nit
Verdict: changes-requested. I could not run tests locally because the review environment filesystem is read-only, including |
Address PR review:
- Blocking: the CacheIndex status used a flattened status.prefixes.{total,hot};
the locked contract is status.prefixes.summary.{total,hot}. Introduce
PrefixStatus{Summary PrefixSummary} so the new v1alpha1 surface ships the
correct nested shape (avoids a breaking field change later); printcolumn
and controller updated to match.
- Register the cluster-scoped CacheIndex resource in the kubebuilder PROJECT
metadata (was only declaring CacheBackend).
|
Addressed in 76571cb.
|
Codex reviewBlocking Should-Fix Nit
Verdict I did not find vendor-neutral naming violations, proto/contract breakage, gRPC fail-open regressions, or hand-edited generated drift from inspection. I could not run Go tests in this environment because the filesystem is read-only and Go could not create its build cache. |
Address PR review nits: - Add cacheindex_viewer_role / cacheindex_editor_role convenience ClusterRoles (mirroring CacheBackend) so operators get RBAC for the new `kubectl get cacheindex` surface; reference them in the rbac kustomization. - Correct the CacheIndexPoller doc: the server exposes the aggregate via /snapshot and the controller scrapes it (pull), not "server pushes".
|
Addressed in c626ecf.
|
Resolve the regenerated zz_generated.deepcopy.go conflict (CacheBackend API from #6 landed on main) by regenerating deepcopy + manifests against the merged api types, so both CacheBackend and CacheIndex are present.
Codex reviewBlocking Should-fix
Nit Verdict I could not run tests because the sandbox is fully read-only, including Go’s build cache location, so this review is static. |
Address PR review:
- Drop omitempty on prefixes.summary.{total,hot} so an empty index renders
total: 0 / hot: 0 explicitly, matching the contract shape (instead of
omitting the zero-valued summary). Test asserts the zero summary serializes.
- Document lastUpdated as "last time the aggregate changed" (the controller
writes status only on change, so it marks the last data change, not the
last poll) and rename the print column Updated → Changed so operators
aren't misled into reading it as poller freshness.
|
Addressed in 0fc84e7.
|
Codex reviewBlocking Should-fix Nit Verdict I could not run tests because the sandbox filesystem is read-only and Go needs a writable build cache. The diff itself shows no vendor-neutral naming violation, no proto contract change, and the new behavior has focused unit and over-the-wire coverage. |
Address PR review nit: fetchSnapshot returned the raw http.NewRequestWithContext error; wrap it with context (incl. the URL) to match the project's error-wrapping standard, consistent with the other returns in the function.
|
Addressed in e68b6ba — |
Codex reviewBlocking Should-fix Nit
Verdict I did not find vendor-neutral naming violations, proto/contract drift, fail-open regressions, or missing generated artifacts in the PR diff. I did not run tests in this read-only review environment. |
Codex reviewBlocking Should-fix
Nit Verdict I also checked the PR diff for banned vendor identity tokens and ran |
Codex reviewBlocking Should-fix Nit Verdict Vendor-neutral guard passed ( |
Address PR review nit: pkg/index is server-owned, but the controller's CacheIndex poller imports its Snapshot* types to decode the /snapshot endpoint. Clarify in doc.go that the index engine runs only in the server while the Snapshot* types are the deliberate, read-only wire contract shared with the controller — resolving the ownership-doc inconsistency.
Codex reviewBlocking Should-fix Nit Verdict I verified the diff for vendor-neutral naming and internal tracker references with |
main switched SchemeBuilder to &runtime.SchemeBuilder{} (func-slice), so
the old SchemeBuilder.Register(&obj{}) form no longer compiles. Convert
CacheIndex's init() to the closure pattern used by CacheBackend
(AddKnownTypes), fixing the semantic merge break the PR test-merge caught.
Codex reviewFindings Blocking: none. Should-fix: none. Nit: none. I did not find violations in the PR diff against the vendor-neutral naming rule, proto/gRPC contract constraints, fail-open hot-path semantics, metadata-only requirements, or generated-manifest expectations. The new Verification note: I attempted focused Verdict: approve. |
The merge with main raised COVER_MIN to 79% and the aggregate sat at 78.7%. Cover the previously-untested unary handlers and helpers in pkg/server (LookupPDRoute, GetCacheState, PublishEvent→index, eventTypeFromProto, microsToTime). Logic coverage is now 81.6%.
Codex reviewBlocking Should-Fix
Nit I did not find vendor-specific identity leakage or proto contract drift. I could not run tests in this environment because the sandbox is read-only and Go cannot create its module/build cache. Verdict: changes-requested. |
Address PR review: statusEqual strips per-replica LastUpdate for churn suppression, so it advances only when a replica's reported stats change. Document the field to match that behavior (consistent with the top-level status.lastUpdated), so operators don't read a steady-state value as staleness. Reporter liveness lives in the server's /metrics.
docs/reference-stack/DEMO.md was swept into the previous commit by a git add -A; it's untracked local scratch (not on main and unrelated to CAC-50). Untrack it (kept on disk) so it's not part of this PR.
Codex reviewBlocking Should-fix Nit Verdict: approve. I verified the diff against the vendor-neutral naming guard and internal-reference guard; both passed. I could not run the touched Go tests because the sandbox is read-only and Go could not create a module cache under |
Summary
The status-surface half of B6 (CAC-50) — exposes the server's in-memory cache aggregate as a cluster-scoped, status-only
CacheIndexCR so operators cankubectl get cacheindex. (Engine half landed in CAC-20/#7.)Design (per the locked decision): server exposes the data, controller writes the CR — no proto/contract change, reuses the controller's existing k8s client + RBAC.
api/v1alpha1/cacheindex_types.go) — cluster-scoped, status-only. Status mirrors the tech spec:replicas[]{id,cacheMemoryBytes,hitRate,pressure,lastUpdate},prefixes.summary{total,hot=0},tenants[]{id,memoryUsed,hitRate}. Rates are decimal strings ("0.78") — CRDs avoid floats for cross-language portability (controller-gen rejects floats withoutallowDangerousTypes, and a Java client is coming).hot=0until per-prefix access-counting exists.pkg/indexSnapshot()— cluster-wide aggregate: latest stats per replica id; tenant memory/hit-rate dedup replicas within a tenant (documented approximation).pkg/server/snapshot— internal HTTP JSON endpoint (alongside/healthz,/readyz,/metrics). Metadata only — replica/tenant stats + prefix counts, never KV tensors or prompt text.internal/controllerCacheIndexPoller— leader-elected managerRunnable(not an event-driven reconciler — the data source is the server, not the CR). Maintains the singletoncluster-defaultCR; writes status only when the meaningful aggregate changed (timestamps ignored for change detection → no churn under steady traffic).cmd/controller— wired with--server-snapshot-url(defaulthttp://inference-cache-server:8080/snapshot) and--cacheindex-refresh-interval(default 30s).Notes / scope
/snapshotis unauthenticated on the internal:8080HTTP port, like/metrics— metadata only, scraped in-cluster by the controller.grpc-contract.mdchange. New CRD is additive/v1alpha1-safe.Test plan
make pre-prgreen (naming, buf lint, fmt, vet, race, build, no generated drift)make cover-check— logic coverage 79.6% (≥65 gate)Snapshot()(dedup/sort/totals); server/snapshotend-to-end (ingest→JSON); controllerbuildCacheIndexStatus,statusEqual(timestamp-insensitive),fetchSnapshot(+non-200),refresh(fake client: create → no-op-on-unchanged → update-on-change)kubectl get cacheindexon a kind cluster shows live aggregate