Add CacheIndex CRD + controller status surface (B6 status half) by EdHasNoLife · Pull Request #16 · cachebox-project/inference-cache

EdHasNoLife · 2026-05-27T19:00:04Z

Summary

The status-surface half of B6 (CAC-50) — exposes the server's in-memory cache aggregate as a cluster-scoped, status-only CacheIndex CR so operators can kubectl get cacheindex. (Engine half landed in CAC-20/#7.)

Design (per the locked decision): server exposes the data, controller writes the CR — no proto/contract change, reuses the controller's existing k8s client + RBAC.

CRD (api/v1alpha1/cacheindex_types.go) — cluster-scoped, status-only. Status mirrors the tech spec: replicas[]{id,cacheMemoryBytes,hitRate,pressure,lastUpdate}, prefixes.summary{total,hot=0}, tenants[]{id,memoryUsed,hitRate}. Rates are decimal strings ("0.78") — CRDs avoid floats for cross-language portability (controller-gen rejects floats without allowDangerousTypes, and a Java client is coming). hot=0 until per-prefix access-counting exists.
pkg/index Snapshot() — cluster-wide aggregate: latest stats per replica id; tenant memory/hit-rate dedup replicas within a tenant (documented approximation).
pkg/server /snapshot — internal HTTP JSON endpoint (alongside /healthz,/readyz,/metrics). Metadata only — replica/tenant stats + prefix counts, never KV tensors or prompt text.
internal/controller CacheIndexPoller — leader-elected manager Runnable (not an event-driven reconciler — the data source is the server, not the CR). Maintains the singleton cluster-default CR; writes status only when the meaningful aggregate changed (timestamps ignored for change detection → no churn under steady traffic).
cmd/controller — wired with --server-snapshot-url (default http://inference-cache-server:8080/snapshot) and --cacheindex-refresh-interval (default 30s).

Notes / scope

Single server instance assumed (Phase-1 in-memory index); HA cross-instance aggregation is later.
/snapshot is unauthenticated on the internal :8080 HTTP port, like /metrics — metadata only, scraped in-cluster by the controller.
No proto change → no grpc-contract.md change. New CRD is additive/v1alpha1-safe.

Test plan

make pre-pr green (naming, buf lint, fmt, vet, race, build, no generated drift)
make cover-check — logic coverage 79.6% (≥65 gate)
Unit tests: index Snapshot() (dedup/sort/totals); server /snapshot end-to-end (ingest→JSON); controller buildCacheIndexStatus, statusEqual (timestamp-insensitive), fetchSnapshot (+non-200), refresh (fake client: create → no-op-on-unchanged → update-on-change)
CI green on the PR
(manual, later) kubectl get cacheindex on a kind cluster shows live aggregate

Expose the server's in-memory cache aggregate as a cluster-scoped, status-only CacheIndex CR (`kubectl get cacheindex`) — the other half of B6 (CAC-50). - api/v1alpha1: cluster-scoped, status-only CacheIndex CRD. Status mirrors the tech spec — replicas[]{id,cacheMemoryBytes,hitRate,pressure, lastUpdate}, prefixes.summary{total,hot=0}, tenants[]{id,memoryUsed, hitRate}. Rates are decimal strings (CRDs avoid floats for cross-language portability). Generated manifests + deepcopy + RBAC committed. - pkg/index: Snapshot() returns the cluster-wide aggregate (latest stats per replica; tenant memory/hit-rate dedup replicas within a tenant). - pkg/server: internal HTTP /snapshot endpoint serving the snapshot as JSON (metadata only — replica/tenant stats + prefix counts, never KV or prompt data). - internal/controller: CacheIndexPoller, a leader-elected manager Runnable that scrapes /snapshot on a timer, maintains the singleton cluster-default CR, and writes status only when the meaningful aggregate changed (timestamps ignored for change detection to avoid churn). - cmd/controller: wire the poller with --server-snapshot-url / --cacheindex-refresh-interval flags. Design per PROJECT decision: server exposes the data, controller writes the CR (reuses its k8s client/RBAC); single server instance assumed (Phase 1).

github-actions · 2026-05-27T19:03:29Z

Codex review

Should-Fix

docs/design/grpc-contract.md:79 still says the CacheIndex CRD status surface is a future follow-up. This PR lands that surface plus /snapshot, but no docs/design/ update was made. That conflicts with the project standard that behavior/spec changes match the design docs.
internal/controller/cacheindex_controller.go:80 fetches /snapshot before ensuring the singleton CacheIndex exists. If the server endpoint is unavailable, which is also likely with the current default controller install because cmd/controller/main.go:53 points at inference-cache-server but config/default/kustomization.yaml:3 only deploys CRD/RBAC/manager, the controller logs errors and never creates cluster-default. That undercuts the stated “kubectl get cacheindex” status surface. Consider creating the singleton first and then updating status when snapshots succeed, or shipping/configuring the server service with the default install.

Nit

cmd/server/main.go:18 flag help says the HTTP listener serves /healthz, /readyz, and /metrics, but the PR adds /snapshot. Update the help text so operator-visible CLI docs are accurate.

Verdict: changes-requested. No vendor-neutral naming or proto-contract violations found.

Address PR review: - refresh() now ensures the cluster-default CacheIndex exists BEFORE scraping /snapshot, so `kubectl get cacheindex` shows the CR even when the server is unreachable (status fills on the next successful tick). Regression test covers the server-down path. - grpc-contract.md: note the CacheIndex status surface + /snapshot are now implemented (no longer a future follow-up). - cmd/server flag help: list /snapshot alongside the other HTTP endpoints.

EdHasNoLife · 2026-05-27T19:07:47Z

Addressed in fa77fe3.

Should-fix (create singleton before scrape): fixed — refresh() now ensures the cluster-default CacheIndex exists before scraping /snapshot, so kubectl get cacheindex shows the CR (empty status) even when the server is unreachable; status fills on the next successful tick. New test TestRefreshCreatesSingletonEvenWhenServerDown covers the server-down path.
- On the related point — the default install doesn't deploy the server (config/default only ships CRD/RBAC/manager): that's a real but separate gap (there's no server Deployment/Service manifest in config/ yet at all). It's deployment scope, not this status-surface PR; the create-first fix makes the controller behave gracefully until the server is deployed. I'll file a follow-up ticket to add the server Deployment + Service to the install.
Should-fix (design doc): updated docs/design/grpc-contract.md — the CacheIndex status surface + /snapshot are now documented as implemented (no longer a future follow-up).
Nit (server flag help): cmd/server --http-bind-address help now lists /snapshot.

github-actions · 2026-05-27T19:10:54Z

Codex review

Blocking
None.

Should-Fix
None.

Nit

internal/controller/cacheindex_controller.go and config/rbac/role.yaml: CacheIndex RBAC is broader than the poller needs. It only does get, create, and status update, but the generated role also grants list, watch, update, patch, and delete on cacheindices. Consider tightening the kubebuilder marker for this status-only singleton.

Verdict
approve-with-nits.

I verified the diff for vendor-neutral naming, internal tracked references, and whitespace drift. go test ./... could not run in this environment because the filesystem is read-only and Go could not create /tmp/go-cache.

Address PR review nit: the poller never updates/patches/deletes the CacheIndex resource itself — only its status subresource. Drop those verbs from the main-resource RBAC marker, keeping get;list;watch;create (list+watch are required because the manager's cached client backs Get with an informer). Regenerated role.yaml.

EdHasNoLife · 2026-05-27T19:14:17Z

Addressed in 87effb8. Tightened the CacheIndex RBAC: the main resource is now create;get;list;watch (dropped update;patch;delete, which the poller never uses on the resource itself — it only writes the status subresource). Kept list;watch because the poller reads through the manager's cached client, where a Get is informer-backed and needs them; noted that on the marker. cacheindices/status stays get;update;patch.

github-actions · 2026-05-27T19:16:38Z

Codex review

Blocking

api/v1alpha1/cacheindex_types.go:70 exposes prefixes as status.prefixes.total/hot, and the generated CRD/print column follows that shape at config/crd/bases/inferencecache.io_cacheindices.yaml:75. The PR body says the locked/status spec is prefixes.summary{total,hot=0}. Because this is a new v1alpha1 CRD API surface, merging the flatter shape would publish the wrong contract and make correcting it later a breaking field change.

Should-fix

PROJECT:6 still only declares the CacheBackend resource. This PR adds a new API/CRD, but the Kubebuilder project metadata was not updated for cluster-scoped CacheIndex, which leaves scaffold/generation metadata stale.

Nit

None.

Verdict: changes-requested.

I could not run tests locally because the review environment filesystem is read-only, including /tmp, so Go could not initialize its build cache.

Address PR review: - Blocking: the CacheIndex status used a flattened status.prefixes.{total,hot}; the locked contract is status.prefixes.summary.{total,hot}. Introduce PrefixStatus{Summary PrefixSummary} so the new v1alpha1 surface ships the correct nested shape (avoids a breaking field change later); printcolumn and controller updated to match. - Register the cluster-scoped CacheIndex resource in the kubebuilder PROJECT metadata (was only declaring CacheBackend).

EdHasNoLife · 2026-05-27T19:21:27Z

Addressed in 76571cb.

Blocking (prefix shape): fixed. Introduced PrefixStatus{ Summary PrefixSummary } so the status surface is status.prefixes.summary.{total,hot} — matching the locked contract — instead of the flattened status.prefixes.{total,hot}. The CRD now generates the nested shape, the print column points at .status.prefixes.summary.total, and the controller + tests were updated. Getting this right now avoids a breaking v1alpha1 field change later.
Should-fix (PROJECT metadata): registered the cluster-scoped CacheIndex resource (namespaced: false) in the kubebuilder PROJECT file alongside CacheBackend.

github-actions · 2026-05-27T19:24:23Z

Codex review

Blocking
None.

Should-Fix
None.

Nit

config/rbac/kustomization.yaml:7: the user-facing viewer/editor RBAC still only includes CacheBackend. Since this feature is explicitly for operators to kubectl get cacheindex, consider adding cacheindex_viewer_role / editor role or extending the existing convenience roles to include cacheindices.
internal/controller/cacheindex_controller.go:29-31: comment says “server pushes the data,” but the implementation polls/scrapes /snapshot. Worth correcting to avoid confusing the contract.

Verdict
approve-with-nits.

I did not find vendor-neutral naming violations, proto/contract breakage, gRPC fail-open regressions, or hand-edited generated drift from inspection. I could not run Go tests in this environment because the filesystem is read-only and Go could not create its build cache.

Address PR review nits: - Add cacheindex_viewer_role / cacheindex_editor_role convenience ClusterRoles (mirroring CacheBackend) so operators get RBAC for the new `kubectl get cacheindex` surface; reference them in the rbac kustomization. - Correct the CacheIndexPoller doc: the server exposes the aggregate via /snapshot and the controller scrapes it (pull), not "server pushes".

EdHasNoLife · 2026-05-27T19:28:09Z

Addressed in c626ecf.

Nit (operator RBAC): added cacheindex-viewer-role and cacheindex-editor-role ClusterRoles (mirroring the CacheBackend convenience roles) and wired them into config/rbac/kustomization.yaml, so operators get RBAC for the new kubectl get cacheindex surface.
Nit (comment): corrected the CacheIndexPoller doc — the server exposes the aggregate via /snapshot and the controller scrapes it (pull), rather than "server pushes."

Resolve the regenerated zz_generated.deepcopy.go conflict (CacheBackend API from #6 landed on main) by regenerating deepcopy + manifests against the merged api types, so both CacheBackend and CacheIndex are present.

github-actions · 2026-05-27T19:36:37Z

Codex review

Blocking
None.

Should-fix

api/v1alpha1/cacheindex_types.go: total and hot are tagged omitempty, so the required zero-valued status surface is omitted from JSON/YAML. This contradicts the PR/spec shape prefixes.summary{total,hot=0} and makes an empty index show no explicit 0 values. Same concern applies to zero-valued cacheMemoryBytes and memoryUsed if those are meant to mirror the status contract. Use pointers or remove omitempty for fields where explicit zero is meaningful.
api/v1alpha1/cacheindex_types.go / internal/controller/cacheindex_controller.go: lastUpdated is documented and printed as the last refresh time, but statusEqual ignores timestamps and skips the status write when only timestamps changed. Under steady state, successful polls will not advance lastUpdated, so operators cannot distinguish a healthy poller from a stale status. Either change the field/column semantics to “last aggregate change” or add a separate freshness signal.

Nit
None.

Verdict
changes-requested

I could not run tests because the sandbox is fully read-only, including Go’s build cache location, so this review is static.

Address PR review: - Drop omitempty on prefixes.summary.{total,hot} so an empty index renders total: 0 / hot: 0 explicitly, matching the contract shape (instead of omitting the zero-valued summary). Test asserts the zero summary serializes. - Document lastUpdated as "last time the aggregate changed" (the controller writes status only on change, so it marks the last data change, not the last poll) and rename the print column Updated → Changed so operators aren't misled into reading it as poller freshness.

EdHasNoLife · 2026-05-27T19:40:05Z

Addressed in 0fc84e7.

Should-fix (hidden zeros): dropped omitempty on prefixes.summary.{total,hot} so an empty index renders total: 0 / hot: 0 explicitly, matching the contract shape. New test asserts the zero summary serializes. (Left omitempty on the per-replica/tenant numeric fields — those are per-list-item optional metrics, not the always-present summary; happy to drop it there too if you'd prefer fully explicit zeros.)
Should-fix (lastUpdated semantics): kept the write-on-change design (the locked decision) but made the field honest — lastUpdated is now documented as "when the aggregate last changed" (not last poll), and the print column is renamed Updated → Changed so operators don't misread it as poller freshness. Poller liveness is observable via the controller's own health/metrics; if you'd like an explicit freshness heartbeat on the CR, I can add one as a follow-up (it would mean a periodic write even when unchanged).

github-actions · 2026-05-27T19:42:39Z

Codex review

Blocking
None.

Should-fix
None.

Nit
internal/controller/cacheindex_controller.go:139-142: the malformed snapshot URL path returns the raw http.NewRequestWithContext error. Wrap it with context, ideally including the URL, to match the project’s error-wrapping standard.

Verdict
approve-with-nits

I could not run tests because the sandbox filesystem is read-only and Go needs a writable build cache. The diff itself shows no vendor-neutral naming violation, no proto contract change, and the new behavior has focused unit and over-the-wire coverage.

Address PR review nit: fetchSnapshot returned the raw http.NewRequestWithContext error; wrap it with context (incl. the URL) to match the project's error-wrapping standard, consistent with the other returns in the function.

EdHasNoLife · 2026-05-27T19:44:05Z

Addressed in e68b6ba — fetchSnapshot now wraps the http.NewRequestWithContext error with context (incl. the URL), consistent with the other wrapped returns in the function.

github-actions · 2026-05-27T19:46:34Z

Codex review

Blocking
None.

Should-fix
None.

Nit

internal/controller/cacheindex_controller.go imports pkg/index for the /snapshot DTO, but pkg/index/doc.go says that package belongs to inferencecache-server. Either move the snapshot wire structs to a small shared/internal contract package or update the package ownership docs if pkg/index is now intentionally shared by controller and server.

Verdict
approve-with-nits.

I did not find vendor-neutral naming violations, proto/contract drift, fail-open regressions, or missing generated artifacts in the PR diff. I did not run tests in this read-only review environment.

github-actions · 2026-05-27T20:53:44Z

Codex review

Blocking
None.

Should-fix

cmd/controller/main.go / cmd/controller/main.go: the CacheIndex poller is enabled unconditionally and defaults to http://inference-cache-server:8080/snapshot, but the default install still only includes the controller manager resources, not an inference-cache-server Service/Deployment. A fresh config/default install will create the singleton CacheIndex but then log scrape failures every 30s and never populate status unless the user separately creates a matching service or overrides the flag. Add the server Service/manifest wiring, set the manager arg to a shipped endpoint, or make this poller opt-in until the server endpoint is actually installed.

Nit
None.

Verdict
changes-requested

I also checked the PR diff for banned vendor identity tokens and ran git diff --check; both were clean. I could not run go test ./... because this review environment has a read-only filesystem, so Go could not create a build cache.

github-actions · 2026-05-27T21:33:54Z

Codex review

Blocking
None.

Should-fix
None.

Nit
None.

Verdict
Approve.

Vendor-neutral guard passed (make verify-naming), and no tracked internal refs were introduced (make verify-no-internal-refs). I could not run Go tests because the sandbox filesystem is read-only and Go could not create a build cache.

Address PR review nit: pkg/index is server-owned, but the controller's CacheIndex poller imports its Snapshot* types to decode the /snapshot endpoint. Clarify in doc.go that the index engine runs only in the server while the Snapshot* types are the deliberate, read-only wire contract shared with the controller — resolving the ownership-doc inconsistency.

github-actions · 2026-05-27T22:11:54Z

Codex review

Blocking
None.

Should-fix
None.

Nit
None.

Verdict
Approve.

I verified the diff for vendor-neutral naming and internal tracker references with make verify-naming verify-no-internal-refs; both passed. I could not run the Go tests because this environment is read-only and go test failed creating /home/runner/go for the module cache.

main switched SchemeBuilder to &runtime.SchemeBuilder{} (func-slice), so the old SchemeBuilder.Register(&obj{}) form no longer compiles. Convert CacheIndex's init() to the closure pattern used by CacheBackend (AddKnownTypes), fixing the semantic merge break the PR test-merge caught.

github-actions · 2026-05-27T22:18:57Z

Codex review

Findings

Blocking: none.

Should-fix: none.

Nit: none.

I did not find violations in the PR diff against the vendor-neutral naming rule, proto/gRPC contract constraints, fail-open hot-path semantics, metadata-only requirements, or generated-manifest expectations. The new CacheIndex CRD/poller/snapshot endpoint is additive and kept out of the proto surface.

Verification note: I attempted focused go test runs, but the sandbox is read-only and Go could not create module/build cache directories, even under /tmp.

Verdict: approve.

The merge with main raised COVER_MIN to 79% and the aggregate sat at 78.7%. Cover the previously-untested unary handlers and helpers in pkg/server (LookupPDRoute, GetCacheState, PublishEvent→index, eventTypeFromProto, microsToTime). Logic coverage is now 81.6%.

github-actions · 2026-05-27T22:24:37Z

Codex review

Blocking
None.

Should-Fix

api/v1alpha1/cacheindex_types.go:29 vs internal/controller/cacheindex_controller.go:191: status.replicas[].lastUpdate is documented as “when the replica's state was last observed”, but statusEqual strips every replica LastUpdate before comparison. A replica that keeps reporting identical memory/hit/pressure will advance in /snapshot, but the CR status will never be written, so operators can see an indefinitely stale lastUpdate for a healthy reporter. Either stop exposing per-replica lastUpdate, update it on a bounded cadence, or make the field’s semantics match the churn-suppression behavior.
cmd/controller/main.go:53 / cmd/controller/main.go:113: the controller always starts the CacheIndexPoller with default URL http://inference-cache-server:8080/snapshot, but the default install only includes CRD/RBAC/manager (config/default/kustomization.yaml:3) and the manager args only pass --leader-elect (config/manager/manager.yaml:22). There is no default inference-cache-server Service/Deployment in config/, so the shipped controller deployment creates an empty singleton and logs scrape failures forever. Add the server Service/Deployment/default wiring, or make the poller opt-in until the server endpoint is actually deployed.

Nit
None.

I did not find vendor-specific identity leakage or proto contract drift. I could not run tests in this environment because the sandbox is read-only and Go cannot create its module/build cache.

Verdict: changes-requested.

Address PR review: statusEqual strips per-replica LastUpdate for churn suppression, so it advances only when a replica's reported stats change. Document the field to match that behavior (consistent with the top-level status.lastUpdated), so operators don't read a steady-state value as staleness. Reporter liveness lives in the server's /metrics.

docs/reference-stack/DEMO.md was swept into the previous commit by a git add -A; it's untracked local scratch (not on main and unrelated to CAC-50). Untrack it (kept on disk) so it's not part of this PR.

github-actions · 2026-05-27T22:30:34Z

Codex review

Blocking
None found.

Should-fix
None found.

Nit
None found.

Verdict: approve.

I verified the diff against the vendor-neutral naming guard and internal-reference guard; both passed. I could not run the touched Go tests because the sandbox is read-only and Go could not create a module cache under /home/runner/go or /tmp.

Merge main into cac-50-cacheindex-crd

b3ab2b4

Resolve the regenerated zz_generated.deepcopy.go conflict (CacheBackend API from #6 landed on main) by regenerating deepcopy + manifests against the merged api types, so both CacheBackend and CacheIndex are present.

Wrap snapshot request build error

e68b6ba

Address PR review nit: fetchSnapshot returned the raw http.NewRequestWithContext error; wrap it with context (incl. the URL) to match the project's error-wrapping standard, consistent with the other returns in the function.

Merge branch 'main' into cac-50-cacheindex-crd

88dcff9

Merge branch 'main' into cac-50-cacheindex-crd

ee79bcb

Edward Sun and others added 2 commits May 27, 2026 15:14

Merge remote-tracking branch 'origin/main' into cac-50-cacheindex-crd

db3e782

EdHasNoLife added 2 commits May 27, 2026 15:26

Drop accidentally-committed local scratch file

14e3f7b

docs/reference-stack/DEMO.md was swept into the previous commit by a git add -A; it's untracked local scratch (not on main and unrelated to CAC-50). Untrack it (kept on disk) so it's not part of this PR.

Conversation

EdHasNoLife commented May 27, 2026

Summary

Notes / scope

Test plan

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

EdHasNoLife commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

github-actions Bot commented May 27, 2026

Codex review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant