Skip to content

Add CacheIndex CRD + controller status surface (B6 status half)#16

Open
EdHasNoLife wants to merge 16 commits into
mainfrom
cac-50-cacheindex-crd
Open

Add CacheIndex CRD + controller status surface (B6 status half)#16
EdHasNoLife wants to merge 16 commits into
mainfrom
cac-50-cacheindex-crd

Conversation

@EdHasNoLife
Copy link
Copy Markdown
Collaborator

Summary

The status-surface half of B6 (CAC-50) — exposes the server's in-memory cache aggregate as a cluster-scoped, status-only CacheIndex CR so operators can kubectl get cacheindex. (Engine half landed in CAC-20/#7.)

Design (per the locked decision): server exposes the data, controller writes the CR — no proto/contract change, reuses the controller's existing k8s client + RBAC.

  • CRD (api/v1alpha1/cacheindex_types.go) — cluster-scoped, status-only. Status mirrors the tech spec: replicas[]{id,cacheMemoryBytes,hitRate,pressure,lastUpdate}, prefixes.summary{total,hot=0}, tenants[]{id,memoryUsed,hitRate}. Rates are decimal strings ("0.78") — CRDs avoid floats for cross-language portability (controller-gen rejects floats without allowDangerousTypes, and a Java client is coming). hot=0 until per-prefix access-counting exists.
  • pkg/index Snapshot() — cluster-wide aggregate: latest stats per replica id; tenant memory/hit-rate dedup replicas within a tenant (documented approximation).
  • pkg/server /snapshot — internal HTTP JSON endpoint (alongside /healthz,/readyz,/metrics). Metadata only — replica/tenant stats + prefix counts, never KV tensors or prompt text.
  • internal/controller CacheIndexPoller — leader-elected manager Runnable (not an event-driven reconciler — the data source is the server, not the CR). Maintains the singleton cluster-default CR; writes status only when the meaningful aggregate changed (timestamps ignored for change detection → no churn under steady traffic).
  • cmd/controller — wired with --server-snapshot-url (default http://inference-cache-server:8080/snapshot) and --cacheindex-refresh-interval (default 30s).

Notes / scope

  • Single server instance assumed (Phase-1 in-memory index); HA cross-instance aggregation is later.
  • /snapshot is unauthenticated on the internal :8080 HTTP port, like /metrics — metadata only, scraped in-cluster by the controller.
  • No proto change → no grpc-contract.md change. New CRD is additive/v1alpha1-safe.

Test plan

  • make pre-pr green (naming, buf lint, fmt, vet, race, build, no generated drift)
  • make cover-check — logic coverage 79.6% (≥65 gate)
  • Unit tests: index Snapshot() (dedup/sort/totals); server /snapshot end-to-end (ingest→JSON); controller buildCacheIndexStatus, statusEqual (timestamp-insensitive), fetchSnapshot (+non-200), refresh (fake client: create → no-op-on-unchanged → update-on-change)
  • CI green on the PR
  • (manual, later) kubectl get cacheindex on a kind cluster shows live aggregate

Expose the server's in-memory cache aggregate as a cluster-scoped,
status-only CacheIndex CR (`kubectl get cacheindex`) — the other half of
B6 (CAC-50).

- api/v1alpha1: cluster-scoped, status-only CacheIndex CRD. Status mirrors
  the tech spec — replicas[]{id,cacheMemoryBytes,hitRate,pressure,
  lastUpdate}, prefixes.summary{total,hot=0}, tenants[]{id,memoryUsed,
  hitRate}. Rates are decimal strings (CRDs avoid floats for cross-language
  portability). Generated manifests + deepcopy + RBAC committed.
- pkg/index: Snapshot() returns the cluster-wide aggregate (latest stats
  per replica; tenant memory/hit-rate dedup replicas within a tenant).
- pkg/server: internal HTTP /snapshot endpoint serving the snapshot as
  JSON (metadata only — replica/tenant stats + prefix counts, never KV or
  prompt data).
- internal/controller: CacheIndexPoller, a leader-elected manager Runnable
  that scrapes /snapshot on a timer, maintains the singleton cluster-default
  CR, and writes status only when the meaningful aggregate changed
  (timestamps ignored for change detection to avoid churn).
- cmd/controller: wire the poller with --server-snapshot-url /
  --cacheindex-refresh-interval flags.

Design per PROJECT decision: server exposes the data, controller writes the
CR (reuses its k8s client/RBAC); single server instance assumed (Phase 1).
@github-actions
Copy link
Copy Markdown

Codex review

Should-Fix

  • docs/design/grpc-contract.md:79 still says the CacheIndex CRD status surface is a future follow-up. This PR lands that surface plus /snapshot, but no docs/design/ update was made. That conflicts with the project standard that behavior/spec changes match the design docs.

  • internal/controller/cacheindex_controller.go:80 fetches /snapshot before ensuring the singleton CacheIndex exists. If the server endpoint is unavailable, which is also likely with the current default controller install because cmd/controller/main.go:53 points at inference-cache-server but config/default/kustomization.yaml:3 only deploys CRD/RBAC/manager, the controller logs errors and never creates cluster-default. That undercuts the stated “kubectl get cacheindex” status surface. Consider creating the singleton first and then updating status when snapshots succeed, or shipping/configuring the server service with the default install.

Nit

  • cmd/server/main.go:18 flag help says the HTTP listener serves /healthz, /readyz, and /metrics, but the PR adds /snapshot. Update the help text so operator-visible CLI docs are accurate.

Verdict: changes-requested. No vendor-neutral naming or proto-contract violations found.

Address PR review:
- refresh() now ensures the cluster-default CacheIndex exists BEFORE
  scraping /snapshot, so `kubectl get cacheindex` shows the CR even when
  the server is unreachable (status fills on the next successful tick).
  Regression test covers the server-down path.
- grpc-contract.md: note the CacheIndex status surface + /snapshot are now
  implemented (no longer a future follow-up).
- cmd/server flag help: list /snapshot alongside the other HTTP endpoints.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in fa77fe3.

  • Should-fix (create singleton before scrape): fixed — refresh() now ensures the cluster-default CacheIndex exists before scraping /snapshot, so kubectl get cacheindex shows the CR (empty status) even when the server is unreachable; status fills on the next successful tick. New test TestRefreshCreatesSingletonEvenWhenServerDown covers the server-down path.
    • On the related point — the default install doesn't deploy the server (config/default only ships CRD/RBAC/manager): that's a real but separate gap (there's no server Deployment/Service manifest in config/ yet at all). It's deployment scope, not this status-surface PR; the create-first fix makes the controller behave gracefully until the server is deployed. I'll file a follow-up ticket to add the server Deployment + Service to the install.
  • Should-fix (design doc): updated docs/design/grpc-contract.md — the CacheIndex status surface + /snapshot are now documented as implemented (no longer a future follow-up).
  • Nit (server flag help): cmd/server --http-bind-address help now lists /snapshot.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-Fix
None.

Nit

  • internal/controller/cacheindex_controller.go and config/rbac/role.yaml: CacheIndex RBAC is broader than the poller needs. It only does get, create, and status update, but the generated role also grants list, watch, update, patch, and delete on cacheindices. Consider tightening the kubebuilder marker for this status-only singleton.

Verdict
approve-with-nits.

I verified the diff for vendor-neutral naming, internal tracked references, and whitespace drift. go test ./... could not run in this environment because the filesystem is read-only and Go could not create /tmp/go-cache.

Address PR review nit: the poller never updates/patches/deletes the
CacheIndex resource itself — only its status subresource. Drop those
verbs from the main-resource RBAC marker, keeping get;list;watch;create
(list+watch are required because the manager's cached client backs Get
with an informer). Regenerated role.yaml.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in 87effb8. Tightened the CacheIndex RBAC: the main resource is now create;get;list;watch (dropped update;patch;delete, which the poller never uses on the resource itself — it only writes the status subresource). Kept list;watch because the poller reads through the manager's cached client, where a Get is informer-backed and needs them; noted that on the marker. cacheindices/status stays get;update;patch.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking

  • api/v1alpha1/cacheindex_types.go:70 exposes prefixes as status.prefixes.total/hot, and the generated CRD/print column follows that shape at config/crd/bases/inferencecache.io_cacheindices.yaml:75. The PR body says the locked/status spec is prefixes.summary{total,hot=0}. Because this is a new v1alpha1 CRD API surface, merging the flatter shape would publish the wrong contract and make correcting it later a breaking field change.

Should-fix

  • PROJECT:6 still only declares the CacheBackend resource. This PR adds a new API/CRD, but the Kubebuilder project metadata was not updated for cluster-scoped CacheIndex, which leaves scaffold/generation metadata stale.

Nit

  • None.

Verdict: changes-requested.

I could not run tests locally because the review environment filesystem is read-only, including /tmp, so Go could not initialize its build cache.

Address PR review:
- Blocking: the CacheIndex status used a flattened status.prefixes.{total,hot};
  the locked contract is status.prefixes.summary.{total,hot}. Introduce
  PrefixStatus{Summary PrefixSummary} so the new v1alpha1 surface ships the
  correct nested shape (avoids a breaking field change later); printcolumn
  and controller updated to match.
- Register the cluster-scoped CacheIndex resource in the kubebuilder PROJECT
  metadata (was only declaring CacheBackend).
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in 76571cb.

  • Blocking (prefix shape): fixed. Introduced PrefixStatus{ Summary PrefixSummary } so the status surface is status.prefixes.summary.{total,hot} — matching the locked contract — instead of the flattened status.prefixes.{total,hot}. The CRD now generates the nested shape, the print column points at .status.prefixes.summary.total, and the controller + tests were updated. Getting this right now avoids a breaking v1alpha1 field change later.
  • Should-fix (PROJECT metadata): registered the cluster-scoped CacheIndex resource (namespaced: false) in the kubebuilder PROJECT file alongside CacheBackend.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-Fix
None.

Nit

  • config/rbac/kustomization.yaml:7: the user-facing viewer/editor RBAC still only includes CacheBackend. Since this feature is explicitly for operators to kubectl get cacheindex, consider adding cacheindex_viewer_role / editor role or extending the existing convenience roles to include cacheindices.
  • internal/controller/cacheindex_controller.go:29-31: comment says “server pushes the data,” but the implementation polls/scrapes /snapshot. Worth correcting to avoid confusing the contract.

Verdict
approve-with-nits.

I did not find vendor-neutral naming violations, proto/contract breakage, gRPC fail-open regressions, or hand-edited generated drift from inspection. I could not run Go tests in this environment because the filesystem is read-only and Go could not create its build cache.

Address PR review nits:
- Add cacheindex_viewer_role / cacheindex_editor_role convenience
  ClusterRoles (mirroring CacheBackend) so operators get RBAC for the new
  `kubectl get cacheindex` surface; reference them in the rbac kustomization.
- Correct the CacheIndexPoller doc: the server exposes the aggregate via
  /snapshot and the controller scrapes it (pull), not "server pushes".
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in c626ecf.

  • Nit (operator RBAC): added cacheindex-viewer-role and cacheindex-editor-role ClusterRoles (mirroring the CacheBackend convenience roles) and wired them into config/rbac/kustomization.yaml, so operators get RBAC for the new kubectl get cacheindex surface.
  • Nit (comment): corrected the CacheIndexPoller doc — the server exposes the aggregate via /snapshot and the controller scrapes it (pull), rather than "server pushes."

Resolve the regenerated zz_generated.deepcopy.go conflict (CacheBackend
API from #6 landed on main) by regenerating deepcopy + manifests against
the merged api types, so both CacheBackend and CacheIndex are present.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix

  • api/v1alpha1/cacheindex_types.go: total and hot are tagged omitempty, so the required zero-valued status surface is omitted from JSON/YAML. This contradicts the PR/spec shape prefixes.summary{total,hot=0} and makes an empty index show no explicit 0 values. Same concern applies to zero-valued cacheMemoryBytes and memoryUsed if those are meant to mirror the status contract. Use pointers or remove omitempty for fields where explicit zero is meaningful.

  • api/v1alpha1/cacheindex_types.go / internal/controller/cacheindex_controller.go: lastUpdated is documented and printed as the last refresh time, but statusEqual ignores timestamps and skips the status write when only timestamps changed. Under steady state, successful polls will not advance lastUpdated, so operators cannot distinguish a healthy poller from a stale status. Either change the field/column semantics to “last aggregate change” or add a separate freshness signal.

Nit
None.

Verdict
changes-requested

I could not run tests because the sandbox is fully read-only, including Go’s build cache location, so this review is static.

Address PR review:
- Drop omitempty on prefixes.summary.{total,hot} so an empty index renders
  total: 0 / hot: 0 explicitly, matching the contract shape (instead of
  omitting the zero-valued summary). Test asserts the zero summary serializes.
- Document lastUpdated as "last time the aggregate changed" (the controller
  writes status only on change, so it marks the last data change, not the
  last poll) and rename the print column Updated → Changed so operators
  aren't misled into reading it as poller freshness.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in 0fc84e7.

  • Should-fix (hidden zeros): dropped omitempty on prefixes.summary.{total,hot} so an empty index renders total: 0 / hot: 0 explicitly, matching the contract shape. New test asserts the zero summary serializes. (Left omitempty on the per-replica/tenant numeric fields — those are per-list-item optional metrics, not the always-present summary; happy to drop it there too if you'd prefer fully explicit zeros.)
  • Should-fix (lastUpdated semantics): kept the write-on-change design (the locked decision) but made the field honest — lastUpdated is now documented as "when the aggregate last changed" (not last poll), and the print column is renamed Updated → Changed so operators don't misread it as poller freshness. Poller liveness is observable via the controller's own health/metrics; if you'd like an explicit freshness heartbeat on the CR, I can add one as a follow-up (it would mean a periodic write even when unchanged).

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix
None.

Nit
internal/controller/cacheindex_controller.go:139-142: the malformed snapshot URL path returns the raw http.NewRequestWithContext error. Wrap it with context, ideally including the URL, to match the project’s error-wrapping standard.

Verdict
approve-with-nits

I could not run tests because the sandbox filesystem is read-only and Go needs a writable build cache. The diff itself shows no vendor-neutral naming violation, no proto contract change, and the new behavior has focused unit and over-the-wire coverage.

Address PR review nit: fetchSnapshot returned the raw
http.NewRequestWithContext error; wrap it with context (incl. the URL) to
match the project's error-wrapping standard, consistent with the other
returns in the function.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in e68b6bafetchSnapshot now wraps the http.NewRequestWithContext error with context (incl. the URL), consistent with the other wrapped returns in the function.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix
None.

Nit

  • internal/controller/cacheindex_controller.go imports pkg/index for the /snapshot DTO, but pkg/index/doc.go says that package belongs to inferencecache-server. Either move the snapshot wire structs to a small shared/internal contract package or update the package ownership docs if pkg/index is now intentionally shared by controller and server.

Verdict
approve-with-nits.

I did not find vendor-neutral naming violations, proto/contract drift, fail-open regressions, or missing generated artifacts in the PR diff. I did not run tests in this read-only review environment.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix

  • cmd/controller/main.go / cmd/controller/main.go: the CacheIndex poller is enabled unconditionally and defaults to http://inference-cache-server:8080/snapshot, but the default install still only includes the controller manager resources, not an inference-cache-server Service/Deployment. A fresh config/default install will create the singleton CacheIndex but then log scrape failures every 30s and never populate status unless the user separately creates a matching service or overrides the flag. Add the server Service/manifest wiring, set the manager arg to a shipped endpoint, or make this poller opt-in until the server endpoint is actually installed.

Nit
None.

Verdict
changes-requested

I also checked the PR diff for banned vendor identity tokens and ran git diff --check; both were clean. I could not run go test ./... because this review environment has a read-only filesystem, so Go could not create a build cache.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix
None.

Nit
None.

Verdict
Approve.

Vendor-neutral guard passed (make verify-naming), and no tracked internal refs were introduced (make verify-no-internal-refs). I could not run Go tests because the sandbox filesystem is read-only and Go could not create a build cache.

Address PR review nit: pkg/index is server-owned, but the controller's
CacheIndex poller imports its Snapshot* types to decode the /snapshot
endpoint. Clarify in doc.go that the index engine runs only in the server
while the Snapshot* types are the deliberate, read-only wire contract
shared with the controller — resolving the ownership-doc inconsistency.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix
None.

Nit
None.

Verdict
Approve.

I verified the diff for vendor-neutral naming and internal tracker references with make verify-naming verify-no-internal-refs; both passed. I could not run the Go tests because this environment is read-only and go test failed creating /home/runner/go for the module cache.

Edward Sun and others added 2 commits May 27, 2026 15:14
main switched SchemeBuilder to &runtime.SchemeBuilder{} (func-slice), so
the old SchemeBuilder.Register(&obj{}) form no longer compiles. Convert
CacheIndex's init() to the closure pattern used by CacheBackend
(AddKnownTypes), fixing the semantic merge break the PR test-merge caught.
@github-actions
Copy link
Copy Markdown

Codex review

Findings

Blocking: none.

Should-fix: none.

Nit: none.

I did not find violations in the PR diff against the vendor-neutral naming rule, proto/gRPC contract constraints, fail-open hot-path semantics, metadata-only requirements, or generated-manifest expectations. The new CacheIndex CRD/poller/snapshot endpoint is additive and kept out of the proto surface.

Verification note: I attempted focused go test runs, but the sandbox is read-only and Go could not create module/build cache directories, even under /tmp.

Verdict: approve.

The merge with main raised COVER_MIN to 79% and the aggregate sat at
78.7%. Cover the previously-untested unary handlers and helpers in
pkg/server (LookupPDRoute, GetCacheState, PublishEvent→index,
eventTypeFromProto, microsToTime). Logic coverage is now 81.6%.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-Fix

  • api/v1alpha1/cacheindex_types.go:29 vs internal/controller/cacheindex_controller.go:191: status.replicas[].lastUpdate is documented as “when the replica's state was last observed”, but statusEqual strips every replica LastUpdate before comparison. A replica that keeps reporting identical memory/hit/pressure will advance in /snapshot, but the CR status will never be written, so operators can see an indefinitely stale lastUpdate for a healthy reporter. Either stop exposing per-replica lastUpdate, update it on a bounded cadence, or make the field’s semantics match the churn-suppression behavior.

  • cmd/controller/main.go:53 / cmd/controller/main.go:113: the controller always starts the CacheIndexPoller with default URL http://inference-cache-server:8080/snapshot, but the default install only includes CRD/RBAC/manager (config/default/kustomization.yaml:3) and the manager args only pass --leader-elect (config/manager/manager.yaml:22). There is no default inference-cache-server Service/Deployment in config/, so the shipped controller deployment creates an empty singleton and logs scrape failures forever. Add the server Service/Deployment/default wiring, or make the poller opt-in until the server endpoint is actually deployed.

Nit
None.

I did not find vendor-specific identity leakage or proto contract drift. I could not run tests in this environment because the sandbox is read-only and Go cannot create its module/build cache.

Verdict: changes-requested.

Address PR review: statusEqual strips per-replica LastUpdate for churn
suppression, so it advances only when a replica's reported stats change.
Document the field to match that behavior (consistent with the top-level
status.lastUpdated), so operators don't read a steady-state value as
staleness. Reporter liveness lives in the server's /metrics.
docs/reference-stack/DEMO.md was swept into the previous commit by a
git add -A; it's untracked local scratch (not on main and unrelated to
CAC-50). Untrack it (kept on disk) so it's not part of this PR.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None found.

Should-fix
None found.

Nit
None found.

Verdict: approve.

I verified the diff against the vendor-neutral naming guard and internal-reference guard; both passed. I could not run the touched Go tests because the sandbox is read-only and Go could not create a module cache under /home/runner/go or /tmp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant