Skip to content

Back LookupRoute with an in-memory CacheIndex (B6 engine)#7

Merged
heymrbox merged 7 commits into
mainfrom
cac-20-cacheindex
May 27, 2026
Merged

Back LookupRoute with an in-memory CacheIndex (B6 engine)#7
heymrbox merged 7 commits into
mainfrom
cac-20-cacheindex

Conversation

@EdHasNoLife
Copy link
Copy Markdown
Collaborator

@EdHasNoLife EdHasNoLife commented May 27, 2026

Summary

Implements the index-engine half of B6 (CAC-20) — real ingestion + ranked lookups behind the previously fail-open stubs. The CacheIndex CRD + controller status surface is split to CAC-50 (kept this PR server-only and focused).

New package pkg/index: a concurrent, soft-state aggregator keyed by (tenant, model, hash_scheme, prefix_hash) → replica entries (+ per-replica stats).

  • ReportCacheState ingests authoritative updates — idempotent on (replica, hash_scheme, prefix_hash).
  • PublishEvent applies PREFIX_EVICTED / REPLICA_UPDATED / ALL_CLEARED deltas (PREFIX_ADDED refreshes known entries; ReportCacheState is authoritative for additions).
  • LookupRoute returns replicas holding the exact prefix within a matching hash_scheme, ranked by matched-tokens × freshness; no match → empty + NO_HINT (fail open, side-effect-free apart from metrics).
  • GetCacheState returns the (tenant, model) aggregate (replica stats + prefix count).
  • Soft state: TTL eviction goroutine (tied to Serve's ctx) + a max-entries cap bound memory. prefix_hash stays engine-opaque.
  • Ops: /readyz now consults index.Ready(); /metrics adds inferencecache_index_entries{model}, inferencecache_lookup_route_calls_total{model,reason_code,hint_used}, inferencecache_lookup_route_latency_seconds{model}. A drained model's index_entries series is zeroed rather than left stale.

Design notes

  • pkg/index is decoupled from proto (pure domain types); the service translates proto ↔ index, so the index stays unit-testable in isolation.
  • Engine-warm /readyz gating (waiting for initial KV-event sync) arrives with the C1 hook; today the index is ready as soon as Serve starts it.
  • The full §4.3 metric schema is standardized in F3/CAC-38; this lands the index/lookup subset.

Request Flow

A new request arrives whose stable prefix hashes (in vLLM's scheme) to X:

Gateway → LookupRoute{
  model_id: "llama-3-70b", tenant_id: "acme",
  prefix_hash: X, prefix_token_count: 2048, hash_scheme: "vllm"
}
InferenceCache → {
  reason_code: "PREFIX_MATCH",
  replica_scores: [
    {replica_id: "replica-1", score: 0.95, matched_tokens: 2048, est_cache_hit_prob: 0.95},
    {replica_id: "replica-2", score: 0.80, matched_tokens: 2048, ...}
  ],
  lookup_latency_us: 90
}

Test plan

  • make pre-pr green (naming + buf lint + fmt + vet + test + build + drift)
  • go test ./pkg/... -race — index unit tests (ranking/top-K, idempotency, tenant + hash-scheme isolation, TTL eviction, max-entries cap, drained-gauge zeroing, ready lifecycle) + over-the-wire ReportCacheStateLookupRoute PREFIX_MATCH
  • Live smoke test (grpcurl): ingest → PREFIX_MATCH (replica ranked, matched_tokens=128), cross-scheme → NO_HINT, GetCacheState aggregate, /metrics shows index_entries + lookup_route_calls_total
  • CI green on the PR

Implement the index engine half of B6: a concurrent, soft-state cache
aggregator in pkg/index keyed by (tenant, model, hash_scheme,
prefix_hash) → replica entries, plus per-replica stats.

- ReportCacheState ingests authoritative updates (idempotent on
  replica/scheme/prefix); PublishEvent applies PREFIX_EVICTED /
  REPLICA_UPDATED / ALL_CLEARED deltas (PREFIX_ADDED refreshes known
  entries — ReportCacheState is authoritative for additions).
- LookupRoute returns replicas holding the exact prefix within a
  matching hash_scheme, ranked by matched-tokens x freshness; no match
  fails open with NO_HINT. GetCacheState returns the (tenant, model)
  aggregate.
- TTL eviction goroutine (tied to Serve's ctx) + a max-entries cap bound
  memory; prefix_hash stays engine-opaque.
- /readyz now consults index.Ready(); /metrics adds index_entries{model}
  and lookup_route_calls_total / lookup_route_latency_seconds. Drained
  models have their index_entries gauge zeroed rather than left stale.

The CacheIndex CRD + controller status surface is split to CAC-50.
@github-actions
Copy link
Copy Markdown

Codex review

Blocking

  • pkg/index/index.go: Ingest treats ReportCacheState as append/refresh only. The PR and code document these updates as authoritative, but a replica that first reports p1,p2 and later reports only p2 will still be returned for p1 until TTL or an explicit event. That makes LookupRoute return PREFIX_MATCH for a prefix the replica no longer holds, and GetCacheState/metrics overcount. Fix by replacing the replica’s prior prefix set for the update’s (tenant, model, hash_scheme, replica) and deleting absent entries, with a regression test for shrinking/empty reports.

Should-fix

  • None found.

Nit

  • None found.

I did not find vendor-neutral naming issues or proto/CRD drift in this PR. I could not run tests because the environment is read-only and Go could not create its build cache.

Verdict: changes-requested.

Address PR review: the code described ReportCacheState ingestion as
"authoritative" while behaving as additive/refresh, which is contradictory
if read as snapshot semantics. Make the contract explicit instead:
CacheStateUpdate is an incremental delta — a prefix's absence from a later
update does not remove it. Removals arrive via CacheEvent (PREFIX_EVICTED /
ALL_CLEARED) or TTL expiry, matching the engine KV-event model
(vLLM BlockStored / BlockRemoved). Documented in code and grpc-contract.md.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Thanks — this is a real contract ambiguity, addressed in 430b1b4 by making the semantics explicit rather than switching behavior. The key decision: CacheStateUpdate is an incremental delta, not a snapshot.

Why delta (not snapshot-replace):

  • The engine integration (C1/CAC-21) is KV events — vLLM emits BlockStored / BlockRemoved / AllBlocksCleared. Removals are explicit events, not "absent from the next batch."
  • A replica can hold 10k–100k+ cached prefixes; resending the full set on every update would be bandwidth-prohibitive and defeats the streaming/event design.
  • Soft state (tech spec): a stale hint causes a cache miss, never a wrong answer, and TTL bounds the staleness. So p1 lingering until its PREFIX_EVICTED event or TTL is intentional and fail-soft, not a correctness bug.

So the defect you caught was the wording: the code said "authoritative" (reads as snapshot) while behaving as additive/refresh. Fixed by nailing the contract in three places:

  • pkg/index/index.go Ingest doc — "additive deltas, NOT snapshots; removals via CacheEvent / TTL."
  • pkg/server/inferencecache_service.go ReportCacheState doc.
  • docs/design/grpc-contract.md — new "Additive CacheStateUpdate" contract guarantee.

Removals are already handled: PublishEvent PREFIX_EVICTED / ALL_CLEARED (tested in TestApplyEventEvictAndClear) + the TTL eviction sweep. No snapshot-replace, by design.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None found.

Should-fix

  • pkg/server/inferencecache_service.go:154: updateFromProto stores ReplicaStats.ReplicaID from the nested stats.replica_id, while the index keys the stats by the top-level CacheStateUpdate.replica_id. If an adapter omits or mismatches the nested field, GetCacheState can return an empty/wrong replica_id for a replica keyed under a different ID. Prefer the top-level update replica ID, or reject/mark inconsistent updates.

Nit

  • pkg/server/inferencecache_service.go:109: non-EOF stream receive errors are returned raw. Project quality guidance says errors should be wrapped; consider fmt.Errorf("receive cache state update: %w", err) if it preserves the gRPC status behavior you want.

Vendor-neutral naming check: no new core identity violations found.

I could not run tests because the environment is read-only and Go cannot create a build cache, even with GOCACHE=/tmp/....

Verdict: changes-requested

Address PR review: CacheStateUpdate carries a redundant nested
stats.replica_id alongside the top-level replica_id that the index keys
on. If an adapter mismatches or omits the nested field, GetCacheState
would report a replica whose stats payload disagrees with its key.

Enforce the top-level replica_id as authoritative in the index (and at
the proto translation boundary), so the stored stats always match their
key. Regression test covers a deliberately mismatched nested id.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in c8e5642.

Should-fix (stats replica id): fixed. The top-level CacheStateUpdate.replica_id is now authoritative — the index overrides ReplicaStats.ReplicaID with the top-level id when storing (so the payload always matches its key, regardless of what an adapter puts in the nested field), and updateFromProto reads the top-level id at the translation boundary. Regression test TestStatsKeyedByTopLevelReplicaID ingests a deliberately mismatched nested id and asserts CacheState reports the top-level one (with the stats payload intact).

Nit (wrap Recv error): intentionally leaving it raw. The non-EOF Recv error is already a gRPC status error (e.g. codes.Canceled when a client aborts). Returning it unwrapped preserves that code over the wire — which TestReportCacheStateClientCancel asserts (status.Code(err) == codes.Canceled). Wrapping with fmt.Errorf("...: %w", err) risks the server remapping it to codes.Unknown, degrading the status the client sees. Propagating an already-structured status error verbatim is the behavior we want here.

@github-actions
Copy link
Copy Markdown

Codex review

Should-Fix

  • docs/design/grpc-contract.md:1 and docs/design/grpc-contract.md:73: the design doc still describes the service as “Phase 1 — proto + fail-open server stubs” and says index-backed LookupRoute is out of scope/later work, while this PR implements index-backed LookupRoute, ReportCacheState, PublishEvent, and GetCacheState. The contract/spec standard asks behavior changes to match docs/design/; update the status/scope text so the doc no longer contradicts the implementation.

Nit

  • pkg/index/index.go:179: Start will panic if a caller uses WithSweepInterval(0) because time.NewTicker(0) panics. Current production config uses the default positive interval, so this is not blocking, but validating/clamping the option would make the public option safer.

No vendor-neutral naming violations found in the PR diff. No proto/generated-code drift concerns because proto/ and generated stubs were not changed.

I could not run tests because the sandbox filesystem is read-only and Go could not create its build cache.

Verdict: approve-with-nits.

Address PR review nits:
- grpc-contract.md no longer says LookupRoute et al. are fail-open stubs
  / out of scope: note that B6 backs LookupRoute, ReportCacheState,
  PublishEvent, and GetCacheState with the in-memory CacheIndex.
- index.New clamps non-positive TTL / sweep interval to defaults so a
  misconfigured WithSweepInterval(0) can't panic time.NewTicker(0) (and
  TTL can't divide-by-zero in freshness).
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Both addressed in adcc348.

  • Should-fix (doc out of sync): updated docs/design/grpc-contract.md — the status line now lists B6, and the scope section has a "B6 (cache index)" update noting LookupRoute / ReportCacheState / PublishEvent / GetCacheState are index-backed (with RenderTemplate / LookupPDRoute / streams still fail-open stubs). No longer contradicts the implementation.
  • Nit (WithSweepInterval(0) panic): index.New now clamps non-positive TTL and sweep interval to their defaults, so a misconfigured option can't panic time.NewTicker(0) (or divide-by-zero in freshness). Covered by TestNonPositiveDurationsClampToDefaults.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking

  • pkg/index/index.go: PREFIX_ADDED refreshes matching prefix_hash across all hash schemes because CacheEvent has no hash_scheme. Since prefix_hash is engine-opaque and only meaningful within a scheme, an event for one scheme can keep another scheme’s stale entry fresh and later produce a PREFIX_MATCH for that scheme. Either carry hash_scheme in the event contract or do not prefix-refresh from PREFIX_ADDED; rely on ReportCacheState for authoritative refresh/addition.

Should-fix

  • pkg/server/server.go: the index lifecycle is tied only to the caller’s ctx. If Serve exits through the internal error branch at line 130 while ctx remains live, the eviction goroutine keeps running and Ready() can stay true after the service has stopped. Use a derived context with defer cancel() around s.index.Start(...).

Nit

  • docs/design/grpc-contract.md: the B4 scope paragraph still says GetCacheState is empty and “real metrics/events” are out of scope, which now conflicts with the B6 update and the new metrics in this PR.

Vendor-neutral naming looks clean in the changed core surfaces. I could not run tests because the sandbox is read-only and Go cannot create its build cache.

Verdict: changes-requested.

Address PR review:
- CacheEvent carries no hash_scheme and prefix_hash is only meaningful
  within a scheme, so events no longer refresh scheme-specific prefix
  freshness (that risked keeping one engine's stale entry fresh via
  another engine's event → a false PREFIX_MATCH). PREFIX_ADDED is now a
  no-op (ReportCacheState is authoritative for adds/refreshes);
  REPLICA_UPDATED only refreshes replica stats liveness; PREFIX_EVICTED /
  ALL_CLEARED removals stay (removal is conservative/fail-soft).
- Serve now derives a cancelable context with defer cancel() for
  index.Start, so the eviction goroutine and Ready() are torn down on any
  return (including the internal error branch), not just caller ctx done.
- Tighten the grpc-contract.md B4/B6 scope text.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

All three addressed in 8f99dd0.

  • Blocking (cross-scheme refresh): fixed. Since CacheEvent has no hash_scheme and prefix_hash is only meaningful within a scheme, events no longer touch scheme-specific prefix freshness. PREFIX_ADDED is now a no-op (ReportCacheState is authoritative for adds/refreshes); REPLICA_UPDATED refreshes only the replica's stats liveness; PREFIX_EVICTED / ALL_CLEARED removals stay (removal is conservative — at worst a cache miss, soft state). New test TestPrefixAddedEventDoesNotRefreshAcrossSchemes proves a PREFIX_ADDED event lets both schemes' entries expire on schedule.
  • Should-fix (index lifecycle): Serve now derives serveCtx, cancel := context.WithCancel(ctx) with defer cancel() and starts the index on serveCtx, so the eviction goroutine and Ready() are torn down on any return — including the internal error branch — not just when the caller's ctx is done.
  • Nit (doc scope): tightened grpc-contract.md — the "still out of scope" line now scopes to the event/metric streams (M10), and the B6 update spells out the scheme-safe event behavior + that /metrics carries the index/lookup metrics.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking

  • pkg/server/inferencecache_service.go:48 / pkg/index/index.go:223 / pkg/index/index.go:297: hash_scheme is allowed to be empty on both ingest and lookup, so a CacheStateUpdate and LookupRouteRequest that omit it can still produce PREFIX_MATCH. That violates the contract that prefix_hash is engine-opaque and only safe within a known matching hash_scheme; empty/default proto values collapse all engines into the same compatibility domain. The safe behavior should fail open: ignore entries without hash_scheme and return NO_HINT for lookups without it.

Should-fix

  • None.

Nit

  • None.

Verdict: changes-requested.

I could not run tests: the sandbox is read-only, so go test failed before building because it could not create GOCACHE under /tmp.

Address PR review (Blocking): an empty/unspecified hash_scheme would
collapse all engines into one compatibility domain, letting an ingest or
lookup that omits the tag produce a PREFIX_MATCH on engine-opaque bytes.
The index now drops ingest prefixes without a hash_scheme and returns no
hint for lookups without one (stats stay scheme-independent). Documented
the guarantee in grpc-contract.md; regression test covers both paths.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in ea216a6.

Blocking (empty hash_scheme): fixed — the index now fails open on a missing scheme:

  • Ingest drops prefix entries when hash_scheme == "" (they can't be matched safely; stats stay scheme-independent).
  • Lookup returns no hint when hash_scheme == "", even if a real scoped entry exists.

So a CacheStateUpdate/LookupRouteRequest that omits the tag can never collapse engines into one domain. Documented under the "Engine-opaque prefix_hash" guarantee in grpc-contract.md, with TestEmptyHashSchemeFailsOpen covering both the ingest-drop and lookup-NO_HINT paths.

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-fix

  • pkg/index/index.go:445: reportEntries() snapshots counts before acquiring reportMu, and callers invoke it after releasing i.mu. Concurrent mutations can publish metrics out of order: ingest A snapshots {m:1}, ingest B snapshots {m:2} and sets 2, then A acquires reportMu and sets 1 last. With no later mutation, /metrics reports stale inferencecache_index_entries. The metrics update needs to be ordered with the index state snapshot, or otherwise discard stale snapshots.

Nit
None.

Verification
git diff --check passed. gofmt -l on changed Go files was clean. I could not run go test ./pkg/index ./pkg/server because the read-only sandbox cannot create the Go build cache.

Verdict: changes-requested.

Address PR review: reportEntries snapshotted per-model counts before
acquiring reportMu, so two concurrent reporters could publish out of
order and leave a stale inferencecache_index_entries gauge. Take the
snapshot while holding reportMu, so reporters are serialized and the last
one always publishes the live count (lock order stays reportMu -> i.mu).
Adds a concurrent-ingest regression test asserting the final gauge.
@EdHasNoLife
Copy link
Copy Markdown
Collaborator Author

Addressed in ee089a3.

Should-fix (stale index_entries under concurrent reporters): fixed. reportEntries now takes the per-model count snapshot while holding reportMu instead of before it. Reporters are serialized by reportMu, and each snapshot reads live index state at publish time (mutations complete under i.mu before reportEntries is called), so whichever reporter runs last always writes the current count — no stale earlier snapshot can win. Lock order stays reportMu → i.mu (no caller holds i.mu when calling reportEntries), so no deadlock. Added TestConcurrentIngestReportsFinalCount (50 concurrent ingests, asserts the final gauge equals the live count; passes repeatedly under -race).

@github-actions
Copy link
Copy Markdown

Codex review

Blocking
None.

Should-Fix
None.

Nit
None.

Verdict
Approve.

I reviewed the PR diff against the vendor-neutral rule, gRPC contract/design doc, fail-open lookup semantics, and generated-code scope. I did not find issues in the introduced changes.

Verification note: I attempted go test ./pkg/..., but this environment is read-only and Go could not create a build cache under /tmp, so tests were not runnable here.

@heymrbox heymrbox merged commit c6470ac into main May 27, 2026
4 checks passed
@EdHasNoLife EdHasNoLife deleted the cac-20-cacheindex branch May 27, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants