feat(main): wire keyviz sampler end-to-end (flags, RunFlusher, coordinator + admin)#647
feat(main): wire keyviz sampler end-to-end (flags, RunFlusher, coordinator + admin)#647
Conversation
Wires the keyviz.Sampler from PR #639 into ShardedCoordinator (design doc §5.1): - Adds an unexported `sampler keyviz.Sampler` field plus `WithSampler(s keyviz.Sampler) *ShardedCoordinator` post- construction option, mirroring WithLeaseReadObserver. - groupMutations now calls `c.observeMutation(routeID, mut)` once per resolved (RouteID, mutation) pair before grouping by GroupID. Reads are not dispatched through this path; all calls use OpWrite. - observeMutation guards against an interface-nil sampler at the call site; the keyviz contract also tolerates a typed-nil *MemSampler, so disabled keyviz wires through to a no-op without branching deep in the hot path. - DelPrefix is intentionally not observed: dispatchDelPrefixBroadcast doesn't resolve a single RouteID (it broadcasts to every shard), and the design treats per-route attribution there as out of scope. Tests: - TestShardedCoordinatorObservesEveryDispatchedMutation: cross-shard Put batch, verifies one Observe per element with the engine's resolved RouteID, OpWrite, and exact key/value lengths. - TestShardedCoordinatorWithoutSamplerStaysSafe: dispatches successfully both with no WithSampler call (interface-nil) and with a typed-nil *MemSampler.
Implements the `GetKeyVizMatrix` admin gRPC RPC declared in `proto/admin.proto`, completing the read-side of the key visualizer flow (design doc §5.2 / §6). - New `KeyVizSampler` interface in adapter (Snapshot only) so tests can pass a deterministic in-memory fake while production wires `*keyviz.MemSampler` directly. - `AdminServer.RegisterSampler` mirrors `RegisterGroup`. Without it, GetKeyVizMatrix returns codes.Unavailable so callers can distinguish "keyviz disabled on this node" from "no data yet" (which is a successful empty response). - Pivots the column-major MatrixColumn slice into the row-major proto layout: one KeyVizRow per RouteID with values aligned to a parallel column_unix_ms slice. KeyVizSeries selection picks the matching per-row counter; UNSPECIFIED defaults to Reads. - bucket_id encodes "route:<id>" for individual slots and "virtual:<syntheticID>" for aggregate buckets. Aggregate rows carry MemberRoutes verbatim through route_ids and route_count. Tests: - TestGetKeyVizMatrixReturnsUnavailableWhenSamplerNotRegistered - TestGetKeyVizMatrixPivotsColumnsToRows — two-column / two-route fixture, verifies the missing-row-becomes-zero contract. - TestGetKeyVizMatrixSeriesSelection — table-driven across all five enum values including UNSPECIFIED defaulting to Reads. - TestGetKeyVizMatrixEncodesAggregateBucket — virtual bucket layout.
Wires the in-memory key visualizer end-to-end on each node binary (design doc §5 / §10): - Three new flags: --keyvizEnabled (off by default — opt-in), --keyvizStep, --keyvizMaxTrackedRoutes, --keyvizMaxMemberRoutesPerSlot. Defaults pull from keyviz.Default* so flag help and code stay in sync. - buildKeyVizSampler() returns *keyviz.MemSampler when enabled, nil otherwise — the single decision point for "keyviz on/off." - seedKeyVizRoutes() copies the engine's startup route catalogue into the sampler so the first matrix snapshots carry metadata. Route-watch propagation post-startup is a follow-up (Phase 3 in the design). - startKeyVizFlusher() runs RunFlusher in the existing errgroup and calls one final Flush after ctx fires so the in-progress step is harvested at graceful shutdown. - Coordinator: chained .WithSampler(keyVizSamplerForCoordinator(s)) onto the existing .WithLeaseReadObserver(...) call so the dispatch hot path observes mutations into the sampler. - AdminServer: setupAdminService now takes the *MemSampler and only calls RegisterSampler when it's non-nil — operators with keyviz disabled get codes.Unavailable on GetKeyVizMatrix instead of a spurious empty success. Tests: - TestBuildKeyVizSamplerHonorsEnabledFlag — flag on/off contract. - TestSeedKeyVizRoutesCopiesEngineCatalogue — engine.Stats path. - TestSeedKeyVizRoutesNoOpOnNilSampler — disabled-sampler safety. - TestStartKeyVizFlusherReturnsAfterCancel — graceful shutdown (RunFlusher exits cleanly + final Flush harvests pending counters).
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 52 minutes and 31 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7bc737ae79
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sampler := buildKeyVizSampler() | ||
| seedKeyVizRoutes(sampler, cfg.engine) | ||
| coordinate := kv.NewShardedCoordinator(cfg.engine, shardGroups, cfg.defaultGroup, clock, shardStore). | ||
| WithLeaseReadObserver(metricsRegistry.LeaseReadObserver()) | ||
| WithLeaseReadObserver(metricsRegistry.LeaseReadObserver()). | ||
| WithSampler(keyVizSamplerForCoordinator(sampler)) | ||
| distCatalog, err := setupDistributionCatalog(ctx, runtimes, cfg.engine) |
There was a problem hiding this comment.
Seed keyviz routes after catalog snapshot is applied
seedKeyVizRoutes runs before setupDistributionCatalog, but setupDistributionCatalog calls distribution.EnsureCatalogSnapshot, which applies a catalog snapshot back into the engine (including durable non-zero RouteIDs). That means the sampler is seeded with pre-snapshot IDs (typically 0 from buildEngine), while later dispatches observe post-snapshot IDs, so Observe misses and drops samples for normal traffic. In practice this can leave keyviz mostly empty right after startup even when writes are flowing.
Useful? React with 👍 / 👎.
| cols := sampler.Snapshot(from, to) | ||
| pickValue := matrixSeriesPicker(req.GetSeries()) | ||
| return matrixToProto(cols, pickValue), nil |
There was a problem hiding this comment.
Apply the request row budget in GetKeyVizMatrix
This handler never reads req.GetRows(), so it always returns every row from the snapshot regardless of caller-requested resolution. On large route counts, this can blow up response size and processing time for admin queries that are supposed to request a bounded Y-axis (for example 256 rows), making the endpoint much more expensive than requested.
Useful? React with 👍 / 👎.
|
@claude review |
|
Claude finished @bootjp's task in 2m 34s —— View job Code Review — PR #647Tasks
SummaryThe overall architecture is clean: single construction point ( Critical — BugSeed fires before
|
Round-1 review fixes for PR #645: - Claude bot Issue 1: WithSampler stores the interface value as supplied; the guard at the call site only checks interface-nil, not typed-nil. Document the nil-receiver requirement on the Sampler interface so a future implementor cannot land a panic-on-nil Observe and silently break dispatch. - Claude bot Issue 2: trim observeMutation comment to the two non-obvious points (reads bypass this path; the early return keeps the disabled hot path branch-only). Per CLAUDE.md, comments for the "what" duplicate the code. - Claude / CodeRabbit nit: TestShardedCoordinatorWithoutSamplerStaysSafe now asserts c.sampler stays the zero interface value for the "no WithSampler call" subcase, so a future refactor that silently initialises the field would fail the test.
Round-1 review fixes for PR #646: Codex P1: GetKeyVizMatrix never read req.GetRows() and always returned every row, breaking the documented row-budget contract for deployments with many tracked routes. Add applyKeyVizRowBudget — when budget > 0, sort by per-row activity total (sum of values across all columns of the requested series) and truncate to the top-N before the final Start-order sort. Codex P2: route_count was set to len(MemberRoutes), but MemberRoutes is intentionally capped at MaxMemberRoutesPerSlot in the sampler while extra routes still drive the bucket counters. Capped buckets under-reported their contributors. Add MemberRoutesTotal to the keyviz routeSlot/MatrixRow surface, increment it in foldIntoBucket / RegisterRoute / bucket creation regardless of the visible cap, and decrement in pruneMemberRoute. The adapter now surfaces it as route_count and flips route_ids_truncated when the visible list is shorter than the total so consumers can tell their drill-down list is partial. Tests: - TestGetKeyVizMatrixHonorsRowsBudget — 4 routes, rows=2 returns top-2 by activity, sorted by Start. - TestGetKeyVizMatrixSurfacesRouteCountTruncation — visible cap=2, total=9 → route_count=9, route_ids_truncated=true. - TestGetKeyVizMatrixEncodesAggregateBucket updated for the new MemberRoutesTotal field on the test fixture.
Codex round-1 P1 on PR #647: seedKeyVizRoutes ran BEFORE setupDistributionCatalog. setupDistributionCatalog calls distribution.EnsureCatalogSnapshot which applies a catalog snapshot back into the engine — including the durable non-zero RouteIDs from the catalog. Seeding earlier registered the placeholder RouteID=0 entries from buildEngine; subsequent dispatches (which resolve to post-snapshot RouteIDs) found no slot and Observe dropped every mutation, leaving the heatmap empty even when traffic was flowing. Move the seed call to after setupDistributionCatalog returns so the sampler sees the catalog's authoritative IDs. Existing tests cover the seed mechanics; the ordering bug surfaced only in the live startup path which the unit tests don't exercise.
|
@claude review |
|
Claude finished @bootjp's task in 4m 17s —— View job Code Review — PR #647Tasks
Round-1 Fixes — Both Addressed ✅Seed ordering ( Row budget ( Moderate — Read path not wired
And the Phase 2 success criterion (§10) includes "leader/follower reads." Since If this is intentional scoping (writes-first in this phase, reads in the next), add a comment or a TODO in Minor — Test gap in
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7e1475104a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if removed && bucket.MemberRoutesTotal > 0 { | ||
| bucket.MemberRoutesTotal-- | ||
| } |
There was a problem hiding this comment.
Decrement total for pruned routes beyond visible cap
When a virtual bucket exceeds MaxMemberRoutesPerSlot, removing one of the hidden members will enqueue a prune but pruneMemberRoute only decrements MemberRoutesTotal if the route was present in MemberRoutes. In that common churn path removed is false, so the total never drops, and GetKeyVizMatrix later reports inflated route_count (and stale truncation) for that bucket. This makes the admin heatmap metadata drift over time even though the route has been removed.
Useful? React with 👍 / 👎.
|
@claude review |
|
Claude finished @bootjp's task in 4m 32s —— View job Code Review — PR #647 (Round 3)Tasks
Round 1+2 Fixes — Confirmed ✅Seed ordering ( Row budget ( Moderate —
|
Round-2 review fixes for PR #647 (Claude bot): - Add --keyvizHistoryColumns flag so operators can shorten the ring buffer for high-cardinality clusters; defaults to keyviz.DefaultHistoryColumns (1440 = 24h at 60s). - startKeyVizFlusher now early-returns when the sampler is nil instead of spawning a goroutine that just parks on ctx.Done. - Document the Phase-2 read-sampling milestone on observeMutation so future readers don't think the missing read path is a bug — the design (§5.1, §10) calls for read attribution from the node that serves the read (LeaseRead / LinearizableRead / follower reads), and that wiring is intentionally deferred to the next slice. Until then the Reads/ReadBytes series stay zero.
… read TODO) (#651) ## Summary Follow-up to PR #647, which was merged at the round-1 commit before round-2 review fixes propagated. This PR carries the orphaned round-2 changes: - New `--keyvizHistoryColumns` flag (defaults to `keyviz.DefaultHistoryColumns = 1440`, i.e. 24h at 60s) so operators can shorten the ring buffer for high-cardinality clusters without rebuilding. - `startKeyVizFlusher` early-returns when the sampler is nil instead of spawning a goroutine that just parks on `ctx.Done` — the goroutine was harmless but had no signal. - TODO on `observeMutation` documenting the Phase-2 read-sampling milestone (design §5.1, §10) so future readers don't think the missing read path is a regression. Until that wiring lands the matrix's `Reads`/`ReadBytes` series stay zero. These items came out of Claude bot's round-2 review of #647 but landed after the merge button was pressed. ## Test plan - [x] `go build .`, `go vet .`, `golangci-lint run ./...` clean. - [x] `go test -race -count=1 -run 'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' .` clean.
## Summary Phase 2-B of the Key Visualizer design (`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into the existing `web/admin/` SPA rather than building a parallel dashboard. Doc + implementation in one PR (doc commit first, per CLAUDE.md design-doc-first workflow). - New page at `/keyviz` — canvas heatmap polling `/admin/api/v1/keyviz/matrix` with series picker (writes / reads / write_bytes / read_bytes), row-budget input (clamped at 1024), and off / 5 s / 30 s auto-refresh. - Cold cells (value 0) render as the page background, not a faint blue — spotting actually-cold routes stays the dominant visual signal. - Row-detail flyout on hover: bucket_id, start, end, aggregate, route_count, route_ids (with truncation indicator). `start` / `end` decode through a printable-or-hex preview so binary keys do not render as mojibake. - Dependency-free: hand-rolled five-stop colour ramp in `lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208 kB (raw), 64 kB gzipped. Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 / #646 / #647 / #651 / #660 / #661 / #672). ## Design `docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed status, lands with this PR. Parent design §12 phase table is split into 2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out, open). ## Five-lens self-review 1. **Data loss** — n/a; SPA is read-only against an existing handler. 2. **Concurrency / distributed** — n/a; single browser tab polling a single handler. Sampler concurrency was already covered by Phase 2-A tests. 3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower bound (sampler flush is 1 s). 4. **Data consistency** — SPA renders whatever the handler returns; consistency guarantees come from the existing leader-issued counters in the sampler. 5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go build ./internal/admin/...` clean (embed glob unaffected); `go test ./internal/admin/...` clean. Manual verification documented in the design doc §5. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output goes to `internal/admin/dist` - [x] `go build ./internal/admin/...` — clean - [x] `go test ./internal/admin/...` — clean - [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot routes light up red within ~5 s of write traffic - [ ] Manual: series picker swaps the displayed counter; row-budget input clamps at 1024; auto-refresh polls without flicker ## Out of scope - **Cluster fan-out** — handler is currently node-local. Phase 2-C will add a cross-node admin RPC; this PR will pick up the aggregate view automatically once that ships. - **Drill-down per-route sparkline** — Phase 3. - **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet built; correlation lands when those pages do. - **`localStorage` for series / rows / refresh** — punt to follow-up.
Summary
End-to-end keyviz wiring on each node binary. Stacked on top of #645 (coordinator
WithSampler) and #646 (admin serverRegisterSampler) — those two PRs are merged into this branch via merge commits, so this PR's incremental diff ismain.go+main_keyviz_test.go.--keyvizEnabled(off by default — opt-in),--keyvizStep,--keyvizMaxTrackedRoutes,--keyvizMaxMemberRoutesPerSlot. Defaults pull fromkeyviz.Default*so flag help and code stay in sync.buildKeyVizSampler()returns*keyviz.MemSamplerwhen enabled,nilotherwise — the single decision point for "keyviz on/off."seedKeyVizRoutes()copies the engine's startup route catalogue into the sampler. Route-watch propagation post-startup is a follow-up (Phase 3 in the design doc).startKeyVizFlusher()runsRunFlusherin the existing errgroup and calls one finalFlush()after ctx fires so the in-progress step is harvested at graceful shutdown..WithSampler(...)onto the existing.WithLeaseReadObserver(...)call so the dispatch hot path observes mutations.setupAdminServicenow takes the*MemSamplerand only callsRegisterSamplerwhen it's non-nil — operators with keyviz disabled getcodes.UnavailableonGetKeyVizMatrixinstead of a spurious empty success.Implements
docs/admin_ui_key_visualizer_design.md§5 / §10. After this lands the heatmap is fully usable end-to-end.Test plan
TestBuildKeyVizSamplerHonorsEnabledFlag— flag on/off contract.TestSeedKeyVizRoutesCopiesEngineCatalogue—engine.Stats()seed path.TestSeedKeyVizRoutesNoOpOnNilSampler— disabled-sampler safety.TestStartKeyVizFlusherReturnsAfterCancel— graceful shutdown drains the pre-cancel counters via the finalFlush.go build .,go vet .,golangci-lint run ./...clean.go test -race -count=1 -run 'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' .passes.Merge order
main(the merge commits in this PR collapse to no-ops) and merge.