feat(main): wire keyviz sampler end-to-end (flags, RunFlusher, coordinator + admin) by bootjp · Pull Request #647 · bootjp/elastickv

bootjp · 2026-04-25T22:16:24Z

Summary

End-to-end keyviz wiring on each node binary. Stacked on top of #645 (coordinator WithSampler) and #646 (admin server RegisterSampler) — those two PRs are merged into this branch via merge commits, so this PR's incremental diff is main.go + main_keyviz_test.go.

Three new flags: --keyvizEnabled (off by default — opt-in), --keyvizStep, --keyvizMaxTrackedRoutes, --keyvizMaxMemberRoutesPerSlot. Defaults pull from keyviz.Default* so flag help and code stay in sync.
buildKeyVizSampler() returns *keyviz.MemSampler when enabled, nil otherwise — the single decision point for "keyviz on/off."
seedKeyVizRoutes() copies the engine's startup route catalogue into the sampler. Route-watch propagation post-startup is a follow-up (Phase 3 in the design doc).
startKeyVizFlusher() runs RunFlusher in the existing errgroup and calls one final Flush() after ctx fires so the in-progress step is harvested at graceful shutdown.
Coordinator: chained .WithSampler(...) onto the existing .WithLeaseReadObserver(...) call so the dispatch hot path observes mutations.
AdminServer: setupAdminService now takes the *MemSampler and only calls RegisterSampler when it's non-nil — operators with keyviz disabled get codes.Unavailable on GetKeyVizMatrix instead of a spurious empty success.

Implements docs/admin_ui_key_visualizer_design.md §5 / §10. After this lands the heatmap is fully usable end-to-end.

Test plan

TestBuildKeyVizSamplerHonorsEnabledFlag — flag on/off contract.
TestSeedKeyVizRoutesCopiesEngineCatalogue — engine.Stats() seed path.
TestSeedKeyVizRoutesNoOpOnNilSampler — disabled-sampler safety.
TestStartKeyVizFlusherReturnsAfterCancel — graceful shutdown drains the pre-cancel counters via the final Flush.
go build ., go vet ., golangci-lint run ./... clean.
go test -race -count=1 -run 'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' . passes.

Merge order

Merge feat(kv): wire keyviz.Sampler into ShardedCoordinator dispatch path #645 (coordinator wiring).
Merge feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix #646 (admin server wiring).
Rebase this branch on main (the merge commits in this PR collapse to no-ops) and merge.

Wires the keyviz.Sampler from PR #639 into ShardedCoordinator (design doc §5.1): - Adds an unexported `sampler keyviz.Sampler` field plus `WithSampler(s keyviz.Sampler) *ShardedCoordinator` post- construction option, mirroring WithLeaseReadObserver. - groupMutations now calls `c.observeMutation(routeID, mut)` once per resolved (RouteID, mutation) pair before grouping by GroupID. Reads are not dispatched through this path; all calls use OpWrite. - observeMutation guards against an interface-nil sampler at the call site; the keyviz contract also tolerates a typed-nil *MemSampler, so disabled keyviz wires through to a no-op without branching deep in the hot path. - DelPrefix is intentionally not observed: dispatchDelPrefixBroadcast doesn't resolve a single RouteID (it broadcasts to every shard), and the design treats per-route attribution there as out of scope. Tests: - TestShardedCoordinatorObservesEveryDispatchedMutation: cross-shard Put batch, verifies one Observe per element with the engine's resolved RouteID, OpWrite, and exact key/value lengths. - TestShardedCoordinatorWithoutSamplerStaysSafe: dispatches successfully both with no WithSampler call (interface-nil) and with a typed-nil *MemSampler.

Implements the `GetKeyVizMatrix` admin gRPC RPC declared in `proto/admin.proto`, completing the read-side of the key visualizer flow (design doc §5.2 / §6). - New `KeyVizSampler` interface in adapter (Snapshot only) so tests can pass a deterministic in-memory fake while production wires `*keyviz.MemSampler` directly. - `AdminServer.RegisterSampler` mirrors `RegisterGroup`. Without it, GetKeyVizMatrix returns codes.Unavailable so callers can distinguish "keyviz disabled on this node" from "no data yet" (which is a successful empty response). - Pivots the column-major MatrixColumn slice into the row-major proto layout: one KeyVizRow per RouteID with values aligned to a parallel column_unix_ms slice. KeyVizSeries selection picks the matching per-row counter; UNSPECIFIED defaults to Reads. - bucket_id encodes "route:<id>" for individual slots and "virtual:<syntheticID>" for aggregate buckets. Aggregate rows carry MemberRoutes verbatim through route_ids and route_count. Tests: - TestGetKeyVizMatrixReturnsUnavailableWhenSamplerNotRegistered - TestGetKeyVizMatrixPivotsColumnsToRows — two-column / two-route fixture, verifies the missing-row-becomes-zero contract. - TestGetKeyVizMatrixSeriesSelection — table-driven across all five enum values including UNSPECIFIED defaulting to Reads. - TestGetKeyVizMatrixEncodesAggregateBucket — virtual bucket layout.

Wires the in-memory key visualizer end-to-end on each node binary (design doc §5 / §10): - Three new flags: --keyvizEnabled (off by default — opt-in), --keyvizStep, --keyvizMaxTrackedRoutes, --keyvizMaxMemberRoutesPerSlot. Defaults pull from keyviz.Default* so flag help and code stay in sync. - buildKeyVizSampler() returns *keyviz.MemSampler when enabled, nil otherwise — the single decision point for "keyviz on/off." - seedKeyVizRoutes() copies the engine's startup route catalogue into the sampler so the first matrix snapshots carry metadata. Route-watch propagation post-startup is a follow-up (Phase 3 in the design). - startKeyVizFlusher() runs RunFlusher in the existing errgroup and calls one final Flush after ctx fires so the in-progress step is harvested at graceful shutdown. - Coordinator: chained .WithSampler(keyVizSamplerForCoordinator(s)) onto the existing .WithLeaseReadObserver(...) call so the dispatch hot path observes mutations into the sampler. - AdminServer: setupAdminService now takes the *MemSampler and only calls RegisterSampler when it's non-nil — operators with keyviz disabled get codes.Unavailable on GetKeyVizMatrix instead of a spurious empty success. Tests: - TestBuildKeyVizSamplerHonorsEnabledFlag — flag on/off contract. - TestSeedKeyVizRoutesCopiesEngineCatalogue — engine.Stats path. - TestSeedKeyVizRoutesNoOpOnNilSampler — disabled-sampler safety. - TestStartKeyVizFlusherReturnsAfterCancel — graceful shutdown (RunFlusher exits cleanly + final Flush harvests pending counters).

gemini-code-assist · 2026-04-25T22:16:29Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-04-25T22:16:31Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 31 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 52 minutes and 31 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: edb0b4a1-0eda-45c8-a606-7c66d2da0ab1

📥 Commits

Reviewing files that changed from the base of the PR and between 29011bb and 7e14751.

📒 Files selected for processing (7)

adapter/admin_grpc.go
adapter/admin_grpc_keyviz_test.go
keyviz/sampler.go
kv/sharded_coordinator.go
kv/sharded_coordinator_sampler_test.go
main.go
main_keyviz_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/keyviz-main-wiring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7bc737ae79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-25T22:26:02Z

+	sampler := buildKeyVizSampler()
+	seedKeyVizRoutes(sampler, cfg.engine)
 	coordinate := kv.NewShardedCoordinator(cfg.engine, shardGroups, cfg.defaultGroup, clock, shardStore).
-		WithLeaseReadObserver(metricsRegistry.LeaseReadObserver())
+		WithLeaseReadObserver(metricsRegistry.LeaseReadObserver()).
+		WithSampler(keyVizSamplerForCoordinator(sampler))
 	distCatalog, err := setupDistributionCatalog(ctx, runtimes, cfg.engine)


Seed keyviz routes after catalog snapshot is applied

seedKeyVizRoutes runs before setupDistributionCatalog, but setupDistributionCatalog calls distribution.EnsureCatalogSnapshot, which applies a catalog snapshot back into the engine (including durable non-zero RouteIDs). That means the sampler is seeded with pre-snapshot IDs (typically 0 from buildEngine), while later dispatches observe post-snapshot IDs, so Observe misses and drops samples for normal traffic. In practice this can leave keyviz mostly empty right after startup even when writes are flowing.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-25T22:26:02Z

+	cols := sampler.Snapshot(from, to)
+	pickValue := matrixSeriesPicker(req.GetSeries())
+	return matrixToProto(cols, pickValue), nil


Apply the request row budget in GetKeyVizMatrix

This handler never reads req.GetRows(), so it always returns every row from the snapshot regardless of caller-requested resolution. On large route counts, this can blow up response size and processing time for admin queries that are supposed to request a bounded Y-axis (for example 256 rows), making the endpoint much more expensive than requested.

Useful? React with 👍 / 👎.

bootjp · 2026-04-26T06:25:32Z

@claude review

claude · 2026-04-26T06:25:47Z

Claude finished @bootjp's task in 2m 34s —— View job

Code Review — PR #647

Tasks

Gather context and understand request
Read changed files (main.go, main_keyviz_test.go, adapter/admin_grpc.go, kv/sharded_coordinator.go, test files)
Review keyviz design doc for alignment
Analyze correctness, concurrency, error handling, test coverage
Post comprehensive review

Summary

The overall architecture is clean: single construction point (buildKeyVizSampler), nil-safe threading through coordinator and admin server, and the KeyVizSampler interface in adapter/ giving good test isolation. The comment/doc quality is high. However there is one functional correctness bug and one incomplete implementation that need addressing before merge.

Critical — Bug

Seed fires before `EnsureCatalogSnapshot` assigns real RouteIDs (`main.go:292`)

// main.go (current order)
seedKeyVizRoutes(sampler, cfg.engine)          // line 292 — RouteIDs are all 0 here
distCatalog, err := setupDistributionCatalog(…) // line 296 — EnsureCatalogSnapshot runs here

UpdateRoute (called during engine construction) deliberately leaves RouteID = 0 — see distribution/engine.go:123-135. EnsureCatalogSnapshot inside setupDistributionCatalog is what assigns non-zero RouteIDs (1, 2, …) and writes them back into the engine via engine.ApplySnapshot. So when seedKeyVizRoutes calls engine.Stats() at line 292, every route has RouteID = 0.

The sampler registers one slot under RouteID = 0. All coordinator-dispatched mutations arrive with RouteIDs 1, 2, … — which miss the slot and are silently dropped. The heatmap will be empty for the lifetime of the process until route-watch propagation lands (Phase 3).

The fix is one-line: move seedKeyVizRoutes to after setupDistributionCatalog returns successfully.

distCatalog, err := setupDistributionCatalog(ctx, runtimes, cfg.engine)
if err != nil {
    return err
}
seedKeyVizRoutes(sampler, cfg.engine)   // ← move here; engine now has real RouteIDs

Fix this →

This also invalidates the existing test TestSeedKeyVizRoutesCopiesEngineCatalogue: it calls engine.UpdateRoute (which leaves RouteID=0) and then observes the same RouteID=0, so it passes today but does not guard against this regression. After fixing the ordering, add a test that calls engine.ApplySnapshot with non-zero RouteIDs before seeding, confirming the sampler registers those IDs correctly.

Moderate — Incomplete

`req.GetRows()` budget never applied (`adapter/admin_grpc.go:509-537`)

The proto defines uint32 rows = 4 on GetKeyVizMatrixRequest as a Y-axis row budget, but GetKeyVizMatrix and matrixToProto never read it. On a cluster with, say, 50 000 routes, a caller requesting rows=256 for a dashboard gets back all 50 000 rows.

The missing enforcement point is in matrixToProto:

// after sortKeyVizRowsByStart(resp.Rows), add:
if req.GetRows() > 0 && len(resp.Rows) > int(req.GetRows()) {
    resp.Rows = resp.Rows[:req.GetRows()]
}

Since matrixToProto already sorts by Start, truncating at the end is deterministic. Alternatively, the cap could be applied before the row-major pivot loop to avoid building rows that will be discarded. Either way, the current behaviour violates the proto contract.

Fix this →

Minor

`--keyvizHistoryColumns` flag missing

buildKeyVizSampler hardcodes HistoryColumns at zero (falling back to DefaultHistoryColumns = 1440, 24 h at 60 s). Operators cannot tune the retention window without recompiling. Given that the other three options are all exposed as flags, this asymmetry seems unintentional. Not a blocker, but worth a follow-up flag or at least a // TODO comment.

Flusher docstring describes the wrong shutdown signal

The startKeyVizFlusher docstring says "calls one final Flush() after ctx fires so the in-progress step is harvested at graceful shutdown". The actual code does RunFlusher(ctx, s, s.Step()); s.Flush() — the Flush() runs after RunFlusher returns, not after the eg drains. That's fine semantically (it's what the comment means), but the phrasing "after ctx fires" slightly understates that the goroutine exits first, then Flush() is called synchronously before the goroutine return. The behaviour is correct; only the wording is slightly imprecise.

`sortKeyVizRowsByStart` tie-breaking on equal Start keys

The secondary sort key is BucketId (string order: "route:N" vs "virtual:N"). During a split, two routes can briefly share the same Start. The tie-break by BucketId string-sorts virtual buckets before real routes ("r" < "v"), which may or may not match what the UI expects. Worth calling out in the design doc comment.

What's good

The KeyVizSampler interface in adapter/ (rather than a concrete dep on *keyviz.MemSampler) is the right call for testability, and the fakeKeyVizSampler in the test is a clean demonstration.
keyVizSamplerForCoordinator correctly returns a nil interface (not a typed nil) so the hot-path if c.sampler == nil guard works as expected. The comment explaining this is valuable.
RegisterSampler guarded by the existing groupsMu RWMutex is correct; concurrent RPC reads take RLock so there's no data race.
The observeMutation placement (after route resolution, before group fan-out) is exactly right — it fires once per resolved (RouteID, key) pair regardless of cross-shard dispatch count.
Test coverage for series selection, aggregate encoding, and the nil-sampler Unavailable signal is thorough.

Round-1 review fixes for PR #645: - Claude bot Issue 1: WithSampler stores the interface value as supplied; the guard at the call site only checks interface-nil, not typed-nil. Document the nil-receiver requirement on the Sampler interface so a future implementor cannot land a panic-on-nil Observe and silently break dispatch. - Claude bot Issue 2: trim observeMutation comment to the two non-obvious points (reads bypass this path; the early return keeps the disabled hot path branch-only). Per CLAUDE.md, comments for the "what" duplicate the code. - Claude / CodeRabbit nit: TestShardedCoordinatorWithoutSamplerStaysSafe now asserts c.sampler stays the zero interface value for the "no WithSampler call" subcase, so a future refactor that silently initialises the field would fail the test.

Round-1 review fixes for PR #646: Codex P1: GetKeyVizMatrix never read req.GetRows() and always returned every row, breaking the documented row-budget contract for deployments with many tracked routes. Add applyKeyVizRowBudget — when budget > 0, sort by per-row activity total (sum of values across all columns of the requested series) and truncate to the top-N before the final Start-order sort. Codex P2: route_count was set to len(MemberRoutes), but MemberRoutes is intentionally capped at MaxMemberRoutesPerSlot in the sampler while extra routes still drive the bucket counters. Capped buckets under-reported their contributors. Add MemberRoutesTotal to the keyviz routeSlot/MatrixRow surface, increment it in foldIntoBucket / RegisterRoute / bucket creation regardless of the visible cap, and decrement in pruneMemberRoute. The adapter now surfaces it as route_count and flips route_ids_truncated when the visible list is shorter than the total so consumers can tell their drill-down list is partial. Tests: - TestGetKeyVizMatrixHonorsRowsBudget — 4 routes, rows=2 returns top-2 by activity, sorted by Start. - TestGetKeyVizMatrixSurfacesRouteCountTruncation — visible cap=2, total=9 → route_count=9, route_ids_truncated=true. - TestGetKeyVizMatrixEncodesAggregateBucket updated for the new MemberRoutesTotal field on the test fixture.

Codex round-1 P1 on PR #647: seedKeyVizRoutes ran BEFORE setupDistributionCatalog. setupDistributionCatalog calls distribution.EnsureCatalogSnapshot which applies a catalog snapshot back into the engine — including the durable non-zero RouteIDs from the catalog. Seeding earlier registered the placeholder RouteID=0 entries from buildEngine; subsequent dispatches (which resolve to post-snapshot RouteIDs) found no slot and Observe dropped every mutation, leaving the heatmap empty even when traffic was flowing. Move the seed call to after setupDistributionCatalog returns so the sampler sees the catalog's authoritative IDs. Existing tests cover the seed mechanics; the ordering bug surfaced only in the live startup path which the unit tests don't exercise.

bootjp · 2026-04-26T08:58:27Z

@claude review

claude · 2026-04-26T08:58:40Z

Claude finished @bootjp's task in 4m 17s —— View job

Code Review — PR #647

Tasks

Gather context and understand request
Read changed files (main.go, main_keyviz_test.go, adapter/admin_grpc.go, keyviz/sampler.go, kv/sharded_coordinator.go + test files)
Analyze diff vs origin/main
Five-lens review (data loss, concurrency, performance, consistency, test coverage)
Post comprehensive review

Round-1 Fixes — Both Addressed ✅

Seed ordering (main.go:299–305): seedKeyVizRoutes is now correctly placed after setupDistributionCatalog, with a clear explanatory comment. The comment is excellent — it cites exactly why seeding earlier would register RouteID=0 and cause observations to miss.

Row budget (adapter/admin_grpc.go:603): applyKeyVizRowBudget is implemented, called in matrixToProto, and has a dedicated test (TestGetKeyVizMatrixHonorsRowsBudget). The activity-descending sort + truncate + final Start-key sort is the correct order for this operation.

Moderate — Read path not wired

observeMutation fires only from groupMutations (the write dispatch path). Reads — LeaseReadForKey, Coordinate.ReadForKey, and any follower-local read paths — never call Observe. The design doc §5.1 explicitly calls for read and write sampling:

"Writes are sampled exactly once by the current Raft leader before proposal. Reads are sampled by the node that actually serves the read."

And the Phase 2 success criterion (§10) includes "leader/follower reads." Since keyviz.OpRead is defined, MatrixRow.Reads is populated by the sampler, and the proto response includes KEYVIZ_SERIES_READS — operators enabling --keyvizEnabled today will get a write-only heatmap with the Reads series permanently zero. That mismatch with the design doc contract and the flag's description ("feeds AdminServer.GetKeyVizMatrix") should be called out explicitly.

If this is intentional scoping (writes-first in this phase, reads in the next), add a comment or a TODO in observeMutation and call it out in the PR description so reviewers and future contributors know Phase 2 is still open on this point.

Fix or document this →

Minor — Test gap in `TestSeedKeyVizRoutesCopiesEngineCatalogue`

The test uses engine.UpdateRoute(...) which leaves RouteID=0, then calls s.Observe(r.RouteID, keyviz.OpRead, 1, 1) against the same RouteID=0. This passes because seedKeyVizRoutes registered RouteID=0, and Observe(0, ...) finds it.

In production, EnsureCatalogSnapshot assigns RouteID=1 (first route) via routeDescriptorsFromEngine → routeID := uint64(1). The test doesn't guard the regression where a future change seeds before EnsureCatalogSnapshot (all IDs 0 again) — the test would still pass because it exercises the same 0 → 0 path throughout. Adding a subtest that calls engine.ApplySnapshot with a non-zero RouteID snapshot before seeding would close this gap. This is not blocking but was flagged in round-1 and remains open.

Minor — Missing `--keyvizHistoryColumns` flag

HistoryColumns is hardcoded to 0 (→ DefaultHistoryColumns=1440, 24 h). The other three tunables (--keyvizStep, --keyvizMaxTrackedRoutes, --keyvizMaxMemberRoutesPerSlot) are all flags. An operator who wants a shorter retention window (e.g., 2 h for a high-cardinality cluster) has no knob. Not a blocker for this PR, but add a // TODO: expose via --keyvizHistoryColumns comment in buildKeyVizSampler or open a follow-up issue.

Minor — `startKeyVizFlusher` always spawns a goroutine when sampler is nil

func startKeyVizFlusher(ctx context.Context, eg *errgroup.Group, s *keyviz.MemSampler) {
    eg.Go(func() error {
        keyviz.RunFlusher(ctx, s, s.Step()) // parks on <-ctx.Done() when s==nil
        s.Flush()                            // no-op when s==nil
        return nil
    })
}

When --keyvizEnabled=false, this goroutine just waits for context cancellation. The overhead is negligible (one parked goroutine), but a nil guard would make the "disabled" path explicit:

if s == nil {
    return
}

Not a blocker — the current code is correct and the docstring says "Nil-safe: a disabled sampler reduces the goroutine to ctx-wait + a no-op Flush."

What's good

Both round-1 issues fixed and tested. TestGetKeyVizMatrixHonorsRowsBudget and the seed-ordering comment are well-written.
keyVizSamplerForCoordinator returns an interface-nil (not a typed nil) so the if c.sampler == nil guard in observeMutation fires correctly. The comment explaining this invariant is valuable.
applyKeyVizRowBudget ordering: activity-descending sort → truncate → re-sort by Start is correct and matches the test expectation.
newKeyVizRowFrom correctly uses MemberRoutesTotal for RouteCount and sets RouteIdsTruncated when total > len(MemberRoutes). The TestGetKeyVizMatrixSurfacesRouteCountTruncation test is exactly right.
RegisterSampler guarded by groupsMu — consistent with RegisterGroup and the read side in GetKeyVizMatrix. No data race.
Seed ordering comment at main.go:299–305 is precise: it explains the exact mechanism (EnsureCatalogSnapshot → ApplySnapshot → non-zero IDs), names the failure mode (RouteID=0 → Observe misses), and references the Phase 3 follow-up. This is the kind of comment future contributors will thank you for.
fakeKeyVizSampler in the test is a clean, deep-copying implementation that mirrors the real contract without standing up a full sampler.
9 new tests are well-structured with clear doc comments that pin both positive contracts and failure modes.

Summary

The PR is in mergeable shape. Both round-1 findings are fixed. The read-path gap is the most substantive remaining item — it's either a deliberate phase boundary (in which case document it) or an oversight (in which case wire the read paths before merge). The other items are low-priority follow-ups.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e1475104a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T09:13:34Z

+	if removed && bucket.MemberRoutesTotal > 0 {
+		bucket.MemberRoutesTotal--
+	}


Decrement total for pruned routes beyond visible cap

When a virtual bucket exceeds MaxMemberRoutesPerSlot, removing one of the hidden members will enqueue a prune but pruneMemberRoute only decrements MemberRoutesTotal if the route was present in MemberRoutes. In that common churn path removed is false, so the total never drops, and GetKeyVizMatrix later reports inflated route_count (and stale truncation) for that bucket. This makes the admin heatmap metadata drift over time even though the route has been removed.

Useful? React with 👍 / 👎.

bootjp · 2026-04-26T09:15:13Z

@claude review

claude · 2026-04-26T09:15:28Z

Claude finished @bootjp's task in 4m 32s —— View job

Code Review — PR #647 (Round 3)

Tasks

Gather context from PR and prior review rounds
Read all 7 changed files + supporting keyviz helpers (flusher.go, ring_buffer.go)
Five-lens review pass (data loss, concurrency, perf, consistency, coverage)
Post comprehensive review

Round 1+2 Fixes — Confirmed ✅

Seed ordering (main.go:299–306): seedKeyVizRoutes correctly placed after setupDistributionCatalog with a precise comment explaining the RouteID=0 failure mode. ✅

Row budget (adapter/admin_grpc.go:603): applyKeyVizRowBudget implemented, called in matrixToProto, covered by TestGetKeyVizMatrixHonorsRowsBudget. ✅

Moderate — `MemberRoutesTotal` drift on past-cap route churn

This is the Codex P2 finding (keyviz/sampler.go:751). The code comment at line 735 partially acknowledges it:

"routes pruned past the visible cap stay in the total because we don't track individual past-cap members"

But the drift is worse than the comment implies — re-registration also double-counts:

Route A registered above MaxTrackedRoutes cap → MemberRoutesTotal++, not appended to MemberRoutes (capped)
Route A removed → deferred pruneMemberRoute fires after grace window → removed=false (route never was in MemberRoutes) → MemberRoutesTotal unchanged (should decrement)
Route A re-registered → memberRoutesContains returns false (not in MemberRoutes) → MemberRoutesTotal++ again

Each remove+re-register cycle for a past-cap route increments MemberRoutesTotal by 1 permanently. On a cluster where routes above the cap churn (splits, migrations), MemberRoutesTotal grows without bound:

route_count in the heatmap becomes permanently inflated
route_ids_truncated stays true forever even after all past-cap routes have departed
Operators drilling into hot buckets see wildly wrong attribution counts

A conservative fix for the prune side: maintain a pastCapMembers map[uint64]struct{} alongside MemberRoutes inside routeSlot, so pruneMemberRoute can detect and decrement past-cap removals. This adds memory per virtual bucket but bounds the drift. A simpler band-aid: cap MemberRoutesTotal at len(MemberRoutes) + some_sentinel and document the approximation clearly.

Fix or track this →

This was flagged by Codex in round 2 but not addressed before merge. Follow-up PR recommended.

Minor — `applyKeyVizRowBudget` uses unstable sort

// adapter/admin_grpc.go:615
sort.Slice(rows, func(i, j int) bool {
    return rowActivityTotal(rows[i]) > rowActivityTotal(rows[j])
})

sort.Slice is not guaranteed to be stable. When two routes have identical activity totals, the selection is non-deterministic across calls. On a dashboard polling GetKeyVizMatrix at regular intervals, this can cause "flickering" — different route sets returned for the same time window depending on internal sort order.

One-word fix: sort.SliceStable.

Fix this →

Still Open from Round 2

Read path not wired: observeMutation at kv/sharded_coordinator.go:978 includes a comment "reads never reach this path" — this is correct as a writes-only Phase 2 MVP. But the design doc §5.1 includes reads in the contract. The observeMutation comment should explicitly note this is deferred to Phase 3 (or equivalent) rather than just saying reads don't reach here, so the limitation is visible to future contributors.

Missing --keyvizHistoryColumns flag: HistoryColumns is hardcoded to DefaultHistoryColumns=1440 (24 h). All other tunables are flags. Add a // TODO: expose via --keyvizHistoryColumns in buildKeyVizSampler if not exposing in this PR.

What's confirmed correct

keyVizSamplerForCoordinator returns an interface-nil (not a typed-nil) so c.sampler == nil fires correctly on the hot path. Comment explaining this is exactly right.
Lock ordering: routesMu → retiredMu in RegisterRoute/RemoveRoute; Flush takes only retiredMu then historyMu. No cycle, no deadlock.
COW routeTable with atomic.Pointer.Store: Observe is a single Load + map lookup with no mutex on the hot path. Mutations build a fresh copy under routesMu, then publish atomically.
RunFlusher nil-safe (if s == nil { <-ctx.Done(); return }). startKeyVizFlusher with a nil sampler reduces to a parked goroutine that exits on cancel — correct and documented.
cancelPendingPruneFor acquires retiredMu while holding routesMu, which is the consistent order everywhere. Flush never touches routesMu. No deadlock.
rebuildSorted uses a seen map to deduplicate virtual-bucket pointers so slots don't appear twice in sortedSlots when both slots and virtualForRoute reference the same *routeSlot. Correct.
ringBuffer.Range calls cloneColumn so the returned slice is independent of the stored history — concurrent flushing cannot corrupt a Snapshot response in flight.
9 tests well-structured with clear doc comments pinning both positive contracts and failure modes. fakeKeyVizSampler correctly deep-copies rows so test assertions don't share memory with stored state.

Summary

Both round-1 critical issues are fixed and well-tested. The main open item is the MemberRoutesTotal drift for past-cap virtual bucket members — the code acknowledges the prune side but the re-registration double-count makes it compound on every churn cycle. This needs a follow-up PR before keyviz is used in any high-cardinality production deployment. The sort.SliceStable fix in applyKeyVizRowBudget is a trivial improvement worth applying at any time.

Round-2 review fixes for PR #647 (Claude bot): - Add --keyvizHistoryColumns flag so operators can shorten the ring buffer for high-cardinality clusters; defaults to keyviz.DefaultHistoryColumns (1440 = 24h at 60s). - startKeyVizFlusher now early-returns when the sampler is nil instead of spawning a goroutine that just parks on ctx.Done. - Document the Phase-2 read-sampling milestone on observeMutation so future readers don't think the missing read path is a bug — the design (§5.1, §10) calls for read attribution from the node that serves the read (LeaseRead / LinearizableRead / follower reads), and that wiring is intentionally deferred to the next slice. Until then the Reads/ReadBytes series stay zero.

… read TODO) (#651) ## Summary Follow-up to PR #647, which was merged at the round-1 commit before round-2 review fixes propagated. This PR carries the orphaned round-2 changes: - New `--keyvizHistoryColumns` flag (defaults to `keyviz.DefaultHistoryColumns = 1440`, i.e. 24h at 60s) so operators can shorten the ring buffer for high-cardinality clusters without rebuilding. - `startKeyVizFlusher` early-returns when the sampler is nil instead of spawning a goroutine that just parks on `ctx.Done` — the goroutine was harmless but had no signal. - TODO on `observeMutation` documenting the Phase-2 read-sampling milestone (design §5.1, §10) so future readers don't think the missing read path is a regression. Until that wiring lands the matrix's `Reads`/`ReadBytes` series stay zero. These items came out of Claude bot's round-2 review of #647 but landed after the merge button was pressed. ## Test plan - [x] `go build .`, `go vet .`, `golangci-lint run ./...` clean. - [x] `go test -race -count=1 -run 'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' .` clean.

## Summary Phase 2-B of the Key Visualizer design (`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into the existing `web/admin/` SPA rather than building a parallel dashboard. Doc + implementation in one PR (doc commit first, per CLAUDE.md design-doc-first workflow). - New page at `/keyviz` — canvas heatmap polling `/admin/api/v1/keyviz/matrix` with series picker (writes / reads / write_bytes / read_bytes), row-budget input (clamped at 1024), and off / 5 s / 30 s auto-refresh. - Cold cells (value 0) render as the page background, not a faint blue — spotting actually-cold routes stays the dominant visual signal. - Row-detail flyout on hover: bucket_id, start, end, aggregate, route_count, route_ids (with truncation indicator). `start` / `end` decode through a printable-or-hex preview so binary keys do not render as mojibake. - Dependency-free: hand-rolled five-stop colour ramp in `lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208 kB (raw), 64 kB gzipped. Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 / #646 / #647 / #651 / #660 / #661 / #672). ## Design `docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed status, lands with this PR. Parent design §12 phase table is split into 2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out, open). ## Five-lens self-review 1. **Data loss** — n/a; SPA is read-only against an existing handler. 2. **Concurrency / distributed** — n/a; single browser tab polling a single handler. Sampler concurrency was already covered by Phase 2-A tests. 3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower bound (sampler flush is 1 s). 4. **Data consistency** — SPA renders whatever the handler returns; consistency guarantees come from the existing leader-issued counters in the sampler. 5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go build ./internal/admin/...` clean (embed glob unaffected); `go test ./internal/admin/...` clean. Manual verification documented in the design doc §5. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output goes to `internal/admin/dist` - [x] `go build ./internal/admin/...` — clean - [x] `go test ./internal/admin/...` — clean - [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot routes light up red within ~5 s of write traffic - [ ] Manual: series picker swaps the displayed counter; row-budget input clamps at 1024; auto-refresh polls without flicker ## Out of scope - **Cluster fan-out** — handler is currently node-local. Phase 2-C will add a cross-node admin RPC; this PR will pick up the aggregate view automatically once that ships. - **Drill-down per-route sparkline** — Phase 3. - **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet built; correlation lands when those pages do. - **`localStorage` for series / rows / refresh** — punt to follow-up.

bootjp added 5 commits April 26, 2026 06:43

Merge feat/keyviz-coordinator-wiring (PR #645)

e703885

Merge feat/keyviz-admin-server (PR #646)

8c70bc1

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

bootjp added 5 commits April 26, 2026 17:50

Merge feat/keyviz-coordinator-wiring round-1 fixes

fb91638

Merge feat/keyviz-admin-server round-1 fixes

8344cc0

bootjp merged commit b9d24f6 into main Apr 26, 2026
8 checks passed

bootjp deleted the feat/keyviz-main-wiring branch April 26, 2026 09:12

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

bootjp mentioned this pull request Apr 26, 2026

feat(main): keyviz wiring follow-up (--keyvizHistoryColumns + Phase-2 read TODO) #651

Merged

2 tasks

bootjp mentioned this pull request Apr 26, 2026

feat(admin): KeyViz heatmap page in web/admin SPA (Phase 2-B) #680

Merged

6 tasks

Conversation

bootjp commented Apr 25, 2026

Summary

Test plan

Merge order

Uh oh!

gemini-code-assist Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #647

Summary

Critical — Bug

Seed fires before EnsureCatalogSnapshot assigns real RouteIDs (main.go:292)

Moderate — Incomplete

req.GetRows() budget never applied (adapter/admin_grpc.go:509-537)

Minor

--keyvizHistoryColumns flag missing

Flusher docstring describes the wrong shutdown signal

sortKeyVizRowsByStart tie-breaking on equal Start keys

What's good

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #647

Round-1 Fixes — Both Addressed ✅

Moderate — Read path not wired

Minor — Test gap in TestSeedKeyVizRoutesCopiesEngineCatalogue

Minor — Missing --keyvizHistoryColumns flag

Minor — startKeyVizFlusher always spawns a goroutine when sampler is nil

What's good

Summary

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — PR #647 (Round 3)

Round 1+2 Fixes — Confirmed ✅

Moderate — MemberRoutesTotal drift on past-cap route churn

Minor — applyKeyVizRowBudget uses unstable sort

Still Open from Round 2

What's confirmed correct

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading

claude Bot commented Apr 26, 2026 •

edited

Loading

Seed fires before `EnsureCatalogSnapshot` assigns real RouteIDs (`main.go:292`)

`req.GetRows()` budget never applied (`adapter/admin_grpc.go:509-537`)

`--keyvizHistoryColumns` flag missing

`sortKeyVizRowsByStart` tie-breaking on equal Start keys

claude Bot commented Apr 26, 2026 •

edited

Loading

Minor — Test gap in `TestSeedKeyVizRoutesCopiesEngineCatalogue`

Minor — Missing `--keyvizHistoryColumns` flag

Minor — `startKeyVizFlusher` always spawns a goroutine when sampler is nil

claude Bot commented Apr 26, 2026 •

edited

Loading

Moderate — `MemberRoutesTotal` drift on past-cap route churn

Minor — `applyKeyVizRowBudget` uses unstable sort