Skip to content

feat(main): wire keyviz sampler end-to-end (flags, RunFlusher, coordinator + admin)#647

Merged
bootjp merged 10 commits intomainfrom
feat/keyviz-main-wiring
Apr 26, 2026
Merged

feat(main): wire keyviz sampler end-to-end (flags, RunFlusher, coordinator + admin)#647
bootjp merged 10 commits intomainfrom
feat/keyviz-main-wiring

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

End-to-end keyviz wiring on each node binary. Stacked on top of #645 (coordinator WithSampler) and #646 (admin server RegisterSampler) — those two PRs are merged into this branch via merge commits, so this PR's incremental diff is main.go + main_keyviz_test.go.

  • Three new flags: --keyvizEnabled (off by default — opt-in), --keyvizStep, --keyvizMaxTrackedRoutes, --keyvizMaxMemberRoutesPerSlot. Defaults pull from keyviz.Default* so flag help and code stay in sync.
  • buildKeyVizSampler() returns *keyviz.MemSampler when enabled, nil otherwise — the single decision point for "keyviz on/off."
  • seedKeyVizRoutes() copies the engine's startup route catalogue into the sampler. Route-watch propagation post-startup is a follow-up (Phase 3 in the design doc).
  • startKeyVizFlusher() runs RunFlusher in the existing errgroup and calls one final Flush() after ctx fires so the in-progress step is harvested at graceful shutdown.
  • Coordinator: chained .WithSampler(...) onto the existing .WithLeaseReadObserver(...) call so the dispatch hot path observes mutations.
  • AdminServer: setupAdminService now takes the *MemSampler and only calls RegisterSampler when it's non-nil — operators with keyviz disabled get codes.Unavailable on GetKeyVizMatrix instead of a spurious empty success.

Implements docs/admin_ui_key_visualizer_design.md §5 / §10. After this lands the heatmap is fully usable end-to-end.

Test plan

  • TestBuildKeyVizSamplerHonorsEnabledFlag — flag on/off contract.
  • TestSeedKeyVizRoutesCopiesEngineCatalogueengine.Stats() seed path.
  • TestSeedKeyVizRoutesNoOpOnNilSampler — disabled-sampler safety.
  • TestStartKeyVizFlusherReturnsAfterCancel — graceful shutdown drains the pre-cancel counters via the final Flush.
  • go build ., go vet ., golangci-lint run ./... clean.
  • go test -race -count=1 -run 'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' . passes.

Merge order

  1. Merge feat(kv): wire keyviz.Sampler into ShardedCoordinator dispatch path #645 (coordinator wiring).
  2. Merge feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix #646 (admin server wiring).
  3. Rebase this branch on main (the merge commits in this PR collapse to no-ops) and merge.

bootjp added 5 commits April 26, 2026 06:43
Wires the keyviz.Sampler from PR #639 into ShardedCoordinator
(design doc §5.1):

- Adds an unexported `sampler keyviz.Sampler` field plus
  `WithSampler(s keyviz.Sampler) *ShardedCoordinator` post-
  construction option, mirroring WithLeaseReadObserver.
- groupMutations now calls `c.observeMutation(routeID, mut)` once
  per resolved (RouteID, mutation) pair before grouping by
  GroupID. Reads are not dispatched through this path; all calls
  use OpWrite.
- observeMutation guards against an interface-nil sampler at the
  call site; the keyviz contract also tolerates a typed-nil
  *MemSampler, so disabled keyviz wires through to a no-op without
  branching deep in the hot path.
- DelPrefix is intentionally not observed: dispatchDelPrefixBroadcast
  doesn't resolve a single RouteID (it broadcasts to every shard),
  and the design treats per-route attribution there as out of scope.

Tests:
- TestShardedCoordinatorObservesEveryDispatchedMutation: cross-shard
  Put batch, verifies one Observe per element with the engine's
  resolved RouteID, OpWrite, and exact key/value lengths.
- TestShardedCoordinatorWithoutSamplerStaysSafe: dispatches
  successfully both with no WithSampler call (interface-nil) and
  with a typed-nil *MemSampler.
Implements the `GetKeyVizMatrix` admin gRPC RPC declared in
`proto/admin.proto`, completing the read-side of the key visualizer
flow (design doc §5.2 / §6).

- New `KeyVizSampler` interface in adapter (Snapshot only) so tests
  can pass a deterministic in-memory fake while production wires
  `*keyviz.MemSampler` directly.
- `AdminServer.RegisterSampler` mirrors `RegisterGroup`. Without it,
  GetKeyVizMatrix returns codes.Unavailable so callers can
  distinguish "keyviz disabled on this node" from "no data yet"
  (which is a successful empty response).
- Pivots the column-major MatrixColumn slice into the row-major proto
  layout: one KeyVizRow per RouteID with values aligned to a parallel
  column_unix_ms slice. KeyVizSeries selection picks the matching
  per-row counter; UNSPECIFIED defaults to Reads.
- bucket_id encodes "route:<id>" for individual slots and
  "virtual:<syntheticID>" for aggregate buckets. Aggregate rows carry
  MemberRoutes verbatim through route_ids and route_count.

Tests:
- TestGetKeyVizMatrixReturnsUnavailableWhenSamplerNotRegistered
- TestGetKeyVizMatrixPivotsColumnsToRows — two-column / two-route
  fixture, verifies the missing-row-becomes-zero contract.
- TestGetKeyVizMatrixSeriesSelection — table-driven across all five
  enum values including UNSPECIFIED defaulting to Reads.
- TestGetKeyVizMatrixEncodesAggregateBucket — virtual bucket layout.
Wires the in-memory key visualizer end-to-end on each node binary
(design doc §5 / §10):

- Three new flags: --keyvizEnabled (off by default — opt-in),
  --keyvizStep, --keyvizMaxTrackedRoutes,
  --keyvizMaxMemberRoutesPerSlot. Defaults pull from
  keyviz.Default* so flag help and code stay in sync.
- buildKeyVizSampler() returns *keyviz.MemSampler when enabled, nil
  otherwise — the single decision point for "keyviz on/off."
- seedKeyVizRoutes() copies the engine's startup route catalogue
  into the sampler so the first matrix snapshots carry metadata.
  Route-watch propagation post-startup is a follow-up (Phase 3 in
  the design).
- startKeyVizFlusher() runs RunFlusher in the existing errgroup and
  calls one final Flush after ctx fires so the in-progress step is
  harvested at graceful shutdown.
- Coordinator: chained .WithSampler(keyVizSamplerForCoordinator(s))
  onto the existing .WithLeaseReadObserver(...) call so the dispatch
  hot path observes mutations into the sampler.
- AdminServer: setupAdminService now takes the *MemSampler and only
  calls RegisterSampler when it's non-nil — operators with keyviz
  disabled get codes.Unavailable on GetKeyVizMatrix instead of a
  spurious empty success.

Tests:
- TestBuildKeyVizSamplerHonorsEnabledFlag — flag on/off contract.
- TestSeedKeyVizRoutesCopiesEngineCatalogue — engine.Stats path.
- TestSeedKeyVizRoutesNoOpOnNilSampler — disabled-sampler safety.
- TestStartKeyVizFlusherReturnsAfterCancel — graceful shutdown
  (RunFlusher exits cleanly + final Flush harvests pending counters).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 31 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 52 minutes and 31 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: edb0b4a1-0eda-45c8-a606-7c66d2da0ab1

📥 Commits

Reviewing files that changed from the base of the PR and between 29011bb and 7e14751.

📒 Files selected for processing (7)
  • adapter/admin_grpc.go
  • adapter/admin_grpc_keyviz_test.go
  • keyviz/sampler.go
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_sampler_test.go
  • main.go
  • main_keyviz_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/keyviz-main-wiring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7bc737ae79

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main.go
Comment on lines +291 to 296
sampler := buildKeyVizSampler()
seedKeyVizRoutes(sampler, cfg.engine)
coordinate := kv.NewShardedCoordinator(cfg.engine, shardGroups, cfg.defaultGroup, clock, shardStore).
WithLeaseReadObserver(metricsRegistry.LeaseReadObserver())
WithLeaseReadObserver(metricsRegistry.LeaseReadObserver()).
WithSampler(keyVizSamplerForCoordinator(sampler))
distCatalog, err := setupDistributionCatalog(ctx, runtimes, cfg.engine)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Seed keyviz routes after catalog snapshot is applied

seedKeyVizRoutes runs before setupDistributionCatalog, but setupDistributionCatalog calls distribution.EnsureCatalogSnapshot, which applies a catalog snapshot back into the engine (including durable non-zero RouteIDs). That means the sampler is seeded with pre-snapshot IDs (typically 0 from buildEngine), while later dispatches observe post-snapshot IDs, so Observe misses and drops samples for normal traffic. In practice this can leave keyviz mostly empty right after startup even when writes are flowing.

Useful? React with 👍 / 👎.

Comment thread adapter/admin_grpc.go Outdated
Comment on lines +536 to +538
cols := sampler.Snapshot(from, to)
pickValue := matrixSeriesPicker(req.GetSeries())
return matrixToProto(cols, pickValue), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply the request row budget in GetKeyVizMatrix

This handler never reads req.GetRows(), so it always returns every row from the snapshot regardless of caller-requested resolution. On large route counts, this can blow up response size and processing time for admin queries that are supposed to request a bounded Y-axis (for example 256 rows), making the endpoint much more expensive than requested.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 34s —— View job


Code Review — PR #647

Tasks

  • Gather context and understand request
  • Read changed files (main.go, main_keyviz_test.go, adapter/admin_grpc.go, kv/sharded_coordinator.go, test files)
  • Review keyviz design doc for alignment
  • Analyze correctness, concurrency, error handling, test coverage
  • Post comprehensive review

Summary

The overall architecture is clean: single construction point (buildKeyVizSampler), nil-safe threading through coordinator and admin server, and the KeyVizSampler interface in adapter/ giving good test isolation. The comment/doc quality is high. However there is one functional correctness bug and one incomplete implementation that need addressing before merge.


Critical — Bug

Seed fires before EnsureCatalogSnapshot assigns real RouteIDs (main.go:292)

// main.go (current order)
seedKeyVizRoutes(sampler, cfg.engine)          // line 292 — RouteIDs are all 0 here
distCatalog, err := setupDistributionCatalog(…) // line 296 — EnsureCatalogSnapshot runs here

UpdateRoute (called during engine construction) deliberately leaves RouteID = 0 — see distribution/engine.go:123-135. EnsureCatalogSnapshot inside setupDistributionCatalog is what assigns non-zero RouteIDs (1, 2, …) and writes them back into the engine via engine.ApplySnapshot. So when seedKeyVizRoutes calls engine.Stats() at line 292, every route has RouteID = 0.

The sampler registers one slot under RouteID = 0. All coordinator-dispatched mutations arrive with RouteIDs 1, 2, … — which miss the slot and are silently dropped. The heatmap will be empty for the lifetime of the process until route-watch propagation lands (Phase 3).

The fix is one-line: move seedKeyVizRoutes to after setupDistributionCatalog returns successfully.

distCatalog, err := setupDistributionCatalog(ctx, runtimes, cfg.engine)
if err != nil {
    return err
}
seedKeyVizRoutes(sampler, cfg.engine)   // ← move here; engine now has real RouteIDs

Fix this →

This also invalidates the existing test TestSeedKeyVizRoutesCopiesEngineCatalogue: it calls engine.UpdateRoute (which leaves RouteID=0) and then observes the same RouteID=0, so it passes today but does not guard against this regression. After fixing the ordering, add a test that calls engine.ApplySnapshot with non-zero RouteIDs before seeding, confirming the sampler registers those IDs correctly.


Moderate — Incomplete

req.GetRows() budget never applied (adapter/admin_grpc.go:509-537)

The proto defines uint32 rows = 4 on GetKeyVizMatrixRequest as a Y-axis row budget, but GetKeyVizMatrix and matrixToProto never read it. On a cluster with, say, 50 000 routes, a caller requesting rows=256 for a dashboard gets back all 50 000 rows.

The missing enforcement point is in matrixToProto:

// after sortKeyVizRowsByStart(resp.Rows), add:
if req.GetRows() > 0 && len(resp.Rows) > int(req.GetRows()) {
    resp.Rows = resp.Rows[:req.GetRows()]
}

Since matrixToProto already sorts by Start, truncating at the end is deterministic. Alternatively, the cap could be applied before the row-major pivot loop to avoid building rows that will be discarded. Either way, the current behaviour violates the proto contract.

Fix this →


Minor

--keyvizHistoryColumns flag missing

buildKeyVizSampler hardcodes HistoryColumns at zero (falling back to DefaultHistoryColumns = 1440, 24 h at 60 s). Operators cannot tune the retention window without recompiling. Given that the other three options are all exposed as flags, this asymmetry seems unintentional. Not a blocker, but worth a follow-up flag or at least a // TODO comment.

Flusher docstring describes the wrong shutdown signal

The startKeyVizFlusher docstring says "calls one final Flush() after ctx fires so the in-progress step is harvested at graceful shutdown". The actual code does RunFlusher(ctx, s, s.Step()); s.Flush() — the Flush() runs after RunFlusher returns, not after the eg drains. That's fine semantically (it's what the comment means), but the phrasing "after ctx fires" slightly understates that the goroutine exits first, then Flush() is called synchronously before the goroutine return. The behaviour is correct; only the wording is slightly imprecise.

sortKeyVizRowsByStart tie-breaking on equal Start keys

The secondary sort key is BucketId (string order: "route:N" vs "virtual:N"). During a split, two routes can briefly share the same Start. The tie-break by BucketId string-sorts virtual buckets before real routes ("r" < "v"), which may or may not match what the UI expects. Worth calling out in the design doc comment.


What's good

  • The KeyVizSampler interface in adapter/ (rather than a concrete dep on *keyviz.MemSampler) is the right call for testability, and the fakeKeyVizSampler in the test is a clean demonstration.
  • keyVizSamplerForCoordinator correctly returns a nil interface (not a typed nil) so the hot-path if c.sampler == nil guard works as expected. The comment explaining this is valuable.
  • RegisterSampler guarded by the existing groupsMu RWMutex is correct; concurrent RPC reads take RLock so there's no data race.
  • The observeMutation placement (after route resolution, before group fan-out) is exactly right — it fires once per resolved (RouteID, key) pair regardless of cross-shard dispatch count.
  • Test coverage for series selection, aggregate encoding, and the nil-sampler Unavailable signal is thorough.

bootjp added 5 commits April 26, 2026 17:50
Round-1 review fixes for PR #645:

- Claude bot Issue 1: WithSampler stores the interface value as
  supplied; the guard at the call site only checks interface-nil,
  not typed-nil. Document the nil-receiver requirement on the
  Sampler interface so a future implementor cannot land a
  panic-on-nil Observe and silently break dispatch.
- Claude bot Issue 2: trim observeMutation comment to the two
  non-obvious points (reads bypass this path; the early return
  keeps the disabled hot path branch-only). Per CLAUDE.md, comments
  for the "what" duplicate the code.
- Claude / CodeRabbit nit: TestShardedCoordinatorWithoutSamplerStaysSafe
  now asserts c.sampler stays the zero interface value for the
  "no WithSampler call" subcase, so a future refactor that silently
  initialises the field would fail the test.
Round-1 review fixes for PR #646:

Codex P1: GetKeyVizMatrix never read req.GetRows() and always
returned every row, breaking the documented row-budget contract for
deployments with many tracked routes. Add applyKeyVizRowBudget — when
budget > 0, sort by per-row activity total (sum of values across
all columns of the requested series) and truncate to the top-N
before the final Start-order sort.

Codex P2: route_count was set to len(MemberRoutes), but MemberRoutes
is intentionally capped at MaxMemberRoutesPerSlot in the sampler
while extra routes still drive the bucket counters. Capped buckets
under-reported their contributors. Add MemberRoutesTotal to the
keyviz routeSlot/MatrixRow surface, increment it in
foldIntoBucket / RegisterRoute / bucket creation regardless of the
visible cap, and decrement in pruneMemberRoute. The adapter now
surfaces it as route_count and flips route_ids_truncated when the
visible list is shorter than the total so consumers can tell their
drill-down list is partial.

Tests:
- TestGetKeyVizMatrixHonorsRowsBudget — 4 routes, rows=2 returns
  top-2 by activity, sorted by Start.
- TestGetKeyVizMatrixSurfacesRouteCountTruncation — visible cap=2,
  total=9 → route_count=9, route_ids_truncated=true.
- TestGetKeyVizMatrixEncodesAggregateBucket updated for the new
  MemberRoutesTotal field on the test fixture.
Codex round-1 P1 on PR #647: seedKeyVizRoutes ran BEFORE
setupDistributionCatalog. setupDistributionCatalog calls
distribution.EnsureCatalogSnapshot which applies a catalog snapshot
back into the engine — including the durable non-zero RouteIDs from
the catalog. Seeding earlier registered the placeholder RouteID=0
entries from buildEngine; subsequent dispatches (which resolve to
post-snapshot RouteIDs) found no slot and Observe dropped every
mutation, leaving the heatmap empty even when traffic was flowing.

Move the seed call to after setupDistributionCatalog returns so the
sampler sees the catalog's authoritative IDs. Existing tests cover
the seed mechanics; the ordering bug surfaced only in the live
startup path which the unit tests don't exercise.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 4m 17s —— View job


Code Review — PR #647

Tasks

  • Gather context and understand request
  • Read changed files (main.go, main_keyviz_test.go, adapter/admin_grpc.go, keyviz/sampler.go, kv/sharded_coordinator.go + test files)
  • Analyze diff vs origin/main
  • Five-lens review (data loss, concurrency, performance, consistency, test coverage)
  • Post comprehensive review

Round-1 Fixes — Both Addressed ✅

Seed ordering (main.go:299–305): seedKeyVizRoutes is now correctly placed after setupDistributionCatalog, with a clear explanatory comment. The comment is excellent — it cites exactly why seeding earlier would register RouteID=0 and cause observations to miss.

Row budget (adapter/admin_grpc.go:603): applyKeyVizRowBudget is implemented, called in matrixToProto, and has a dedicated test (TestGetKeyVizMatrixHonorsRowsBudget). The activity-descending sort + truncate + final Start-key sort is the correct order for this operation.


Moderate — Read path not wired

observeMutation fires only from groupMutations (the write dispatch path). Reads — LeaseReadForKey, Coordinate.ReadForKey, and any follower-local read paths — never call Observe. The design doc §5.1 explicitly calls for read and write sampling:

"Writes are sampled exactly once by the current Raft leader before proposal. Reads are sampled by the node that actually serves the read."

And the Phase 2 success criterion (§10) includes "leader/follower reads." Since keyviz.OpRead is defined, MatrixRow.Reads is populated by the sampler, and the proto response includes KEYVIZ_SERIES_READS — operators enabling --keyvizEnabled today will get a write-only heatmap with the Reads series permanently zero. That mismatch with the design doc contract and the flag's description ("feeds AdminServer.GetKeyVizMatrix") should be called out explicitly.

If this is intentional scoping (writes-first in this phase, reads in the next), add a comment or a TODO in observeMutation and call it out in the PR description so reviewers and future contributors know Phase 2 is still open on this point.

Fix or document this →


Minor — Test gap in TestSeedKeyVizRoutesCopiesEngineCatalogue

The test uses engine.UpdateRoute(...) which leaves RouteID=0, then calls s.Observe(r.RouteID, keyviz.OpRead, 1, 1) against the same RouteID=0. This passes because seedKeyVizRoutes registered RouteID=0, and Observe(0, ...) finds it.

In production, EnsureCatalogSnapshot assigns RouteID=1 (first route) via routeDescriptorsFromEnginerouteID := uint64(1). The test doesn't guard the regression where a future change seeds before EnsureCatalogSnapshot (all IDs 0 again) — the test would still pass because it exercises the same 0 → 0 path throughout. Adding a subtest that calls engine.ApplySnapshot with a non-zero RouteID snapshot before seeding would close this gap. This is not blocking but was flagged in round-1 and remains open.


Minor — Missing --keyvizHistoryColumns flag

HistoryColumns is hardcoded to 0 (→ DefaultHistoryColumns=1440, 24 h). The other three tunables (--keyvizStep, --keyvizMaxTrackedRoutes, --keyvizMaxMemberRoutesPerSlot) are all flags. An operator who wants a shorter retention window (e.g., 2 h for a high-cardinality cluster) has no knob. Not a blocker for this PR, but add a // TODO: expose via --keyvizHistoryColumns comment in buildKeyVizSampler or open a follow-up issue.


Minor — startKeyVizFlusher always spawns a goroutine when sampler is nil

func startKeyVizFlusher(ctx context.Context, eg *errgroup.Group, s *keyviz.MemSampler) {
    eg.Go(func() error {
        keyviz.RunFlusher(ctx, s, s.Step()) // parks on <-ctx.Done() when s==nil
        s.Flush()                            // no-op when s==nil
        return nil
    })
}

When --keyvizEnabled=false, this goroutine just waits for context cancellation. The overhead is negligible (one parked goroutine), but a nil guard would make the "disabled" path explicit:

if s == nil {
    return
}

Not a blocker — the current code is correct and the docstring says "Nil-safe: a disabled sampler reduces the goroutine to ctx-wait + a no-op Flush."


What's good

  • Both round-1 issues fixed and tested. TestGetKeyVizMatrixHonorsRowsBudget and the seed-ordering comment are well-written.
  • keyVizSamplerForCoordinator returns an interface-nil (not a typed nil) so the if c.sampler == nil guard in observeMutation fires correctly. The comment explaining this invariant is valuable.
  • applyKeyVizRowBudget ordering: activity-descending sort → truncate → re-sort by Start is correct and matches the test expectation.
  • newKeyVizRowFrom correctly uses MemberRoutesTotal for RouteCount and sets RouteIdsTruncated when total > len(MemberRoutes). The TestGetKeyVizMatrixSurfacesRouteCountTruncation test is exactly right.
  • RegisterSampler guarded by groupsMu — consistent with RegisterGroup and the read side in GetKeyVizMatrix. No data race.
  • Seed ordering comment at main.go:299–305 is precise: it explains the exact mechanism (EnsureCatalogSnapshotApplySnapshot → non-zero IDs), names the failure mode (RouteID=0 → Observe misses), and references the Phase 3 follow-up. This is the kind of comment future contributors will thank you for.
  • fakeKeyVizSampler in the test is a clean, deep-copying implementation that mirrors the real contract without standing up a full sampler.
  • 9 new tests are well-structured with clear doc comments that pin both positive contracts and failure modes.

Summary

The PR is in mergeable shape. Both round-1 findings are fixed. The read-path gap is the most substantive remaining item — it's either a deliberate phase boundary (in which case document it) or an oversight (in which case wire the read paths before merge). The other items are low-priority follow-ups.

@bootjp bootjp merged commit b9d24f6 into main Apr 26, 2026
8 checks passed
@bootjp bootjp deleted the feat/keyviz-main-wiring branch April 26, 2026 09:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e1475104a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go
Comment on lines +749 to +751
if removed && bucket.MemberRoutesTotal > 0 {
bucket.MemberRoutesTotal--
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decrement total for pruned routes beyond visible cap

When a virtual bucket exceeds MaxMemberRoutesPerSlot, removing one of the hidden members will enqueue a prune but pruneMemberRoute only decrements MemberRoutesTotal if the route was present in MemberRoutes. In that common churn path removed is false, so the total never drops, and GetKeyVizMatrix later reports inflated route_count (and stale truncation) for that bucket. This makes the admin heatmap metadata drift over time even though the route has been removed.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 4m 32s —— View job


Code Review — PR #647 (Round 3)

Tasks

  • Gather context from PR and prior review rounds
  • Read all 7 changed files + supporting keyviz helpers (flusher.go, ring_buffer.go)
  • Five-lens review pass (data loss, concurrency, perf, consistency, coverage)
  • Post comprehensive review

Round 1+2 Fixes — Confirmed ✅

Seed ordering (main.go:299–306): seedKeyVizRoutes correctly placed after setupDistributionCatalog with a precise comment explaining the RouteID=0 failure mode. ✅

Row budget (adapter/admin_grpc.go:603): applyKeyVizRowBudget implemented, called in matrixToProto, covered by TestGetKeyVizMatrixHonorsRowsBudget. ✅


Moderate — MemberRoutesTotal drift on past-cap route churn

This is the Codex P2 finding (keyviz/sampler.go:751). The code comment at line 735 partially acknowledges it:

"routes pruned past the visible cap stay in the total because we don't track individual past-cap members"

But the drift is worse than the comment implies — re-registration also double-counts:

  1. Route A registered above MaxTrackedRoutes cap → MemberRoutesTotal++, not appended to MemberRoutes (capped)
  2. Route A removed → deferred pruneMemberRoute fires after grace window → removed=false (route never was in MemberRoutes) → MemberRoutesTotal unchanged (should decrement)
  3. Route A re-registered → memberRoutesContains returns false (not in MemberRoutes) → MemberRoutesTotal++ again

Each remove+re-register cycle for a past-cap route increments MemberRoutesTotal by 1 permanently. On a cluster where routes above the cap churn (splits, migrations), MemberRoutesTotal grows without bound:

  • route_count in the heatmap becomes permanently inflated
  • route_ids_truncated stays true forever even after all past-cap routes have departed
  • Operators drilling into hot buckets see wildly wrong attribution counts

A conservative fix for the prune side: maintain a pastCapMembers map[uint64]struct{} alongside MemberRoutes inside routeSlot, so pruneMemberRoute can detect and decrement past-cap removals. This adds memory per virtual bucket but bounds the drift. A simpler band-aid: cap MemberRoutesTotal at len(MemberRoutes) + some_sentinel and document the approximation clearly.

Fix or track this →

This was flagged by Codex in round 2 but not addressed before merge. Follow-up PR recommended.


Minor — applyKeyVizRowBudget uses unstable sort

// adapter/admin_grpc.go:615
sort.Slice(rows, func(i, j int) bool {
    return rowActivityTotal(rows[i]) > rowActivityTotal(rows[j])
})

sort.Slice is not guaranteed to be stable. When two routes have identical activity totals, the selection is non-deterministic across calls. On a dashboard polling GetKeyVizMatrix at regular intervals, this can cause "flickering" — different route sets returned for the same time window depending on internal sort order.

One-word fix: sort.SliceStable.

Fix this →


Still Open from Round 2

Read path not wired: observeMutation at kv/sharded_coordinator.go:978 includes a comment "reads never reach this path" — this is correct as a writes-only Phase 2 MVP. But the design doc §5.1 includes reads in the contract. The observeMutation comment should explicitly note this is deferred to Phase 3 (or equivalent) rather than just saying reads don't reach here, so the limitation is visible to future contributors.

Missing --keyvizHistoryColumns flag: HistoryColumns is hardcoded to DefaultHistoryColumns=1440 (24 h). All other tunables are flags. Add a // TODO: expose via --keyvizHistoryColumns in buildKeyVizSampler if not exposing in this PR.


What's confirmed correct

  • keyVizSamplerForCoordinator returns an interface-nil (not a typed-nil) so c.sampler == nil fires correctly on the hot path. Comment explaining this is exactly right.
  • Lock ordering: routesMu → retiredMu in RegisterRoute/RemoveRoute; Flush takes only retiredMu then historyMu. No cycle, no deadlock.
  • COW routeTable with atomic.Pointer.Store: Observe is a single Load + map lookup with no mutex on the hot path. Mutations build a fresh copy under routesMu, then publish atomically.
  • RunFlusher nil-safe (if s == nil { <-ctx.Done(); return }). startKeyVizFlusher with a nil sampler reduces to a parked goroutine that exits on cancel — correct and documented.
  • cancelPendingPruneFor acquires retiredMu while holding routesMu, which is the consistent order everywhere. Flush never touches routesMu. No deadlock.
  • rebuildSorted uses a seen map to deduplicate virtual-bucket pointers so slots don't appear twice in sortedSlots when both slots and virtualForRoute reference the same *routeSlot. Correct.
  • ringBuffer.Range calls cloneColumn so the returned slice is independent of the stored history — concurrent flushing cannot corrupt a Snapshot response in flight.
  • 9 tests well-structured with clear doc comments pinning both positive contracts and failure modes. fakeKeyVizSampler correctly deep-copies rows so test assertions don't share memory with stored state.

Summary

Both round-1 critical issues are fixed and well-tested. The main open item is the MemberRoutesTotal drift for past-cap virtual bucket members — the code acknowledges the prune side but the re-registration double-count makes it compound on every churn cycle. This needs a follow-up PR before keyviz is used in any high-cardinality production deployment. The sort.SliceStable fix in applyKeyVizRowBudget is a trivial improvement worth applying at any time.

bootjp added a commit that referenced this pull request Apr 26, 2026
Round-2 review fixes for PR #647 (Claude bot):

- Add --keyvizHistoryColumns flag so operators can shorten the ring
  buffer for high-cardinality clusters; defaults to
  keyviz.DefaultHistoryColumns (1440 = 24h at 60s).
- startKeyVizFlusher now early-returns when the sampler is nil
  instead of spawning a goroutine that just parks on ctx.Done.
- Document the Phase-2 read-sampling milestone on observeMutation
  so future readers don't think the missing read path is a bug —
  the design (§5.1, §10) calls for read attribution from the node
  that serves the read (LeaseRead / LinearizableRead / follower
  reads), and that wiring is intentionally deferred to the next
  slice. Until then the Reads/ReadBytes series stay zero.
bootjp added a commit that referenced this pull request Apr 26, 2026
… read TODO) (#651)

## Summary

Follow-up to PR #647, which was merged at the round-1 commit before
round-2 review fixes propagated. This PR carries the orphaned round-2
changes:

- New `--keyvizHistoryColumns` flag (defaults to
`keyviz.DefaultHistoryColumns = 1440`, i.e. 24h at 60s) so operators can
shorten the ring buffer for high-cardinality clusters without
rebuilding.
- `startKeyVizFlusher` early-returns when the sampler is nil instead of
spawning a goroutine that just parks on `ctx.Done` — the goroutine was
harmless but had no signal.
- TODO on `observeMutation` documenting the Phase-2 read-sampling
milestone (design §5.1, §10) so future readers don't think the missing
read path is a regression. Until that wiring lands the matrix's
`Reads`/`ReadBytes` series stay zero.

These items came out of Claude bot's round-2 review of #647 but landed
after the merge button was pressed.

## Test plan

- [x] `go build .`, `go vet .`, `golangci-lint run ./...` clean.
- [x] `go test -race -count=1 -run
'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' .`
clean.
bootjp added a commit that referenced this pull request Apr 26, 2026
## Summary

Phase 2-B of the Key Visualizer design
(`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into
the existing `web/admin/` SPA rather than building a parallel dashboard.
Doc + implementation in one PR (doc commit first, per CLAUDE.md
design-doc-first workflow).

- New page at `/keyviz` — canvas heatmap polling
`/admin/api/v1/keyviz/matrix` with series picker (writes / reads /
write_bytes / read_bytes), row-budget input (clamped at 1024), and off /
5 s / 30 s auto-refresh.
- Cold cells (value 0) render as the page background, not a faint blue —
spotting actually-cold routes stays the dominant visual signal.
- Row-detail flyout on hover: bucket_id, start, end, aggregate,
route_count, route_ids (with truncation indicator). `start` / `end`
decode through a printable-or-hex preview so binary keys do not render
as mojibake.
- Dependency-free: hand-rolled five-stop colour ramp in
`lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208
kB (raw), 64 kB gzipped.

Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and
its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 /
#646 / #647 / #651 / #660 / #661 / #672).

## Design

`docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed
status, lands with this PR. Parent design §12 phase table is split into
2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out,
open).

## Five-lens self-review

1. **Data loss** — n/a; SPA is read-only against an existing handler.
2. **Concurrency / distributed** — n/a; single browser tab polling a
single handler. Sampler concurrency was already covered by Phase 2-A
tests.
3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget
at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower
bound (sampler flush is 1 s).
4. **Data consistency** — SPA renders whatever the handler returns;
consistency guarantees come from the existing leader-issued counters in
the sampler.
5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go
build ./internal/admin/...` clean (embed glob unaffected); `go test
./internal/admin/...` clean. Manual verification documented in the
design doc §5.

## Test plan

- [x] `npm run lint` (`tsc -b --noEmit`) — clean
- [x] `npm run build` (Vite) — clean, output goes to
`internal/admin/dist`
- [x] `go build ./internal/admin/...` — clean
- [x] `go test ./internal/admin/...` — clean
- [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot
routes light up red within ~5 s of write traffic
- [ ] Manual: series picker swaps the displayed counter; row-budget
input clamps at 1024; auto-refresh polls without flicker

## Out of scope

- **Cluster fan-out** — handler is currently node-local. Phase 2-C will
add a cross-node admin RPC; this PR will pick up the aggregate view
automatically once that ships.
- **Drill-down per-route sparkline** — Phase 3.
- **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet
built; correlation lands when those pages do.
- **`localStorage` for series / rows / refresh** — punt to follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant