feat(kv): observe lease + linearizable reads via keyviz Sampler by bootjp · Pull Request #661 · bootjp/elastickv

bootjp · 2026-04-26T10:16:53Z

Summary

Wires read sampling into the shard coordinator's two key-routed read entry points so the heatmap's Reads / ReadBytes series get populated (design §5.1: "Reads are sampled by the node that actually serves the read"). Mirrors the write-side observeMutation wiring; both feed the same keyviz.Sampler the coordinator already holds.

New observeReadKey helper, called from LinearizableReadForKey and LeaseReadForKey after the route is resolved.
valueLen is always 0 at this layer — the consistency check doesn't fetch data; the actual MVCCStore.GetAt happens further down the stack and is sampled separately. Adapter direct-read paths (Redis / DynamoDB / S3 hitting MVCCStore.GetAt without going through the coordinator) still bypass keyviz; sampling those is a future slice — see the observeReadKey doc comment.
Leadership-check entries (IsLeaderForKey, VerifyLeaderForKey, RaftLeaderForKey) intentionally do NOT call observeReadKey — they're internal routing checks, not user-facing data reads.

Closes the read-side gap noted on the observeMutation Phase-2 TODO. The matrix's Reads and ReadBytes series are no longer permanently zero.

Test plan

TestShardedCoordinatorObservesLeaseAndLinearizableReads — drives one of each through a real raft + store, asserts exactly one Observe(routeID, OpRead, len(key), 0) per call.
TestShardedCoordinatorSkipsObserveForLeadershipChecks — pins the negative contract that the three leadership checks stay silent.
go test -race -count=1 -run TestShardedCoordinator ./kv/... clean.
golangci-lint run ./kv/... clean.

chatgpt-codex-connector · 2026-04-26T10:17:00Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai · 2026-04-26T10:17:00Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 48 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 37 minutes and 48 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10328615-756b-4e5f-b352-57216554623e

📥 Commits

Reviewing files that changed from the base of the PR and between bf67fa5 and b8fe219.

📒 Files selected for processing (2)

kv/sharded_coordinator.go
kv/sharded_coordinator_sampler_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/keyviz-read-observe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Wires read sampling into the shard coordinator's two key-routed read entry points so the heatmap's Reads / ReadBytes series get populated (design §5.1: "Reads are sampled by the node that actually serves the read"). Mirrors the write-side wiring in observeMutation; both feed the same keyviz.Sampler the coordinator already holds. - New observeReadKey helper, called from LinearizableReadForKey and LeaseReadForKey after the route is resolved. - valueLen is always 0 at this layer — the consistency check doesn't fetch data; the actual MVCCStore.GetAt happens further down the stack and is sampled separately. Adapter direct-read paths (Redis / DynamoDB / S3 hitting MVCCStore.GetAt without going through the coordinator) still bypass keyviz; sampling those is task B in the design's Phase 2 follow-up — see the observeReadKey doc comment. - Leadership-check entries (IsLeaderForKey, VerifyLeaderForKey, RaftLeaderForKey) intentionally do NOT call observeReadKey — they're internal routing checks, not user-facing data reads. Tests: - TestShardedCoordinatorObservesLeaseAndLinearizableReads — drives one of each through a real raft + store, asserts exactly one Observe(routeID, OpRead, len(key), 0) per call. - TestShardedCoordinatorSkipsObserveForLeadershipChecks — pins the negative contract that the three leadership checks stay silent.

gemini-code-assist

Code Review

This pull request implements read-side sampling for key visualization by integrating the observeReadKey helper into the LinearizableReadForKey and LeaseReadForKey methods of the ShardedCoordinator. It also includes unit tests to verify that user-facing reads are correctly recorded while internal leadership checks are ignored. Feedback was provided regarding the performance overhead of redundant routing lookups within the observeReadKey helper and the potential for typed-nil interface issues when checking the sampler.

gemini-code-assist · 2026-04-26T10:19:29Z

+func (c *ShardedCoordinator) observeReadKey(key []byte) {
+	if c.sampler == nil {
+		return
+	}
+	route, ok := c.engine.GetRoute(routeKey(key))
+	if !ok {
+		return
+	}
+	c.sampler.Observe(route.RouteID, keyviz.OpRead, len(key), 0)
+}


The observeReadKey helper introduces a redundant routing lookup on the read hot path. Since the caller (LinearizableReadForKey or LeaseReadForKey) has already performed a lookup via groupForKey, performing it again here doubles the overhead (lock acquisition and binary search) when sampling is enabled.

Additionally, the c.sampler == nil check does not account for typed-nil interface values (e.g., a *keyviz.MemSampler(nil)), which are explicitly supported and tested. For such cases, the expensive GetRoute lookup is still performed before the sampler's Observe method (which is nil-safe) returns.

While the comment mentions that this keeps the helper nil-safe, it contradicts the goal of keeping the hot path efficient. A better approach would be to refactor the routing logic to return the Route object and pass the RouteID directly to the observation helper, mirroring the efficient pattern used in observeMutation.

References

For frequently accessed fields and hot paths, prioritize performance by minimizing overhead such as redundant lookups and lock acquisitions.

bootjp · 2026-04-26T10:23:15Z

@claude review

claude · 2026-04-26T10:23:26Z

Claude finished @bootjp's task in 3m 7s —— View job

Code Review

Gather context and understand request
Read changed files (kv/sharded_coordinator.go, kv/sharded_coordinator_sampler_test.go)
Analyze code changes for correctness, performance, and consistency
Post detailed review feedback

The change correctly wires read-side keyviz sampling and the test structure is solid. One structural issue merits addressing before merge, and there are a few smaller observations.

Performance: redundant `GetRoute` on the sampled read path

This is the same concern Gemini raised and it is real. The current code does two full GetRoute calls (each acquires e.mu.RLock() + binary-search over the route slice) per sampled read:

LinearizableReadForKey / LeaseReadForKey
  → groupForKey(key)              // 1st RLock + routeIndex
      c.engine.GetRoute(routeKey(key))  ← discards route.RouteID
  → observeReadKey(key)           // 2nd RLock + routeIndex (when c.sampler != nil)
      c.engine.GetRoute(routeKey(key))

The observeMutation path avoids this entirely because groupMutations already resolves the route while sharding mutations, and passes routeID uint64 directly:

func (c *ShardedCoordinator) observeMutation(routeID uint64, mut *pb.Mutation) {

The cleanest fix is to mirror that pattern for reads. groupForKey already holds the route internally — it just discards it:

func (c *ShardedCoordinator) groupForKey(key []byte) (*ShardGroup, bool) {
    route, ok := c.engine.GetRoute(routeKey(key))
    if !ok { return nil, false }
    g, ok := c.groups[route.GroupID]
    return g, ok   // route.RouteID thrown away here
}

Suggested refactor: add a variant that also returns routeID, or change groupForKey to return a (routeID uint64, *ShardGroup, bool) triple. Then:

// observeReadKey(routeID, keyLen) — mirrors observeMutation
func (c *ShardedCoordinator) observeReadKey(routeID uint64, keyLen int) {
    if c.sampler == nil { return }
    c.sampler.Observe(routeID, keyviz.OpRead, keyLen, 0)
}

And call sites become:

routeID, g, ok := c.groupForKeyWithRouteID(key)
if !ok { return 0, errors.WithStack(ErrLeaderNotFound) }
c.observeReadKey(routeID, len(key))

This also eliminates the unreachable !ok branch inside observeReadKey — since the caller has already confirmed the route exists, the second lookup can never fail under a stable routing table.

Fix this →

Unreachable `!ok` arm in `observeReadKey`

func (c *ShardedCoordinator) observeReadKey(key []byte) {
    if c.sampler == nil { return }
    route, ok := c.engine.GetRoute(routeKey(key))
    if !ok {   // ← can never fire: caller already confirmed ok=true
        return
    }
    ...
}

This is only reachable if the routing table changes between groupForKey and observeReadKey. In practice that window is nanoseconds and route changes are rare, but the branch is dead under the current call pattern and will confuse future readers. The refactor above eliminates it naturally.

Doc comment accuracy

// Caller must have already passed groupForKey(key) so the GetRoute
// lookup almost always hits in the table — keeping it inside this
// helper keeps observeReadKey nil-safe at the lookup level too

"Almost always hits" reads like a cache-hit hint. What it actually means is "this is a duplicate lookup that we justify by saying it'll succeed." The comment inadvertently papers over the double-lookup rather than explaining why it is acceptable. Once the refactor is done, this comment goes away entirely.

Typed-nil sampler (`c.sampler == nil` vs. typed-nil)

Gemini's concern here is valid in theory but the Sampler interface contract explicitly handles this:

// Implementations MUST be nil-receiver-safe: a typed-nil
// implementation passed through this interface ... must not panic.
// The coordinator stores the interface value as supplied and dispatches
// through it on the hot path; a guard at the call site only checks for
// an interface-nil, not a typed-nil.

So observeReadKey checking c.sampler == nil is consistent with observeMutation and the contract is correctly documented. The concern evaporates once the double-GetRoute is fixed (typed-nil still avoids the map lookup cost because observeReadKey becomes just if c.sampler == nil { return }).

Minor observations

LeaseRead (no-key variant) intentionally unsampled — The default-group LeaseRead at line 763 has no key context and is correctly left without observeReadKey. Worth a brief comment for the next reviewer's benefit, since it sits right above the sampled LeaseReadForKey.

RouteID: 0 in tests — engine.UpdateRoute (used in the new tests) assigns RouteID: 0. This exercises the wiring correctly but if someone changes the test's assertion to verify a non-zero RouteID they'll be surprised. The existing mutation tests have the same pattern so this is not a new problem, just worth being aware of.

Test key "hot-key" vs. route start "a" — "hot-key" starts with 'h' which is ≥ 'a', so it correctly falls in the [a, ∞) route. Not a bug, but if the test key ever changes to something starting with a character < 'a' (e.g. '!' or a capital letter) it would silently miss the route and observeReadKey would be a no-op, making require.Len(t, calls, 2) the only failure signal. A comment noting the dependency would prevent a future head-scratch.

Summary

The read-side wiring is conceptually correct and well-tested. The double GetRoute lookup is the only issue worth fixing before merge — it doubles hot-path lock acquisitions whenever keyviz is enabled and leaves dead code behind. The suggested refactor is small (add a groupForKeyWithRouteID variant and change observeReadKey's signature) and mirrors an already-established pattern in the file.

Round-1 review fix for PR #661 (Gemini medium): observeReadKey was repeating the GetRoute lookup that the calling LinearizableReadForKey / LeaseReadForKey had already done via groupForKey — doubling the per-read routing cost when sampling was enabled. Refactor: - New routeAndGroupForKey returns both the route ID and the *ShardGroup in one engine.GetRoute pass; reads use it instead of groupForKey. - observeRead now takes the precomputed routeID directly. The c.sampler == nil guard is the sole short-circuit; we never pay GetRoute for the keyviz call. - Leadership-only entries (IsLeaderForKey, VerifyLeaderForKey, RaftLeaderForKey) keep using groupForKey since they don't need the route ID.

bootjp · 2026-04-26T10:26:21Z

@claude review

claude · 2026-04-26T10:26:37Z

Claude finished @bootjp's task in 2m 26s —— View job

Code Review

Reading changed files and gathering context
Reviewing kv/sharded_coordinator.go changes
Reviewing kv/sharded_coordinator_sampler_test.go changes
Cross-checking against CLAUDE.md's 5-lens review requirements
Posting final review

The PR is well-structured and the second commit (b8fe219) already addressed Gemini's round-1 performance nit cleanly. Here are findings across the five review lenses:

Data Loss

No issues. observeRead fires before the underlying operation (pre-barrier for LinearizableReadForKey, pre-lease-check for LeaseReadForKey), which is consistent with observeMutation's offered-load semantics. Mutations that fail their Raft proposal are still counted; reads that fail their barrier will be similarly counted. This is intentional and fine.

Concurrency / Distributed Failures

No issues. routeAndGroupForKey is a straightforward composition of engine.GetRoute + groups map lookup — same concurrency profile as groupForKey. The sampler field is set once via WithSampler before concurrent use, consistent with the pre-existing write-path pattern.

The typed-nil concern Gemini raised in round 1 is not a bug: MemSampler.Observe has an explicit if s == nil { return } guard at keyviz/sampler.go:355, so a (*keyviz.MemSampler)(nil) receiver is safe by contract. The existing TestShardedCoordinatorWithoutSamplerStaysSafe already covers this case.

Performance

The second commit resolves the double-lookup. routeAndGroupForKey performs exactly one GetRoute + one map lookup, same as the old groupForKey. observeRead with an interface-nil sampler is a single branch and return — allocation-free. No regressions.

Data Consistency

No issues. valueLen = 0 at this layer is correct and well-documented: the coordinator's consistency check doesn't fetch data; the actual MVCCStore.GetAt happens further down the stack. The asymmetry between keyLen > 0 (actual key size) and valueLen = 0 is noted in the comment and the test asserts valueLen: 0 explicitly.

Test Coverage

Tests are solid. recordingSampler is thread-safe (mutex-guarded), table-driven where appropriate, and the negative contract (TestShardedCoordinatorSkipsObserveForLeadershipChecks) is pinned explicitly. Two issues worth addressing:

Issues

1. PR/review-tool references in production comments — [fix before merge]

kv/sharded_coordinator.go embeds "Gemini round-1 nit on PR #661" in two places:

routeAndGroupForKey comment (line ~843): "...so the GetRoute lookup runs once instead of twice (Gemini round-1 nit on PR #661)."
observeRead comment (line ~1017–1018): "...would double the per-read overhead when sampling is enabled (Gemini round-1 nit on PR #661)."

Per CLAUDE.md conventions: "Do not reference the current task, fix, or callers...since those belong in the PR description and rot as the codebase evolves." These references will be meaningless in six months. Replace with just the rationale:

// routeAndGroupForKey is groupForKey extended to return the RouteID so
// read entry points that need to observe into keyviz can do so without
// a second GetRoute lookup. Leadership-only callers (IsLeaderForKey /
// VerifyLeaderForKey / RaftLeaderForKey) keep using groupForKey because
// they don't need the route ID.

Fix this →

2. observeRead comment omits offered-load rationale — [minor]

observeMutation explicitly documents that pre-operation counting is intentional ("heatmap reflects offered load, not just committed writes"). observeRead fires before the read barrier too, but doesn't say so. A reader will wonder: "what if the read fails?" Add one line:

// Counted pre-operation so failed reads (context-cancelled barrier,
// lease expiry) still contribute to offered-load traffic — intentional,
// matching observeMutation's semantics.

Minor Observations (not blocking)

groupForKey duplicates 3 of 4 lines from routeAndGroupForKey: groupForKey could delegate to routeAndGroupForKey and discard the routeID. Removes duplication at the cost of one extra ignored return value — reasonable trade-off but not urgent.
TestShardedCoordinatorObservesLeaseAndLinearizableReads only tests the happy path: there's no test that observeRead fires even when the underlying read fails (e.g., context canceled). Not strictly required, but would strengthen the offered-load contract.

Overall: Two commits, clean separation of concerns, the performance concern was resolved before review. The only required fix is the review-tool attribution in comments. Everything else is nitpicks.
· Branch: feat/keyviz-read-observe

## Summary Phase 2-B of the Key Visualizer design (`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into the existing `web/admin/` SPA rather than building a parallel dashboard. Doc + implementation in one PR (doc commit first, per CLAUDE.md design-doc-first workflow). - New page at `/keyviz` — canvas heatmap polling `/admin/api/v1/keyviz/matrix` with series picker (writes / reads / write_bytes / read_bytes), row-budget input (clamped at 1024), and off / 5 s / 30 s auto-refresh. - Cold cells (value 0) render as the page background, not a faint blue — spotting actually-cold routes stays the dominant visual signal. - Row-detail flyout on hover: bucket_id, start, end, aggregate, route_count, route_ids (with truncation indicator). `start` / `end` decode through a printable-or-hex preview so binary keys do not render as mojibake. - Dependency-free: hand-rolled five-stop colour ramp in `lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208 kB (raw), 64 kB gzipped. Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 / #646 / #647 / #651 / #660 / #661 / #672). ## Design `docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed status, lands with this PR. Parent design §12 phase table is split into 2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out, open). ## Five-lens self-review 1. **Data loss** — n/a; SPA is read-only against an existing handler. 2. **Concurrency / distributed** — n/a; single browser tab polling a single handler. Sampler concurrency was already covered by Phase 2-A tests. 3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower bound (sampler flush is 1 s). 4. **Data consistency** — SPA renders whatever the handler returns; consistency guarantees come from the existing leader-issued counters in the sampler. 5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go build ./internal/admin/...` clean (embed glob unaffected); `go test ./internal/admin/...` clean. Manual verification documented in the design doc §5. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output goes to `internal/admin/dist` - [x] `go build ./internal/admin/...` — clean - [x] `go test ./internal/admin/...` — clean - [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot routes light up red within ~5 s of write traffic - [ ] Manual: series picker swaps the displayed counter; row-budget input clamps at 1024; auto-refresh polls without flicker ## Out of scope - **Cluster fan-out** — handler is currently node-local. Phase 2-C will add a cross-node admin RPC; this PR will pick up the aggregate view automatically once that ships. - **Drill-down per-route sparkline** — Phase 3. - **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet built; correlation lands when those pages do. - **`localStorage` for series / rows / refresh** — punt to follow-up.

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

bootjp force-pushed the feat/keyviz-read-observe branch from 9bca535 to 254b9cf Compare April 26, 2026 10:19

bootjp merged commit 5e4fe67 into main Apr 26, 2026
9 checks passed

bootjp deleted the feat/keyviz-read-observe branch April 26, 2026 11:09

This was referenced Apr 26, 2026

admin: S3 bucket write endpoints (P2 slice 2a) #669

Merged

admin: AdminForward integration for S3 admin ops (P2 slice 2b) #673

Merged

feat(admin): KeyViz heatmap page in web/admin SPA (Phase 2-B) #680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kv): observe lease + linearizable reads via keyviz Sampler#661

feat(kv): observe lease + linearizable reads via keyviz Sampler#661
bootjp merged 2 commits intomainfrom
feat/keyviz-read-observe

bootjp commented Apr 26, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bootjp commented Apr 26, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Performance: redundant GetRoute on the sampled read path

Unreachable !ok arm in observeReadKey

Doc comment accuracy

Typed-nil sampler (c.sampler == nil vs. typed-nil)

Minor observations

Summary

Uh oh!

bootjp commented Apr 26, 2026

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Data Loss

Concurrency / Distributed Failures

Performance

Data Consistency

Test Coverage

Issues

Minor Observations (not blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

claude Bot commented Apr 26, 2026 •

edited

Loading

Performance: redundant `GetRoute` on the sampled read path

Unreachable `!ok` arm in `observeReadKey`

Typed-nil sampler (`c.sampler == nil` vs. typed-nil)

claude Bot commented Apr 26, 2026 •

edited

Loading