Skip to content

feat(kv): observe lease + linearizable reads via keyviz Sampler#661

Merged
bootjp merged 2 commits intomainfrom
feat/keyviz-read-observe
Apr 26, 2026
Merged

feat(kv): observe lease + linearizable reads via keyviz Sampler#661
bootjp merged 2 commits intomainfrom
feat/keyviz-read-observe

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

Summary

Wires read sampling into the shard coordinator's two key-routed read entry points so the heatmap's Reads / ReadBytes series get populated (design §5.1: "Reads are sampled by the node that actually serves the read"). Mirrors the write-side observeMutation wiring; both feed the same keyviz.Sampler the coordinator already holds.

  • New observeReadKey helper, called from LinearizableReadForKey and LeaseReadForKey after the route is resolved.
  • valueLen is always 0 at this layer — the consistency check doesn't fetch data; the actual MVCCStore.GetAt happens further down the stack and is sampled separately. Adapter direct-read paths (Redis / DynamoDB / S3 hitting MVCCStore.GetAt without going through the coordinator) still bypass keyviz; sampling those is a future slice — see the observeReadKey doc comment.
  • Leadership-check entries (IsLeaderForKey, VerifyLeaderForKey, RaftLeaderForKey) intentionally do NOT call observeReadKey — they're internal routing checks, not user-facing data reads.

Closes the read-side gap noted on the observeMutation Phase-2 TODO. The matrix's Reads and ReadBytes series are no longer permanently zero.

Test plan

  • TestShardedCoordinatorObservesLeaseAndLinearizableReads — drives one of each through a real raft + store, asserts exactly one Observe(routeID, OpRead, len(key), 0) per call.
  • TestShardedCoordinatorSkipsObserveForLeadershipChecks — pins the negative contract that the three leadership checks stay silent.
  • go test -race -count=1 -run TestShardedCoordinator ./kv/... clean.
  • golangci-lint run ./kv/... clean.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 37 minutes and 48 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 37 minutes and 48 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10328615-756b-4e5f-b352-57216554623e

📥 Commits

Reviewing files that changed from the base of the PR and between bf67fa5 and b8fe219.

📒 Files selected for processing (2)
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_sampler_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/keyviz-read-observe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Wires read sampling into the shard coordinator's two key-routed
read entry points so the heatmap's Reads / ReadBytes series get
populated (design §5.1: "Reads are sampled by the node that
actually serves the read"). Mirrors the write-side wiring in
observeMutation; both feed the same keyviz.Sampler the coordinator
already holds.

- New observeReadKey helper, called from LinearizableReadForKey and
  LeaseReadForKey after the route is resolved.
- valueLen is always 0 at this layer — the consistency check
  doesn't fetch data; the actual MVCCStore.GetAt happens further
  down the stack and is sampled separately. Adapter direct-read
  paths (Redis / DynamoDB / S3 hitting MVCCStore.GetAt without
  going through the coordinator) still bypass keyviz; sampling
  those is task B in the design's Phase 2 follow-up — see the
  observeReadKey doc comment.
- Leadership-check entries (IsLeaderForKey, VerifyLeaderForKey,
  RaftLeaderForKey) intentionally do NOT call observeReadKey —
  they're internal routing checks, not user-facing data reads.

Tests:
- TestShardedCoordinatorObservesLeaseAndLinearizableReads — drives
  one of each through a real raft + store, asserts exactly one
  Observe(routeID, OpRead, len(key), 0) per call.
- TestShardedCoordinatorSkipsObserveForLeadershipChecks — pins the
  negative contract that the three leadership checks stay silent.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements read-side sampling for key visualization by integrating the observeReadKey helper into the LinearizableReadForKey and LeaseReadForKey methods of the ShardedCoordinator. It also includes unit tests to verify that user-facing reads are correctly recorded while internal leadership checks are ignored. Feedback was provided regarding the performance overhead of redundant routing lookups within the observeReadKey helper and the potential for typed-nil interface issues when checking the sampler.

Comment thread kv/sharded_coordinator.go Outdated
Comment on lines +1006 to +1015
func (c *ShardedCoordinator) observeReadKey(key []byte) {
if c.sampler == nil {
return
}
route, ok := c.engine.GetRoute(routeKey(key))
if !ok {
return
}
c.sampler.Observe(route.RouteID, keyviz.OpRead, len(key), 0)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The observeReadKey helper introduces a redundant routing lookup on the read hot path. Since the caller (LinearizableReadForKey or LeaseReadForKey) has already performed a lookup via groupForKey, performing it again here doubles the overhead (lock acquisition and binary search) when sampling is enabled.

Additionally, the c.sampler == nil check does not account for typed-nil interface values (e.g., a *keyviz.MemSampler(nil)), which are explicitly supported and tested. For such cases, the expensive GetRoute lookup is still performed before the sampler's Observe method (which is nil-safe) returns.

While the comment mentions that this keeps the helper nil-safe, it contradicts the goal of keeping the hot path efficient. A better approach would be to refactor the routing logic to return the Route object and pass the RouteID directly to the observation helper, mirroring the efficient pattern used in observeMutation.

References
  1. For frequently accessed fields and hot paths, prioritize performance by minimizing overhead such as redundant lookups and lock acquisitions.

@bootjp bootjp force-pushed the feat/keyviz-read-observe branch from 9bca535 to 254b9cf Compare April 26, 2026 10:19
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 7s —— View job


Code Review

  • Gather context and understand request
  • Read changed files (kv/sharded_coordinator.go, kv/sharded_coordinator_sampler_test.go)
  • Analyze code changes for correctness, performance, and consistency
  • Post detailed review feedback

The change correctly wires read-side keyviz sampling and the test structure is solid. One structural issue merits addressing before merge, and there are a few smaller observations.


Performance: redundant GetRoute on the sampled read path

This is the same concern Gemini raised and it is real. The current code does two full GetRoute calls (each acquires e.mu.RLock() + binary-search over the route slice) per sampled read:

LinearizableReadForKey / LeaseReadForKey
  → groupForKey(key)              // 1st RLock + routeIndex
      c.engine.GetRoute(routeKey(key))  ← discards route.RouteID
  → observeReadKey(key)           // 2nd RLock + routeIndex (when c.sampler != nil)
      c.engine.GetRoute(routeKey(key))

The observeMutation path avoids this entirely because groupMutations already resolves the route while sharding mutations, and passes routeID uint64 directly:

func (c *ShardedCoordinator) observeMutation(routeID uint64, mut *pb.Mutation) {

The cleanest fix is to mirror that pattern for reads. groupForKey already holds the route internally — it just discards it:

func (c *ShardedCoordinator) groupForKey(key []byte) (*ShardGroup, bool) {
    route, ok := c.engine.GetRoute(routeKey(key))
    if !ok { return nil, false }
    g, ok := c.groups[route.GroupID]
    return g, ok   // route.RouteID thrown away here
}

Suggested refactor: add a variant that also returns routeID, or change groupForKey to return a (routeID uint64, *ShardGroup, bool) triple. Then:

// observeReadKey(routeID, keyLen) — mirrors observeMutation
func (c *ShardedCoordinator) observeReadKey(routeID uint64, keyLen int) {
    if c.sampler == nil { return }
    c.sampler.Observe(routeID, keyviz.OpRead, keyLen, 0)
}

And call sites become:

routeID, g, ok := c.groupForKeyWithRouteID(key)
if !ok { return 0, errors.WithStack(ErrLeaderNotFound) }
c.observeReadKey(routeID, len(key))

This also eliminates the unreachable !ok branch inside observeReadKey — since the caller has already confirmed the route exists, the second lookup can never fail under a stable routing table.

Fix this →


Unreachable !ok arm in observeReadKey

func (c *ShardedCoordinator) observeReadKey(key []byte) {
    if c.sampler == nil { return }
    route, ok := c.engine.GetRoute(routeKey(key))
    if !ok {   // ← can never fire: caller already confirmed ok=true
        return
    }
    ...
}

This is only reachable if the routing table changes between groupForKey and observeReadKey. In practice that window is nanoseconds and route changes are rare, but the branch is dead under the current call pattern and will confuse future readers. The refactor above eliminates it naturally.


Doc comment accuracy

// Caller must have already passed groupForKey(key) so the GetRoute
// lookup almost always hits in the table — keeping it inside this
// helper keeps observeReadKey nil-safe at the lookup level too

"Almost always hits" reads like a cache-hit hint. What it actually means is "this is a duplicate lookup that we justify by saying it'll succeed." The comment inadvertently papers over the double-lookup rather than explaining why it is acceptable. Once the refactor is done, this comment goes away entirely.


Typed-nil sampler (c.sampler == nil vs. typed-nil)

Gemini's concern here is valid in theory but the Sampler interface contract explicitly handles this:

// Implementations MUST be nil-receiver-safe: a typed-nil
// implementation passed through this interface ... must not panic.
// The coordinator stores the interface value as supplied and dispatches
// through it on the hot path; a guard at the call site only checks for
// an interface-nil, not a typed-nil.

So observeReadKey checking c.sampler == nil is consistent with observeMutation and the contract is correctly documented. The concern evaporates once the double-GetRoute is fixed (typed-nil still avoids the map lookup cost because observeReadKey becomes just if c.sampler == nil { return }).


Minor observations

LeaseRead (no-key variant) intentionally unsampled — The default-group LeaseRead at line 763 has no key context and is correctly left without observeReadKey. Worth a brief comment for the next reviewer's benefit, since it sits right above the sampled LeaseReadForKey.

RouteID: 0 in testsengine.UpdateRoute (used in the new tests) assigns RouteID: 0. This exercises the wiring correctly but if someone changes the test's assertion to verify a non-zero RouteID they'll be surprised. The existing mutation tests have the same pattern so this is not a new problem, just worth being aware of.

Test key "hot-key" vs. route start "a""hot-key" starts with 'h' which is ≥ 'a', so it correctly falls in the [a, ∞) route. Not a bug, but if the test key ever changes to something starting with a character < 'a' (e.g. '!' or a capital letter) it would silently miss the route and observeReadKey would be a no-op, making require.Len(t, calls, 2) the only failure signal. A comment noting the dependency would prevent a future head-scratch.


Summary

The read-side wiring is conceptually correct and well-tested. The double GetRoute lookup is the only issue worth fixing before merge — it doubles hot-path lock acquisitions whenever keyviz is enabled and leaves dead code behind. The suggested refactor is small (add a groupForKeyWithRouteID variant and change observeReadKey's signature) and mirrors an already-established pattern in the file.

Round-1 review fix for PR #661 (Gemini medium):

observeReadKey was repeating the GetRoute lookup that the calling
LinearizableReadForKey / LeaseReadForKey had already done via
groupForKey — doubling the per-read routing cost when sampling
was enabled. Refactor:

- New routeAndGroupForKey returns both the route ID and the
  *ShardGroup in one engine.GetRoute pass; reads use it instead
  of groupForKey.
- observeRead now takes the precomputed routeID directly. The
  c.sampler == nil guard is the sole short-circuit; we never pay
  GetRoute for the keyviz call.
- Leadership-only entries (IsLeaderForKey, VerifyLeaderForKey,
  RaftLeaderForKey) keep using groupForKey since they don't need
  the route ID.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 26s —— View job


Code Review

  • Reading changed files and gathering context
  • Reviewing kv/sharded_coordinator.go changes
  • Reviewing kv/sharded_coordinator_sampler_test.go changes
  • Cross-checking against CLAUDE.md's 5-lens review requirements
  • Posting final review

The PR is well-structured and the second commit (b8fe219) already addressed Gemini's round-1 performance nit cleanly. Here are findings across the five review lenses:

Data Loss

No issues. observeRead fires before the underlying operation (pre-barrier for LinearizableReadForKey, pre-lease-check for LeaseReadForKey), which is consistent with observeMutation's offered-load semantics. Mutations that fail their Raft proposal are still counted; reads that fail their barrier will be similarly counted. This is intentional and fine.

Concurrency / Distributed Failures

No issues. routeAndGroupForKey is a straightforward composition of engine.GetRoute + groups map lookup — same concurrency profile as groupForKey. The sampler field is set once via WithSampler before concurrent use, consistent with the pre-existing write-path pattern.

The typed-nil concern Gemini raised in round 1 is not a bug: MemSampler.Observe has an explicit if s == nil { return } guard at keyviz/sampler.go:355, so a (*keyviz.MemSampler)(nil) receiver is safe by contract. The existing TestShardedCoordinatorWithoutSamplerStaysSafe already covers this case.

Performance

The second commit resolves the double-lookup. routeAndGroupForKey performs exactly one GetRoute + one map lookup, same as the old groupForKey. observeRead with an interface-nil sampler is a single branch and return — allocation-free. No regressions.

Data Consistency

No issues. valueLen = 0 at this layer is correct and well-documented: the coordinator's consistency check doesn't fetch data; the actual MVCCStore.GetAt happens further down the stack. The asymmetry between keyLen > 0 (actual key size) and valueLen = 0 is noted in the comment and the test asserts valueLen: 0 explicitly.

Test Coverage

Tests are solid. recordingSampler is thread-safe (mutex-guarded), table-driven where appropriate, and the negative contract (TestShardedCoordinatorSkipsObserveForLeadershipChecks) is pinned explicitly. Two issues worth addressing:


Issues

1. PR/review-tool references in production comments — [fix before merge]

kv/sharded_coordinator.go embeds "Gemini round-1 nit on PR #661" in two places:

  • routeAndGroupForKey comment (line ~843): "...so the GetRoute lookup runs once instead of twice (Gemini round-1 nit on PR #661)."
  • observeRead comment (line ~1017–1018): "...would double the per-read overhead when sampling is enabled (Gemini round-1 nit on PR #661)."

Per CLAUDE.md conventions: "Do not reference the current task, fix, or callers...since those belong in the PR description and rot as the codebase evolves." These references will be meaningless in six months. Replace with just the rationale:

// routeAndGroupForKey is groupForKey extended to return the RouteID so
// read entry points that need to observe into keyviz can do so without
// a second GetRoute lookup. Leadership-only callers (IsLeaderForKey /
// VerifyLeaderForKey / RaftLeaderForKey) keep using groupForKey because
// they don't need the route ID.

Fix this →

2. observeRead comment omits offered-load rationale — [minor]

observeMutation explicitly documents that pre-operation counting is intentional ("heatmap reflects offered load, not just committed writes"). observeRead fires before the read barrier too, but doesn't say so. A reader will wonder: "what if the read fails?" Add one line:

// Counted pre-operation so failed reads (context-cancelled barrier,
// lease expiry) still contribute to offered-load traffic — intentional,
// matching observeMutation's semantics.

Minor Observations (not blocking)

  • groupForKey duplicates 3 of 4 lines from routeAndGroupForKey: groupForKey could delegate to routeAndGroupForKey and discard the routeID. Removes duplication at the cost of one extra ignored return value — reasonable trade-off but not urgent.

  • TestShardedCoordinatorObservesLeaseAndLinearizableReads only tests the happy path: there's no test that observeRead fires even when the underlying read fails (e.g., context canceled). Not strictly required, but would strengthen the offered-load contract.


Overall: Two commits, clean separation of concerns, the performance concern was resolved before review. The only required fix is the review-tool attribution in comments. Everything else is nitpicks.
· Branch: feat/keyviz-read-observe

@bootjp bootjp merged commit 5e4fe67 into main Apr 26, 2026
9 checks passed
@bootjp bootjp deleted the feat/keyviz-read-observe branch April 26, 2026 11:09
bootjp added a commit that referenced this pull request Apr 26, 2026
## Summary

Phase 2-B of the Key Visualizer design
(`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into
the existing `web/admin/` SPA rather than building a parallel dashboard.
Doc + implementation in one PR (doc commit first, per CLAUDE.md
design-doc-first workflow).

- New page at `/keyviz` — canvas heatmap polling
`/admin/api/v1/keyviz/matrix` with series picker (writes / reads /
write_bytes / read_bytes), row-budget input (clamped at 1024), and off /
5 s / 30 s auto-refresh.
- Cold cells (value 0) render as the page background, not a faint blue —
spotting actually-cold routes stays the dominant visual signal.
- Row-detail flyout on hover: bucket_id, start, end, aggregate,
route_count, route_ids (with truncation indicator). `start` / `end`
decode through a printable-or-hex preview so binary keys do not render
as mojibake.
- Dependency-free: hand-rolled five-stop colour ramp in
`lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208
kB (raw), 64 kB gzipped.

Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and
its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 /
#646 / #647 / #651 / #660 / #661 / #672).

## Design

`docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed
status, lands with this PR. Parent design §12 phase table is split into
2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out,
open).

## Five-lens self-review

1. **Data loss** — n/a; SPA is read-only against an existing handler.
2. **Concurrency / distributed** — n/a; single browser tab polling a
single handler. Sampler concurrency was already covered by Phase 2-A
tests.
3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget
at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower
bound (sampler flush is 1 s).
4. **Data consistency** — SPA renders whatever the handler returns;
consistency guarantees come from the existing leader-issued counters in
the sampler.
5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go
build ./internal/admin/...` clean (embed glob unaffected); `go test
./internal/admin/...` clean. Manual verification documented in the
design doc §5.

## Test plan

- [x] `npm run lint` (`tsc -b --noEmit`) — clean
- [x] `npm run build` (Vite) — clean, output goes to
`internal/admin/dist`
- [x] `go build ./internal/admin/...` — clean
- [x] `go test ./internal/admin/...` — clean
- [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot
routes light up red within ~5 s of write traffic
- [ ] Manual: series picker swaps the displayed counter; row-budget
input clamps at 1024; auto-refresh polls without flicker

## Out of scope

- **Cluster fan-out** — handler is currently node-local. Phase 2-C will
add a cross-node admin RPC; this PR will pick up the aggregate view
automatically once that ships.
- **Drill-down per-route sparkline** — Phase 3.
- **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet
built; correlation lands when those pages do.
- **`localStorage` for series / rows / refresh** — punt to follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant