Skip to content

obs(redis-lua): per-cmd fast-path outcome counter#571

Merged
bootjp merged 4 commits intomainfrom
feat/lua-fastpath-metrics
Apr 21, 2026
Merged

obs(redis-lua): per-cmd fast-path outcome counter#571
bootjp merged 4 commits intomainfrom
feat/lua-fastpath-metrics

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 21, 2026

Summary

  • Add elastickv_lua_cmd_fastpath_total{cmd,outcome} counter to measure how often Lua-side fast paths actually take the fast path vs fall back (hit / skip_loaded / skip_cached_type / fallback).
  • Wire LuaFastPathObserver through Registry → RedisServer option → luaScriptContext.observeFastPathOutcome.
  • Instrument cmdZRangeByScore as the first caller.

Motivation

Post-deploy CPU profile on a BullMQ-style workload shows cmdZRangeByScoreSlow consuming ~14x the CPU of zsetRangeByScoreFast (860ms vs 60ms in a 30 s profile), despite the fast path being eligible on paper. Without per-outcome instrumentation it is not possible to tell whether scripts are hitting luaZSetAlreadyLoaded (a prior ZCARD/ZCOUNT on the same key), cachedType (SET/DEL earlier in the script), or the server-side hit=false fallback (legacy-blob zset, truncation, large-offset short-circuit).

Shipping this counter first so we can decide whether the follow-up should loosen the luaZSetAlreadyLoaded guard, or target a different cause.

Test plan

  • go test ./adapter/... ./monitoring/... green
  • New TestLuaFastPathObserver* in monitoring/hotpath_test.go
  • Zero-value observer safe (already used by tests that do not wire a registry)
  • Deploy to prod, watch rate(elastickv_lua_cmd_fastpath_total{cmd=zrangebyscore}[5m]) by outcome for ~5 minutes

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics to monitor Lua script fast-path execution performance, tracking per-command outcomes including successful hits, skipped operations from cached data, and fallback scenarios.
  • Tests

    • Added tests validating Lua fast-path metrics accuracy and zero-value safety.

Instrument redis.call() fast-path decisions inside Lua scripts so
operators can see how often a given command actually takes the fast
path vs falls back. Triggered by a production profile showing
cmdZRangeByScoreSlow consuming ~14x the CPU of zsetRangeByScoreFast
on a BullMQ-style workload, despite the fast path being eligible
on paper.

- Add elastickv_lua_cmd_fastpath_total{cmd,outcome} counter, with
  outcome = hit | skip_loaded | skip_cached_type | fallback.
- Wire LuaFastPathObserver through Registry, RedisServer option, and
  luaScriptContext.observeFastPathOutcome.
- Instrument cmdZRangeByScore as the first caller; other fast-path
  commands (ZSCORE, HGET, HEXISTS, SISMEMBER) will follow once this
  metric validates the instrumentation shape.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

The PR adds Lua fast-path monitoring instrumentation for ZRANGEBYSCORE operations. It introduces a new LuaFastPathObserver type to track outcomes (hit, skip_loaded, skip_cached_type, fallback), integrates the observer into the Redis server adapter, instruments the cmdZRangeByScore function with metric calls, and provides test coverage.

Changes

Cohort / File(s) Summary
Redis Adapter Configuration
adapter/redis.go, adapter/redis_lua_context.go
Adds luaFastPathObserver and luaFastPathZRange fields to RedisServer, introduces WithLuaFastPathObserver option function, and instruments cmdZRangeByScore with four observer outcome metrics (ObserveHit, ObserveSkipLoaded, ObserveSkipCachedType, ObserveFallback).
Monitoring Infrastructure
monitoring/hotpath.go, monitoring/hotpath_test.go, monitoring/registry.go
Defines new LuaFastPathObserver and LuaFastPathCmd types with outcome label constants, adds Prometheus counter metric luaFastPathTotal, provides ForCommand method for per-command metric resolution, implements zero-safe observer methods, and adds unit tests for observer behavior and counter emission.
Server Integration
main.go
Wires LuaFastPathObserver into Redis server initialization via metricsRegistry.LuaFastPathObserver().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 Fast paths shimmer in the Lua night,
Each hop tracked with metrics bright,
Hit or skip, cached type's call—
Observers count them all! 🎯✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'obs(redis-lua): per-cmd fast-path outcome counter' accurately and concisely summarizes the main change: adding per-command Lua fast-path outcome metrics instrumentation via a Prometheus counter.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/lua-fastpath-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a monitoring framework to track fast-path execution outcomes for Redis commands within Lua scripts, starting with ZRANGEBYSCORE. It adds a new LuaFastPathObserver, defines Prometheus metrics for various outcomes like hits and fallbacks, and integrates these into the Redis adapter and monitoring registry. Feedback was provided regarding the performance impact of using Prometheus label lookups on hot paths, suggesting the use of atomic operations or pre-resolved counters to minimize mutex contention.

Comment thread adapter/redis_lua_context.go Outdated
Comment on lines +2466 to +2471
func (c *luaScriptContext) observeFastPathOutcome(cmd, outcome string) {
if c.server == nil {
return
}
c.server.luaFastPathObserver.ObserveLuaFastPath(cmd, outcome)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The observeFastPathOutcome method is called on every ZRANGEBYSCORE execution within a Lua script. Frequent map lookups in Prometheus (WithLabelValues) involve internal mutexes and can introduce overhead on hot paths. Following repository guidelines for hot paths, consider using atomic operations or pre-resolving the counter to avoid mutex contention and improve performance.

References
  1. For frequently accessed fields that require monotonic updates, prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.

Address gemini review on PR #571: CounterVec.WithLabelValues takes a
mutex-guarded map lookup, which runs inside every redis.call() on
the ZRANGEBYSCORE fast path. Resolve the four outcome counters once
at server construction (WithLuaFastPathObserver) and call the
pre-resolved handles from the Lua context so the hot path becomes a
single atomic Inc per call.

- LuaFastPathObserver.ForCommand(cmd) returns a LuaFastPathCmd
  bundle with pre-resolved Counter handles for hit / skip_loaded /
  skip_cached_type / fallback.
- RedisServer stores the ZRANGEBYSCORE bundle and calls
  ObserveHit/etc. directly, no map lookups on the hot path.
- Zero-value LuaFastPathCmd still safe: each ObserveXxx checks the
  handle and no-ops when nil (tests without a wired registry).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 21, 2026

Addressed gemini's review (1cd79de):

  • Medium — WithLabelValues mutex on hot path: the observer API now pre-resolves counter handles at server construction via LuaFastPathObserver.ForCommand(cmd), which returns a LuaFastPathCmd bundle with four resolved prometheus.Counter handles. cmdZRangeByScore calls .ObserveHit() / .ObserveSkipAlreadyLoaded() / .ObserveSkipCachedType() / .ObserveFallback() directly — each is a single atomic Inc, no map lookups on the redis.call() path.

Zero-value LuaFastPathCmd remains safe (each ObserveXxx nil-checks the handle), so tests that do not wire a registry continue to work without change.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a LuaFastPathObserver to monitor the performance and hit ratios of Redis command fast paths within Lua scripts, specifically targeting ZRANGEBYSCORE and ZREVRANGEBYSCORE. The implementation includes new monitoring structures, pre-resolved Prometheus counters for hot-path efficiency, and integration within the Redis adapter. Review feedback focuses on correcting a documentation inaccuracy regarding method names and improving naming consistency between internal fields and metric labels.

Comment thread monitoring/hotpath.go Outdated
// inside Lua scripts. The zero value is safe and silently drops
// samples so tests can pass LuaFastPathObserver{} as a stub.
//
// Hot-path shape: each ObserveLuaFastPath-on-handle call is a single
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is inaccurate as it refers to a non-existent method ObserveLuaFastPath-on-handle. The correct observer methods on LuaFastPathCmd are named ObserveHit, ObserveSkip, etc. Updating this improves code clarity by removing references to non-existent paths. Additionally, ensure these methods are non-blocking as they are on the hot path; if any blocking behavior is intended, it must be explicitly documented in the interface description.

Suggested change
// Hot-path shape: each ObserveLuaFastPath-on-handle call is a single
// Hot-path shape: each Observe* call on a LuaFastPathCmd handle is a single
References
  1. Removing references to non-existent code paths improves clarity. Interface methods intended to block must have their blocking semantics explicitly documented to avoid ambiguity.

Comment thread monitoring/hotpath.go Outdated
// Observe* methods per redis.call(). Safe to copy.
type LuaFastPathCmd struct {
hit prometheus.Counter
skipAlreadyLoad prometheus.Counter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The name skipAlreadyLoad is verbose and inconsistent with the metric label "skip_loaded". Consider renaming the field to skipLoaded and updating the associated constants and methods for consistency. Note that using prometheus.Counter for these monotonic updates on a hot path is preferred over mutexes as it leverages atomic operations, which aligns with performance best practices for frequently accessed fields.

Suggested change
skipAlreadyLoad prometheus.Counter
skipLoaded prometheus.Counter
References
  1. For frequently accessed fields requiring monotonic updates on hot paths, atomic operations are preferred over mutexes to improve performance.

Address gemini review on PR #571:

- Fix the doc comment on LuaFastPathObserver that referenced a
  non-existent "ObserveLuaFastPath-on-handle" method; describe the
  actual Observe* call shape and note the counters are non-blocking
  atomic Inc (client_golang's default Counter uses sync/atomic).
- Rename the skipAlreadyLoad field, LuaFastPathOutcomeSkipAlreadyLoad
  constant, and ObserveSkipAlreadyLoaded method to skipLoaded /
  LuaFastPathOutcomeSkipLoaded / ObserveSkipLoaded so every layer
  agrees with the metric label value "skip_loaded".
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 21, 2026

Addressed gemini's review round 2 (e75442e):

  • Fixed the doc comment on LuaFastPathObserver that referenced a non-existent ObserveLuaFastPath-on-handle method. Now describes the actual Observe* call shape on LuaFastPathCmd and notes the counters are non-blocking atomic Inc (client_golang's default Counter uses sync/atomic internally).
  • Renamed the field skipAlreadyLoad / constant LuaFastPathOutcomeSkipAlreadyLoad / method ObserveSkipAlreadyLoadedskipLoaded / LuaFastPathOutcomeSkipLoaded / ObserveSkipLoaded so every layer agrees with the metric label value skip_loaded.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a monitoring infrastructure to track Lua fast-path outcomes for Redis commands, specifically implementing it for ZRANGEBYSCORE and ZREVRANGEBYSCORE. It adds a LuaFastPathObserver to the monitoring package, integrates it into the RedisServer adapter, and includes unit tests to ensure correct metric reporting and zero-value safety. I have no feedback to provide.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
adapter/redis.go (1)

269-346: ⚠️ Potential issue | 🟡 Minor

Keep the cmd label command-specific.

cmdZRevRangeByScore also reaches the Lua fast-path code, but this single bundle records it as cmd="zrangebyscore". That makes the new “per-command” metric conflate ZRANGEBYSCORE and ZREVRANGEBYSCORE, which can skew the production validation query.

Suggested direction
-	// luaFastPathZRange is the pre-resolved counter bundle for the
-	// ZRANGEBYSCORE / ZREVRANGEBYSCORE Lua fast path. Resolved once in
+	// luaFastPathZRangeByScore and luaFastPathZRevRangeByScore are
+	// pre-resolved counter bundles for the Lua fast path. Resolved once in
 	// WithLuaFastPathObserver so the hot path does not pay for
 	// CounterVec.WithLabelValues on every redis.call().
-	luaFastPathZRange monitoring.LuaFastPathCmd
+	luaFastPathZRangeByScore    monitoring.LuaFastPathCmd
+	luaFastPathZRevRangeByScore monitoring.LuaFastPathCmd
...
 	return func(r *RedisServer) {
 		r.luaFastPathObserver = observer
-		r.luaFastPathZRange = observer.ForCommand(luaFastPathCmdZRangeByScore)
+		r.luaFastPathZRangeByScore = observer.ForCommand(luaFastPathCmdZRangeByScore)
+		r.luaFastPathZRevRangeByScore = observer.ForCommand(luaFastPathCmdZRevRangeByScore)
 	}
 }
 
-// luaFastPathCmdZRangeByScore is the shared label for ZRANGEBYSCORE
-// and ZREVRANGEBYSCORE fast-path outcomes. Both directions take the
-// same branch through zsetRangeByScoreFast so sharing one label
-// keeps the counter cardinality bounded.
-const luaFastPathCmdZRangeByScore = "zrangebyscore"
+const (
+	luaFastPathCmdZRangeByScore    = "zrangebyscore"
+	luaFastPathCmdZRevRangeByScore = "zrevrangebyscore"
+)

Then select the appropriate bundle in cmdZRangeByScore based on reverse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/redis.go` around lines 269 - 346, The fast-path counter currently
conflates ZRANGEBYSCORE and ZREVRANGEBYSCORE by always resolving
r.luaFastPathZRange to luaFastPathCmdZRangeByScore in WithLuaFastPathObserver;
add a distinct constant (e.g. luaFastPathCmdZRevRangeByScore) and stop using one
pre-resolved bundle for both. In practice, remove/stop relying on the single
luaFastPathZRange field and instead select the correct counter from
r.luaFastPathObserver inside cmdZRangeByScore based on the reverse boolean (call
r.luaFastPathObserver.ForCommand(reverse ? luaFastPathCmdZRevRangeByScore :
luaFastPathCmdZRangeByScore)), ensuring each command has its own label.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@adapter/redis.go`:
- Around line 269-346: The fast-path counter currently conflates ZRANGEBYSCORE
and ZREVRANGEBYSCORE by always resolving r.luaFastPathZRange to
luaFastPathCmdZRangeByScore in WithLuaFastPathObserver; add a distinct constant
(e.g. luaFastPathCmdZRevRangeByScore) and stop using one pre-resolved bundle for
both. In practice, remove/stop relying on the single luaFastPathZRange field and
instead select the correct counter from r.luaFastPathObserver inside
cmdZRangeByScore based on the reverse boolean (call
r.luaFastPathObserver.ForCommand(reverse ? luaFastPathCmdZRevRangeByScore :
luaFastPathCmdZRangeByScore)), ensuring each command has its own label.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20cd12ec-978e-4359-a4ed-9967158c078f

📥 Commits

Reviewing files that changed from the base of the PR and between f1dcc14 and 138cb73.

📒 Files selected for processing (6)
  • adapter/redis.go
  • adapter/redis_lua_context.go
  • main.go
  • monitoring/hotpath.go
  • monitoring/hotpath_test.go
  • monitoring/registry.go

@bootjp bootjp merged commit e067ec5 into main Apr 21, 2026
8 checks passed
@bootjp bootjp deleted the feat/lua-fastpath-metrics branch April 21, 2026 14:22
bootjp added a commit that referenced this pull request Apr 21, 2026
Production metric from PR #571 on n1 (leader) after 15 min of BullMQ
traffic:
    fallback         1463  (96.4%)
    hit                54  (3.6%)
    skip_loaded         0
    skip_cached_type    0

The Lua-layer guards (luaZSetAlreadyLoaded / cachedType) are not the
issue; the fast path is being rejected server-side. The single
"fallback" bucket is too coarse to tell whether that is the
legacy-blob eligibility check, the truncation guard, the large-offset
short-circuit, the empty-key missing-zset path, or the
string-priority wrong-type guard.

Subdivide the outcome label and thread the reason back from
zsetRangeByScoreFast:

  - fallback_ineligible   (legacy-blob zset exists)
  - fallback_large_offset (offset >= maxWideScanLimit short-circuit)
  - fallback_truncated    (scan saturated)
  - fallback_missing_key  (zsetRangeEmptyFastResult: zsetExists=false)
  - fallback_wrong_type   (hasHigherPriorityStringEncoding=true)
  - fallback_other        (error / fallback we didn't classify)

All counters are pre-resolved at ForCommand() time (same hot-path
shape as #571) so routing by reason is still a single atomic Inc per
call. ObserveFallback(reason) dispatches via a small switch.

The hypothesis from the metric pattern is that most fallbacks are
"fallback_missing_key" (BullMQ polling empty delayed queues). Ship
this first, measure in prod, then write the fix accordingly.
bootjp added a commit that referenced this pull request Apr 21, 2026
Production metric from PR #571 on n1 (leader) after 15 min of BullMQ
traffic:
    fallback         1463  (96.4%)
    hit                54  (3.6%)
    skip_loaded         0
    skip_cached_type    0

The Lua-layer guards (luaZSetAlreadyLoaded / cachedType) are not the
issue; the fast path is being rejected server-side. The single
"fallback" bucket is too coarse to tell whether that is the
legacy-blob eligibility check, the truncation guard, the large-offset
short-circuit, the empty-key missing-zset path, or the
string-priority wrong-type guard.

Subdivide the outcome label and thread the reason back from
zsetRangeByScoreFast:

  - fallback_ineligible   (legacy-blob zset exists)
  - fallback_large_offset (offset >= maxWideScanLimit short-circuit)
  - fallback_truncated    (scan saturated)
  - fallback_missing_key  (zsetRangeEmptyFastResult: zsetExists=false)
  - fallback_wrong_type   (hasHigherPriorityStringEncoding=true)
  - fallback_other        (error / fallback we didn't classify)

All counters are pre-resolved at ForCommand() time (same hot-path
shape as #571) so routing by reason is still a single atomic Inc per
call. ObserveFallback(reason) dispatches via a small switch.

The hypothesis from the metric pattern is that most fallbacks are
"fallback_missing_key" (BullMQ polling empty delayed queues). Ship
this first, measure in prod, then write the fix accordingly.
bootjp added a commit that referenced this pull request Apr 21, 2026
…572)

## Summary

Follow-up to PR #571. The counter it added showed (on n1 / leader, 15
min of BullMQ traffic):

```
fallback         1463  (96.4%)
hit                54  (3.6%)
skip_loaded         0
skip_cached_type    0
```

The Lua-layer guards are not the issue; the fast path is being rejected
server-side. The current single `fallback` bucket is too coarse to tell
*why*: legacy-blob eligibility, large-offset short-circuit, truncation,
empty-key missing-zset, or wrong-type guard.

This PR subdivides the `fallback` outcome label so the real cause is
visible in 5 minutes of prod scrapes.

## Changes

- New outcome labels (replacing single `fallback`):
- `fallback_ineligible` — `zsetFastPathEligible` rejected (legacy-blob
zset)
  - `fallback_large_offset` — `offset >= maxWideScanLimit` short-circuit
  - `fallback_truncated` — scan hit `scanLimit` with request unsatisfied
- `fallback_missing_key` — `zsetRangeEmptyFastResult` saw
`zsetExists=false`
  - `fallback_wrong_type` — `hasHigherPriorityStringEncoding=true`
  - `fallback_other` — error / unclassified
- `zsetRangeByScoreFast` grows a `reason string` return (empty when
`hit=true`); all `hit=false` sites populate a specific reason.
- `zsetRangeEmptyFastResult` now returns `(hit, reason, err)`; the old
`[]redisZSetEntry` return was always `nil`.
- `LuaFastPathCmd` pre-resolves 6 new counters (same hot-path shape as
#571: one atomic `Inc` per redis.call()).
- `ObserveFallback(reason)` dispatches via a small switch; unknown
reasons land on `fallback_other`.

## Motivation

Pre-deploy hypothesis: most fallbacks are `fallback_missing_key` (BullMQ
polling empty delayed queues). Ship this first, measure in prod, then
write the actual fix accordingly.

## Test plan

- [x] `go test -race -short ./adapter/...` (53s) green
- [x] `go test -race ./monitoring/...` green
- [x] `TestLuaFastPathObserverCountsByCmdAndOutcome` updated for new
labels + unknown-reason dispatch
- [ ] Deploy to prod; watch `sum by (outcome)
(rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore"}[5m]))`


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Improved query performance monitoring with categorized fallback
reasons for sorted set operations, providing enhanced visibility into
query optimization effectiveness and performance diagnostics.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant