obs(redis-lua): per-cmd fast-path outcome counter#571
Conversation
Instrument redis.call() fast-path decisions inside Lua scripts so
operators can see how often a given command actually takes the fast
path vs falls back. Triggered by a production profile showing
cmdZRangeByScoreSlow consuming ~14x the CPU of zsetRangeByScoreFast
on a BullMQ-style workload, despite the fast path being eligible
on paper.
- Add elastickv_lua_cmd_fastpath_total{cmd,outcome} counter, with
outcome = hit | skip_loaded | skip_cached_type | fallback.
- Wire LuaFastPathObserver through Registry, RedisServer option, and
luaScriptContext.observeFastPathOutcome.
- Instrument cmdZRangeByScore as the first caller; other fast-path
commands (ZSCORE, HGET, HEXISTS, SISMEMBER) will follow once this
metric validates the instrumentation shape.
📝 WalkthroughWalkthroughThe PR adds Lua fast-path monitoring instrumentation for ZRANGEBYSCORE operations. It introduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a monitoring framework to track fast-path execution outcomes for Redis commands within Lua scripts, starting with ZRANGEBYSCORE. It adds a new LuaFastPathObserver, defines Prometheus metrics for various outcomes like hits and fallbacks, and integrates these into the Redis adapter and monitoring registry. Feedback was provided regarding the performance impact of using Prometheus label lookups on hot paths, suggesting the use of atomic operations or pre-resolved counters to minimize mutex contention.
| func (c *luaScriptContext) observeFastPathOutcome(cmd, outcome string) { | ||
| if c.server == nil { | ||
| return | ||
| } | ||
| c.server.luaFastPathObserver.ObserveLuaFastPath(cmd, outcome) | ||
| } |
There was a problem hiding this comment.
The observeFastPathOutcome method is called on every ZRANGEBYSCORE execution within a Lua script. Frequent map lookups in Prometheus (WithLabelValues) involve internal mutexes and can introduce overhead on hot paths. Following repository guidelines for hot paths, consider using atomic operations or pre-resolving the counter to avoid mutex contention and improve performance.
References
- For frequently accessed fields that require monotonic updates, prefer atomic operations (e.g., CAS) over mutexes to improve performance on hot paths.
Address gemini review on PR #571: CounterVec.WithLabelValues takes a mutex-guarded map lookup, which runs inside every redis.call() on the ZRANGEBYSCORE fast path. Resolve the four outcome counters once at server construction (WithLuaFastPathObserver) and call the pre-resolved handles from the Lua context so the hot path becomes a single atomic Inc per call. - LuaFastPathObserver.ForCommand(cmd) returns a LuaFastPathCmd bundle with pre-resolved Counter handles for hit / skip_loaded / skip_cached_type / fallback. - RedisServer stores the ZRANGEBYSCORE bundle and calls ObserveHit/etc. directly, no map lookups on the hot path. - Zero-value LuaFastPathCmd still safe: each ObserveXxx checks the handle and no-ops when nil (tests without a wired registry).
|
Addressed gemini's review (1cd79de):
Zero-value /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a LuaFastPathObserver to monitor the performance and hit ratios of Redis command fast paths within Lua scripts, specifically targeting ZRANGEBYSCORE and ZREVRANGEBYSCORE. The implementation includes new monitoring structures, pre-resolved Prometheus counters for hot-path efficiency, and integration within the Redis adapter. Review feedback focuses on correcting a documentation inaccuracy regarding method names and improving naming consistency between internal fields and metric labels.
| // inside Lua scripts. The zero value is safe and silently drops | ||
| // samples so tests can pass LuaFastPathObserver{} as a stub. | ||
| // | ||
| // Hot-path shape: each ObserveLuaFastPath-on-handle call is a single |
There was a problem hiding this comment.
This comment is inaccurate as it refers to a non-existent method ObserveLuaFastPath-on-handle. The correct observer methods on LuaFastPathCmd are named ObserveHit, ObserveSkip, etc. Updating this improves code clarity by removing references to non-existent paths. Additionally, ensure these methods are non-blocking as they are on the hot path; if any blocking behavior is intended, it must be explicitly documented in the interface description.
| // Hot-path shape: each ObserveLuaFastPath-on-handle call is a single | |
| // Hot-path shape: each Observe* call on a LuaFastPathCmd handle is a single |
References
- Removing references to non-existent code paths improves clarity. Interface methods intended to block must have their blocking semantics explicitly documented to avoid ambiguity.
| // Observe* methods per redis.call(). Safe to copy. | ||
| type LuaFastPathCmd struct { | ||
| hit prometheus.Counter | ||
| skipAlreadyLoad prometheus.Counter |
There was a problem hiding this comment.
The name skipAlreadyLoad is verbose and inconsistent with the metric label "skip_loaded". Consider renaming the field to skipLoaded and updating the associated constants and methods for consistency. Note that using prometheus.Counter for these monotonic updates on a hot path is preferred over mutexes as it leverages atomic operations, which aligns with performance best practices for frequently accessed fields.
| skipAlreadyLoad prometheus.Counter | |
| skipLoaded prometheus.Counter |
References
- For frequently accessed fields requiring monotonic updates on hot paths, atomic operations are preferred over mutexes to improve performance.
Address gemini review on PR #571: - Fix the doc comment on LuaFastPathObserver that referenced a non-existent "ObserveLuaFastPath-on-handle" method; describe the actual Observe* call shape and note the counters are non-blocking atomic Inc (client_golang's default Counter uses sync/atomic). - Rename the skipAlreadyLoad field, LuaFastPathOutcomeSkipAlreadyLoad constant, and ObserveSkipAlreadyLoaded method to skipLoaded / LuaFastPathOutcomeSkipLoaded / ObserveSkipLoaded so every layer agrees with the metric label value "skip_loaded".
|
Addressed gemini's review round 2 (e75442e):
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a monitoring infrastructure to track Lua fast-path outcomes for Redis commands, specifically implementing it for ZRANGEBYSCORE and ZREVRANGEBYSCORE. It adds a LuaFastPathObserver to the monitoring package, integrates it into the RedisServer adapter, and includes unit tests to ensure correct metric reporting and zero-value safety. I have no feedback to provide.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
adapter/redis.go (1)
269-346:⚠️ Potential issue | 🟡 MinorKeep the
cmdlabel command-specific.
cmdZRevRangeByScorealso reaches the Lua fast-path code, but this single bundle records it ascmd="zrangebyscore". That makes the new “per-command” metric conflateZRANGEBYSCOREandZREVRANGEBYSCORE, which can skew the production validation query.Suggested direction
- // luaFastPathZRange is the pre-resolved counter bundle for the - // ZRANGEBYSCORE / ZREVRANGEBYSCORE Lua fast path. Resolved once in + // luaFastPathZRangeByScore and luaFastPathZRevRangeByScore are + // pre-resolved counter bundles for the Lua fast path. Resolved once in // WithLuaFastPathObserver so the hot path does not pay for // CounterVec.WithLabelValues on every redis.call(). - luaFastPathZRange monitoring.LuaFastPathCmd + luaFastPathZRangeByScore monitoring.LuaFastPathCmd + luaFastPathZRevRangeByScore monitoring.LuaFastPathCmd ... return func(r *RedisServer) { r.luaFastPathObserver = observer - r.luaFastPathZRange = observer.ForCommand(luaFastPathCmdZRangeByScore) + r.luaFastPathZRangeByScore = observer.ForCommand(luaFastPathCmdZRangeByScore) + r.luaFastPathZRevRangeByScore = observer.ForCommand(luaFastPathCmdZRevRangeByScore) } } -// luaFastPathCmdZRangeByScore is the shared label for ZRANGEBYSCORE -// and ZREVRANGEBYSCORE fast-path outcomes. Both directions take the -// same branch through zsetRangeByScoreFast so sharing one label -// keeps the counter cardinality bounded. -const luaFastPathCmdZRangeByScore = "zrangebyscore" +const ( + luaFastPathCmdZRangeByScore = "zrangebyscore" + luaFastPathCmdZRevRangeByScore = "zrevrangebyscore" +)Then select the appropriate bundle in
cmdZRangeByScorebased onreverse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/redis.go` around lines 269 - 346, The fast-path counter currently conflates ZRANGEBYSCORE and ZREVRANGEBYSCORE by always resolving r.luaFastPathZRange to luaFastPathCmdZRangeByScore in WithLuaFastPathObserver; add a distinct constant (e.g. luaFastPathCmdZRevRangeByScore) and stop using one pre-resolved bundle for both. In practice, remove/stop relying on the single luaFastPathZRange field and instead select the correct counter from r.luaFastPathObserver inside cmdZRangeByScore based on the reverse boolean (call r.luaFastPathObserver.ForCommand(reverse ? luaFastPathCmdZRevRangeByScore : luaFastPathCmdZRangeByScore)), ensuring each command has its own label.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@adapter/redis.go`:
- Around line 269-346: The fast-path counter currently conflates ZRANGEBYSCORE
and ZREVRANGEBYSCORE by always resolving r.luaFastPathZRange to
luaFastPathCmdZRangeByScore in WithLuaFastPathObserver; add a distinct constant
(e.g. luaFastPathCmdZRevRangeByScore) and stop using one pre-resolved bundle for
both. In practice, remove/stop relying on the single luaFastPathZRange field and
instead select the correct counter from r.luaFastPathObserver inside
cmdZRangeByScore based on the reverse boolean (call
r.luaFastPathObserver.ForCommand(reverse ? luaFastPathCmdZRevRangeByScore :
luaFastPathCmdZRangeByScore)), ensuring each command has its own label.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 20cd12ec-978e-4359-a4ed-9967158c078f
📒 Files selected for processing (6)
adapter/redis.goadapter/redis_lua_context.gomain.gomonitoring/hotpath.gomonitoring/hotpath_test.gomonitoring/registry.go
Production metric from PR #571 on n1 (leader) after 15 min of BullMQ traffic: fallback 1463 (96.4%) hit 54 (3.6%) skip_loaded 0 skip_cached_type 0 The Lua-layer guards (luaZSetAlreadyLoaded / cachedType) are not the issue; the fast path is being rejected server-side. The single "fallback" bucket is too coarse to tell whether that is the legacy-blob eligibility check, the truncation guard, the large-offset short-circuit, the empty-key missing-zset path, or the string-priority wrong-type guard. Subdivide the outcome label and thread the reason back from zsetRangeByScoreFast: - fallback_ineligible (legacy-blob zset exists) - fallback_large_offset (offset >= maxWideScanLimit short-circuit) - fallback_truncated (scan saturated) - fallback_missing_key (zsetRangeEmptyFastResult: zsetExists=false) - fallback_wrong_type (hasHigherPriorityStringEncoding=true) - fallback_other (error / fallback we didn't classify) All counters are pre-resolved at ForCommand() time (same hot-path shape as #571) so routing by reason is still a single atomic Inc per call. ObserveFallback(reason) dispatches via a small switch. The hypothesis from the metric pattern is that most fallbacks are "fallback_missing_key" (BullMQ polling empty delayed queues). Ship this first, measure in prod, then write the fix accordingly.
Production metric from PR #571 on n1 (leader) after 15 min of BullMQ traffic: fallback 1463 (96.4%) hit 54 (3.6%) skip_loaded 0 skip_cached_type 0 The Lua-layer guards (luaZSetAlreadyLoaded / cachedType) are not the issue; the fast path is being rejected server-side. The single "fallback" bucket is too coarse to tell whether that is the legacy-blob eligibility check, the truncation guard, the large-offset short-circuit, the empty-key missing-zset path, or the string-priority wrong-type guard. Subdivide the outcome label and thread the reason back from zsetRangeByScoreFast: - fallback_ineligible (legacy-blob zset exists) - fallback_large_offset (offset >= maxWideScanLimit short-circuit) - fallback_truncated (scan saturated) - fallback_missing_key (zsetRangeEmptyFastResult: zsetExists=false) - fallback_wrong_type (hasHigherPriorityStringEncoding=true) - fallback_other (error / fallback we didn't classify) All counters are pre-resolved at ForCommand() time (same hot-path shape as #571) so routing by reason is still a single atomic Inc per call. ObserveFallback(reason) dispatches via a small switch. The hypothesis from the metric pattern is that most fallbacks are "fallback_missing_key" (BullMQ polling empty delayed queues). Ship this first, measure in prod, then write the fix accordingly.
…572) ## Summary Follow-up to PR #571. The counter it added showed (on n1 / leader, 15 min of BullMQ traffic): ``` fallback 1463 (96.4%) hit 54 (3.6%) skip_loaded 0 skip_cached_type 0 ``` The Lua-layer guards are not the issue; the fast path is being rejected server-side. The current single `fallback` bucket is too coarse to tell *why*: legacy-blob eligibility, large-offset short-circuit, truncation, empty-key missing-zset, or wrong-type guard. This PR subdivides the `fallback` outcome label so the real cause is visible in 5 minutes of prod scrapes. ## Changes - New outcome labels (replacing single `fallback`): - `fallback_ineligible` — `zsetFastPathEligible` rejected (legacy-blob zset) - `fallback_large_offset` — `offset >= maxWideScanLimit` short-circuit - `fallback_truncated` — scan hit `scanLimit` with request unsatisfied - `fallback_missing_key` — `zsetRangeEmptyFastResult` saw `zsetExists=false` - `fallback_wrong_type` — `hasHigherPriorityStringEncoding=true` - `fallback_other` — error / unclassified - `zsetRangeByScoreFast` grows a `reason string` return (empty when `hit=true`); all `hit=false` sites populate a specific reason. - `zsetRangeEmptyFastResult` now returns `(hit, reason, err)`; the old `[]redisZSetEntry` return was always `nil`. - `LuaFastPathCmd` pre-resolves 6 new counters (same hot-path shape as #571: one atomic `Inc` per redis.call()). - `ObserveFallback(reason)` dispatches via a small switch; unknown reasons land on `fallback_other`. ## Motivation Pre-deploy hypothesis: most fallbacks are `fallback_missing_key` (BullMQ polling empty delayed queues). Ship this first, measure in prod, then write the actual fix accordingly. ## Test plan - [x] `go test -race -short ./adapter/...` (53s) green - [x] `go test -race ./monitoring/...` green - [x] `TestLuaFastPathObserverCountsByCmdAndOutcome` updated for new labels + unknown-reason dispatch - [ ] Deploy to prod; watch `sum by (outcome) (rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore"}[5m]))` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved query performance monitoring with categorized fallback reasons for sorted set operations, providing enhanced visibility into query optimization effectiveness and performance diagnostics. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
elastickv_lua_cmd_fastpath_total{cmd,outcome}counter to measure how often Lua-side fast paths actually take the fast path vs fall back (hit / skip_loaded / skip_cached_type / fallback).LuaFastPathObserverthrough Registry → RedisServer option →luaScriptContext.observeFastPathOutcome.cmdZRangeByScoreas the first caller.Motivation
Post-deploy CPU profile on a BullMQ-style workload shows
cmdZRangeByScoreSlowconsuming ~14x the CPU ofzsetRangeByScoreFast(860ms vs 60ms in a 30 s profile), despite the fast path being eligible on paper. Without per-outcome instrumentation it is not possible to tell whether scripts are hittingluaZSetAlreadyLoaded(a prior ZCARD/ZCOUNT on the same key),cachedType(SET/DEL earlier in the script), or the server-sidehit=falsefallback (legacy-blob zset, truncation, large-offset short-circuit).Shipping this counter first so we can decide whether the follow-up should loosen the
luaZSetAlreadyLoadedguard, or target a different cause.Test plan
go test ./adapter/... ./monitoring/...greenTestLuaFastPathObserver*in monitoring/hotpath_test.gorate(elastickv_lua_cmd_fastpath_total{cmd=zrangebyscore}[5m])by outcome for ~5 minutesSummary by CodeRabbit
New Features
Tests