obs(redis-lua): subdivide ZRANGEBYSCORE fast-path fallback by reason#572
obs(redis-lua): subdivide ZRANGEBYSCORE fast-path fallback by reason#572
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces typed fallback reasons for Lua fast-path query tracking. It replaces a single boolean fallback indicator with a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Production metric from PR #571 on n1 (leader) after 15 min of BullMQ traffic: fallback 1463 (96.4%) hit 54 (3.6%) skip_loaded 0 skip_cached_type 0 The Lua-layer guards (luaZSetAlreadyLoaded / cachedType) are not the issue; the fast path is being rejected server-side. The single "fallback" bucket is too coarse to tell whether that is the legacy-blob eligibility check, the truncation guard, the large-offset short-circuit, the empty-key missing-zset path, or the string-priority wrong-type guard. Subdivide the outcome label and thread the reason back from zsetRangeByScoreFast: - fallback_ineligible (legacy-blob zset exists) - fallback_large_offset (offset >= maxWideScanLimit short-circuit) - fallback_truncated (scan saturated) - fallback_missing_key (zsetRangeEmptyFastResult: zsetExists=false) - fallback_wrong_type (hasHigherPriorityStringEncoding=true) - fallback_other (error / fallback we didn't classify) All counters are pre-resolved at ForCommand() time (same hot-path shape as #571) so routing by reason is still a single atomic Inc per call. ObserveFallback(reason) dispatches via a small switch. The hypothesis from the metric pattern is that most fallbacks are "fallback_missing_key" (BullMQ polling empty delayed queues). Ship this first, measure in prod, then write the fix accordingly.
1d6d6ef to
c275eeb
Compare
There was a problem hiding this comment.
Code Review
This pull request enhances the monitoring of the Lua fast-path by subdividing the generic 'fallback' outcome into specific, granular reasons such as ineligibility, missing keys, wrong types, and truncated results. The changes involve updating function signatures across the adapter layer to propagate these reasons and expanding the monitoring package to support individual counters for each fallback type. A review comment suggests improving type safety by introducing a custom type for the fallback reason instead of using a raw string.
| func (c LuaFastPathCmd) ObserveFallback(reason string) { | ||
| var counter prometheus.Counter | ||
| switch reason { | ||
| case LuaFastPathOutcomeFallbackIneligible: | ||
| counter = c.fallbackIneligible | ||
| case LuaFastPathOutcomeFallbackMissingKey: | ||
| counter = c.fallbackMissingKey | ||
| case LuaFastPathOutcomeFallbackWrongType: | ||
| counter = c.fallbackWrongType | ||
| case LuaFastPathOutcomeFallbackTruncated: | ||
| counter = c.fallbackTruncated | ||
| case LuaFastPathOutcomeFallbackLargeOffset: | ||
| counter = c.fallbackLargeOffset | ||
| default: | ||
| counter = c.fallbackOther | ||
| } | ||
| if counter != nil { | ||
| counter.Inc() | ||
| } | ||
| } |
There was a problem hiding this comment.
The ObserveFallback method uses a switch on a string to route to the correct counter. While this is safe because unknown values land in the fallback_other bucket, consider using a custom type for the reason parameter (e.g., type LuaFastPathFallbackReason string) to improve type safety and documentation of the expected values across the adapter and monitoring packages.
Address gemini review on PR #572: ObserveFallback took a bare string, so call sites in adapter/ could silently drift away from the constants defined in monitoring/. Introduce a named string type so typos now fail at compile time. - New type monitoring.LuaFastPathFallbackReason. - Fallback reason constants move to the new type; the Outcome* names (which were only used as fallback labels) are renamed to LuaFastPathFallback* to match. - Signatures of zsetRangeByScoreFast, zsetRangeEmptyFastResult, finalizeZSetFastRange, and zrangeByScoreFastPath return the typed value instead of bare string. - ObserveFallback(reason LuaFastPathFallbackReason) switch is unchanged in behaviour; unknown values still land on fallback_other. - LuaFastPathFallbackNone = "" is the sentinel returned alongside hit=true; callers must not observe it. Label values on the wire are unchanged (string(reason)), so the metric output and dashboards are untouched.
|
Addressed gemini's review (aa0c0de):
The corresponding constants are renamed /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces granular monitoring for Lua fast-path fallbacks by replacing the generic fallback outcome with specific reasons such as ineligibility, missing keys, wrong types, truncated results, and large offsets. The monitoring package now defines a LuaFastPathFallbackReason type and associated constants, while the Redis adapter and Lua context have been updated to propagate and record these detailed metrics. I have no feedback to provide.
…#573) ## Summary Fix the top source of slow-path fallbacks on BullMQ-style workloads. Production metric from PR #572 after 8 minutes of traffic on n1 (leader): ``` fallback_missing_key 3288 (96.25%) hit 128 ( 3.75%) fallback_ineligible 0 fallback_large_offset 0 fallback_truncated 0 fallback_wrong_type 0 fallback_other 0 skip_loaded 0 skip_cached_type 0 ``` Every ZRANGEBYSCORE that hit an absent zset key was punting to the slow path *purely* to distinguish "missing" from "wrong type" — then the slow path re-probed the same member / meta / delta rows and the same string / list / hash / set prefixes, only to return an empty array. Pure duplicate I/O on a poll loop that is empty most of the time. ## Change In `zsetRangeEmptyFastResult`'s `!zsetExists` branch, classify the key via `keyTypeAt`: - `redisTypeNone` → return `hit=true` with empty result (the correct Redis answer; saves the slow-path round-trip). - Any other type → keep `hit=false` with `fallback_wrong_type` so the slow path still surfaces `WRONGTYPE`. `keyTypeAt` (post-TTL) is used so a TTL-expired string is treated as logically absent, matching Redis semantics. `rawKeyTypeAt` would incorrectly surface `WRONGTYPE` for that case. ## Expected prod impact - `fallback_missing_key` → ~0 - `hit` → ~96% - Slow-path dispatch rate for ZRANGEBYSCORE drops by ~20x - Per-call I/O on the missing-key hot path: ~13–16 → ~11 ## Test plan - [x] `go test -race -short ./adapter/...` (59 s) green - [x] New `TestLua_ZRANGEBYSCORE_MissingKeyReturnsEmptyViaFastPath` pins the short-circuit - [x] `TestLua_ZRANGEBYSCORE_StringKeyReturnsWrongType` and `...ListKeyReturnsWrongType` verify WRONGTYPE is preserved for non-zset encodings - [ ] Deploy; watch `sum by (outcome) (rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore"}[5m]))` and `elastickv_lua_redis_call_duration_seconds` p50/p95
Summary
Follow-up to PR #571. The counter it added showed (on n1 / leader, 15 min of BullMQ traffic):
The Lua-layer guards are not the issue; the fast path is being rejected server-side. The current single
fallbackbucket is too coarse to tell why: legacy-blob eligibility, large-offset short-circuit, truncation, empty-key missing-zset, or wrong-type guard.This PR subdivides the
fallbackoutcome label so the real cause is visible in 5 minutes of prod scrapes.Changes
fallback):fallback_ineligible—zsetFastPathEligiblerejected (legacy-blob zset)fallback_large_offset—offset >= maxWideScanLimitshort-circuitfallback_truncated— scan hitscanLimitwith request unsatisfiedfallback_missing_key—zsetRangeEmptyFastResultsawzsetExists=falsefallback_wrong_type—hasHigherPriorityStringEncoding=truefallback_other— error / unclassifiedzsetRangeByScoreFastgrows areason stringreturn (empty whenhit=true); allhit=falsesites populate a specific reason.zsetRangeEmptyFastResultnow returns(hit, reason, err); the old[]redisZSetEntryreturn was alwaysnil.LuaFastPathCmdpre-resolves 6 new counters (same hot-path shape as obs(redis-lua): per-cmd fast-path outcome counter #571: one atomicIncper redis.call()).ObserveFallback(reason)dispatches via a small switch; unknown reasons land onfallback_other.Motivation
Pre-deploy hypothesis: most fallbacks are
fallback_missing_key(BullMQ polling empty delayed queues). Ship this first, measure in prod, then write the actual fix accordingly.Test plan
go test -race -short ./adapter/...(53s) greengo test -race ./monitoring/...greenTestLuaFastPathObserverCountsByCmdAndOutcomeupdated for new labels + unknown-reason dispatchsum by (outcome) (rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore"}[5m]))Summary by CodeRabbit