Skip to content

obs(redis-lua): subdivide ZRANGEBYSCORE fast-path fallback by reason#572

Merged
bootjp merged 2 commits intomainfrom
feat/lua-fastpath-fallback-reason
Apr 21, 2026
Merged

obs(redis-lua): subdivide ZRANGEBYSCORE fast-path fallback by reason#572
bootjp merged 2 commits intomainfrom
feat/lua-fastpath-fallback-reason

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 21, 2026

Summary

Follow-up to PR #571. The counter it added showed (on n1 / leader, 15 min of BullMQ traffic):

fallback         1463  (96.4%)
hit                54  (3.6%)
skip_loaded         0
skip_cached_type    0

The Lua-layer guards are not the issue; the fast path is being rejected server-side. The current single fallback bucket is too coarse to tell why: legacy-blob eligibility, large-offset short-circuit, truncation, empty-key missing-zset, or wrong-type guard.

This PR subdivides the fallback outcome label so the real cause is visible in 5 minutes of prod scrapes.

Changes

  • New outcome labels (replacing single fallback):
    • fallback_ineligiblezsetFastPathEligible rejected (legacy-blob zset)
    • fallback_large_offsetoffset >= maxWideScanLimit short-circuit
    • fallback_truncated — scan hit scanLimit with request unsatisfied
    • fallback_missing_keyzsetRangeEmptyFastResult saw zsetExists=false
    • fallback_wrong_typehasHigherPriorityStringEncoding=true
    • fallback_other — error / unclassified
  • zsetRangeByScoreFast grows a reason string return (empty when hit=true); all hit=false sites populate a specific reason.
  • zsetRangeEmptyFastResult now returns (hit, reason, err); the old []redisZSetEntry return was always nil.
  • LuaFastPathCmd pre-resolves 6 new counters (same hot-path shape as obs(redis-lua): per-cmd fast-path outcome counter #571: one atomic Inc per redis.call()).
  • ObserveFallback(reason) dispatches via a small switch; unknown reasons land on fallback_other.

Motivation

Pre-deploy hypothesis: most fallbacks are fallback_missing_key (BullMQ polling empty delayed queues). Ship this first, measure in prod, then write the actual fix accordingly.

Test plan

  • go test -race -short ./adapter/... (53s) green
  • go test -race ./monitoring/... green
  • TestLuaFastPathObserverCountsByCmdAndOutcome updated for new labels + unknown-reason dispatch
  • Deploy to prod; watch sum by (outcome) (rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore"}[5m]))

Summary by CodeRabbit

  • New Features
    • Improved query performance monitoring with categorized fallback reasons for sorted set operations, providing enhanced visibility into query optimization effectiveness and performance diagnostics.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR introduces typed fallback reasons for Lua fast-path query tracking. It replaces a single boolean fallback indicator with a LuaFastPathFallbackReason type and seven distinct reason constants, threading the reason value through ZSET range helpers and updating the metrics observer to record reason-specific counters rather than a single aggregated fallback metric.

Changes

Cohort / File(s) Summary
Fallback Reason Monitoring Infrastructure
monitoring/hotpath.go
Introduced LuaFastPathFallbackReason type and seven exported constants (Ineligible, MissingKey, WrongType, Truncated, LargeOffset, Other, None). Replaced single fallback counter in LuaFastPathCmd with reason-specific counters. Updated ObserveFallback() signature to accept a reason argument and increment the corresponding counter via switch logic.
ZSET Fast-Path Query Helpers
adapter/redis_compat_commands.go
Updated zsetRangeByScoreFast, finalizeZSetFastRange, and zsetRangeEmptyFastResult return signatures to include monitoring.LuaFastPathFallbackReason. Replaced all hit=false fallthroughs with explicit reason codes (LuaFastPathFallbackIneligible, LuaFastPathFallbackMissingKey, LuaFastPathFallbackWrongType, LuaFastPathFallbackTruncated, LuaFastPathFallbackLargeOffset, LuaFastPathFallbackOther). Empty results now return reason="" for successful hits and coded reasons for fallback cases.
Command Handler Integration
adapter/redis_lua_context.go
Updated cmdZRangeByScore to unpack the new fallbackReason return value from zrangeByScoreFastPath and pass it to zrangeMetrics.ObserveFallback(fallbackReason) when fast path reports hit=false.
Test Coverage
monitoring/hotpath_test.go
Updated test assertions to verify reason-specific fallback counters instead of a single outcome="fallback" counter. Added test cases for specific reasons (LuaFastPathFallbackMissingKey, LuaFastPathFallbackTruncated) and unknown reason fallback routing. Updated metric help text validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 Hops through code with glee,
"Seven reasons now," cries ye!
No more vague fallback calls—
Each metric reason sprawls!
Observability, oh what a thrill!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: subdividing the ZRANGEBYSCORE fast-path fallback metric by specific fallback reasons for better observability.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/lua-fastpath-fallback-reason

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Production metric from PR #571 on n1 (leader) after 15 min of BullMQ
traffic:
    fallback         1463  (96.4%)
    hit                54  (3.6%)
    skip_loaded         0
    skip_cached_type    0

The Lua-layer guards (luaZSetAlreadyLoaded / cachedType) are not the
issue; the fast path is being rejected server-side. The single
"fallback" bucket is too coarse to tell whether that is the
legacy-blob eligibility check, the truncation guard, the large-offset
short-circuit, the empty-key missing-zset path, or the
string-priority wrong-type guard.

Subdivide the outcome label and thread the reason back from
zsetRangeByScoreFast:

  - fallback_ineligible   (legacy-blob zset exists)
  - fallback_large_offset (offset >= maxWideScanLimit short-circuit)
  - fallback_truncated    (scan saturated)
  - fallback_missing_key  (zsetRangeEmptyFastResult: zsetExists=false)
  - fallback_wrong_type   (hasHigherPriorityStringEncoding=true)
  - fallback_other        (error / fallback we didn't classify)

All counters are pre-resolved at ForCommand() time (same hot-path
shape as #571) so routing by reason is still a single atomic Inc per
call. ObserveFallback(reason) dispatches via a small switch.

The hypothesis from the metric pattern is that most fallbacks are
"fallback_missing_key" (BullMQ polling empty delayed queues). Ship
this first, measure in prod, then write the fix accordingly.
@bootjp bootjp force-pushed the feat/lua-fastpath-fallback-reason branch from 1d6d6ef to c275eeb Compare April 21, 2026 14:38
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the monitoring of the Lua fast-path by subdividing the generic 'fallback' outcome into specific, granular reasons such as ineligibility, missing keys, wrong types, and truncated results. The changes involve updating function signatures across the adapter layer to propagate these reasons and expanding the monitoring package to support individual counters for each fallback type. A review comment suggests improving type safety by introducing a custom type for the fallback reason instead of using a raw string.

Comment thread monitoring/hotpath.go Outdated
Comment on lines 182 to 201
func (c LuaFastPathCmd) ObserveFallback(reason string) {
var counter prometheus.Counter
switch reason {
case LuaFastPathOutcomeFallbackIneligible:
counter = c.fallbackIneligible
case LuaFastPathOutcomeFallbackMissingKey:
counter = c.fallbackMissingKey
case LuaFastPathOutcomeFallbackWrongType:
counter = c.fallbackWrongType
case LuaFastPathOutcomeFallbackTruncated:
counter = c.fallbackTruncated
case LuaFastPathOutcomeFallbackLargeOffset:
counter = c.fallbackLargeOffset
default:
counter = c.fallbackOther
}
if counter != nil {
counter.Inc()
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ObserveFallback method uses a switch on a string to route to the correct counter. While this is safe because unknown values land in the fallback_other bucket, consider using a custom type for the reason parameter (e.g., type LuaFastPathFallbackReason string) to improve type safety and documentation of the expected values across the adapter and monitoring packages.

Address gemini review on PR #572: ObserveFallback took a bare
string, so call sites in adapter/ could silently drift away from
the constants defined in monitoring/. Introduce a named string
type so typos now fail at compile time.

- New type monitoring.LuaFastPathFallbackReason.
- Fallback reason constants move to the new type; the Outcome*
  names (which were only used as fallback labels) are renamed to
  LuaFastPathFallback* to match.
- Signatures of zsetRangeByScoreFast, zsetRangeEmptyFastResult,
  finalizeZSetFastRange, and zrangeByScoreFastPath return the
  typed value instead of bare string.
- ObserveFallback(reason LuaFastPathFallbackReason) switch is
  unchanged in behaviour; unknown values still land on
  fallback_other.
- LuaFastPathFallbackNone = "" is the sentinel returned alongside
  hit=true; callers must not observe it.

Label values on the wire are unchanged (string(reason)), so the
metric output and dashboards are untouched.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 21, 2026

Addressed gemini's review (aa0c0de):

  • Medium — bare string for fallback reason: introduced monitoring.LuaFastPathFallbackReason (named string type) and propagated it through zsetRangeByScoreFastfinalizeZSetFastRangezsetRangeEmptyFastResultzrangeByScoreFastPathObserveFallback. Typos in call sites now fail at compile time instead of silently routing to fallback_other.

The corresponding constants are renamed LuaFastPathOutcomeFallback*LuaFastPathFallback* (on the new type); LuaFastPathOutcomeHit/SkipLoaded/SkipCachedType stay as bare-string constants since they're non-fallback outcomes. Label values on the wire are unchanged — string(reason) preserves dashboards.

/gemini review

@bootjp bootjp merged commit 626bcbe into main Apr 21, 2026
7 of 8 checks passed
@bootjp bootjp deleted the feat/lua-fastpath-fallback-reason branch April 21, 2026 15:07
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces granular monitoring for Lua fast-path fallbacks by replacing the generic fallback outcome with specific reasons such as ineligibility, missing keys, wrong types, truncated results, and large offsets. The monitoring package now defines a LuaFastPathFallbackReason type and associated constants, while the Redis adapter and Lua context have been updated to propagate and record these detailed metrics. I have no feedback to provide.

bootjp added a commit that referenced this pull request Apr 21, 2026
…#573)

## Summary

Fix the top source of slow-path fallbacks on BullMQ-style workloads.

Production metric from PR #572 after 8 minutes of traffic on n1
(leader):

```
fallback_missing_key  3288  (96.25%)
hit                    128  ( 3.75%)
fallback_ineligible     0
fallback_large_offset   0
fallback_truncated      0
fallback_wrong_type     0
fallback_other          0
skip_loaded             0
skip_cached_type        0
```

Every ZRANGEBYSCORE that hit an absent zset key was punting to the slow
path *purely* to distinguish "missing" from "wrong type" — then the slow
path re-probed the same member / meta / delta rows and the same string /
list / hash / set prefixes, only to return an empty array. Pure
duplicate I/O on a poll loop that is empty most of the time.

## Change

In `zsetRangeEmptyFastResult`'s `!zsetExists` branch, classify the key
via `keyTypeAt`:

- `redisTypeNone` → return `hit=true` with empty result (the correct
Redis answer; saves the slow-path round-trip).
- Any other type → keep `hit=false` with `fallback_wrong_type` so the
slow path still surfaces `WRONGTYPE`.

`keyTypeAt` (post-TTL) is used so a TTL-expired string is treated as
logically absent, matching Redis semantics. `rawKeyTypeAt` would
incorrectly surface `WRONGTYPE` for that case.

## Expected prod impact

- `fallback_missing_key` → ~0
- `hit` → ~96%
- Slow-path dispatch rate for ZRANGEBYSCORE drops by ~20x
- Per-call I/O on the missing-key hot path: ~13–16 → ~11

## Test plan

- [x] `go test -race -short ./adapter/...` (59 s) green
- [x] New `TestLua_ZRANGEBYSCORE_MissingKeyReturnsEmptyViaFastPath` pins
the short-circuit
- [x] `TestLua_ZRANGEBYSCORE_StringKeyReturnsWrongType` and
`...ListKeyReturnsWrongType` verify WRONGTYPE is preserved for non-zset
encodings
- [ ] Deploy; watch `sum by (outcome)
(rate(elastickv_lua_cmd_fastpath_total{cmd="zrangebyscore"}[5m]))` and
`elastickv_lua_redis_call_duration_seconds` p50/p95
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant