Conversation
- proxy/dualwrite.go: add elapsed, attempts, sha, noscript_fallback fields to the "secondary write failed" warn log for EVALSHA commands; extract failure reporting to recordSecondaryWriteFailure to keep cyclomatic complexity within lint limits - proxy/command.go: define cmdNameAUTH / cmdNameSELECT constants - proxy/proxy.go, proxy/pubsub.go: use the new constants (goconst fix) - adapter/redis_retry.go: add debug log on each write-conflict retry (attempt number, backoff, error kind) - adapter/redis_lua.go: add warn log on Lua script failure with elapsed time and total attempt count from retryRedisWrite
…oard Add a new row with two panels to elastickv-redis-summary.json: - Secondary Write Error Rate by Command: rate of proxy_command_total errors on the secondary backend, broken down by command name - EVALSHA Secondary Latency (p50/p95/p99): duration histogram for EVALSHA/EVALSHA_RO on the secondary, using proxy_command_duration_seconds
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 28 minutes and 15 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR enhances observability for secondary write operations in the dual-write proxy by introducing instrumentation (timing, retry counts, noscript fallback detection) in Lua script execution paths, centralizing secondary write failure logging via a new helper method, adding debug retry telemetry, and expanding the Grafana dashboard with secondary write diagnostics. Command-name constants are also standardized across proxy components. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances diagnostics for secondary writes and Lua script executions by adding structured logging for failures, including elapsed time and retry attempts. It also introduces a 'Secondary Write Diagnostics' section to the Grafana dashboard. Feedback suggests including the SHA-1 digest in Lua failure logs to better identify failing scripts and refining Grafana queries to specifically isolate secondary backend metrics, as current expressions aggregate data across all backends.
| slog.Default().Warn("lua script execution failed", | ||
| "elapsed", elapsed, "attempts", attempts, "err", err) |
There was a problem hiding this comment.
The log message for Lua script execution failure is missing the script's SHA-1 digest. Including the SHA is essential for identifying which specific script is failing in an environment with multiple Lua scripts. Since the SHA is already calculated or looked up in the callers (eval/evalsha), it should be included in the structured log here.
| slog.Default().Warn("lua script execution failed", | |
| "elapsed", elapsed, "attempts", attempts, "err", err) | |
| slog.Default().Warn("lua script execution failed", | |
| "sha", luaScriptSHA(script), | |
| "elapsed", elapsed, "attempts", attempts, "err", err) |
| { | ||
| "datasource": "$datasource", | ||
| "editorMode": "code", | ||
| "expr": "sum by (command) (rate(proxy_command_total{job=\"elastickv\", status=\"error\"}[5m]))", |
There was a problem hiding this comment.
This Prometheus expression aggregates proxy_command_total across all backends. Since the DualWriter records this metric for both primary and secondary backends (see proxy/dualwrite.go), this panel will incorrectly include errors from the primary backend in a section titled 'Secondary Write Diagnostics'. Consider filtering by the backend label to isolate secondary failures.
| "expr": "histogram_quantile(0.50, sum by (le) (rate(proxy_command_duration_seconds_bucket{job=\"elastickv\", command=~\"EVALSHA|EVALSHA_RO\"}[5m])))", | ||
| "legendFormat": "p50", | ||
| "range": true, | ||
| "refId": "A" | ||
| }, | ||
| { | ||
| "datasource": "$datasource", | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.95, sum by (le) (rate(proxy_command_duration_seconds_bucket{job=\"elastickv\", command=~\"EVALSHA|EVALSHA_RO\"}[5m])))", | ||
| "legendFormat": "p95", | ||
| "range": true, | ||
| "refId": "B" | ||
| }, | ||
| { | ||
| "datasource": "$datasource", | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.99, sum by (le) (rate(proxy_command_duration_seconds_bucket{job=\"elastickv\", command=~\"EVALSHA|EVALSHA_RO\"}[5m])))", | ||
| "legendFormat": "p99", | ||
| "range": true, | ||
| "refId": "C" | ||
| } |
There was a problem hiding this comment.
The latency queries for EVALSHA/EVALSHA_RO aggregate data from all backends. This mixes primary and secondary latency distributions, which makes it difficult to isolate performance issues specifically related to the secondary backend (such as Raft round-trip latency or write conflicts). Filtering by the secondary backend label is recommended to ensure the 'Secondary Write Diagnostics' section remains accurate.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
proxy/proxy.go (1)
336-359: LGTM — consider extending the constant usage to remaining hard-coded sites.The
SELECT/AUTHcomparisons now use the shared constants. Two related sites still use string literals and could be updated in a follow-up for consistency:
proxy/proxy.goline 363:if name == "HELLO"(consider acmdNameHELLOconstant if more HELLO handling is anticipated).proxy/leader_aware_backend_test.golines 117-121:case "CLIENT", "AUTH", "SELECT", "PING":— could referencecmdNameAUTH/cmdNameSELECT/cmdPingfor consistency (test file, low priority).Not blocking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/proxy.go` around lines 336 - 359, Replace remaining hard-coded command literals with shared constants: add a new constant cmdNameHELLO = "HELLO" alongside the existing command constants and change the comparison in the if that currently uses "HELLO" in proxy/proxy.go to use cmdNameHELLO; likewise update the test case switch in leader_aware_backend_test.go to reference cmdNameAUTH, cmdNameSELECT, and cmdPing instead of the literal strings "AUTH", "SELECT", "PING" (import or reference the existing constants from the same package), ensuring names match the existing cmdName* naming convention.proxy/command.go (1)
18-21: LGTM — constants are correctly defined.The
cmdNameAUTH/cmdNameSELECTconstants mirror the existingcmdSubscribe,cmdPing, etc. block inproxy/pubsub.go(lines 25-33). Consider consolidating all of these command-name constants into a single block inproxy/command.goin a follow-up so there's one canonical location — not required for this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/command.go` around lines 18 - 21, Constants for command names are split across files; consolidate them into a single canonical block in proxy/command.go by moving the other command-name constants (e.g., cmdSubscribe, cmdPing, and any similar cmd* symbols found in proxy/pubsub.go) into the same const block alongside cmdNameAUTH and cmdNameSELECT, update any references to those constants (no renaming) so they import from the same package scope, and remove the duplicate const block from proxy/pubsub.go to avoid duplication.adapter/redis_retry.go (1)
100-101: Debug log message is misleading for txn-locked retries.
retryRedisWritehandles bothstore.ErrWriteConflictandkv.ErrTxnLocked, but the log message hardcodes "write conflict". Consider using the error text or including the policy name so operators can distinguish the two retry paths in telemetry.Proposed tweak
- slog.Default().Debug("retryRedisWrite: retrying on write conflict", - "attempt", attempt+1, "backoff", backoff, "err", err) + slog.Default().Debug("retryRedisWrite: retrying on retryable txn error", + "attempt", attempt+1, "backoff", backoff, "err", err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/redis_retry.go` around lines 100 - 101, The debug log in retryRedisWrite misleadingly hardcodes "write conflict" though the function can retry for both store.ErrWriteConflict and kv.ErrTxnLocked; update the slog.Default().Debug call in retryRedisWrite to include the actual error (err.Error() or err) and/or the retry policy name so telemetry shows which path triggered (e.g., include fields "error", err, and "policy", policyName) rather than a fixed "write conflict" message.proxy/dualwrite.go (1)
320-337: Prefer the existing command-name constants over string literals.Elsewhere in the package EVALSHA/EVALSHA_RO are referenced through constants (e.g.
cmdEvalSHA,cmdEvalSHAROinproxy/script_cache.go). Using the same constants here keeps the surface consistent and guards against typo drift.Proposed tweak
- if (cmd == "EVALSHA" || cmd == "EVALSHA_RO") && len(iArgs) > 1 { + if (cmd == cmdEvalSHA || cmd == cmdEvalSHARO) && len(iArgs) > 1 {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/dualwrite.go` around lines 320 - 337, In recordSecondaryWriteFailure (method DualWriter.recordSecondaryWriteFailure) replace the string-literal checks for "EVALSHA" and "EVALSHA_RO" with the package constants used elsewhere (e.g., cmdEvalSHA and cmdEvalSHARO from proxy/script_cache.go) so the conditional becomes if (cmd == cmdEvalSHA || cmd == cmdEvalSHARO) && len(iArgs) > 1 { ... }; this keeps command names consistent across the package and avoids typo drift—no other behavior changes required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@monitoring/grafana/dashboards/elastickv-redis-summary.json`:
- Around line 628-691: The EVALSHA Secondary Latency panel is querying
proxy_command_duration_seconds_bucket without restricting the backend label, so
p50/p95/p99 are mixed with primary latencies; update the three target
expressions (the expr strings for refId "A", "B", "C" in the "EVALSHA Secondary
Latency (p50 / p95 / p99)" panel) to include a backend matcher like
backend="secondary" (or use your templated backend variable, e.g.
backend=~"$backend") in the metric selector
proxy_command_duration_seconds_bucket{job="elastickv",
command=~"EVALSHA|EVALSHA_RO", backend="secondary"} so the histogram_quantile is
computed only for secondary backends.
- Around line 579-627: The panel "Secondary Write Error Rate by Command" uses
the metric proxy_command_total without filtering the backend, so it currently
includes primary errors; update the PromQL in the panel's targets (expr for
refId "A") to include a backend label filter matching the secondary backend
(e.g. add backend="<secondary-name>"), using the actual value emitted by
d.secondary.Name() or expose that name as a dashboard variable and reference it,
ensuring proxy_command_total{job="elastickv", backend="<secondary>",
status="error"} is used to scope the query to the secondary backend.
---
Nitpick comments:
In `@adapter/redis_retry.go`:
- Around line 100-101: The debug log in retryRedisWrite misleadingly hardcodes
"write conflict" though the function can retry for both store.ErrWriteConflict
and kv.ErrTxnLocked; update the slog.Default().Debug call in retryRedisWrite to
include the actual error (err.Error() or err) and/or the retry policy name so
telemetry shows which path triggered (e.g., include fields "error", err, and
"policy", policyName) rather than a fixed "write conflict" message.
In `@proxy/command.go`:
- Around line 18-21: Constants for command names are split across files;
consolidate them into a single canonical block in proxy/command.go by moving the
other command-name constants (e.g., cmdSubscribe, cmdPing, and any similar cmd*
symbols found in proxy/pubsub.go) into the same const block alongside
cmdNameAUTH and cmdNameSELECT, update any references to those constants (no
renaming) so they import from the same package scope, and remove the duplicate
const block from proxy/pubsub.go to avoid duplication.
In `@proxy/dualwrite.go`:
- Around line 320-337: In recordSecondaryWriteFailure (method
DualWriter.recordSecondaryWriteFailure) replace the string-literal checks for
"EVALSHA" and "EVALSHA_RO" with the package constants used elsewhere (e.g.,
cmdEvalSHA and cmdEvalSHARO from proxy/script_cache.go) so the conditional
becomes if (cmd == cmdEvalSHA || cmd == cmdEvalSHARO) && len(iArgs) > 1 { ... };
this keeps command names consistent across the package and avoids typo drift—no
other behavior changes required.
In `@proxy/proxy.go`:
- Around line 336-359: Replace remaining hard-coded command literals with shared
constants: add a new constant cmdNameHELLO = "HELLO" alongside the existing
command constants and change the comparison in the if that currently uses
"HELLO" in proxy/proxy.go to use cmdNameHELLO; likewise update the test case
switch in leader_aware_backend_test.go to reference cmdNameAUTH, cmdNameSELECT,
and cmdPing instead of the literal strings "AUTH", "SELECT", "PING" (import or
reference the existing constants from the same package), ensuring names match
the existing cmdName* naming convention.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7d073945-9d33-41f6-90e7-3a8c8ed3eeef
📒 Files selected for processing (7)
adapter/redis_lua.goadapter/redis_retry.gomonitoring/grafana/dashboards/elastickv-redis-summary.jsonproxy/command.goproxy/dualwrite.goproxy/proxy.goproxy/pubsub.go
Add per-phase Prometheus metrics to EVAL/EVALSHA execution so dashboards
can distinguish Lua VM slowness, Raft commit latency, and write-conflict
retries as separate root causes.
monitoring/lua.go (new):
- LuaScriptReport / LuaScriptObserver interface
- LuaMetrics with three histograms:
elastickv_lua_exec_duration_seconds{outcome} — pure Lua VM (PCall)
elastickv_lua_commit_duration_seconds{outcome} — Raft + storage dispatch
elastickv_lua_conflict_retries — retry count distribution
monitoring/registry.go:
- instantiate LuaMetrics in NewRegistry
- expose LuaObserver() accessor
adapter/redis.go:
- add luaObserver field and WithLuaObserver() option
adapter/redis_lua.go:
- time PCall and commit() separately inside retryRedisWrite closure
- call luaObserver.ObserveLuaScript after each invocation
main.go:
- wire WithLuaObserver(metricsRegistry.LuaObserver())
grafana/elastickv-redis-summary.json:
- new "Lua Script Internals" row with three panels:
Lua VM Execution Time (p50/p95/p99)
Raft Commit Time (p50/p95/p99)
Write-Conflict Retries per Script (p50/p95/p99)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces detailed monitoring for Lua script execution and secondary backend writes, including new Prometheus metrics and Grafana dashboard panels for execution timing and retry counts. It also refactors command handling to use constants and improves error logging in the dual-write proxy. Feedback includes a suggestion to centralize complex retry logic for better maintainability and a recommendation to use safer type assertions when logging script identifiers to prevent potential runtime issues.
| var usedNOSCRIPTFallback bool | ||
| args := iArgs | ||
| for attempt := 0; ; attempt++ { | ||
| for ; ; attempt++ { |
There was a problem hiding this comment.
The retry logic here handles multiple error types (isReadTSCompactedError and NOSCRIPT) with nested calls. To improve maintainability and ensure consistent telemetry (like the attempts count), consider centralizing this complex retry/backoff logic into a shared helper function as per repository guidelines.
References
- Centralize common complex logic, such as retry/backoff mechanisms, into shared helper functions to avoid code duplication and improve maintainability.
| } | ||
| warnArgs := []any{"cmd", cmd, "err", sErr, "elapsed", elapsed, "attempts", attempts} | ||
| if (cmd == "EVALSHA" || cmd == "EVALSHA_RO") && len(iArgs) > 1 { | ||
| if sha, ok := iArgs[1].([]byte); ok { |
There was a problem hiding this comment.
The type assertion iArgs[1].([]byte) assumes that the second argument of EVALSHA is always a byte slice. While the proxy typically handles [][]byte, if any part of the pipeline converts arguments to strings before reaching this point, the assertion will fail and the SHA will not be logged. Consider adding a fallback for string type or using fmt.Sprintf("%s", iArgs[1]) for safer logging.
| if sha, ok := iArgs[1].([]byte); ok { | |
| if sha, ok := iArgs[1].([]byte); ok { | |
| warnArgs = append(warnArgs, "sha", string(sha)) | |
| } else if sha, ok := iArgs[1].(string); ok { | |
| warnArgs = append(warnArgs, "sha", sha) | |
| } |
adapter/redis_lua.go:
- add sha to "lua script execution failed" warn log so failing scripts
can be identified without cross-referencing EVALSHA args
proxy/dualwrite.go (recordSecondaryWriteFailure):
- add string fallback in SHA type assertion (iArgs[1] may be string
or []byte depending on caller path)
grafana/elastickv-redis-summary.json:
- add backend="elastickv" filter to "Secondary Write Error Rate by
Command" and "EVALSHA Secondary Latency" panels so primary backend
metrics are not mixed in with secondary diagnostics
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive monitoring and diagnostics for Lua script execution and secondary write operations. Key changes include the addition of a LuaScriptObserver to track execution and commit durations, conflict retries, and errors. These metrics are exposed via Prometheus and visualized in new Grafana dashboard panels. Additionally, the PR improves logging for write conflicts and secondary write failures, providing better visibility into system performance and potential bottlenecks. I have no feedback to provide.
Summary by CodeRabbit
New Features
Bug Fixes