Skip to content

enhance redis tracing #531

Merged
bootjp merged 5 commits intomainfrom
feat/trace-redis
Apr 19, 2026
Merged

enhance redis tracing #531
bootjp merged 5 commits intomainfrom
feat/trace-redis

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 19, 2026

Summary by CodeRabbit

  • New Features

    • Added secondary write diagnostics to Grafana dashboard with panels for failed proxy command rates and EVALSHA latency metrics.
  • Bug Fixes

    • Enhanced execution telemetry and logging for Lua script operations, now capturing retry attempts and elapsed time for improved error diagnostics.

bootjp added 2 commits April 19, 2026 16:38
- proxy/dualwrite.go: add elapsed, attempts, sha, noscript_fallback fields
  to the "secondary write failed" warn log for EVALSHA commands; extract
  failure reporting to recordSecondaryWriteFailure to keep cyclomatic
  complexity within lint limits
- proxy/command.go: define cmdNameAUTH / cmdNameSELECT constants
- proxy/proxy.go, proxy/pubsub.go: use the new constants (goconst fix)
- adapter/redis_retry.go: add debug log on each write-conflict retry
  (attempt number, backoff, error kind)
- adapter/redis_lua.go: add warn log on Lua script failure with elapsed
  time and total attempt count from retryRedisWrite
…oard

Add a new row with two panels to elastickv-redis-summary.json:
- Secondary Write Error Rate by Command: rate of proxy_command_total
  errors on the secondary backend, broken down by command name
- EVALSHA Secondary Latency (p50/p95/p99): duration histogram for
  EVALSHA/EVALSHA_RO on the secondary, using proxy_command_duration_seconds
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 28 minutes and 15 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 28 minutes and 15 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ef6b6ce-3d87-4f78-84a1-26fc62088e49

📥 Commits

Reviewing files that changed from the base of the PR and between decd7a8 and df10380.

📒 Files selected for processing (7)
  • adapter/redis.go
  • adapter/redis_lua.go
  • main.go
  • monitoring/grafana/dashboards/elastickv-redis-summary.json
  • monitoring/lua.go
  • monitoring/registry.go
  • proxy/dualwrite.go
📝 Walkthrough

Walkthrough

This PR enhances observability for secondary write operations in the dual-write proxy by introducing instrumentation (timing, retry counts, noscript fallback detection) in Lua script execution paths, centralizing secondary write failure logging via a new helper method, adding debug retry telemetry, and expanding the Grafana dashboard with secondary write diagnostics. Command-name constants are also standardized across proxy components.

Changes

Cohort / File(s) Summary
Lua & Retry Instrumentation
adapter/redis_lua.go, adapter/redis_retry.go
Added execution telemetry (elapsed time, retry attempt counting) and structured logging at failure points for Lua script execution and retry operations.
Secondary Write Monitoring
proxy/dualwrite.go
Refactored failure handling via new recordSecondaryWriteFailure helper method that centralizes metrics increments, Sentry reporting, and enriched logging with elapsed time, attempt counts, and NOSCRIPT fallback flags.
Command-Name Constants
proxy/command.go, proxy/proxy.go, proxy/pubsub.go
Introduced unexported constants cmdNameAUTH and cmdNameSELECT and replaced hard-coded string literals with these constants in admin and pubsub command dispatch paths.
Dashboard Configuration
monitoring/grafana/dashboards/elastickv-redis-summary.json
Added "Secondary Write Diagnostics" section with two panels: failed proxy command rate by command type and EVALSHA/EVALSHA_RO tail latency quantiles on secondary backend.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 Hops through the logs with glee,
Metrics dancing, times we see,
Retries counted, fallbacks bright,
Secondary writes shining in the light!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'enhance redis tracing' is partially related to the changeset. While the PR adds logging and monitoring for Redis operations, the term 'tracing' is broader and more specific than what is actually implemented. The changes focus on structured logging, retry diagnostics, and Grafana dashboard enhancements rather than comprehensive distributed tracing. Consider a more precise title such as 'Add Redis operation logging and retry diagnostics' to better reflect the actual implementation of structured logs and monitoring enhancements rather than generic 'tracing'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/trace-redis

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances diagnostics for secondary writes and Lua script executions by adding structured logging for failures, including elapsed time and retry attempts. It also introduces a 'Secondary Write Diagnostics' section to the Grafana dashboard. Feedback suggests including the SHA-1 digest in Lua failure logs to better identify failing scripts and refining Grafana queries to specifically isolate secondary backend metrics, as current expressions aggregate data across all backends.

Comment thread adapter/redis_lua.go
Comment on lines +155 to +156
slog.Default().Warn("lua script execution failed",
"elapsed", elapsed, "attempts", attempts, "err", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message for Lua script execution failure is missing the script's SHA-1 digest. Including the SHA is essential for identifying which specific script is failing in an environment with multiple Lua scripts. Since the SHA is already calculated or looked up in the callers (eval/evalsha), it should be included in the structured log here.

Suggested change
slog.Default().Warn("lua script execution failed",
"elapsed", elapsed, "attempts", attempts, "err", err)
slog.Default().Warn("lua script execution failed",
"sha", luaScriptSHA(script),
"elapsed", elapsed, "attempts", attempts, "err", err)

{
"datasource": "$datasource",
"editorMode": "code",
"expr": "sum by (command) (rate(proxy_command_total{job=\"elastickv\", status=\"error\"}[5m]))",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This Prometheus expression aggregates proxy_command_total across all backends. Since the DualWriter records this metric for both primary and secondary backends (see proxy/dualwrite.go), this panel will incorrectly include errors from the primary backend in a section titled 'Secondary Write Diagnostics'. Consider filtering by the backend label to isolate secondary failures.

Comment on lines +668 to +688
"expr": "histogram_quantile(0.50, sum by (le) (rate(proxy_command_duration_seconds_bucket{job=\"elastickv\", command=~\"EVALSHA|EVALSHA_RO\"}[5m])))",
"legendFormat": "p50",
"range": true,
"refId": "A"
},
{
"datasource": "$datasource",
"editorMode": "code",
"expr": "histogram_quantile(0.95, sum by (le) (rate(proxy_command_duration_seconds_bucket{job=\"elastickv\", command=~\"EVALSHA|EVALSHA_RO\"}[5m])))",
"legendFormat": "p95",
"range": true,
"refId": "B"
},
{
"datasource": "$datasource",
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum by (le) (rate(proxy_command_duration_seconds_bucket{job=\"elastickv\", command=~\"EVALSHA|EVALSHA_RO\"}[5m])))",
"legendFormat": "p99",
"range": true,
"refId": "C"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The latency queries for EVALSHA/EVALSHA_RO aggregate data from all backends. This mixes primary and secondary latency distributions, which makes it difficult to isolate performance issues specifically related to the secondary backend (such as Raft round-trip latency or write conflicts). Filtering by the secondary backend label is recommended to ensure the 'Secondary Write Diagnostics' section remains accurate.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
proxy/proxy.go (1)

336-359: LGTM — consider extending the constant usage to remaining hard-coded sites.

The SELECT/AUTH comparisons now use the shared constants. Two related sites still use string literals and could be updated in a follow-up for consistency:

  • proxy/proxy.go line 363: if name == "HELLO" (consider a cmdNameHELLO constant if more HELLO handling is anticipated).
  • proxy/leader_aware_backend_test.go lines 117-121: case "CLIENT", "AUTH", "SELECT", "PING": — could reference cmdNameAUTH/cmdNameSELECT/cmdPing for consistency (test file, low priority).

Not blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/proxy.go` around lines 336 - 359, Replace remaining hard-coded command
literals with shared constants: add a new constant cmdNameHELLO = "HELLO"
alongside the existing command constants and change the comparison in the if
that currently uses "HELLO" in proxy/proxy.go to use cmdNameHELLO; likewise
update the test case switch in leader_aware_backend_test.go to reference
cmdNameAUTH, cmdNameSELECT, and cmdPing instead of the literal strings "AUTH",
"SELECT", "PING" (import or reference the existing constants from the same
package), ensuring names match the existing cmdName* naming convention.
proxy/command.go (1)

18-21: LGTM — constants are correctly defined.

The cmdNameAUTH / cmdNameSELECT constants mirror the existing cmdSubscribe, cmdPing, etc. block in proxy/pubsub.go (lines 25-33). Consider consolidating all of these command-name constants into a single block in proxy/command.go in a follow-up so there's one canonical location — not required for this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/command.go` around lines 18 - 21, Constants for command names are split
across files; consolidate them into a single canonical block in proxy/command.go
by moving the other command-name constants (e.g., cmdSubscribe, cmdPing, and any
similar cmd* symbols found in proxy/pubsub.go) into the same const block
alongside cmdNameAUTH and cmdNameSELECT, update any references to those
constants (no renaming) so they import from the same package scope, and remove
the duplicate const block from proxy/pubsub.go to avoid duplication.
adapter/redis_retry.go (1)

100-101: Debug log message is misleading for txn-locked retries.

retryRedisWrite handles both store.ErrWriteConflict and kv.ErrTxnLocked, but the log message hardcodes "write conflict". Consider using the error text or including the policy name so operators can distinguish the two retry paths in telemetry.

Proposed tweak
-		slog.Default().Debug("retryRedisWrite: retrying on write conflict",
-			"attempt", attempt+1, "backoff", backoff, "err", err)
+		slog.Default().Debug("retryRedisWrite: retrying on retryable txn error",
+			"attempt", attempt+1, "backoff", backoff, "err", err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/redis_retry.go` around lines 100 - 101, The debug log in
retryRedisWrite misleadingly hardcodes "write conflict" though the function can
retry for both store.ErrWriteConflict and kv.ErrTxnLocked; update the
slog.Default().Debug call in retryRedisWrite to include the actual error
(err.Error() or err) and/or the retry policy name so telemetry shows which path
triggered (e.g., include fields "error", err, and "policy", policyName) rather
than a fixed "write conflict" message.
proxy/dualwrite.go (1)

320-337: Prefer the existing command-name constants over string literals.

Elsewhere in the package EVALSHA/EVALSHA_RO are referenced through constants (e.g. cmdEvalSHA, cmdEvalSHARO in proxy/script_cache.go). Using the same constants here keeps the surface consistent and guards against typo drift.

Proposed tweak
-	if (cmd == "EVALSHA" || cmd == "EVALSHA_RO") && len(iArgs) > 1 {
+	if (cmd == cmdEvalSHA || cmd == cmdEvalSHARO) && len(iArgs) > 1 {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/dualwrite.go` around lines 320 - 337, In recordSecondaryWriteFailure
(method DualWriter.recordSecondaryWriteFailure) replace the string-literal
checks for "EVALSHA" and "EVALSHA_RO" with the package constants used elsewhere
(e.g., cmdEvalSHA and cmdEvalSHARO from proxy/script_cache.go) so the
conditional becomes if (cmd == cmdEvalSHA || cmd == cmdEvalSHARO) && len(iArgs)
> 1 { ... }; this keeps command names consistent across the package and avoids
typo drift—no other behavior changes required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@monitoring/grafana/dashboards/elastickv-redis-summary.json`:
- Around line 628-691: The EVALSHA Secondary Latency panel is querying
proxy_command_duration_seconds_bucket without restricting the backend label, so
p50/p95/p99 are mixed with primary latencies; update the three target
expressions (the expr strings for refId "A", "B", "C" in the "EVALSHA Secondary
Latency (p50 / p95 / p99)" panel) to include a backend matcher like
backend="secondary" (or use your templated backend variable, e.g.
backend=~"$backend") in the metric selector
proxy_command_duration_seconds_bucket{job="elastickv",
command=~"EVALSHA|EVALSHA_RO", backend="secondary"} so the histogram_quantile is
computed only for secondary backends.
- Around line 579-627: The panel "Secondary Write Error Rate by Command" uses
the metric proxy_command_total without filtering the backend, so it currently
includes primary errors; update the PromQL in the panel's targets (expr for
refId "A") to include a backend label filter matching the secondary backend
(e.g. add backend="<secondary-name>"), using the actual value emitted by
d.secondary.Name() or expose that name as a dashboard variable and reference it,
ensuring proxy_command_total{job="elastickv", backend="<secondary>",
status="error"} is used to scope the query to the secondary backend.

---

Nitpick comments:
In `@adapter/redis_retry.go`:
- Around line 100-101: The debug log in retryRedisWrite misleadingly hardcodes
"write conflict" though the function can retry for both store.ErrWriteConflict
and kv.ErrTxnLocked; update the slog.Default().Debug call in retryRedisWrite to
include the actual error (err.Error() or err) and/or the retry policy name so
telemetry shows which path triggered (e.g., include fields "error", err, and
"policy", policyName) rather than a fixed "write conflict" message.

In `@proxy/command.go`:
- Around line 18-21: Constants for command names are split across files;
consolidate them into a single canonical block in proxy/command.go by moving the
other command-name constants (e.g., cmdSubscribe, cmdPing, and any similar cmd*
symbols found in proxy/pubsub.go) into the same const block alongside
cmdNameAUTH and cmdNameSELECT, update any references to those constants (no
renaming) so they import from the same package scope, and remove the duplicate
const block from proxy/pubsub.go to avoid duplication.

In `@proxy/dualwrite.go`:
- Around line 320-337: In recordSecondaryWriteFailure (method
DualWriter.recordSecondaryWriteFailure) replace the string-literal checks for
"EVALSHA" and "EVALSHA_RO" with the package constants used elsewhere (e.g.,
cmdEvalSHA and cmdEvalSHARO from proxy/script_cache.go) so the conditional
becomes if (cmd == cmdEvalSHA || cmd == cmdEvalSHARO) && len(iArgs) > 1 { ... };
this keeps command names consistent across the package and avoids typo drift—no
other behavior changes required.

In `@proxy/proxy.go`:
- Around line 336-359: Replace remaining hard-coded command literals with shared
constants: add a new constant cmdNameHELLO = "HELLO" alongside the existing
command constants and change the comparison in the if that currently uses
"HELLO" in proxy/proxy.go to use cmdNameHELLO; likewise update the test case
switch in leader_aware_backend_test.go to reference cmdNameAUTH, cmdNameSELECT,
and cmdPing instead of the literal strings "AUTH", "SELECT", "PING" (import or
reference the existing constants from the same package), ensuring names match
the existing cmdName* naming convention.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d073945-9d33-41f6-90e7-3a8c8ed3eeef

📥 Commits

Reviewing files that changed from the base of the PR and between 9a0a973 and decd7a8.

📒 Files selected for processing (7)
  • adapter/redis_lua.go
  • adapter/redis_retry.go
  • monitoring/grafana/dashboards/elastickv-redis-summary.json
  • proxy/command.go
  • proxy/dualwrite.go
  • proxy/proxy.go
  • proxy/pubsub.go

Comment thread monitoring/grafana/dashboards/elastickv-redis-summary.json
Comment thread monitoring/grafana/dashboards/elastickv-redis-summary.json
Add per-phase Prometheus metrics to EVAL/EVALSHA execution so dashboards
can distinguish Lua VM slowness, Raft commit latency, and write-conflict
retries as separate root causes.

monitoring/lua.go (new):
  - LuaScriptReport / LuaScriptObserver interface
  - LuaMetrics with three histograms:
    elastickv_lua_exec_duration_seconds{outcome}   — pure Lua VM (PCall)
    elastickv_lua_commit_duration_seconds{outcome} — Raft + storage dispatch
    elastickv_lua_conflict_retries                 — retry count distribution

monitoring/registry.go:
  - instantiate LuaMetrics in NewRegistry
  - expose LuaObserver() accessor

adapter/redis.go:
  - add luaObserver field and WithLuaObserver() option

adapter/redis_lua.go:
  - time PCall and commit() separately inside retryRedisWrite closure
  - call luaObserver.ObserveLuaScript after each invocation

main.go:
  - wire WithLuaObserver(metricsRegistry.LuaObserver())

grafana/elastickv-redis-summary.json:
  - new "Lua Script Internals" row with three panels:
    Lua VM Execution Time (p50/p95/p99)
    Raft Commit Time (p50/p95/p99)
    Write-Conflict Retries per Script (p50/p95/p99)
@bootjp bootjp changed the title Feat/trace redis enhance redis tracing Apr 19, 2026
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 19, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces detailed monitoring for Lua script execution and secondary backend writes, including new Prometheus metrics and Grafana dashboard panels for execution timing and retry counts. It also refactors command handling to use constants and improves error logging in the dual-write proxy. Feedback includes a suggestion to centralize complex retry logic for better maintainability and a recommendation to use safer type assertions when logging script identifiers to prevent potential runtime issues.

Comment thread proxy/dualwrite.go
var usedNOSCRIPTFallback bool
args := iArgs
for attempt := 0; ; attempt++ {
for ; ; attempt++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic here handles multiple error types (isReadTSCompactedError and NOSCRIPT) with nested calls. To improve maintainability and ensure consistent telemetry (like the attempts count), consider centralizing this complex retry/backoff logic into a shared helper function as per repository guidelines.

References
  1. Centralize common complex logic, such as retry/backoff mechanisms, into shared helper functions to avoid code duplication and improve maintainability.

Comment thread proxy/dualwrite.go Outdated
}
warnArgs := []any{"cmd", cmd, "err", sErr, "elapsed", elapsed, "attempts", attempts}
if (cmd == "EVALSHA" || cmd == "EVALSHA_RO") && len(iArgs) > 1 {
if sha, ok := iArgs[1].([]byte); ok {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type assertion iArgs[1].([]byte) assumes that the second argument of EVALSHA is always a byte slice. While the proxy typically handles [][]byte, if any part of the pipeline converts arguments to strings before reaching this point, the assertion will fail and the SHA will not be logged. Consider adding a fallback for string type or using fmt.Sprintf("%s", iArgs[1]) for safer logging.

Suggested change
if sha, ok := iArgs[1].([]byte); ok {
if sha, ok := iArgs[1].([]byte); ok {
warnArgs = append(warnArgs, "sha", string(sha))
} else if sha, ok := iArgs[1].(string); ok {
warnArgs = append(warnArgs, "sha", sha)
}

adapter/redis_lua.go:
  - add sha to "lua script execution failed" warn log so failing scripts
    can be identified without cross-referencing EVALSHA args

proxy/dualwrite.go (recordSecondaryWriteFailure):
  - add string fallback in SHA type assertion (iArgs[1] may be string
    or []byte depending on caller path)

grafana/elastickv-redis-summary.json:
  - add backend="elastickv" filter to "Secondary Write Error Rate by
    Command" and "EVALSHA Secondary Latency" panels so primary backend
    metrics are not mixed in with secondary diagnostics
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 19, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive monitoring and diagnostics for Lua script execution and secondary write operations. Key changes include the addition of a LuaScriptObserver to track execution and commit durations, conflict retries, and errors. These metrics are exposed via Prometheus and visualized in new Grafana dashboard panels. Additionally, the PR improves logging for write conflicts and secondary write failures, providing better visibility into system performance and potential bottlenecks. I have no feedback to provide.

@bootjp bootjp merged commit f09c4d5 into main Apr 19, 2026
9 checks passed
@bootjp bootjp deleted the feat/trace-redis branch April 19, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant