obs(redis): treat missing keys per Redis spec (nil/0) not as errors#595
obs(redis): treat missing keys per Redis spec (nil/0) not as errors#595
Conversation
Adds an end-to-end test that spins up a 3-node cluster, wires a shared monitoring.RedisRequestObserver into every node, and then issues each of the commands whose Redis protocol semantics report a missing key via a null / 0 / "none" reply (GET, HGET, HMGET, HLEN, HEXISTS, HGETALL, LLEN, LINDEX, LRANGE, LPOP, RPOP, SCARD, SMEMBERS, SISMEMBER, ZCARD, ZSCORE, ZRANGE, XLEN, XRANGE, EXISTS, TYPE, TTL, PTTL). The test asserts that none of those commands increment elastickv_redis_errors_total, and that the GET command pushes the outcome=success sample forward. As a control, RENAME and LSET on a missing key are exercised and must increment the errors counter, matching the Redis spec which returns an ERR reply for those cases. No production-code change is needed; the existing handlers already internalise store.ErrKeyNotFound through keyTypeAt / logicalExistsAt / loadHashAt / loadSetAt / loadZSetAt / loadStreamAt / listPopClaim / hashFieldFastLookup. This test pins that contract so a future refactor that naively forwards ErrKeyNotFound to conn.WriteError will surface as a test failure instead of a silently-inflated error metric in production.
📝 WalkthroughWalkthroughA new test suite is added to validate that Redis commands returning missing-key responses (GET, HGET, LINDEX, etc.) are not counted as metrics errors, while error-returning operations (RENAME, LSET on missing keys) are correctly incremented in the error counter. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new test file, adapter/redis_metrics_missing_key_test.go, which verifies the observability behavior for Redis commands when keys are missing. It ensures that commands returning null or zero values are correctly categorized as successes in metrics, while commands that return errors, such as RENAME and LSET, are tracked as errors. Feedback was provided regarding the fragility of exact counter assertions in a clustered environment; specifically, the RENAME and LSET tests might experience double-counting due to request proxying between nodes, so using GreaterOrEqual is recommended for more robust assertions.
| require.Error(t, err, "RENAME on a missing key must return an error") | ||
| require.Contains(t, err.Error(), "no such key") | ||
|
|
||
| require.Equal(t, errorsBefore+1, countErrorMetrics(t, registry), |
There was a problem hiding this comment.
The assertion errorsBefore+1 is fragile in a clustered test environment. Since all nodes share the same observer and registry, a command that is proxied to a leader node will be recorded by both the proxying node and the leader node, resulting in the error counter incrementing by 2 instead of 1. This makes the test dependent on whether the key mk:rename:missing hashes to the node the client is connected to (nodes[0]). Consider using GreaterOrEqual (as done for the GET success check on line 181) to make the test more robust against cluster topology and key distribution.
| require.Error(t, err, "LSET on a missing key must return an error") | ||
| require.Contains(t, err.Error(), "no such key") | ||
|
|
||
| require.Equal(t, errorsBefore+1, countErrorMetrics(t, registry), |
There was a problem hiding this comment.
The RENAME and LSET subtests asserted strict equality `errorsBefore+1 == errorsAfter`, but the multi-node test harness can route a single client command through proxyToLeader when the hashed key lands on a non-leader. Both the proxy node and the leader node then record an error for the same command, producing `errorsBefore+2` and flaking the test. Switch to `GreaterOrEqual(errorsAfter, errorsBefore+1)` to match the GET success-counter assertion above, and add a comment explaining the rationale. Addresses Gemini review feedback on #595.
|
Addressed both Gemini medium items at What changed
Verification
/gemini review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
adapter/redis_metrics_missing_key_test.go (1)
57-166: Consider table-driving the missing-key command cases.This long manual sequence is correct, but a table of
name+ assertion function cases would better match the project’s test style and make future Redis command additions easier to audit. As per coding guidelines,Unit tests must be co-located with packages in *_test.go files; prefer table-driven test cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/redis_metrics_missing_key_test.go` around lines 57 - 166, The test contains a long manual sequence of missing-key assertions; convert it into a table-driven test: create a slice of cases (e.g., struct { name string; run func(t *testing.T, rdb *redis.Client, ctx context.Context) }) where each case calls the specific command (referencing rdb and ctx) and performs the intended assertion (e.g., require.ErrorIs(err, redis.Nil) for Get/LIndex/LPop/RPop/ZScore, require.NoError + require.Empty for HMGet/LRange/SMembers/ZRange/XRange, require.NoError + require.Equal for HLen/LLen/SCard/ZCard/XLen/Exists and TTL/PTTL checks, and require.NoError + require.Equal for Type -> "none" and boolean checks for HExists/SIsMember); then replace the repeated blocks with a loop over cases running t.Run(case.name, func(t *testing.T) { case.run(t, rdb, ctx) }) so new commands can be added by appending cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@adapter/redis_metrics_missing_key_test.go`:
- Around line 57-166: The test contains a long manual sequence of missing-key
assertions; convert it into a table-driven test: create a slice of cases (e.g.,
struct { name string; run func(t *testing.T, rdb *redis.Client, ctx
context.Context) }) where each case calls the specific command (referencing rdb
and ctx) and performs the intended assertion (e.g., require.ErrorIs(err,
redis.Nil) for Get/LIndex/LPop/RPop/ZScore, require.NoError + require.Empty for
HMGet/LRange/SMembers/ZRange/XRange, require.NoError + require.Equal for
HLen/LLen/SCard/ZCard/XLen/Exists and TTL/PTTL checks, and require.NoError +
require.Equal for Type -> "none" and boolean checks for HExists/SIsMember); then
replace the repeated blocks with a loop over cases running t.Run(case.name,
func(t *testing.T) { case.run(t, rdb, ctx) }) so new commands can be added by
appending cases.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bc30d3ba-068a-4771-b177-ea5e7b18b3c8
📒 Files selected for processing (1)
adapter/redis_metrics_missing_key_test.go
There was a problem hiding this comment.
Code Review
This pull request introduces a new test suite in adapter/redis_metrics_missing_key_test.go to verify that Redis commands correctly report success or error metrics when keys are missing. The review feedback suggests several improvements to the test implementation, including logging ignored errors during resource cleanup, optimizing performance by deferring expensive metric gathering until an assertion fails, and improving the efficiency of metric lookup helpers by removing redundant calls, implementing early loop termination, and centralizing shared logic.
|
|
||
| ctx := context.Background() | ||
| rdb := redis.NewClient(&redis.Options{Addr: nodes[0].redisAddress}) | ||
| defer func() { _ = rdb.Close() }() |
There was a problem hiding this comment.
The error from rdb.Close() is silently ignored. According to the general rules, errors from Close() methods on resources like network connections should be logged to ensure visibility into potential resource leaks or cleanup issues.
| defer func() { _ = rdb.Close() }() | |
| defer func() { if err := rdb.Close(); err != nil { t.Logf("failed to close redis client: %v", err) } }() |
References
- Do not silently ignore errors from Close() methods on resources like network connections. Log them to ensure resource leaks or other cleanup problems are visible.
| require.Equal(t, errorsBefore, countErrorMetrics(t, registry), | ||
| "missing-key commands must not increment elastickv_redis_errors_total; details=%s", | ||
| dumpErrorMetrics(t, registry)) |
There was a problem hiding this comment.
dumpErrorMetrics is called on every execution of this assertion, even when it passes. Since dumpErrorMetrics performs an expensive Gather() operation on the Prometheus registry, it should only be invoked if the assertion fails to improve test performance.
| require.Equal(t, errorsBefore, countErrorMetrics(t, registry), | |
| "missing-key commands must not increment elastickv_redis_errors_total; details=%s", | |
| dumpErrorMetrics(t, registry)) | |
| currentErrors := countErrorMetrics(t, registry) | |
| if currentErrors != errorsBefore { | |
| require.Equal(t, errorsBefore, currentErrors, | |
| "missing-key commands must not increment elastickv_redis_errors_total; details=%s", | |
| dumpErrorMetrics(t, registry)) | |
| } |
| for _, mf := range mfs { | ||
| if mf.GetName() != "elastickv_redis_errors_total" { | ||
| continue | ||
| } | ||
| for _, m := range mf.GetMetric() { | ||
| if m.GetCounter().GetValue() == 0 { | ||
| continue | ||
| } | ||
| labels := "" | ||
| for _, lp := range m.GetLabel() { | ||
| labels += lp.GetName() + "=" + lp.GetValue() + "," | ||
| } | ||
| out += labels + "value=" + strconv.FormatFloat(m.GetCounter().GetValue(), 'f', -1, 64) + "\n" | ||
| } | ||
| } |
There was a problem hiding this comment.
The loop iterates over all metric families even after the target metric has been found. Since metric names are unique within a registry, the loop can be terminated early with a break to improve efficiency. Additionally, consider centralizing this common metric lookup logic into a shared helper function to avoid code duplication.
| for _, mf := range mfs { | |
| if mf.GetName() != "elastickv_redis_errors_total" { | |
| continue | |
| } | |
| for _, m := range mf.GetMetric() { | |
| if m.GetCounter().GetValue() == 0 { | |
| continue | |
| } | |
| labels := "" | |
| for _, lp := range m.GetLabel() { | |
| labels += lp.GetName() + "=" + lp.GetValue() + "," | |
| } | |
| out += labels + "value=" + strconv.FormatFloat(m.GetCounter().GetValue(), 'f', -1, 64) + "\n" | |
| } | |
| } | |
| for _, mf := range mfs { | |
| if mf.GetName() == "elastickv_redis_errors_total" { | |
| for _, m := range mf.GetMetric() { | |
| if m.GetCounter().GetValue() == 0 { | |
| continue | |
| } | |
| labels := "" | |
| for _, lp := range m.GetLabel() { | |
| labels += lp.GetName() + "=" + lp.GetValue() + "," | |
| } | |
| out += labels + "value=" + strconv.FormatFloat(m.GetCounter().GetValue(), 'f', -1, 64) + "\n" | |
| } | |
| break | |
| } | |
| } |
References
- Centralize common complex logic, such as retry/backoff mechanisms, into shared helper functions to avoid code duplication and improve maintainability.
| } | ||
| for _, m := range mf.GetMetric() { | ||
| var cmd, out string | ||
| for _, lp := range m.GetLabel() { | ||
| switch lp.GetName() { | ||
| case "command": | ||
| cmd = lp.GetValue() | ||
| case "outcome": | ||
| out = lp.GetValue() | ||
| } | ||
| } | ||
| if cmd == command && out == outcome { | ||
| total += m.GetCounter().GetValue() | ||
| } | ||
| } | ||
| } | ||
| return total |
There was a problem hiding this comment.
The loop iterates over all metric families even after the target metric has been found. Adding a break after processing the target metric family improves efficiency. Additionally, consider centralizing this common metric lookup logic into a shared helper function to avoid code duplication.
for _, mf := range mfs {
if mf.GetName() == "elastickv_redis_requests_total" {
for _, m := range mf.GetMetric() {
var cmd, out string
for _, lp := range m.GetLabel() {
switch lp.GetName() {
case "command":
cmd = lp.GetValue()
case "outcome":
out = lp.GetValue()
}
}
if cmd == command && out == outcome {
total += m.GetCounter().GetValue()
}
}
break
}
}References
- Centralize common complex logic, such as retry/backoff mechanisms, into shared helper functions to avoid code duplication and improve maintainability.
| got, err := testutil.GatherAndCount(registry.Gatherer(), "elastickv_redis_errors_total") | ||
| require.NoError(t, err) | ||
| if got == 0 { | ||
| return 0 | ||
| } | ||
| // GatherAndCount returns the number of distinct label combinations. | ||
| // We want the sum of the counter values; iterate manually. | ||
| mfs, err := registry.Gatherer().Gather() | ||
| require.NoError(t, err) | ||
| var total float64 | ||
| for _, mf := range mfs { | ||
| if mf.GetName() != "elastickv_redis_errors_total" { | ||
| continue | ||
| } | ||
| for _, m := range mf.GetMetric() { | ||
| total += m.GetCounter().GetValue() | ||
| } | ||
| } | ||
| return total |
There was a problem hiding this comment.
The call to testutil.GatherAndCount is redundant because the subsequent manual iteration over the gathered metrics already handles the case where the metric family is missing or empty. Removing this call avoids an unnecessary extra Gather() operation. Additionally, the loop can be optimized to break early once the target metric family is found. Consider centralizing this metric lookup logic into a shared helper function to reduce duplication.
mfs, err := registry.Gatherer().Gather()
require.NoError(t, err)
var total float64
for _, mf := range mfs {
if mf.GetName() == "elastickv_redis_errors_total" {
for _, m := range mf.GetMetric() {
total += m.GetCounter().GetValue()
}
break
}
}
return totalReferences
- Centralize common complex logic, such as retry/backoff mechanisms, into shared helper functions to avoid code duplication and improve maintainability.
|
Addressed Gemini medium-priority items from the review of the #595 series:
Verified:
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5dbd26f4a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| require.Equal(t, errorsBefore, countErrorMetrics(t, registry), | ||
| "missing-key commands must not increment elastickv_redis_errors_total; details=%s", | ||
| dumpErrorMetrics(t, registry)) |
There was a problem hiding this comment.
Gate expensive metric dump to assertion-failure path
dumpErrorMetrics(t, registry) is evaluated eagerly as a formatting argument to require.Equal, so the test always performs an extra full Gather() even when the equality assertion passes. In -race/-count runs this adds avoidable per-iteration overhead on the happy path; compute the diagnostic dump only when the assertion actually fails.
Useful? React with 👍 / 👎.
| got, err := testutil.GatherAndCount(registry.Gatherer(), "elastickv_redis_errors_total") | ||
| require.NoError(t, err) | ||
| if got == 0 { | ||
| return 0 |
There was a problem hiding this comment.
Read error counters from one scrape snapshot
countErrorMetrics scrapes metrics twice (GatherAndCount then Gather) and can return early based on the first scrape. In this multi-node test, updates landing between those two calls can make the returned count inconsistent (including a stale zero), which risks masking regressions or creating flaky assertions. Summing directly from a single Gather() avoids that TOCTOU window.
Useful? React with 👍 / 👎.
Summary
Adds a regression-preventing end-to-end test that pins the observability
contract around
elastickv_redis_errors_total/elastickv_redis_requests_total{outcome="error"}: commands whose Redisprotocol semantics report a missing key via a null / 0 / "none" reply
must be counted as
outcome="success", not as errors.Audit
Starting from
grep 'WriteError(err.Error())' adapter/*.go, I classifiedevery call site where
errcould plausibly bestore.ErrKeyNotFound:(a) Redis spec: null/0/none — FIX if leaked
GET, HGET, HMGET, HLEN, HEXISTS, HGETALL, LLEN, LINDEX, LRANGE, LPOP,
RPOP, SCARD, SMEMBERS, SISMEMBER, ZCARD, ZSCORE, ZRANGE, XLEN, XRANGE,
EXISTS, TYPE, TTL, PTTL.
Result: already correct in the current codebase. Every handler
either routes through
keyTypeAt/logicalExistsAt(which returnredisTypeNone/falsefor a missing key, not an error) orthrough helpers like
loadHashAt/loadSetAt/loadZSetAt/loadStreamAt/listPopClaim/hashFieldFastLookup, all ofwhich internalise
store.ErrKeyNotFoundand translate it to thecaller-visible empty reply. No production-code change was needed.
(b) Redis spec: ERR reply — LEAVE AS-IS
RENAME (
cmdRenameinredis_lua_context.go:1339), LSET(
cmdLSetinredis_lua_context.go:2131). Both returnerrors.New("ERR no such key")— notstore.ErrKeyNotFound— sothe Lua compatibility layer at
redis_lua.go:857sends the errorto the client, the metric correctly records
outcome="error", andthe wire-format matches real Redis. RPOPLPUSH / LPOS / etc. route
through the same Lua layer and inherit the same semantics.
(c) Not implemented — no call site
DUMP, OBJECT, RESTORE, COPY: no handler exists yet, nothing to audit.
What changed
adapter/redis_metrics_missing_key_test.go: a single new test(
TestRedisMetrics_MissingKeyNotCountedAsError) with three subteststhat spin up a 3-node cluster, share a
monitoring.RedisRequestObserveracross every node (so
proxyToLeader-induced routing is captured too),and:
elastickv_redis_errors_totaldoes not move (with a per-commanddiagnostic dump on failure). GET is additionally checked against
elastickv_redis_requests_total{outcome="success"}to guard againsta regression that silently drops the success sample.
increments by exactly 1.
increments by exactly 1.
The
monitoring/redis.go/RedisRequestReportstruct is leftuntouched per the coordination note in the task brief (PR #594 owns
that file).
Test plan
go test -race -count=1 -run '^TestRedisMetrics_MissingKeyNotCountedAsError$' ./adapter/passes.go test -race -count=1 ./adapter/passes (394 → 395 tests, +1 new with 3 subtests).golangci-lint run ./adapter/...is clean.Summary by CodeRabbit
Tests