fix(replication): add first_failed_key to audit-failure log#132
Merged
Conversation
The audit-failure ERROR line added in #129 already includes reason and key counts (confirmed/challenged/absent/digest-mismatch), but it identifies the *target peer* only via the message prefix and gives no way to identify the *specific chunk* that was missing. The peer is a 64-hex string that can be matched in ES via a runtime regex, but a sample key prefix in the structured message is much cheaper to group by and query. Add a `first_failed_key=0x<16-hex>` field that takes the first 8 bytes of the first confirmed-failed key (XorName) and formats it as `0x` + 16 hex chars. This is short enough to keep log volume low on the hot path but long enough to disambiguate distinct chunks in the same close group. The new field is also exposed as a small `first_failed_key_label` helper with three unit tests covering truncation, the empty-fallback case, and first-key selection. No behaviour change beyond the log line. Refs: production investigation of ant-prod-01 (2026-06-07)
There was a problem hiding this comment.
Pull request overview
Adds a compact first_failed_key=0x<16-hex> label to the replication audit-failure ERROR log line to enable easier cross-host correlation and grouping by the (truncated) chunk key prefix.
Changes:
- Introduces
first_failed_key_label()to hex-encode the first 8 bytes of the first confirmed-failed key (or"0x"when empty). - Extends the audit-failure
error!log line inhandle_audit_resultto includefirst_failed_key=.... - Adds unit tests covering truncation behavior, empty input fallback, and “first key only” behavior.
Comments suppressed due to low confidence (1)
src/replication/mod.rs:2601
first_failed_keyis computed unconditionally before theerror!call. When theloggingfeature is disabled, the logging macros compile to no-ops and do not evaluate their arguments (seesrc/logging.rs), but this precomputedStringwill still allocate/hex-encode on every audit failure. Inline thefirst_failed_key_label(...)call in theerror!invocation so it’s skipped when logging is compiled out.
let first_failed_key = first_failed_key_label(confirmed_failed_keys);
error!(
"Audit failure for {challenged_peer}: reason={reason:?}, confirmed_failed_keys={}, challenged_keys={}, absent_keys={}, digest_mismatch_keys={}, first_failed_key={first_failed_key}",
confirmed_failed_keys.len(),
summary.challenged_keys,
summary.absent_keys,
summary.digest_mismatch_keys,
);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2717
to
+2719
| // Should never happen in production (handle_audit_failure rejects | ||
| // empty sets), but the formatter must still produce a valid label | ||
| // so the log line doesn't contain a misleading default. |
jacderida
approved these changes
Jun 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
first_failed_key=0x<16-hex>field to the audit-failure ERROR line inreplication::handle_audit_result. This is a follow-up to #129, which already gave us structured counts but no way to identify the specific chunk that was missing.The line currently looks like:
After this PR:
first_failed_keyis the high-order 8 bytes of the firstconfirmed_failed_keysentry, hex-encoded with a0xprefix. 16 hex chars is short enough to keep the log volume low on the hot path but long enough to disambiguate distinct chunks inside the same close group.Why
Audit failure for <hex>prefix from the target peer (which is also its XorName, not the chunk XorName). Addingfirst_failed_keylets us group audit-failures by chunk and by peer without runtime regex.first_failed_key=0x18878f1d2d9e0612+challenged_peer=<peer-hex>forms a stable(chunk-prefix, target-peer)pair that an operator can paste into a chat thread and anyone can grep for.Test plan
Three new unit tests in
replication::testscover the helper:first_failed_key_label_truncates_to_16_hex_chars— sets the first 8 bytes to a known pattern and the rest to0xAA, asserts the label is exactly the 16-hex of the first 8 bytes and the lower bytes are dropped.first_failed_key_label_falls_back_when_empty— empty input produces"0x"(no misleading default).first_failed_key_label_uses_first_key_only— only the first entry of the list is used.Verified locally:
Sample deployed log (before)
Sample deployed log (after, in 0.13.0)
Out of scope (intentionally)
This PR does not address the underlying issue: a peer that passes responsibility confirmation but has zero of its assigned keys. That's a separate concern — the responsibility check in
handle_audit_failureshould only returnconfirmed_failuresafter a grace window has elapsed since the peer joined the close group, otherwise a fresh-joiner will fail every audit until its first sync completes. That fix needs its own PR; the new log line just makes the issue visible enough to measure and triage.