Skip to content

fix(replication): add first_failed_key to audit-failure log#132

Merged
jacderida merged 1 commit into
mainfrom
fix/audit-failure-log-add-first-failed-key
Jun 7, 2026
Merged

fix(replication): add first_failed_key to audit-failure log#132
jacderida merged 1 commit into
mainfrom
fix/audit-failure-log-add-first-failed-key

Conversation

@dirvine

@dirvine dirvine commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a first_failed_key=0x<16-hex> field to the audit-failure ERROR line in replication::handle_audit_result. This is a follow-up to #129, which already gave us structured counts but no way to identify the specific chunk that was missing.

The line currently looks like:

Audit failure for <peer>: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0

After this PR:

Audit failure for <peer>: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0, first_failed_key=0x18878f1d2d9e0612

first_failed_key is the high-order 8 bytes of the first confirmed_failed_keys entry, hex-encoded with a 0x prefix. 16 hex chars is short enough to keep the log volume low on the hot path but long enough to disambiguate distinct chunks inside the same close group.

Why

  • Top-failing-key dashboard: today, the only way to identify which chunk is missing on a peer is to regex-extract the Audit failure for <hex> prefix from the target peer (which is also its XorName, not the chunk XorName). Adding first_failed_key lets us group audit-failures by chunk and by peer without runtime regex.
  • Reproducible cross-host correlation: first_failed_key=0x18878f1d2d9e0612 + challenged_peer=<peer-hex> forms a stable (chunk-prefix, target-peer) pair that an operator can paste into a chat thread and anyone can grep for.
  • No behaviour change: the only modification is the format string. Trust-event emission, responsibility confirmation, and bootstrap-claim handling are untouched.

Test plan

Three new unit tests in replication::tests cover the helper:

  • first_failed_key_label_truncates_to_16_hex_chars — sets the first 8 bytes to a known pattern and the rest to 0xAA, asserts the label is exactly the 16-hex of the first 8 bytes and the lower bytes are dropped.
  • first_failed_key_label_falls_back_when_empty — empty input produces "0x" (no misleading default).
  • first_failed_key_label_uses_first_key_only — only the first entry of the list is used.

Verified locally:

$ cargo test --lib replication::
test result: ok. 228 passed; 0 failed; 0 ignored; 0 measured; 285 filtered out

$ cargo clippy --lib --all-features
    Finished `dev` profile [optimized + debuginfo] target(s) in 2.70s
# zero warnings

Sample deployed log (before)

2026-06-07T12:17:37.515658Z  ERROR  ant_node::replication
  Audit failure for 18878f1d2d9e061239fe77159f783cff10af2745bddcd8d05dc3ef9522f8528a: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0

Sample deployed log (after, in 0.13.0)

2026-06-07T12:17:37.515658Z  ERROR  ant_node::replication
  Audit failure for 18878f1d2d9e061239fe77159f783cff10af2745bddcd8d05dc3ef9522f8528a: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0, first_failed_key=0x18878f1d2d9e0612

Out of scope (intentionally)

This PR does not address the underlying issue: a peer that passes responsibility confirmation but has zero of its assigned keys. That's a separate concern — the responsibility check in handle_audit_failure should only return confirmed_failures after a grace window has elapsed since the peer joined the close group, otherwise a fresh-joiner will fail every audit until its first sync completes. That fix needs its own PR; the new log line just makes the issue visible enough to measure and triage.

The audit-failure ERROR line added in #129 already includes reason and
key counts (confirmed/challenged/absent/digest-mismatch), but it
identifies the *target peer* only via the message prefix and gives no
way to identify the *specific chunk* that was missing. The peer is a
64-hex string that can be matched in ES via a runtime regex, but a
sample key prefix in the structured message is much cheaper to group by
and query.

Add a `first_failed_key=0x<16-hex>` field that takes the first 8 bytes
of the first confirmed-failed key (XorName) and formats it as
`0x` + 16 hex chars. This is short enough to keep log volume low on
the hot path but long enough to disambiguate distinct chunks in the
same close group.

The new field is also exposed as a small `first_failed_key_label`
helper with three unit tests covering truncation, the empty-fallback
case, and first-key selection.

No behaviour change beyond the log line.

Refs: production investigation of ant-prod-01 (2026-06-07)
Copilot AI review requested due to automatic review settings June 7, 2026 12:42

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a compact first_failed_key=0x<16-hex> label to the replication audit-failure ERROR log line to enable easier cross-host correlation and grouping by the (truncated) chunk key prefix.

Changes:

  • Introduces first_failed_key_label() to hex-encode the first 8 bytes of the first confirmed-failed key (or "0x" when empty).
  • Extends the audit-failure error! log line in handle_audit_result to include first_failed_key=....
  • Adds unit tests covering truncation behavior, empty input fallback, and “first key only” behavior.
Comments suppressed due to low confidence (1)

src/replication/mod.rs:2601

  • first_failed_key is computed unconditionally before the error! call. When the logging feature is disabled, the logging macros compile to no-ops and do not evaluate their arguments (see src/logging.rs), but this precomputed String will still allocate/hex-encode on every audit failure. Inline the first_failed_key_label(...) call in the error! invocation so it’s skipped when logging is compiled out.
                let first_failed_key = first_failed_key_label(confirmed_failed_keys);
                error!(
                    "Audit failure for {challenged_peer}: reason={reason:?}, confirmed_failed_keys={}, challenged_keys={}, absent_keys={}, digest_mismatch_keys={}, first_failed_key={first_failed_key}",
                    confirmed_failed_keys.len(),
                    summary.challenged_keys,
                    summary.absent_keys,
                    summary.digest_mismatch_keys,
                );

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/replication/mod.rs
Comment on lines +2717 to +2719
// Should never happen in production (handle_audit_failure rejects
// empty sets), but the formatter must still produce a valid label
// so the log line doesn't contain a misleading default.
@jacderida jacderida merged commit fb7494e into main Jun 7, 2026
13 checks passed
@jacderida jacderida deleted the fix/audit-failure-log-add-first-failed-key branch June 7, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants