fix(replication): add first_failed_key to audit-failure log by dirvine · Pull Request #132 · WithAutonomi/ant-node

dirvine · 2026-06-07T12:42:45Z

Summary

Adds a first_failed_key=0x<16-hex> field to the audit-failure ERROR line in replication::handle_audit_result. This is a follow-up to #129, which already gave us structured counts but no way to identify the specific chunk that was missing.

The line currently looks like:

Audit failure for <peer>: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0

After this PR:

Audit failure for <peer>: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0, first_failed_key=0x18878f1d2d9e0612

first_failed_key is the high-order 8 bytes of the first confirmed_failed_keys entry, hex-encoded with a 0x prefix. 16 hex chars is short enough to keep the log volume low on the hot path but long enough to disambiguate distinct chunks inside the same close group.

Why

Top-failing-key dashboard: today, the only way to identify which chunk is missing on a peer is to regex-extract the Audit failure for <hex> prefix from the target peer (which is also its XorName, not the chunk XorName). Adding first_failed_key lets us group audit-failures by chunk and by peer without runtime regex.
Reproducible cross-host correlation: first_failed_key=0x18878f1d2d9e0612 + challenged_peer=<peer-hex> forms a stable (chunk-prefix, target-peer) pair that an operator can paste into a chat thread and anyone can grep for.
No behaviour change: the only modification is the format string. Trust-event emission, responsibility confirmation, and bootstrap-claim handling are untouched.

Test plan

Three new unit tests in replication::tests cover the helper:

first_failed_key_label_truncates_to_16_hex_chars — sets the first 8 bytes to a known pattern and the rest to 0xAA, asserts the label is exactly the 16-hex of the first 8 bytes and the lower bytes are dropped.
first_failed_key_label_falls_back_when_empty — empty input produces "0x" (no misleading default).
first_failed_key_label_uses_first_key_only — only the first entry of the list is used.

Verified locally:

$ cargo test --lib replication::
test result: ok. 228 passed; 0 failed; 0 ignored; 0 measured; 285 filtered out

$ cargo clippy --lib --all-features
    Finished `dev` profile [optimized + debuginfo] target(s) in 2.70s
# zero warnings

Sample deployed log (before)

2026-06-07T12:17:37.515658Z  ERROR  ant_node::replication
  Audit failure for 18878f1d2d9e061239fe77159f783cff10af2745bddcd8d05dc3ef9522f8528a: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0

Sample deployed log (after, in 0.13.0)

2026-06-07T12:17:37.515658Z  ERROR  ant_node::replication
  Audit failure for 18878f1d2d9e061239fe77159f783cff10af2745bddcd8d05dc3ef9522f8528a: reason=KeyAbsent, confirmed_failed_keys=1, challenged_keys=1, absent_keys=1, digest_mismatch_keys=0, first_failed_key=0x18878f1d2d9e0612

Out of scope (intentionally)

This PR does not address the underlying issue: a peer that passes responsibility confirmation but has zero of its assigned keys. That's a separate concern — the responsibility check in handle_audit_failure should only return confirmed_failures after a grace window has elapsed since the peer joined the close group, otherwise a fresh-joiner will fail every audit until its first sync completes. That fix needs its own PR; the new log line just makes the issue visible enough to measure and triage.

The audit-failure ERROR line added in #129 already includes reason and key counts (confirmed/challenged/absent/digest-mismatch), but it identifies the *target peer* only via the message prefix and gives no way to identify the *specific chunk* that was missing. The peer is a 64-hex string that can be matched in ES via a runtime regex, but a sample key prefix in the structured message is much cheaper to group by and query. Add a `first_failed_key=0x<16-hex>` field that takes the first 8 bytes of the first confirmed-failed key (XorName) and formats it as `0x` + 16 hex chars. This is short enough to keep log volume low on the hot path but long enough to disambiguate distinct chunks in the same close group. The new field is also exposed as a small `first_failed_key_label` helper with three unit tests covering truncation, the empty-fallback case, and first-key selection. No behaviour change beyond the log line. Refs: production investigation of ant-prod-01 (2026-06-07)

Copilot

Pull request overview

Adds a compact first_failed_key=0x<16-hex> label to the replication audit-failure ERROR log line to enable easier cross-host correlation and grouping by the (truncated) chunk key prefix.

Changes:

Introduces first_failed_key_label() to hex-encode the first 8 bytes of the first confirmed-failed key (or "0x" when empty).
Extends the audit-failure error! log line in handle_audit_result to include first_failed_key=....
Adds unit tests covering truncation behavior, empty input fallback, and “first key only” behavior.

Comments suppressed due to low confidence (1)

src/replication/mod.rs:2601

first_failed_key is computed unconditionally before the error! call. When the logging feature is disabled, the logging macros compile to no-ops and do not evaluate their arguments (see src/logging.rs), but this precomputed String will still allocate/hex-encode on every audit failure. Inline the first_failed_key_label(...) call in the error! invocation so it’s skipped when logging is compiled out.

                let first_failed_key = first_failed_key_label(confirmed_failed_keys);
                error!(
                    "Audit failure for {challenged_peer}: reason={reason:?}, confirmed_failed_keys={}, challenged_keys={}, absent_keys={}, digest_mismatch_keys={}, first_failed_key={first_failed_key}",
                    confirmed_failed_keys.len(),
                    summary.challenged_keys,
                    summary.absent_keys,
                    summary.digest_mismatch_keys,
                );

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // Should never happen in production (handle_audit_failure rejects
+        // empty sets), but the formatter must still produce a valid label
+        // so the log line doesn't contain a misleading default.


Copilot AI review requested due to automatic review settings June 7, 2026 12:42

Copilot started reviewing on behalf of dirvine June 7, 2026 12:42 View session

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Comment thread src/replication/mod.rs

Comment on lines +2717 to +2719

// Should never happen in production (handle_audit_failure rejects

// empty sets), but the formatter must still produce a valid label

// so the log line doesn't contain a misleading default.

jacderida approved these changes Jun 7, 2026

View reviewed changes

jacderida merged commit fb7494e into main Jun 7, 2026
13 checks passed

jacderida deleted the fix/audit-failure-log-add-first-failed-key branch June 7, 2026 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replication): add first_failed_key to audit-failure log#132

fix(replication): add first_failed_key to audit-failure log#132
jacderida merged 1 commit into
mainfrom
fix/audit-failure-log-add-first-failed-key

dirvine commented Jun 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dirvine commented Jun 7, 2026

Summary

Why

Test plan

Sample deployed log (before)

Sample deployed log (after, in 0.13.0)

Out of scope (intentionally)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants