Skip to content

fix: answer Merkle closeness check from local routing table#111

Open
jacderida wants to merge 1 commit into
WithAutonomi:rc-2026.5.4from
jacderida:fix/merkle-closeness-local-lookup
Open

fix: answer Merkle closeness check from local routing table#111
jacderida wants to merge 1 commit into
WithAutonomi:rc-2026.5.4from
jacderida:fix/merkle-closeness-local-lookup

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

Summary

The Merkle pay-yourself defence verified candidate closeness with an iterative Kademlia network lookup (find_closest_nodes_network) on the PUT-handling hot path. That lookup runs up to MAX_ITERATIONS rounds bounded by CLOSENESS_LOOKUP_TIMEOUT (240s) and is the dominant term in slow per-chunk store times; its instability (fresh transient peers pulled in on every call) also contributes to the closeness disagreements that cause outright rejections.

  • Answer the closeness check from the local routing table (find_closest_nodes_local — a pure in-memory k-bucket read, no network I/O), matching the precedent already used for the close-group responsibility check (find_closest_nodes_local_with_self).
  • Fall back to the network lookup only when the local table is genuinely too sparse to be authoritative (fewer than CLOSENESS_LOOKUP_WIDTH peers near the midpoint).
  • The fallback is gated on local table size, not match outcome, so a forged pool cannot force the expensive 240s path — an attacker cannot make a victim's local routing table sparse (DoS-safe).
  • check_closeness_match and the single-flight pass-cache wrapper are unchanged.
  • Node-side only, no wire/protocol change → backwards compatible across a mixed-version fleet.
  • The fallback decision is extracted into a pure const fn (closeness_should_fall_back_to_network) so its CLOSENESS_LOOKUP_WIDTH boundary is unit tested without standing up a P2PNode.

Test plan

  • cargo fmt -- --check — clean
  • cargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used — no warnings
  • cargo test --lib payment::verifier — 67 passed, 0 failed (incl. new boundary test closeness_falls_back_to_network_only_below_lookup_width)
  • e2e test target (--test e2e --features test-utils) compiles
  • Follow-up (not in this PR): prod uploaders run comparing per-chunk store_durations_ms p50/p99 against baseline

🤖 Generated with Claude Code

The Merkle pay-yourself defence verified candidate closeness with an iterative Kademlia
*network* lookup (find_closest_nodes_network) on the PUT-handling hot path. That lookup runs
up to MAX_ITERATIONS rounds bounded by CLOSENESS_LOOKUP_TIMEOUT (240s) and is the dominant
term in slow per-chunk store times; its instability (fresh transient peers pulled in on every
call) also contributes to the closeness disagreements that cause outright rejections.

Answer instead from the local routing table (find_closest_nodes_local, a pure in-memory
k-bucket read with no network I/O), matching the precedent already used for the close-group
responsibility check (find_closest_nodes_local_with_self). Fall back to the network lookup
only when the local table is genuinely too sparse to be authoritative (fewer than
CLOSENESS_LOOKUP_WIDTH peers near the midpoint). The fallback is gated on local table size,
not match outcome, so a forged pool cannot force the expensive 240s path -- an attacker
cannot make a victim's local routing table sparse.

check_closeness_match and the single-flight pass-cache wrapper are unchanged. Node-side only,
no wire/protocol change, so this is backwards compatible across a mixed-version fleet. The
fallback decision is extracted into a pure const fn (closeness_should_fall_back_to_network)
so its CLOSENESS_LOOKUP_WIDTH boundary is unit tested without standing up a P2PNode.

Test results:
- cargo fmt -- --check: clean
- cargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used
  -D clippy::expect_used: no warnings
- cargo test --lib payment::verifier: 67 passed, 0 failed (incl. new boundary test
  closeness_falls_back_to_network_only_below_lookup_width)
- e2e test target (--test e2e --features test-utils): compiles

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Overall: The local-table-first approach is a sound performance optimization (avoids 240s network lookup on the hot path). The DoS-safe argument (gate on table size, not match outcome) is correct. No visible CI checks.

Coordination concern with saorsa-core PR #121

saoresa-core PR #121 (fix/closegroup-reachability) modifies find_closest_nodes_local to re-rank by (reachability_tier, xor_distance) before truncation. This PR (#111) calls find_closest_nodes_local for the Merkle closeness check.

check_closeness_match does a set-membership check (HashSet<PeerId>) — it verifies that a quorum of the candidate pool peer IDs appear in the returned network peers set. Because find_closest_nodes_local now over-fetches 2×, re-ranks by reachability, then truncates, the set of peers returned can differ from a pure XOR-distance set:

  • The top 32 XOR-closest peers from the routing table might include some relay-only nodes
  • Reachability re-ranking promotes direct peers (even if XOR-farther) into the top 32
  • A valid candidate pool containing a relay-only peer could be rejected because that peer is no longer in the truncated set

The uploader still constructs its candidate pool via network lookup (XOR-only), not via the reachability-aware local path — so the uploader and storer could disagree on which peers belong in the close-group set.

Recommendation: Before merging, either:

  1. Have this PR use XOR-only ordering for the closeness check (e.g. sort network_peers by XOR distance before the set-membership check), or
  2. Coordinate with PR #121 to ensure the ordering semantics are compatible

The ideal ordering for the closeness check should be XOR-only, while the close-group selection for storage benefits from the reachability re-rank. These are different concerns.

jacderida added a commit to jacderida/saorsa-core that referenced this pull request May 25, 2026
The routing table now serves two consumers with opposite ordering needs. Close-group and
candidate *selection* for storage benefits from the (reachability_tier, xor_distance) re-rank
in find_closest_nodes_local — a directly-reachable peer is a better place to put data than an
XOR-equal relay-only one. Closeness *verification* (ant-node's Merkle pay-yourself defence,
WithAutonomi/ant-node#111) is the opposite: it must mirror the uploader's pure XOR-distance
network view. If verification re-ranked by reachability it could demote an XOR-close relay-only
peer out of the compared window and falsely reject an honest candidate pool that legitimately
contains that peer.

Add find_closest_nodes_local_by_distance: the distance-pure counterpart to
find_closest_nodes_local. No over-fetch, no reachability re-rank — it returns the routing
table's XOR-distance order as-is, still excluding self and stamping the real trust score. This
gives the verification path the raw XOR ordering it needs while every selection caller keeps
the reachability re-rank.

No change to existing functions or public-API behaviour; purely additive. The ant-node
closeness check will switch to this method once this PR (saorsa-labs#121) merges and ant-node bumps its
saorsa-core pin (tracked on ant-node#111).

Tested: cargo fmt clean; cargo clippy --lib + --test two_node_messaging
(-D warnings -D unwrap_used -D expect_used) clean; cargo test --lib 485 passed/0 failed; new
two-node integration test local_by_distance_returns_peer_and_stamps_neutral_trust passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida
Copy link
Copy Markdown
Collaborator Author

Thanks @dirvine — good catch on the interaction with saorsa-core#121. We're addressing it at the source.

The concern. Once saorsa-core#121 re-ranks find_closest_nodes_local by (reachability_tier, xor_distance) + truncate, the set of peers it returns can differ from the XOR-closest set. Since check_closeness_match is a set-membership test, a reachability re-rank could push an XOR-close relay-only peer out of the compared window and falsely reject an honest pool whose candidates the uploader (XOR-only) legitimately picked. (Worth noting the lever is which peers survive truncation, not their order — a HashSet membership check is order-insensitive — so the fix has to change the selection, not sort after the fact.)

How we're addressing it. Rather than special-casing in ant-node, we've added a distance-pure variant in saorsa-core#121: find_closest_nodes_local_by_distance (commit a04f428). It returns the routing table's pure XOR-distance order (no over-fetch, no re-rank) for the verification path, while all selection callers keep the reachability re-rank. This matches the "closeness check should be XOR-only, storage selection benefits from the re-rank — different concerns" framing exactly, and keeps the two concerns cleanly separated.

Why this PR is safe to merge as-is today. ant-node currently pins published saorsa-core 0.24.4, where find_closest_nodes_local is already XOR-only (the re-rank is unreleased, on the #121 branch). So this PR's closeness check compares against an XOR set today, exactly like the uploader.

Follow-up. When saorsa-core#121 merges, we'll update this PR to reference the saorsa-core RC branch and switch the closeness check (src/payment/verifier.rs) from find_closest_nodes_local to find_closest_nodes_local_by_distance in the same change — so the bump and the call-site switch land together and the hazard never arms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants