fix: answer Merkle closeness check from local routing table#111
fix: answer Merkle closeness check from local routing table#111jacderida wants to merge 1 commit into
Conversation
The Merkle pay-yourself defence verified candidate closeness with an iterative Kademlia *network* lookup (find_closest_nodes_network) on the PUT-handling hot path. That lookup runs up to MAX_ITERATIONS rounds bounded by CLOSENESS_LOOKUP_TIMEOUT (240s) and is the dominant term in slow per-chunk store times; its instability (fresh transient peers pulled in on every call) also contributes to the closeness disagreements that cause outright rejections. Answer instead from the local routing table (find_closest_nodes_local, a pure in-memory k-bucket read with no network I/O), matching the precedent already used for the close-group responsibility check (find_closest_nodes_local_with_self). Fall back to the network lookup only when the local table is genuinely too sparse to be authoritative (fewer than CLOSENESS_LOOKUP_WIDTH peers near the midpoint). The fallback is gated on local table size, not match outcome, so a forged pool cannot force the expensive 240s path -- an attacker cannot make a victim's local routing table sparse. check_closeness_match and the single-flight pass-cache wrapper are unchanged. Node-side only, no wire/protocol change, so this is backwards compatible across a mixed-version fleet. The fallback decision is extracted into a pure const fn (closeness_should_fall_back_to_network) so its CLOSENESS_LOOKUP_WIDTH boundary is unit tested without standing up a P2PNode. Test results: - cargo fmt -- --check: clean - cargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used: no warnings - cargo test --lib payment::verifier: 67 passed, 0 failed (incl. new boundary test closeness_falls_back_to_network_only_below_lookup_width) - e2e test target (--test e2e --features test-utils): compiles Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dirvine
left a comment
There was a problem hiding this comment.
Review
Overall: The local-table-first approach is a sound performance optimization (avoids 240s network lookup on the hot path). The DoS-safe argument (gate on table size, not match outcome) is correct. No visible CI checks.
Coordination concern with saorsa-core PR #121
saoresa-core PR #121 (fix/closegroup-reachability) modifies find_closest_nodes_local to re-rank by (reachability_tier, xor_distance) before truncation. This PR (#111) calls find_closest_nodes_local for the Merkle closeness check.
check_closeness_match does a set-membership check (HashSet<PeerId>) — it verifies that a quorum of the candidate pool peer IDs appear in the returned network peers set. Because find_closest_nodes_local now over-fetches 2×, re-ranks by reachability, then truncates, the set of peers returned can differ from a pure XOR-distance set:
- The top 32 XOR-closest peers from the routing table might include some relay-only nodes
- Reachability re-ranking promotes direct peers (even if XOR-farther) into the top 32
- A valid candidate pool containing a relay-only peer could be rejected because that peer is no longer in the truncated set
The uploader still constructs its candidate pool via network lookup (XOR-only), not via the reachability-aware local path — so the uploader and storer could disagree on which peers belong in the close-group set.
Recommendation: Before merging, either:
- Have this PR use XOR-only ordering for the closeness check (e.g. sort network_peers by XOR distance before the set-membership check), or
- Coordinate with PR #121 to ensure the ordering semantics are compatible
The ideal ordering for the closeness check should be XOR-only, while the close-group selection for storage benefits from the reachability re-rank. These are different concerns.
The routing table now serves two consumers with opposite ordering needs. Close-group and candidate *selection* for storage benefits from the (reachability_tier, xor_distance) re-rank in find_closest_nodes_local — a directly-reachable peer is a better place to put data than an XOR-equal relay-only one. Closeness *verification* (ant-node's Merkle pay-yourself defence, WithAutonomi/ant-node#111) is the opposite: it must mirror the uploader's pure XOR-distance network view. If verification re-ranked by reachability it could demote an XOR-close relay-only peer out of the compared window and falsely reject an honest candidate pool that legitimately contains that peer. Add find_closest_nodes_local_by_distance: the distance-pure counterpart to find_closest_nodes_local. No over-fetch, no reachability re-rank — it returns the routing table's XOR-distance order as-is, still excluding self and stamping the real trust score. This gives the verification path the raw XOR ordering it needs while every selection caller keeps the reachability re-rank. No change to existing functions or public-API behaviour; purely additive. The ant-node closeness check will switch to this method once this PR (saorsa-labs#121) merges and ant-node bumps its saorsa-core pin (tracked on ant-node#111). Tested: cargo fmt clean; cargo clippy --lib + --test two_node_messaging (-D warnings -D unwrap_used -D expect_used) clean; cargo test --lib 485 passed/0 failed; new two-node integration test local_by_distance_returns_peer_and_stamps_neutral_trust passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @dirvine — good catch on the interaction with saorsa-core#121. We're addressing it at the source. The concern. Once saorsa-core#121 re-ranks How we're addressing it. Rather than special-casing in ant-node, we've added a distance-pure variant in saorsa-core#121: Why this PR is safe to merge as-is today. ant-node currently pins published Follow-up. When saorsa-core#121 merges, we'll update this PR to reference the saorsa-core RC branch and switch the closeness check ( |
Summary
The Merkle pay-yourself defence verified candidate closeness with an iterative Kademlia network lookup (
find_closest_nodes_network) on the PUT-handling hot path. That lookup runs up toMAX_ITERATIONSrounds bounded byCLOSENESS_LOOKUP_TIMEOUT(240s) and is the dominant term in slow per-chunk store times; its instability (fresh transient peers pulled in on every call) also contributes to the closeness disagreements that cause outright rejections.find_closest_nodes_local— a pure in-memory k-bucket read, no network I/O), matching the precedent already used for the close-group responsibility check (find_closest_nodes_local_with_self).CLOSENESS_LOOKUP_WIDTHpeers near the midpoint).check_closeness_matchand the single-flight pass-cache wrapper are unchanged.const fn(closeness_should_fall_back_to_network) so itsCLOSENESS_LOOKUP_WIDTHboundary is unit tested without standing up aP2PNode.Test plan
cargo fmt -- --check— cleancargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used— no warningscargo test --lib payment::verifier— 67 passed, 0 failed (incl. new boundary testcloseness_falls_back_to_network_only_below_lookup_width)--test e2e --features test-utils) compilesprod uploadersrun comparing per-chunkstore_durations_msp50/p99 against baseline🤖 Generated with Claude Code