Skip to content

fix(payment): verify closeness against pure-XOR view; escalate Merkle on mismatch#114

Merged
jacderida merged 3 commits into
rc-2026.5.4from
fix/single-node-and-merkle-closeness-verification
May 26, 2026
Merged

fix(payment): verify closeness against pure-XOR view; escalate Merkle on mismatch#114
jacderida merged 3 commits into
rc-2026.5.4from
fix/single-node-and-merkle-closeness-verification

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

Summary

Fixes two upload-breaking regressions seen on testnets with a meaningful NAT fraction (e.g. STG-01: ~900 nodes, 30% NAT-simulated — every upload failed). Both stem from the same root cause: storer-side closeness verification compares the uploader's network-walk-derived peer set against the storer's local routing-table view, and those views diverge on a real network.

1. Single-node payment close-group check (regressed by saorsa-core #121)

The check (introduced in #107) calls find_closest_nodes_local_with_self. When saorsa-core #121 re-ranked that lookup (reachability_tier, xor_distance) — correct for storage selection — the verifier silently inherited it. The re-rank demotes XOR-close relay-only / NAT'd peers out of the local top-CLOSE_GROUP_SIZE, while the uploader's quoted set (get_store_quotes, pure XOR) still contains them. Result: only 3–4 of 7 quoted peers match, breaching the >=5 threshold → every single-node payment rejected.

Fix: switch to the XOR-only find_closest_nodes_local_by_distance_with_self (WithAutonomi/saorsa-core#122), exactly mirroring the Merkle closeness path. Still a pure local lookup → no added network cost.

2. Merkle candidate-pool closeness check (regressed by #111)

#111 moved this check off the authoritative iterative-Kademlia network lookup onto the local routing table (a real per-chunk latency win), with the network fallback gated on local-table size, not match outcome. The DoS argument was sound — an attacker can't make a victim's table sparse — but it also means that when the local k-bucket sample legitimately disagrees with the uploader's network-walked candidates (which include reachable responders from positions 17–32), honest pools are hard-rejected with no escalation. On STG-01 this was 100% of Merkle uploads (candidate pub_keys do not match; node logs: all match-failures, zero sparse-defers, zero timeouts).

Fix: keep #111's local fast path — accept immediately on a local match (the common case on a healthy, converged network) — but escalate to the authoritative network lookup on match failure as well as sparsity. This restores the pre-#111 correctness only on the (rare, on a healthy net) mismatch path, not per chunk.

DoS bound: the reopened network-fallback path is capped by a new closeness_fallback_permits semaphore (CLOSENESS_NETWORK_FALLBACK_CONCURRENCY = 16). inflight_closeness already collapses same-pool concurrency; this caps the distinct-pool_hash case so a forged-pool flood cannot spawn unbounded CLOSENESS_LOOKUP_TIMEOUT-long walks. Excess concurrent walks get a fast, retryable rejection rather than queueing — addressing the exact DoS rationale #111 gave for the size-only gate.

Dependency / merge order

⚠️ Requires WithAutonomi/saorsa-core#122 (find_closest_nodes_local_by_distance_with_self) to land on rc-2026.5.4 first. The saorsa-core git pin then resolves it via cargo update -p saorsa-core; until then CI here will not compile. (Verified locally against #122 via a path patch.)

Test plan

  • cargo check --lib — clean (against saorsa-core#122)
  • cargo test --lib payment::verifier — 67 passed, 0 failed
  • cargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used — clean
  • cargo fmt --all -- --check — clean
  • Follow-up (not in this PR): testnet upload run on a NAT-simulated network confirming single-node + Merkle uploads succeed, plus per-chunk store_durations_ms p50/p99 vs baseline to confirm the fast path is preserved

🤖 Generated with Claude Code

jacderida and others added 2 commits May 26, 2026 21:22
… on mismatch

Two upload-breaking regressions on testnets with a meaningful NAT
fraction, both from storer-side closeness verification diverging from
the uploader's network-walked peer selection.

Single-node close-group check (introduced in #107): switch off the
reachability-reranked find_closest_nodes_local_with_self onto the
XOR-only find_closest_nodes_local_by_distance_with_self. The re-rank
(saorsa-core #121) demoted XOR-close relay-only / NAT'd peers out of the
local top-CLOSE_GROUP_SIZE, dropping 2-3 of the uploader's 7 quoted
peers and breaching the >=5 threshold. This mirrors the fix already
applied to the Merkle path; it remains a pure local lookup, so no added
network cost.

Merkle candidate-pool check (changed in #111): #111 moved the check off
the authoritative network lookup onto the local routing table, with the
fallback gated on local-table *size*, not match *outcome*. On a real
network the local k-bucket sample legitimately diverges from the
uploader's network-walked candidates (which include reachable responders
from positions 17-32), so honest pools were hard-rejected with no
escalation. Keep #111's local fast path (accept on a local match), but
escalate to the authoritative network lookup on match *failure* too.

Bound the reopened network-fallback path with a new
closeness_fallback_permits semaphore (CLOSENESS_NETWORK_FALLBACK_CONCURRENCY
= 16): inflight_closeness already collapses same-pool concurrency, and
this caps the distinct-pool case so a forged-pool flood cannot spawn
unbounded 240s Kademlia walks -- addressing the DoS rationale #111 gave
for the size-only gate.

Requires saorsa-core's find_closest_nodes_local_by_distance_with_self
(WithAutonomi/saorsa-core#122) on rc-2026.5.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up find_closest_nodes_local_by_distance_with_self
(WithAutonomi/saorsa-core#122, now merged to rc-2026.5.4) that the
single-node close-group verification change depends on. The crate is
pinned to `branch = "rc-2026.5.4"`, so this only advances Cargo.lock
from 1be7352 to 82bb541; no manifest change. ant-node now compiles
against the published branch without a local patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI only triggered for push/pull_request against `main`, so PRs targeting
release branches (e.g. rc-2026.5.4) ran no checks. Add `rc-*` to both
branch filters.

Note: the pull_request branch filter is evaluated against the PR's base
branch, so this only starts firing for rc-targeted PRs once it has landed
on the rc-2026.5.4 branch itself (i.e. after this PR merges).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida merged commit 59da17b into rc-2026.5.4 May 26, 2026
11 checks passed
@jacderida jacderida deleted the fix/single-node-and-merkle-closeness-verification branch May 26, 2026 21:10
jacderida added a commit that referenced this pull request May 27, 2026
The reverts of #114/#111 also dropped the rc-* branch filters from the CI
workflow. Restore them so push/PR CI still runs for rc-* base branches.
jacderida added a commit that referenced this pull request May 27, 2026
…le-prs

Revert #114 + #111; drop abandoned saorsa-core/ant-protocol rc pins
jacderida added a commit to jacderida/ant-node that referenced this pull request May 28, 2026
…gle-node-and-merkle-closeness-verification"

This reverts commit 59da17b, reversing
changes made to ed939af.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant