Skip to content

fix(payment): widen merkle closeness lookup to K=32, raise timeout to 240s#89

Merged
jacderida merged 4 commits intoWithAutonomi:rc-2026.5.1from
grumbach:grumbach/fix/merkle-closeness-A-G
May 7, 2026
Merged

fix(payment): widen merkle closeness lookup to K=32, raise timeout to 240s#89
jacderida merged 4 commits intoWithAutonomi:rc-2026.5.1from
grumbach:grumbach/fix/merkle-closeness-A-G

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

@grumbach grumbach commented May 7, 2026

Summary

Two related, single-file changes in src/payment/verifier.rs that together restore merkle batch payment uploads on under-warm networks. Validated experimentally on STG-01 (1k-node testnet, 30% NAT-simulated peers, 2026-05-01).

Before After
CLOSENESS_LOOKUP_TIMEOUT 60s 240s
CLOSENESS_LOOKUP_WIDTH 16 (implicit, = pool size) 32 (explicit constant)
CANDIDATE_CLOSENESS_REQUIRED 13 13 (unchanged)

Why A — bump timeout 60s → 240s

The iterative DHT lookup runs MAX_ITERATIONS = 20 with ALPHA = 3 and a 15s/RPC request_timeout. On a young network iterations average ~10s each (dial cascades to NAT-simulated and unresponsive peers consume the full per-RPC budget). Captured trace from STG-01 EWR-3 ant-node-1 immediately before a pre-fix timeout:

Iter 0: +0.0s  | Iter 1: +0.2s  | Iter 2: +6.6s  | Iter 3: +13.1s
Iter 4: +20.9s | Iter 5: +39.8s | Iter 6: +50.8s | [60s wall]

The lookup was making real forward progress through 7 of 20 iterations and then hit the wall. 240s gives ~1.2× headroom over the ~200s natural worst-case runtime on a 1k-node testnet.

DoS amplification stays bounded by the per-pool single-flight cache (closeness_pass_cache + inflight_closeness): one in-flight lookup per pool_hash regardless of concurrency.

Why G — widen storer's lookup K=16 → K=32

The client's get_merkle_candidate_pool over-queries 2 * CANDIDATES_PER_POOL = 32 peers via find_closest_peers(addr, 32), then picks the 16 closest valid responders by XOR distance. Legitimate pools therefore routinely include peers from positions 17–32 of the network's true ranking when the closer peers are slow or NAT-stuck. The storer was fetching only top-16, so it never even observed those legitimate candidates and rejected the pool with no security benefit. Widening the storer to K=32 lets its view be a superset of the client's selection space.

The client code itself documents this exact failure mode in ant-client/ant-core/src/data/client/merkle.rs:455-461:

the storing-node verifier re-runs a network closest-peers lookup of the pool midpoint and rejects the pool if fewer than 13 of the 16 candidate pub_keys appear in that authoritative close-set. Pools built from the fastest-to-respond quoters fail this check whenever truly-close peers are slower (NAT/relay paths) than farther peers.

Empirical effect (STG-01, 5-min window)

Metric 60s baseline 240s + K=32
Storer-side lookup-timeout rejections dominant 0
Client-side closeness mismatches 115 31 (73% reduction)
Storer-replication mismatches (ES) 78 reduced
Lookup completes? no, cuts off after 7 of 20 iterations yes, ≤200s natural runtime

Security argument

CANDIDATE_CLOSENESS_REQUIRED (13/16) is unchanged. The pay-yourself attack still requires the attacker's fabricated PeerIds to land in the storer's authoritative top-K. K=32 vs K=16 doubles the window (≈1 extra bit of grinding cost), but the dominant cost remains Sybil-grinding midpoint addresses or running real nodes near the target — same security floor.

DoS bound is unchanged: the per-pool single-flight cache ensures one in-flight lookup per pool_hash regardless of how many concurrent verifications hit a single storer.

Compatibility: no protocol or wire-format change. Old clients work with new storers (their proofs verify within the wider window) and new clients work with old storers (the existing 16-window storers still accept the same pools they did before).

Tests

Four new constant-pinning tests + one compile-time const-assert in the existing closeness test module:

  • closeness_lookup_timeout_is_240s
  • closeness_lookup_width_is_32
  • closeness_required_threshold_unchanged_at_13
  • const _: () = assert!(WIDTH >= CANDIDATES_PER_POOL) (compile-time invariant)

These pin both knobs so any future patch silently changing them breaks CI with a one-line failure pointing back to this PR's empirical evidence.

Existing 47/47 verifier tests + 452/452 lib tests pass. Clippy clean (--all-features --all-targets -D warnings). Format clean.

Test plan

  • cargo test --lib (452 passed)
  • cargo clippy --all-features --all-targets -- -D warnings
  • cargo fmt --all -- --check
  • CI green
  • Adversarial review (security argument + correctness)
  • Re-deploy to a fresh testnet and re-measure (STG-01 was decommissioned mid-investigation; will re-validate when the next testnet is up)

Two related changes that together restore merkle batch payment uploads
on under-warm networks. Validated experimentally on STG-01 (1k-node
testnet, 30% NAT-simulated peers, 2026-05-01).

A. CLOSENESS_LOOKUP_TIMEOUT: 60s → 240s
   The iterative DHT lookup runs MAX_ITERATIONS=20 with ALPHA=3 and a
   15s/RPC request_timeout. On a young network iterations average ~10s
   each (dial cascades to NAT-simulated and unresponsive peers consume
   the full per-RPC budget). Captured trace from STG-01 EWR-3 ant-node-1
   immediately before a pre-fix timeout shows the lookup making real
   forward progress through 7 of 20 iterations and then hitting the
   60s wall:

       Iter 0: +0.0s | Iter 1: +0.2s | Iter 2: +6.6s | Iter 3: +13.1s
       Iter 4: +20.9s | Iter 5: +39.8s | Iter 6: +50.8s | [60s wall]

   240s gives ~1.2× headroom over the ~200s natural worst-case runtime
   on a 1k-node testnet. DoS amplification stays bounded by the
   per-pool single-flight cache (closeness_pass_cache +
   inflight_closeness): one in-flight lookup per pool_hash regardless
   of concurrency.

B. CLOSENESS_LOOKUP_WIDTH: new constant = 32 (was implicit 16)
   The client's get_merkle_candidate_pool over-queries
   2 * CANDIDATES_PER_POOL = 32 peers via find_closest_peers(addr, 32),
   then picks the 16 closest *valid responders* by XOR distance. So
   legitimate pools routinely include peers from positions 17–32 of
   the network's true ranking when the closer peers are slow or
   NAT-stuck. The storer was fetching only top-16, so it never even
   observed those legitimate candidates and rejected the pool with no
   security benefit. Widening to 32 lets the storer's view be a
   superset of the client's selection space.

   Empirical effect on STG-01: client-side closeness mismatches
   dropped from ~115 to ~31 per 5 min (73% reduction) after deploy.

Security argument
   CANDIDATE_CLOSENESS_REQUIRED (13/16) is unchanged. The pay-yourself
   attack still requires the attacker's fabricated PeerIds to land in
   the storer's authoritative top-K. K=32 vs K=16 doubles the window
   (≈1 extra bit of grinding cost), but the dominant cost remains
   Sybil-grinding midpoint addresses or running real nodes near the
   target — same security floor.

Tests
   Four new constant-pinning tests + one compile-time const-assert in
   the existing closeness test module:
   - closeness_lookup_timeout_is_240s
   - closeness_lookup_width_is_32
   - closeness_required_threshold_unchanged_at_13
   - const _: () = assert!(WIDTH >= CANDIDATES_PER_POOL)

   Full lib suite: 452 passed; 0 failed. Clippy clean
   (--all-features --all-targets -D warnings). Format clean.
Copilot AI review requested due to automatic review settings May 7, 2026 03:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the merkle-payment “candidate closeness” verification in PaymentVerifier to better tolerate slower/under-warm networks by (1) increasing the authoritative DHT lookup timeout and (2) widening the lookup window so the storer verifies against the same candidate-selection space the client uses.

Changes:

  • Increased CLOSENESS_LOOKUP_TIMEOUT from 60s to 240s for iterative DHT lookups.
  • Introduced CLOSENESS_LOOKUP_WIDTH = 32 and updated the closeness lookup to query at least that many peers.
  • Added regression tests that pin the timeout/width and assert invariants around these constants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/payment/verifier.rs Outdated
Comment on lines +637 to +641
/// over the ~200s natural worst-case runtime on a 1k-node testnet
/// while still capping `DoS` amplification at roughly one bounded
/// lookup per forged `pool_hash` (the per-pool single-flight cache
/// at `closeness_pass_cache` + `inflight_closeness` ensures at most
/// one in-flight lookup per `pool_hash` regardless of concurrency).
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 8a91f7a: tightened the doc to acknowledge that inflight_closeness is an LRU and a flood of unique pool_hashes can evict an in-flight slot, allowing a second leader to race for the same pool. The new wording calls out the bound as "typical" and references InflightGuard::drop for the existing caveat. A global semaphore would be a separate, larger concern (would also affect the existing replication path) — flagging for a follow-up.

Comment thread src/payment/verifier.rs Outdated
Comment on lines +664 to +668
/// K=32 doubles the window vs K=16 (≈1 extra bit of grinding), but
/// the dominant cost is still Sybil-grinding midpoint addresses or
/// running real nodes near the target — same security floor.
/// `CANDIDATE_CLOSENESS_REQUIRED` (13/16) is unchanged.
const CLOSENESS_LOOKUP_WIDTH: usize = 32;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f95c6b: CLOSENESS_LOOKUP_WIDTH is now defined as 2 * evmlib::merkle_payments::CANDIDATES_PER_POOL so it tracks the protocol's over-query factor automatically. The pinning test still asserts equality with 2 * CANDIDATES_PER_POOL — same value, but the constant is now derived rather than coincidental.

Comment thread src/payment/verifier.rs Outdated
grumbach added 2 commits May 7, 2026 13:04
…elper

Addresses adversarial-review findings on PR WithAutonomi#89.

Correctness fixes
   - MAX_LEADER_RETRIES: 4 -> 1. Worst-case waiter wall-clock under a
     leader-cancellation cascade was (4+1) * CLOSENESS_LOOKUP_TIMEOUT
     = 5 * 240s = 20min user-visible. Capped at 2 * 240s = 8min by
     reducing the retry budget. The only realistic trigger is leader
     future-cancellation which should be extraordinarily rare;
     adversarial cancellation does not benefit from a higher cap (it
     just hides the symptom).
   - Doc note on CLOSENESS_LOOKUP_TIMEOUT: acknowledge that
     inflight_closeness is an LRU and a flood of unique pool_hash
     entries can evict an in-flight slot, allowing a second leader to
     race for the same pool. Previously the doc claimed a hard
     'one in-flight per pool_hash' bound; the InflightGuard::drop
     comment already documented this caveat. Aligns the public-facing
     doc with reality.

Testability
   - Extract closeness_lookup_count(pool_len) -> usize as a const fn so
     the WIDTH.max(pool_len) formula can be exercised in unit tests
     without standing up a DhtNetworkManager. Adds
     closeness_lookup_count_uses_max_of_width_and_pool_len pinning all
     three behaviours: standard (16-pool -> 32 lookup), future-proof
     (64-pool -> 64 lookup), lower-bound (any short pool -> WIDTH).

Doc clarifications
   - Performance note on CLOSENESS_LOOKUP_WIDTH: count does not just
     truncate the lookup; saorsa-core's stagnation rule keeps iterating
     until best_nodes.len() >= count. K=32 can extend lookups by a few
     iterations vs K=16, which reinforces the timeout bump rather than
     undermining it.
   - Tighten .max(...) doc to acknowledge that pool.candidate_nodes is
     a fixed-size array today, so the .max() form is belt-and-braces
     for a hypothetical Vec-based future protocol bump rather than
     doing useful work today.

Tests
   - 48/48 verifier tests pass (1 new). Clippy clean, format clean.
   - The deeper behavioural gaps (proving 80s lookup succeeds at 240s,
     proving 16/16-at-position-17 pool accepts at K=32) require a
     trait seam on DhtNetworkManager that doesn't exist yet — flagged
     for a follow-up PR.
…fixed

Refactor the matched-count check out of verify_merkle_candidate_closeness_inner
into a pure-logic helper check_closeness_match(candidate_ids, network_ids,
pool_address) and add 7 new regression tests against synthetic peer-ID
sets. The helper is a 1:1 extraction — same sparse-DHT short-circuit,
same set-membership check, same error strings. Production call site
goes through the helper; only the network-lookup wiring stays in the
async outer function.

This unlocks unit-test coverage for the original STG-01 failure modes
without standing up a real DHT, which previously required either
integration tests or a saorsa-core trait extraction.

New regression tests (all under #[cfg(test)] in the existing module):

  closeness_match_passes_when_all_16_candidates_in_top_16
    Happy path: trivial all-match case still works post-refactor.

  closeness_match_passes_when_candidates_span_positions_1_to_15_and_17
    A pool with 15 candidates at network-true positions 1..=15 plus
    one at position 17 verifies cleanly under K=32 — the simplest
    'one peer in the 17–32 window' shape.

  closeness_match_fails_at_k_16_passes_at_k_32_for_honest_skew
    Headline regression test: candidates at positions {1..=12, 17,
    19, 21, 23} are REJECTED at K=16 (the original STG-01 bug) and
    ACCEPTED at K=32 (the fix). Asserts both directions in one test.

  closeness_match_rejects_forged_pool_at_k_32
    Security floor: a pool with 16 candidates fully disjoint from the
    network's top-32 must still be rejected at K=32. Pins the
    pay-yourself defence.

  closeness_match_rejects_pool_at_exactly_12_of_16_match
  closeness_match_accepts_pool_at_exactly_13_of_16_match
    Threshold sanity at the 13/16 boundary, in both directions.

  closeness_match_returns_sparse_dht_error_when_lookup_too_small
    The sparse-DHT short-circuit fires correctly when network_peers
    falls below CANDIDATE_CLOSENESS_REQUIRED.

460/460 lib tests pass (was 452). Clippy clean
(--all-features --all-targets -D warnings) — added clippy::panic to
the test-module allow list alongside the existing expect_used (panic
is appropriate for asserting unexpected match arms in tests). Format
clean.

The refactor preserves all existing behaviour by construction: the
new check_closeness_match is the same code that was inlined in
verify_merkle_candidate_closeness_inner, just reachable from tests.
Copilot AI review requested due to automatic review settings May 7, 2026 04:13
…R_POOL

Per Copilot review on PR WithAutonomi#89: defining the constant as a literal 32
duplicates the relationship the pinning test enforces. Define it as
'2 * CANDIDATES_PER_POOL' so the constant tracks the protocol's
over-query factor automatically. The pinning test still asserts
equality with '2 * CANDIDATES_PER_POOL' — same value, just the
constant is now derived rather than coincidental.

No functional change. Tests still 15/15 pass for closeness, lib full
suite still 460/460. Clippy + fmt clean.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread src/payment/verifier.rs
Comment thread src/payment/verifier.rs
Comment on lines +691 to +695
/// cancellation a higher cap doesn't add resilience, it just hides
/// the symptom. With `CLOSENESS_LOOKUP_TIMEOUT = 240s` this caps a
/// single user-visible verification at ~8 min worst case (vs ~20 min
/// at the previous value of 4).
const MAX_LEADER_RETRIES: usize = 1;
Comment thread src/payment/verifier.rs
Comment on lines +2757 to +2761
#[test]
fn closeness_lookup_count_uses_max_of_width_and_pool_len() {
// The honest case: a 16-candidate pool must trigger a 32-peer
// network lookup. This is the K=16-rejects-honest-pool fix from
// the STG-01 investigation — without it, the storer never
grumbach added a commit to grumbach/ant-node that referenced this pull request May 7, 2026
The new bad-binding regression tests use panic!() in match arms to flag
unexpected outcomes — same shape as the existing tests on PR WithAutonomi#89. The
workspace clippy config has -D clippy::panic, so the test module needs
the explicit allow alongside the existing clippy::expect_used.

Fixes the Clippy CI failure on PR WithAutonomi#90.
@jacderida jacderida changed the base branch from main to rc-2026.5.1 May 7, 2026 15:38
@jacderida jacderida merged commit ada65b3 into WithAutonomi:rc-2026.5.1 May 7, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants