Skip to content

feat: Kademlia DHT lookup for closest-peer chunk routing#8

Merged
mickvandijke merged 3 commits intomainfrom
feat/kademlia-chunk-routing
Feb 12, 2026
Merged

feat: Kademlia DHT lookup for closest-peer chunk routing#8
mickvandijke merged 3 commits intomainfrom
feat/kademlia-chunk-routing

Conversation

@mickvandijke
Copy link
Copy Markdown
Collaborator

Summary

  • Replace naive first-connected-peer selection with iterative Kademlia network queries (find_closest_nodes) so chunks are routed to the peer whose BLAKE3-derived DHT key is closest to the chunk address by XOR distance
  • Exclude the local node from results by comparing against its own BLAKE3 DHT key hex

This is still a WIP — further testing and refinement needed before merging.

Test plan

  • Verify chunk routing targets the closest peer by XOR distance
  • Confirm local node is excluded from routing candidates
  • E2E test with multi-node setup
  • Run full test suite (cargo test)

🤖 Generated with Claude Code

mickvandijke and others added 2 commits February 10, 2026 21:18
Replace naive first-connected-peer selection with iterative Kademlia
network queries (find_closest_nodes) so chunks are routed to the peer
whose BLAKE3-derived DHT key is closest to the chunk address by XOR
distance. Excludes the local node from results by comparing against
its own BLAKE3 DHT key hex.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pick_target_peer was comparing DHT results (transport-level peer IDs)
against the app-level peer ID, which never matched. This caused the
client to route chunks to itself and timeout after 30s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mickvandijke mickvandijke marked this pull request as ready for review February 11, 2026 16:47
Copilot AI review requested due to automatic review settings February 11, 2026 16:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Kademlia-based closest-peer selection for chunk routing, replacing naive “first connected peer” targeting.

Changes:

  • Adds iterative DHT lookup (find_closest_nodes) for selecting a routing peer near a chunk address.
  • Introduces xor_distance and peer_id_to_xor_name helpers (with unit tests).
  • Updates public re-exports to expose the new helpers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/lib.rs Re-exports new client helper functions for external use.
src/client/quantum.rs Switches peer selection to Kademlia closest-nodes lookup and excludes local peer.
src/client/mod.rs Re-exports new data-type utilities from data_types.
src/client/data_types.rs Adds XOR-distance + peer-id conversion helpers and unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +340 to +348
let closest = closest_nodes
.into_iter()
.next()
.ok_or_else(|| Error::Network("No connected peers available".into()))
.find(|n| {
n.peer_id != *local_peer_id
&& local_transport_id
.as_ref()
.map_or(true, |tid| n.peer_id != *tid)
})
.ok_or_else(|| Error::Network("No remote peers found near target address".into()))?;
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find(...) selects the first non-local entry from closest_nodes, but this assumes the returned list is already sorted by XOR distance (and in the same ordering you want). If find_closest_nodes returns an unsorted set (or a differently-sorted set), this can select a non-closest peer, contradicting the function docstring. Consider explicitly selecting the minimum-distance node from the filtered set (compute XOR distance to target and choose the smallest), or document/assert that the DHT API guarantees stable closest-first ordering.

Copilot uses AI. Check for mistakes.
Comment on lines +331 to +333
let local_peer_id = node.peer_id();
let local_transport_id = node.transport_peer_id();

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self-exclusion logic relies on n.peer_id being directly comparable to both node.peer_id() and node.transport_peer_id(). If find_closest_nodes returns records keyed by a different identifier (e.g., DHT node ID vs transport peer ID string formatting), this filter may fail to exclude the local node. Prefer comparing the same canonical identifier type/bytes on both sides (or use a dedicated ‘is local’/node-id equality if available from the DHT/node APIs).

Copilot uses AI. Check for mistakes.
Comment on lines +342 to +347
.find(|n| {
n.peer_id != *local_peer_id
&& local_transport_id
.as_ref()
.map_or(true, |tid| n.peer_id != *tid)
})
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self-exclusion logic relies on n.peer_id being directly comparable to both node.peer_id() and node.transport_peer_id(). If find_closest_nodes returns records keyed by a different identifier (e.g., DHT node ID vs transport peer ID string formatting), this filter may fail to exclude the local node. Prefer comparing the same canonical identifier type/bytes on both sides (or use a dedicated ‘is local’/node-id equality if available from the DHT/node APIs).

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +37
/// Number of closest peers to consider for chunk routing.
const CLOSE_GROUP_SIZE: usize = 8;
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLOSE_GROUP_SIZE is a routing-critical constant but is hard-coded. Consider making it configurable via QuantumConfig (or a node/DHT config) so deployments can tune lookup breadth without code changes.

Copilot uses AI. Check for mistakes.
Comment on lines +326 to +330
/// Pick the closest peer to `target` using an iterative Kademlia network lookup.
///
/// Queries the DHT for the `CLOSE_GROUP_SIZE` closest nodes to the target
/// address and returns the single closest remote peer (excluding ourselves).
async fn pick_target_peer(node: &P2PNode, target: &XorName) -> Result<String> {
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new routing behavior (closest-peer selection + local exclusion) is core logic but doesn’t appear to be covered by tests here. Adding a unit test with a mocked/stubbed DHT result set (including the local node and multiple remotes with known distances) would lock in the intended selection semantics and prevent regressions.

Copilot uses AI. Check for mistakes.
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Replaced naive first-connected-peer selection with Kademlia DHT network queries for chunk routing. The implementation adds XOR distance calculation utilities and self-node filtering, but contains a critical bug in the peer selection logic.

Key changes:

  • Added xor_distance and peer_id_to_xor_name helper functions with comprehensive test coverage
  • Modified pick_target_peer to call find_closest_nodes DHT lookup instead of using connected_peers()
  • Filters local node using both peer_id() and transport_peer_id() for proper self-exclusion
  • Moved address computation before peer selection in put_chunk to enable routing by content address

Critical issue:
The peer selection logic (src/client/quantum.rs:340-348) uses .find() which returns the first non-self peer from the DHT results, not the actual closest peer by XOR distance. The defined xor_distance helper is never used in the routing logic. This means chunks aren't routed to the closest peer as intended - just to the first peer that isn't the local node.

Confidence Score: 2/5

  • Critical logical bug prevents correct Kademlia routing - chunks won't route to closest peer
  • The implementation has a fundamental logic error where .find() is used instead of proper XOR distance sorting, defeating the entire purpose of the Kademlia DHT lookup. While the helper functions are well-tested and the self-filtering logic appears correct, the core routing algorithm doesn't work as described in the PR summary.
  • src/client/quantum.rs requires the peer selection logic to be fixed before merging

Important Files Changed

Filename Overview
src/client/quantum.rs Replaced naive peer selection with Kademlia DHT lookup, but the closest-peer logic uses .find() instead of actual XOR distance sorting
src/client/data_types.rs Added xor_distance and peer_id_to_xor_name helpers with comprehensive tests - implementation looks correct

Sequence Diagram

sequenceDiagram
    participant Client as QuantumClient
    participant Node as P2PNode
    participant DHT as Kademlia DHT
    participant Peer as Target Peer

    Client->>Client: compute_address(content)
    Client->>Node: dht()
    Node-->>Client: DHT handle
    Client->>DHT: find_closest_nodes(target, 8)
    DHT-->>Client: closest_nodes[]
    Client->>Client: filter self from results
    Client->>Client: select first non-self peer
    Note over Client: BUG: Should sort by XOR distance,<br/>currently just uses first peer
    Client->>Peer: ChunkPutRequest/ChunkGetRequest
    Peer-->>Client: ChunkPutResponse/ChunkGetResponse
Loading

Copy link
Copy Markdown

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +340 to +348
let closest = closest_nodes
.into_iter()
.next()
.ok_or_else(|| Error::Network("No connected peers available".into()))
.find(|n| {
n.peer_id != *local_peer_id
&& local_transport_id
.as_ref()
.map_or(true, |tid| n.peer_id != *tid)
})
.ok_or_else(|| Error::Network("No remote peers found near target address".into()))?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.find() returns the first non-self peer, not the closest by XOR distance

The find_closest_nodes DHT call returns nodes sorted by XOR distance, but using .find() just picks the first non-self peer instead of the actual closest. The defined xor_distance helper (line 25 in data_types.rs) is never used.

To select the true closest peer, you need to:

  1. Parse peer IDs to XorNames using peer_id_to_xor_name
  2. Compute XOR distance to target for each peer
  3. Sort by distance and pick the minimum
Suggested change
let closest = closest_nodes
.into_iter()
.next()
.ok_or_else(|| Error::Network("No connected peers available".into()))
.find(|n| {
n.peer_id != *local_peer_id
&& local_transport_id
.as_ref()
.map_or(true, |tid| n.peer_id != *tid)
})
.ok_or_else(|| Error::Network("No remote peers found near target address".into()))?;
let closest = closest_nodes
.into_iter()
.filter(|n| {
n.peer_id != *local_peer_id
&& local_transport_id
.as_ref()
.map_or(true, |tid| n.peer_id != *tid)
})
.filter_map(|n| {
crate::client::peer_id_to_xor_name(&n.peer_id)
.map(|xor| (n, crate::client::xor_distance(&xor, target)))
})
.min_by(|(_, dist_a), (_, dist_b)| dist_a.cmp(dist_b))
.map(|(n, _)| n)
.ok_or_else(|| Error::Network("No remote peers found near target address".into()))?;

Copy link
Copy Markdown
Collaborator

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thanks @mickvandijke for this contribution! The approach of using Kademlia DHT lookup for closest-peer chunk routing is exactly right. However, there are a few issues that need to be addressed before merging:

🔴 Blocking: Compilation Error

The CI is failing because node.dht() returns Option<&Arc<DhtNetworkManager>>, not &Arc<DhtNetworkManager> directly. You need to handle the Option:

// Current (broken):
let closest_nodes = node
    .dht()
    .find_closest_nodes(target, CLOSE_GROUP_SIZE)
    .await
    .map_err(...)?;

// Fixed:
let dht = node.dht().ok_or_else(|| {
    Error::Network("DHT not available".into())
})?;
let closest_nodes = dht
    .find_closest_nodes(target, CLOSE_GROUP_SIZE)
    .await
    .map_err(|e| Error::Network(format!("Kademlia closest-nodes lookup failed: {e}")))?;

🟢 Good News: XOR Sorting Concern is Invalid

Both Copilot and Greptile raised concerns about not explicitly sorting by XOR distance. I verified that find_closest_nodes in saorsa-core does return results already sorted by XOR distance (see src/dht/core_engine.rs:224: candidates.sort_by(|a, b| a.1.cmp(&b.1))). So your .find() approach is correct - the first non-self peer IS the closest.

📝 Suggestions (non-blocking)

  1. Consider adding a unit test with mocked DHT results to verify the peer selection logic
  2. Consider making CLOSE_GROUP_SIZE configurable via QuantumConfig for deployment flexibility

Once the compilation error is fixed, I'll be happy to approve this. Great work! 🎉

Copy link
Copy Markdown
Collaborator

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Simpler Fix Available! 🎉

I just realized the root cause: the PR uses saorsa-core = "0.10.4" but v0.11.0 (just published) changed the dht() API from returning Option to returning &Arc<DhtNetworkManager> directly.

The Fix

Simply update Cargo.toml:

-saorsa-core = "0.10.4"
+saorsa-core = "0.11.0"

This should make the code compile as-is without any changes to the pick_target_peer logic. The current implementation is correct for v0.11.0!

v0.11.0 was just published from PR #25 (feat/dht-transport-refactor) which cleaned up the DHT API.

cc @davidirvine

The DHT transport refactor in saorsa-core v0.11.0 changed the `dht()`
method to return `&Arc<DhtNetworkManager>` directly instead of `Option`,
which fixes the compilation error in the Kademlia chunk routing code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@dirvine dirvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved!

I've pushed a commit that bumps saorsa-core to v0.11.0, which fixes the compilation error. The new version has a non-optional dht() API that returns &Arc<DhtNetworkManager> directly.

Changes Made

  • Updated Cargo.toml: saorsa-core = "0.10.4""0.11.0"

Verification

  • cargo check passes
  • cargo clippy -- -D warnings passes
  • ✅ All 138 unit tests pass

Summary

This is a great contribution @mickvandijke! 🎉

  • Correct approach: Using Kademlia DHT lookup for closest-peer chunk routing
  • Correct implementation: find_closest_nodes in saorsa-core already returns results sorted by XOR distance, so .find() correctly selects the closest non-self peer
  • Good tests: The xor_distance and peer_id_to_xor_name helpers have solid unit test coverage

Thanks for contributing to saorsa-node! 🙌

@dirvine
Copy link
Copy Markdown
Collaborator

dirvine commented Feb 11, 2026

Thanks so much for this contribution @mickvandijke! 🙌

The Kademlia-based closest-peer selection is exactly the right approach for chunk routing. Great work on:

  • Adding the XOR distance utilities with comprehensive tests
  • Properly filtering out the local node from routing candidates
  • Moving address computation before peer selection to enable routing by content address

Looking forward to more contributions from you! 🚀

@mickvandijke mickvandijke merged commit 3233cd5 into main Feb 12, 2026
8 of 12 checks passed
@mickvandijke mickvandijke deleted the feat/kademlia-chunk-routing branch February 12, 2026 13:52
mickvandijke added a commit that referenced this pull request Apr 1, 2026
Add unit and e2e tests covering the remaining Section 18 scenarios:

Unit tests (32 new):
- Quorum: #4 fail→abandoned, #16 timeout→inconclusive, #27 single-round
  dual-evidence, #28 dynamic threshold undersized, #33 batched per-key,
  #34 partial response unresolved, #42 quorum-derived paid-list auth
- Admission: #5 unauthorized peer, #7 out-of-range rejected
- Config: #18 invalid config rejected, #26 dynamic paid threshold
- Scheduling: #8 dedup safety, #8 replica/paid collapse
- Neighbor sync: #35 round-robin cooldown skip, #36 cycle completion,
  #38 snapshot stability mid-join, #39 unreachable removal + slot fill,
  #40 cooldown peer removed, #41 cycle termination guarantee,
  consecutive rounds, cycle preserves sync times
- Pruning: #50 hysteresis prevents premature delete, #51 timestamp reset
  on heal, #52 paid/record timestamps independent, #23 entry removal
- Audit: #19/#53 partial failure mixed responsibility, #54 all pass,
  #55 empty failure discard, #56 repair opportunity filter,
  response count validation, digest uses full record bytes
- Types: #13 bootstrap drain, repair opportunity edge cases,
  terminal state variants
- Bootstrap claims: #46 first-seen recorded, #49 cleared on normal

E2e tests (4 new):
- #2 fresh offer with empty PoP rejected
- #5/#37 neighbor sync request returns response
- #11 audit challenge multi-key (present + absent)
- Fetch not-found for non-existent key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants