Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 26, 2025

Problem

NAT peers cannot know their public address at startup (issue #2156). Previously, they would advertise 127.0.0.1 as their address, which caused connection failures when other peers tried to connect back to them.

The root cause: when a NAT peer starts, it doesn't know its public IP. The existing code stored a placeholder address (127.0.0.1:port), but this placeholder was treated as a real address and propagated through the network.

Previous Approach

An earlier PR (#2088) attempted to fix this by rewriting addresses at the transport boundary - a hacky approach that treated the symptom rather than the cause.

This Solution

Introduce a Joiner enum that explicitly models the two states:

pub(crate) enum Joiner {
    /// Peer that doesn't yet know its public address (NAT peer before discovery)
    Unknown(TransportPublicKey),
    /// Peer with a known address (gateway, or NAT peer after ObservedAddress)
    Known(PeerId),
}

Key insight: If we don't know our external address, we shouldn't store a placeholder - we should model that uncertainty explicitly. This forces correct handling throughout the codebase.

Flow for NAT peers:

  1. NAT peer sends ConnectRequest with joiner: Unknown(pub_key)
  2. Gateway receives request and observes the real source address from UDP
  3. Gateway sends ObservedAddress message with the discovered address
  4. Gateway upgrades the joiner to Known(PeerId) with the correct address
  5. Peer receives ObservedAddress and updates its own address

Changes:

  • Added Joiner enum in connect.rs with helper methods
  • Changed ConnectRequest.joiner from PeerKeyLocation to Joiner
  • Updated RelayState::handle_request to upgrade joiner when observed_addr is available
  • Added is_gateway parameter to initiate_join_request to determine if peer knows its address
  • Updated all call sites (3 places) to pass is_gateway

Testing

Unit tests: All 244 tests pass

E2E test with technic (NAT peer behind residential NAT) connecting to nova (gateway):

# NAT peer sends Unknown joiner
joiner: Unknown(v6MWKgqHgBpgJJmn)

# Gateway observes address and sends ObservedAddress
ObservedAddress { target: v6MWKgqHgBpgJJmn, address: 136.62.52.28:31337 }

# Peer updates from placeholder to real address
Updating peer address from placeholder to observed address, 
  old_addr: 127.0.0.1:31337, new_addr: 136.62.52.28:31337

Fixes

Closes #2156

[AI-assisted - Claude]

Copilot finished reviewing on behalf of sanity November 26, 2025 20:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes NAT peer address discovery by introducing a Joiner enum that explicitly models whether a peer knows its public address. NAT peers start as Unknown (just a public key) and are upgraded to Known (with full PeerId) when a gateway observes their address.

Key changes:

  • Introduces the Joiner enum to distinguish between peers with known vs unknown addresses
  • Gateways now observe and communicate the real source address to NAT peers via ObservedAddress messages
  • NAT peers update their internal address when receiving ObservedAddress

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/core/src/operations/connect.rs Adds Joiner enum and updates connection logic to handle address discovery
crates/core/src/ring/connection_manager.rs Adds update_peer_address method for NAT peers to update their address
crates/core/src/ring/mod.rs Passes is_gateway flag to initiate_join_request
crates/core/src/node/p2p_impl.rs Passes is_gateway flag to initiate_join_request
crates/core/src/node/network_bridge/p2p_protoc.rs Updates peer initialization logic based on gateway status and NAT configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pub fn with_observed_address(&self, addr: SocketAddr) -> Self {
match self {
Joiner::Unknown(key) => Joiner::Known(PeerId::new(addr, key.clone())),
Joiner::Known(peer_id) => Joiner::Known(PeerId::new(addr, peer_id.pub_key.clone())),
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Joiner::Known branch creates a new PeerId even if the address hasn't changed. Consider checking if the address is different before creating a new instance to avoid unnecessary allocations: if peer_id.addr == addr { self.clone() } else { Joiner::Known(PeerId::new(addr, peer_id.pub_key.clone())) }

Suggested change
Joiner::Known(peer_id) => Joiner::Known(PeerId::new(addr, peer_id.pub_key.clone())),
Joiner::Known(peer_id) => {
if peer_id.addr == addr {
self.clone()
} else {
Joiner::Known(PeerId::new(addr, peer_id.pub_key.clone()))
}
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Applied - avoids unnecessary allocation when address hasn't changed.

let mut this_peer = self.peer_key.lock();
if let Some(ref mut peer) = *this_peer {
// Only update if current address is a placeholder (localhost)
if peer.addr.ip().is_loopback() {
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is_loopback() only validates the IP, but the comment mentions "placeholder (localhost)" suggesting a specific address. Consider also checking for the specific 127.0.0.1 placeholder or documenting why any loopback address should be treated as a placeholder. This could prevent unintended updates if a peer legitimately uses a non-127.0.0.1 loopback address.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loopback check is correct - we use 127.0.0.1:port as the placeholder when a peer doesn't know its public address. Any loopback IP (127.x.x.x) should be treated as a placeholder since peers shouldn't legitimately have a loopback public address. The comment and code are consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this placeholder business, seems a bit hackish, cannot we make it explicit that the public address is not known by encoding it with an option?

Comment on lines +392 to +401
tracing::warn!(
joiner = %self.request.joiner,
"connect: cannot accept joiner without known address"
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is logged when a joiner without a known address reaches the acceptance logic, but this should never happen after the observed_addr upgrade at line 361. This suggests a logic issue: if observed_addr is Some, the joiner is upgraded to Known at line 361, so the else branch at line 391 should be unreachable. Consider adding a debug assertion or restructuring the logic to make this invariant clearer.

Suggested change
tracing::warn!(
joiner = %self.request.joiner,
"connect: cannot accept joiner without known address"
debug_assert!(
false,
"connect: joiner should always have a known address after observed_addr upgrade; joiner: {}",
self.request.joiner

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is actually valid and not unreachable. The observed_addr upgrade at line 361 only happens when req.observed_addr is Some. But in early protocol stages (before gateway sends ObservedAddress), the joiner may still be Unknown. This warning helps identify cases where acceptance logic is reached without address discovery - useful for debugging protocol issues.

NAT peers cannot know their public address at startup. Previously they
would advertise 127.0.0.1 which caused connection failures when other
peers tried to connect to them.

This PR introduces a `Joiner` enum with two variants:
- `Unknown(TransportPublicKey)` - NAT peer before address discovery
- `Known(PeerId)` - peer with known address (gateway or after discovery)

When a gateway receives a ConnectRequest with `Unknown` joiner, it:
1. Observes the real public address from the UDP source
2. Sends an ObservedAddress message back to the joiner
3. Upgrades the joiner to `Known` with the discovered address

E2E tested with technic (NAT peer) connecting to nova (gateway):
- Peer sends: `joiner: Unknown(v6MWKgqHgBpgJJmn)`
- Gateway observes and sends ObservedAddress
- Peer updates: `old_addr: 127.0.0.1:31337, new_addr: 136.62.52.28:31337`

Closes #2156

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity sanity force-pushed the fix-2156-nat-address branch from 6123273 to d9dfc64 Compare November 26, 2025 20:39
@sanity
Copy link
Collaborator Author

sanity commented Nov 26, 2025

@claude Please carefully review this PR, both at a high level (is it the right approach broadly), and the details.

@claude
Copy link

claude bot commented Nov 26, 2025

Claude finished @sanity's task —— View job


Comprehensive PR Review

I've completed a thorough review of PR #2158 covering both the high-level architectural approach and implementation details. Here are my findings:


✅ High-Level Approach: Excellent Design

The introduction of the Joiner enum is architecturally sound and represents a significant improvement over the previous approach. Key strengths:

  1. Type Safety: The enum explicitly models the uncertainty of NAT peer addresses at the type level, making invalid states unrepresentable. This is superior to using placeholder addresses (like 127.0.0.1) which get treated as real addresses.

  2. Clear Protocol Flow: The upgrade path from Unknown(TransportPublicKey)Known(PeerId) mirrors the actual physical constraints of NAT traversal. A peer behind NAT genuinely doesn't know its public address until observed.

  3. Addresses Root Cause: Unlike PR fix: advertise observed addresses for private joiners #2088's approach of rewriting addresses at boundaries (treating the symptom), this solution models the problem domain correctly. The Joiner enum forces correct handling throughout the codebase.

  4. Well-Tested: The test suite is comprehensive, including regression tests for the specific issue (ConnectResponse sent to joiner's private address instead of observed external address #2141) and edge cases like unknown joiners reaching acceptance logic.


🔍 Implementation Review: Strong with Minor Concerns

Critical Issues: None Found

The implementation is solid. All 244 tests pass, E2E testing validates the flow, and the code handles the state transitions correctly.

Areas of Concern

1. Placeholder Address Pattern (iduartgomez's concern is valid)

Location: connection_manager.rs:410

pub fn update_peer_address(&self, new_addr: SocketAddr) -> bool {
    // Only update if current address is a placeholder (localhost)
    if peer.addr.ip().is_loopback() { ... }
}

Issue: While the Joiner enum eliminates placeholders in the protocol layer, the connection manager still uses placeholder addresses internally. This creates a semantic mismatch:

  • Protocol layer: "We don't know the address" (modeled explicitly with Joiner::Unknown)
  • Connection manager: "We know the address, it's 127.0.0.1" (placeholder treated as real address)

Recommendation: Consider making PeerId.addr an Option<SocketAddr> or introducing a PeerId variant that can represent "address unknown". This would make the placeholder pattern unnecessary and align both layers.

However, this change would be invasive (affecting many parts of the codebase). If you decide to keep the current approach:

  • ✅ Document why placeholders are used at the connection manager level
  • ✅ Add a comment explaining that the Joiner enum handles protocol-level unknowns while connection manager uses placeholders for internal bookkeeping

2. Optimization Opportunity (Already Applied)

Location: connect.rs:186-198

pub fn with_observed_address(&self, addr: SocketAddr) -> Self {
    match self {
        Joiner::Unknown(key) => Joiner::Known(PeerId::new(addr, key.clone())),
        Joiner::Known(peer_id) => {
            // Avoid allocation if address hasn't changed
            if peer_id.addr == addr {
                self.clone()
            } else {
                Joiner::Known(PeerId::new(addr, peer_id.pub_key.clone()))
            }
        }
    }
}

Good: You already applied the copilot suggestion to avoid unnecessary allocations when the address hasn't changed. This is a solid optimization.

3. Warning Log Reachability (copilot concern, but your response is correct)

Location: connect.rs:399-402

} else {
    tracing::warn!(
        joiner = %self.request.joiner,
        "connect: cannot accept joiner without known address"
    );
}

The copilot reviewer suggested this is unreachable after the observed_addr upgrade at line 361. Your response is correct: this branch is reachable when req.observed_addr is None (early protocol stages). The warning is valid for debugging protocol issues.

However, the code could be clearer. Consider adding a comment:

} else {
    // This can happen in early protocol stages before ObservedAddress is sent.
    // Useful for debugging protocol issues where acceptance logic is reached
    // before address discovery completes.
    tracing::warn!(
        joiner = %self.request.joiner,
        "connect: cannot accept joiner without known address"
    );
}

4. Panic Message Clarity (copilot suggestion, but not critical)

Location: p2p_protoc.rs:163

.expect("Gateway must have public_address configured");

The copilot suggestion to make this more specific ("in network_api settings or peer_id") is reasonable but not critical. The current message is clear enough for developers. Low priority.


📋 Code Quality Observations

Strengths

  1. Consistent Naming: Joiner::Unknown and Joiner::Known are self-documenting
  2. Helper Methods: pub_key(), peer_id(), to_peer_key_location(), location() provide ergonomic access
  3. Test Coverage: Excellent regression tests including the specific NAT address issue (ConnectResponse sent to joiner's private address instead of observed external address #2141)
  4. Documentation: Inline comments explain the why (e.g., "NAT peers don't know their address until observed")

Areas for Improvement

  1. update_peer_address Semantics: The loopback check is correct but deserves a comment explaining why any loopback IP is treated as a placeholder
  2. Protocol Invariants: Consider adding debug assertions to catch protocol violations (e.g., debug_assert!(joiner.has_known_address()) before operations requiring addresses)

🎯 Specific Code Review Notes

connect.rs:731-736 - is_gateway Parameter

let joiner = if is_gateway {
    Joiner::Known(own.peer.clone())
} else {
    Joiner::Unknown(own.peer.pub_key.clone())
};

Correct: Gateways know their address from config (required for them to function), while NAT peers start as Unknown.

Observation: All 3 call sites correctly pass is_gateway from connection_manager.is_gateway(). The threading of this parameter is clean and consistent.

connect.rs:364-375 - Address Upgrade Logic

if let Some(joiner_addr) = self.request.observed_addr {
    if !self.observed_sent {
        self.request.joiner = self.request.joiner.with_observed_address(joiner_addr);
        self.observed_sent = true;
        if let Some(joiner_pkl) = self.request.joiner.to_peer_key_location() {
            actions.observed_address = Some((joiner_pkl, joiner_addr));
        }
    }
}

Well-Designed:

  • Idempotent (only upgrades once via observed_sent flag)
  • Safe (uses Option chaining)
  • Correct (upgrades joiner before sending ObservedAddress)

connection_manager.rs:406-433 - Address Update

The logic is correct but could benefit from better documentation per iduartgomez's concern:

/// Updates the address of the peer key.
///
/// This is used when a peer behind NAT learns its actual public address from a gateway
/// via the ObservedAddress message. The peer initially starts with a placeholder address
/// (127.0.0.1) and updates it here when the real address is discovered.
///
/// **Design Note**: While the Joiner enum models "address unknown" at the protocol level,
/// the connection manager uses a placeholder loopback address for internal bookkeeping.
/// This is a pragmatic choice to avoid making PeerId.addr optional throughout the codebase.
/// Any loopback address (127.x.x.x) is treated as a placeholder.

🔄 Response to Existing Review Comments

iduartgomez's Concern: "Placeholder business seems hackish"

Valid observation. The tension is between:

  • Protocol layer (elegant): Joiner enum explicitly models unknown state
  • Connection manager (pragmatic): Uses placeholder 127.0.0.1 to avoid Option<SocketAddr> throughout

Options:

  1. Accept the pragmatism (recommended for this PR): Document the design decision clearly
  2. Refactor PeerId (future work): Make addr optional or create a PeerId variant for unknown addresses

I recommend Option 1 for this PR since:

  • The solution already solves the NAT address discovery bug
  • A full refactor of PeerId would be invasive and high-risk
  • The placeholder pattern is localized to connection manager
  • Proper documentation makes the tradeoff explicit

Suggested follow-up: Create an issue to track "Eliminate placeholder addresses by making PeerId.addr optional" for future consideration.


✅ Final Recommendation

Approve with minor documentation improvements.

This PR is ready to merge with the following optional enhancements:

Required: None (code is functionally correct)

Recommended:

  1. Add documentation to update_peer_address explaining why loopback check covers all placeholder cases
  2. Add a comment at connect.rs:399 clarifying the warning is reachable in early protocol stages
  3. Consider creating a follow-up issue to eliminate placeholders entirely (non-blocking)

Test Coverage: ✅ Excellent

Breaking Changes: None

Security Issues: None

Performance: Improved (optimization applied to avoid unnecessary allocations)


Great work on this PR! The Joiner enum is a clean architectural solution that addresses the root cause rather than patching symptoms. The implementation is solid, well-tested, and ready for production.


Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems extreme hackish (like a lot of the connect/handshake changes) but since all this is being reworked in other PRs and I guess we are basing off this one will merge.

@iduartgomez iduartgomez added this pull request to the merge queue Nov 29, 2025
Merged via the queue into main with commit ce20db3 Nov 29, 2025
23 checks passed
@iduartgomez iduartgomez deleted the fix-2156-nat-address branch November 29, 2025 06:47
sanity added a commit that referenced this pull request Nov 30, 2025
sanity added a commit that referenced this pull request Nov 30, 2025
…way param

The update_peer_address method was added by PR #2158, which was reverted.
Remove the call and add the missing is_gateway parameter to initiate_join_request.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: Peers behind NAT use 127.0.0.1 as advertised address, breaking connectivity

3 participants