Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 24, 2025

Problem

test_three_node_network_connectivity was flaking on main (see CI run 19622111181) with only one live connection and PUT timeouts. The gateway sometimes mis-identified inbound peers when multiple peers shared the same IP because the inbound expectation tracker fell back to any entry with the same IP. That caused peer1 dials to be recorded as peer2, leaving peer1 disconnected.

Solution

  • Track expected inbound connections by full socket address (no IP-wide fallback) and log matches by socket.
  • Add a regression test to ensure peers on the same IP stay distinct.
  • Keep deterministic locations in the three-node test to make failures reproducible.

Testing

  • cargo test -p freenet tracker_keeps_peers_separate_on_same_ip -- --nocapture
  • Baseline before fix: 1/5 failures (logs: /tmp/three-node-pre-1.log; passes: /tmp/three-node-pre-[2-5].log)
  • After fix: 5/5 passes (logs: /tmp/three-node-post-*.log) plus manual run with --nocapture

@sanity
Copy link
Collaborator Author

sanity commented Nov 24, 2025

@claude Please review this PR

@claude
Copy link

claude bot commented Nov 24, 2025

Claude finished @sanity's task —— View job


PR Review for fix-three-node-flake

Tasks

  • Read changed files
  • Analyze handshake.rs changes
  • Review test additions
  • Provide comprehensive feedback

Review Summary

This PR effectively fixes the root cause of the three-node connectivity flake. The solution is clean, well-tested, and properly scoped.

✅ Strengths

  1. Correct root cause identification: The issue was clearly diagnosed - the IP-wide fallback in ExpectedInboundTracker caused peer misidentification when multiple peers shared an IP address.

  2. Clean simplification: The change from HashMap<IpAddr, Vec<ExpectedInbound>> to HashMap<SocketAddr, ExpectedInbound> removes unnecessary complexity:

    • Eliminates the per-IP vector management
    • Removes the problematic fallback logic (matched by IP fallback)
    • Makes lookups O(1) instead of O(n) per IP
  3. Excellent test coverage: The new tracker_keeps_peers_separate_on_same_ip unit test directly validates the fix and would have caught the original bug.

  4. Improved logging: The debug message now says "matched by socket address" instead of ambiguous "matched by exact port" vs "matched by IP fallback" messages.

  5. Deterministic test setup: Using node_locations_fn = fixed_three_node_locations makes failures reproducible, which is critical for debugging flakes.

🔍 Code Analysis

handshake.rs changes (crates/core/src/node/network_bridge/handshake.rs:127-173):

  • ✅ The simplified register() method correctly replaces entries by full socket address
  • drop_peer() is now a simple HashMap remove - no edge cases
  • consume() eliminates the dual-mode matching (exact port vs IP fallback)
  • ✅ The #[cfg(test)] helper contains() correctly checks by socket address

connectivity.rs changes (crates/core/tests/connectivity.rs:11-14, 262, 281-284):

  • fixed_three_node_locations() provides deterministic locations (0.1, 0.5, 0.9)
  • ✅ The debug output in the test helps with troubleshooting

🤔 Minor Observations

  1. Unused import removed: The change removes use std::net::IpAddr which was only needed for the old IP-keyed HashMap. Good cleanup.

  2. Semantics preserved: The comment "Replace any existing expectation for the same socket to ensure the newest registration wins" correctly describes the new behavior, which maintains the same overwrite semantics as before (just simpler).

  3. Race condition behavior unchanged: The original code had last-write-wins semantics for the same socket, and this is preserved. The tracker is only accessed from the single-threaded run_driver task, so no additional synchronization is needed.

💡 Questions / Suggestions

  1. Should there be a warning for unexpected inbound connections? Currently, when consume() returns None (no registered expectation), the code silently accepts the connection with peer=None, transaction=None, transient=false (handshake.rs:212-214). Consider logging a warning for unexpected inbound connections to aid debugging:

    let entry = expected_inbound.consume(remote_addr);
    if entry.is_none() {
        tracing::warn!(%remote_addr, "Received unexpected inbound connection (no matching expectation)");
    }
  2. Test naming: The test tracker_keeps_peers_separate_on_same_ip is clear, but technically both test peers use IpAddr::LOCALHOST with different ports. The name is accurate since they're on the same IP.

🎯 Testing Validation

The PR description mentions:

  • Baseline: 1/5 failures before fix
  • After fix: 5/5 passes plus manual --nocapture run

This is solid evidence the fix works. The test is now deterministic and reproducible.

✅ Verdict

LGTM - This is a high-quality fix that:

  • Solves the root cause with a simplification rather than adding complexity
  • Has excellent test coverage (both unit and integration)
  • Improves code maintainability
  • Makes failures reproducible

The only suggestion is the optional logging enhancement for unexpected inbound connections, but that's not blocking.

--- | Branch: fix-three-node-flake

Copilot finished reviewing on behalf of sanity November 24, 2025 21:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in three-node connectivity tests where peers sharing the same IP address were being mis-identified during inbound connection handshakes. The gateway's expectation tracker previously fell back to any peer with a matching IP, causing peer1 connections to be incorrectly recorded as peer2.

Key Changes:

  • Refactored ExpectedInboundTracker to use exact socket address matching instead of IP-based lookup with port fallback
  • Added regression test ensuring peers on the same IP remain distinct during handshake
  • Introduced deterministic node locations for reproducible three-node test failures

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
crates/core/src/node/network_bridge/handshake.rs Replaced IP-based tracking (HashMap<IpAddr, Vec<ExpectedInbound>>) with exact socket tracking (HashMap<SocketAddr, ExpectedInbound>), simplified register/drop_peer/consume methods, removed IP fallback logic, and added regression test
crates/core/tests/connectivity.rs Added fixed_three_node_locations() helper for deterministic ring positions and configured test_three_node_network_connectivity to use it for reproducible failures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sanity
Copy link
Collaborator Author

sanity commented Nov 24, 2025

Thanks for the thorough review! I added a warning when we get an inbound connection without a matching expectation (socket lookup still handles the peer correctly). Patch is in the latest push. 🙌}

@sanity sanity enabled auto-merge November 24, 2025 21:30
@sanity sanity added this pull request to the merge queue Nov 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 24, 2025
@sanity sanity added this pull request to the merge queue Nov 24, 2025
@sanity sanity removed this pull request from the merge queue due to a manual request Nov 24, 2025
@sanity sanity merged commit 0d73345 into main Nov 24, 2025
12 checks passed
@sanity sanity deleted the fix-three-node-flake branch November 24, 2025 22:30
sanity added a commit that referenced this pull request Nov 24, 2025
…cture

PR #2136 changed ExpectedInboundTracker from HashMap<IpAddr, Vec<ExpectedInbound>>
to HashMap<SocketAddr, ExpectedInbound>. The transactions_for() test helper was
added in this branch before that change and wasn't updated during the merge.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants