Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Oct 2, 2025

Summary

Fixes a race condition where non-gateway peers could receive and attempt to process client operations (PUT, GET, etc.) before completing their initial network handshake and having their peer_id set.

This issue was exposed by PR #1898 (removal of legacy client management), which changed the initialization order such that the SessionActor starts earlier, allowing clients to connect before the peer is fully initialized.

Changes

Core Implementation

  • Added peer_ready flag to OpManager: Arc<AtomicBool> that tracks when a peer is ready to process client operations

    • Gateways: Set to true immediately (peer_id set from config)
    • Non-gateway peers: Set to false initially, becomes true after first successful handshake
  • Updated HandshakeHandler:

    • Added peer_ready field to track readiness state
    • Sets peer_ready to true after successfully calling try_set_peer_key() (handshake.rs:302-308)
  • Added readiness check in client operations:

    • Before processing PUT operations, non-gateway peers check peer_ready flag (client_events/mod.rs:407-419)
    • Gateways bypass this check (always ready)
    • Returns clear error message if client attempts operation before peer is ready

Key Files Modified

  • crates/core/src/node/op_state_manager.rs - Added peer_ready and is_gateway fields
  • crates/core/src/node/network_bridge/handshake.rs - Set peer_ready after handshake
  • crates/core/src/node/network_bridge/p2p_protoc.rs - Pass peer_ready to HandshakeHandler
  • crates/core/src/client_events/mod.rs - Check peer_ready before operations

Why This Fix is Safe

  1. Gateway compatibility: Gateways are completely unaffected

  2. Non-gateway peer behavior:

    • Only affects the specific race condition window
    • Clear, actionable error message if hit
    • No performance impact (single atomic load)
  3. No conflicts with PR Remove legacy actor client management #1898:

Test Results

Multi-machine test passed: Local 2-peer River integration test

  • Gateway startup: 1.07s
  • Peer connection: 7.05s
  • Room creation: 0.12s
  • Message exchange verified between both users
  • No "peer id not found" errors

Full test output available in test-results/river-test-20251002-172303/

Related Issues

Implementation Notes

Per @iduartgomez and @netsirius guidance:

[AI-assisted debugging and comment]

This fixes a race condition where clients could connect and send operations
(PUT, GET, etc.) before a non-gateway peer completes its initial network
handshake and has its peer_id set.

Changes:
- Add peer_ready AtomicBool to OpManager (true for gateways, false initially for peers)
- Add peer_ready to HandshakeHandler to set flag after first successful handshake
- Check peer_ready before processing client operations (non-gateways only)
- Gateways are unaffected (always ready since peer_id set from config)

This ensures non-gateway peers only accept client operations after establishing
their network identity, preventing "peer id not found" errors.

Addresses race condition exposed by PR #1898 (removal of legacy client management).
Per @iduartgomez and @netsirius guidance: implemented as separate fix to avoid
conflicts with PR #1898, which will be rebased after this is merged.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@iduartgomez iduartgomez enabled auto-merge October 2, 2025 17:34
@iduartgomez iduartgomez added this pull request to the merge queue Oct 2, 2025
Merged via the queue into main with commit 1fd1be3 Oct 2, 2025
8 checks passed
@iduartgomez iduartgomez deleted the fix/peer-id-race-condition branch October 2, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants