Remove legacy actor client management #1898

netsirius · 2025-10-02T09:30:11Z

Fixes issue #1836

This pull request removes support for legacy client management, making actor-based client management the default and only mode. The changes simplify configuration, environment handling, and related code paths, ensuring all client communication uses the actor-based system. Tests and documentation related to the legacy mode and its feature flag have been removed.

Client Management System Simplification

Removed the actor_clients configuration option and all related environment variable handling (FREENET_ACTOR_CLIENTS). Actor-based client management is now always enabled, and legacy mode is no longer supported. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L79-L83), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L114), [[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L377), [[4]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L461-L463), [[5]](https://github.com/freenet/freenet-core/pull/1898/files#diff-9087957d5f7ad95ca762ede82d9fc3adf17684b08f19785fb53dda3ead4d5bfaL69), [[6]](https://github.com/freenet/freenet-core/pull/1898/files#diff-9087957d5f7ad95ca762ede82d9fc3adf17684b08f19785fb53dda3ead4d5bfaL115), [[7]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL226-R226))
Updated the client event handling logic to always create and use the actor-based RequestRouter, removing conditional logic for legacy mode. ([crates/core/src/client_events/mod.rsL216-R217](https://github.com/freenet/freenet-core/pull/1898/files#diff-3b08f38e7b62f928a42cb1e50f46d185ae36543ea4206af15246bc82e97a7ac1L216-R217))
Simplified the P2P implementation to always install the actor-based session adapter and result router, removing conditional code and related logs for legacy mode. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL226-R226), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL251-R250), [[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL267-R259), [[4]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL277-R276))

Test and Workflow Cleanup

Removed all tests and verification logic for legacy mode and the FREENET_ACTOR_CLIENTS flag, including environment variable manipulation and related assertions. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-95579cd60108ff762c8cf1bcc7561347053566c8558f0e677738b3cbb6d19a23L120-L147), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-95579cd60108ff762c8cf1bcc7561347053566c8558f0e677738b3cbb6d19a23L249-L282))
Updated CI workflow to eliminate legacy and actor mode test jobs, consolidating testing under the default actor-based configuration. ([.github/workflows/ci.ymlL44-L49](https://github.com/freenet/freenet-core/pull/1898/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fL44-L49))

Codebase and Logging Updates

Removed legacy client delivery code paths and related logging from result reporting, ensuring only actor-based delivery is performed. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L392-R392), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L426-R426), [[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L435-L483))

crates/core/src/node/mod.rs

crates/core/src/node/p2p_impl.rs

iduartgomez · 2025-10-02T13:42:03Z

crates/core/src/node/testing_impl.rs

 /// Starts listening to incoming events. Will attempt to join the ring if any gateways have been provided.
 async fn run_event_listener<NB, UsrEv>(
-    cli_response_sender: contract::ClientResponsesSender,
+    _cli_response_sender: contract::ClientResponsesSender,


this can be removed, and maybe the contract::ClientResponsesSender type too?

sanity · 2025-10-02T15:56:29Z

Review Comments

This PR successfully removes legacy client management code and consolidates on actor-based client management. The changes are well-structured and comprehensive. A few observations:

Strengths:

Clean removal of the FREENET_ACTOR_CLIENTS flag and related configuration
Proper cleanup of test code verifying legacy mode behavior
Simplification of P2P implementation by removing conditional logic
CI workflow updated to remove redundant test jobs

Observations:

The PR includes a large Cargo.lock update (255 additions, 422 deletions overall) - dependency updates appear to be noise from normal version bumps
All conditional code paths checking actor_clients flag have been removed
The RequestRouter is now unconditionally created in client event handling (crates/core/src/client_events/mod.rs:216-217)

Testing:

Removed 62 lines of integration verification tests that were specifically testing the flag behavior
The remaining tests verify the actor-based system functionality

The changes look solid and achieve the stated goal of making actor-based client management the default and only mode.

[AI-assisted debugging and comment]

iduartgomez · 2025-10-02T16:40:38Z

@claude can you take a look at #1836 comments and attempt to find what went wrong in the PR?

claude · 2025-10-02T16:40:52Z

Claude finished @iduartgomez's task —— View job

Investigation Complete: Issues Found in PR #1898

I've thoroughly analyzed the PR against issue #1836 to identify what went wrong. Here's the complete analysis:

Summary

The PR successfully removed the legacy client management system and made actor-based client management the only mode. However, outdated documentation gives the false impression that legacy mode still exists, which is causing confusion during code review.

Issues Found

1. Outdated Documentation ❌ NEEDS FIX

Location: crates/core/src/client_events/mod.rs:206-225

The documentation still describes a "dual-mode" system:

/// This function operates in one of two modes based on `op_manager.actor_clients`:
///
/// **Actor Mode** (default, `actor_clients=true`):
/// - Uses ResultRouter → SessionActor...
///
/// **Legacy Mode** (`actor_clients=false`):
/// - Uses direct client callbacks...

Problem: This is completely outdated. There is no actor_clients flag anymore, no legacy mode, no dual-mode system. Only actor mode exists now.

Impact: This misleading documentation is likely what caused the reviewer's confusion about "still passing options."

2. Unused Sender in testing_impl.rs ✅ Actually Correct

Location: crates/core/src/node/testing_impl.rs:784

let (client_responses, _cli_response_sender) = contract::client_responses_channel();

The reviewer asked: "this can be removed, and maybe the contract::ClientResponsesSender type too?"

Analysis:

The client_responses receiver IS used - it's passed to client_event_handling() at line 805
The _cli_response_sender is intentionally unused (hence the _ prefix) because in actor mode, the SessionActor provides its own sender
The ClientResponsesSender type must be kept - it's actively used by the SessionActor system

Verdict: This is actually correct as-is. The unused sender is expected in testing mode.

3. Optional Types Successfully Removed ✅ Correctly Fixed

The PR correctly removed Option<> wrappers:

op_state_manager.rs:68:

// Before: pub result_router_tx: Option<mpsc::Sender<(Transaction, HostResult)>>,
// After:  pub result_router_tx: mpsc::Sender<(Transaction, HostResult)>,

network_bridge/p2p_protoc.rs:130:

// Before: message_processor: Option<Arc<MessageProcessor>>,
// After:  message_processor: Arc<MessageProcessor>,

Status: ✅ Correctly fixed - these are now required (non-optional) because actor mode is always enabled.

What Went Wrong in This PR?

The core implementation is actually correct! What went wrong is:

Documentation wasn't updated - it still describes the old dual-mode system that no longer exists
This caused reviewer confusion - reading docs that say "two modes" while seeing code that's been simplified to one mode creates cognitive dissonance

The reviewer's comment "I still see us passing an option here" was likely because:

The documentation incorrectly described dual modes
An earlier commit may have had Options that were later removed
The final code actually DID remove the Options correctly

What the PR Did Successfully

✅ Removed FREENET_ACTOR_CLIENTS environment variable
✅ Removed actor_clients config field
✅ Removed all conditional legacy mode logic
✅ Made MessageProcessor required (removed Option<>)
✅ Made result_router_tx required (removed Option<>)
✅ Removed legacy client delivery paths
✅ Cleaned up tests and CI workflows
✅ Removed 459 lines, added 293 lines (net -166 lines)

Recommendations

Must Fix:

Update documentation in crates/core/src/client_events/mod.rs:206-225 to remove all references to:
- "dual-mode"
- "legacy mode"
- actor_clients flag
- The false choice between modes

Optional:
2. Add a comment in testing_impl.rs:784 explaining why _cli_response_sender is unused (it's intentional, not an oversight)

Final Verdict

The PR implementation is sound. The migration from legacy to actor-based client management is complete and correct. The only issue is outdated documentation that creates the false impression that legacy mode still exists.

Job Run | Branch: remove-legacy-preactor-client-mgmt

sanity · 2025-10-02T16:47:39Z

Multi-Machine Test Results

I've run the multi-machine test suite on this PR. The test FAILED with a critical error.

Test Configuration

Branch: remove-legacy-preactor-client-mgmt (commit 954c5e6)
Test: River multi-user chat test (gateway on vega.locut.us, local peer)
Date: 2025-10-02 16:32:00

Critical Error Found

The test failed during room creation with the following error:

ERROR freenet::client_events: peer id not found at put op, it should be set
    at crates/core/src/client_events/mod.rs:390

This error caused the WebSocket connection to be dropped immediately:

Error: Failed to receive response: client error: client disconnected

Diagnostic Details

Timeline of events:

✅ Gateway started successfully (70.6s)
✅ Peer connected to gateway (55.0s)
✅ Actor-based infrastructure confirmed active on both gateway and peer
❌ Room creation PUT operation failed with "peer id not found" error
❌ WebSocket connection dropped
❌ Test failed

Key log entries:

16:34:05.966 [RIVER] Sending PUT request for room creation
16:34:05.966 [PEER] ERROR freenet::client_events: peer id not found at put op, it should be set
16:34:05.966 [PEER] INFO dropped connection to client #1
16:34:05.966 [RIVER] Error: Failed to receive response: client error: client disconnected

Analysis

The actor-based client management system is properly initialized and running (confirmed by log messages):

Actor-based client management infrastructure installed with result router
Session actor starting
Result router starting

However, there appears to be a regression where peer ID is not being set during PUT operations. This is likely a bug introduced during the removal of legacy code paths.

Recommendation

This PR should not be merged until this issue is resolved. The error at crates/core/src/client_events/mod.rs:390 suggests that the peer ID tracking mechanism may have been inadvertently affected by the legacy code removal.

Full test logs are available in: ./test-results/river-test-20251002-163200/

[AI-assisted debugging and comment]

iduartgomez · 2025-10-02T16:50:47Z

Multi-Machine Test Results

I've run the multi-machine test suite on this PR. The test FAILED with a critical error.

Test Configuration

Branch: remove-legacy-preactor-client-mgmt (commit 954c5e6)

Test: River multi-user chat test (gateway on vega.locut.us, local peer)

Date: 2025-10-02 16:32:00

Critical Error Found

The test failed during room creation with the following error:
ERROR freenet::client_events: peer id not found at put op, it should be set
    at crates/core/src/client_events/mod.rs:390
This error caused the WebSocket connection to be dropped immediately:
Error: Failed to receive response: client error: client disconnected
Diagnostic Details

Timeline of events:

✅ Gateway started successfully (70.6s)

✅ Peer connected to gateway (55.0s)

✅ Actor-based infrastructure confirmed active on both gateway and peer

❌ Room creation PUT operation failed with "peer id not found" error

❌ WebSocket connection dropped

❌ Test failed

Key log entries:
16:34:05.966 [RIVER] Sending PUT request for room creation
16:34:05.966 [PEER] ERROR freenet::client_events: peer id not found at put op, it should be set
16:34:05.966 [PEER] INFO dropped connection to client #1
16:34:05.966 [RIVER] Error: Failed to receive response: client error: client disconnected
Analysis

The actor-based client management system is properly initialized and running (confirmed by log messages):

Actor-based client management infrastructure installed with result router

Session actor starting

Result router starting

However, there appears to be a regression where peer ID is not being set during PUT operations. This is likely a bug introduced during the removal of legacy code paths.

Recommendation

This PR should not be merged until this issue is resolved. The error at crates/core/src/client_events/mod.rs:390 suggests that the peer ID tracking mechanism may have been inadvertently affected by the legacy code removal.

Full test logs are available in: ./test-results/river-test-20251002-163200/

[AI-assisted debugging and comment]

@claude in particular what was changed that may have caused this?

claude · 2025-10-02T16:50:59Z

Claude finished @iduartgomez's task —— View job

Root Cause Analysis: "peer id not found" Error

I've identified the root cause of the test failure. The issue is NOT a bug introduced by removing legacy code, but rather a pre-existing race condition that was being masked by the legacy initialization order.

The Issue

Error Location: crates/core/src/client_events/mod.rs:405

let Some(peer_id) = op_manager.ring.connection_manager.get_peer_key()
else {
    tracing::error!("peer id not found at put op, it should be set");
    return Err(Error::Disconnected);
};

This code checks if the node's peer ID has been set. The peer ID gets initialized in one of two ways:

Gateway nodes: Set immediately from config.peer_id during ConnectionManager::new()
Regular peers: Set to None initially, then populated by try_set_peer_key() during the first successful network handshake

What Changed in This PR

The PR refactored the initialization order in crates/core/src/node/p2p_impl.rs:

BEFORE (with conditional actor_clients flag):

1. Create ConnectionManager
2. Create OpManager with result_router_tx: None
3. IF actor_clients flag enabled:
     - Create result router channel
     - Spawn SessionActor  
     - Spawn ResultRouter
4. Create P2pConnManager
5. Start client event handling

AFTER (actor mode always on):

1. Create result router channel
2. Spawn SessionActor
3. Spawn ResultRouter  
4. Create ConnectionManager
5. Create OpManager with result_router_tx (non-optional)
6. Create P2pConnManager
7. Start client event handling

The Race Condition

The change in initialization order has subtly affected timing:

SessionActor and ResultRouter are now spawned earlier in the startup sequence
This means client connections can be accepted slightly earlier
A client can now connect and send PUT requests BEFORE the peer completes its initial network handshake with the gateway
When a PUT arrives before handshake completion, get_peer_key() returns None → error

This is a timing-sensitive race condition - it likely only manifests under certain conditions:

Fast client connections
Slow network handshakes
The specific test environment (multi-machine test with remote gateway)

Why This Wasn't Caught Before

The legacy initialization order had actors spawning LATER, which accidentally gave more time for the initial handshake to complete before clients could connect. The PR's optimization (spawning actors earlier) exposed this pre-existing vulnerability.

The Fix

There are several possible solutions:

Option 1: Defer client connections until peer_id is set (Recommended)

Add a "ready" flag that blocks client event handling until first handshake completes
Cleanest solution but requires coordination between network and client layers

Option 2: Handle None peer_id gracefully in operations

Instead of returning Error::Disconnected, wait/retry until peer_id is available
Less invasive but adds complexity to operation handlers

Option 3: Initialize peer_id earlier for non-gateway nodes

Set a temporary peer_id based on listening address during ConnectionManager construction
Simpler but may have other implications

Option 4: Revert to legacy spawn order (Quick fix)

Move SessionActor/ResultRouter spawning after OpManager creation
Masks the issue rather than fixing it, but would unblock the PR

Recommendation

I recommend Option 1 combined with better documentation. The peer_id check exists for a reason - operations should not proceed until the node has established its network identity. The PR correctly exposed a timing assumption that wasn't properly enforced.

Would you like me to implement Option 1, or would you prefer a different approach?

Job Run | Branch: remove-legacy-preactor-client-mgmt

netsirius · 2025-10-02T17:02:38Z

@claude Can you re-run @sanity and @iduartgomez tests again??

claude · 2025-10-02T17:02:52Z

Claude finished @netsirius's task —— View job

Re-running Tests Analysis

I've analyzed the current state of the code to determine if the race condition identified by @sanity's multi-machine test has been fixed.

Findings

1. The Race Condition Still Exists ❌

The error location at crates/core/src/client_events/mod.rs:403-407 has not been fixed:

let Some(peer_id) = op_manager.ring.connection_manager.get_peer_key()
else {
    tracing::error!("peer id not found at put op, it should be set");
    return Err(Error::Disconnected);
};

This is the exact same code that failed in @sanity's multi-machine test.

2. Initialization Order Unchanged

The initialization sequence in crates/core/src/node/p2p_impl.rs:221-275 remains:

Create result router channel (line 228)
Spawn SessionActor (line 230-237)
Spawn ResultRouter (line 239-246)
Create ConnectionManager (line 250)
Create OpManager (line 251-258)
Start client event handling (line 298+)

This means SessionActor starts before the peer has a chance to complete its initial network handshake, allowing clients to connect and send PUT requests before peer_id is set.

3. Recent "fix" Commit

The most recent commit is e9e6465 fix, but it did not address the race condition issue identified in the test failure.

About Re-running Tests

I cannot re-run @sanity's multi-machine test because it requires:

A gateway running on a remote machine (vega.locut.us)
A local peer connecting to that gateway
The River chat application test harness

This is a custom test setup that I don't have access to replicate.

The Root Cause

As identified in the previous Claude analysis, the issue is a race condition:

Non-gateway peers start with peer_id = None
The peer_id only gets set during the first successful network handshake with another peer
SessionActor now starts before ConnectionManager is fully initialized
Clients can connect via WebSocket before the initial handshake completes
When a PUT request arrives before peer_id is set → Error: "peer id not found"

Summary

The race condition identified by @sanity's multi-machine test has NOT been fixed. The code at the error location remains unchanged, and the initialization order that exposes this race condition is still in place.

To properly verify this PR:

Either @sanity needs to re-run their multi-machine test setup manually
Or the race condition needs to be fixed first (I can implement this if requested)

Would you like me to implement the fix for the race condition?

sanity · 2025-10-02T17:04:26Z

Recommendation

I recommend Option 1 combined with better documentation. The peer_id check exists for a reason - operations should not proceed until the node has established its network identity. The PR correctly exposed a timing assumption that wasn't properly enforced.

Would you like me to implement Option 1, or would you prefer a different approach?

@claude Do you think this fix might interact poorly with this recently merged PR? #1871

claude · 2025-10-02T17:04:38Z

Claude finished @sanity's task —— View job

Analysis: Interaction Between Option 1 Fix and PR #1871

Good question! After thoroughly examining PR #1871, I can confirm that Option 1 will work well with PR #1871 and actually complements it nicely. Here's why:

What PR #1871 Changed

PR #1871 ("fix: enable gateway bootstrap when starting with zero connections") solved a critical gateway bootstrapping issue:

The Problem: Gateways starting with zero connections couldn't accept their first joiner because the normal flow requires existing connections to forward through.

The Solution: Added a special "bootstrap acceptance" path that:

Detects when a gateway has 0 connections and only 1 reserved connection (the first joiner)
Marks this connection with is_bootstrap_acceptance=true
Immediately registers the connection in the ring (bypassing normal CheckConnectivity flow)
This allows the gateway to respond to FindOptimalPeer requests from subsequent joiners

Key code locations:

crates/core/src/operations/connect.rs:1055-1091 - Bootstrap detection logic
crates/core/src/node/network_bridge/p2p_protoc.rs:863-890 - Immediate ring registration for bootstrap connections

Why Option 1 Won't Conflict

The critical insight: PR #1871's bootstrap path only affects gateways, and gateways already have their peer_id set from the start!

Looking at the peer_id initialization:

Gateways: peer_id is set immediately from config.peer_id during ConnectionManager::new() (crates/core/src/ring/connection_manager.rs)
Regular peers: peer_id starts as None, then gets set via try_set_peer_key() during first outbound gateway handshake (crates/core/src/node/network_bridge/handshake.rs:296)

This means:

Gateways never experience the "peer id not found" error because their peer_id is always set
Regular peers need to complete an outbound connection to get their peer_id
Option 1 (deferring client connections until peer_id is set) only affects regular peers
PR fix: enable gateway bootstrap when starting with zero connections #1871's bootstrap logic only affects gateways

There's no overlap! The two fixes operate on completely separate code paths.

Actually, Option 1 Makes PR #1871 Even Better

The proposed Option 1 fix would improve the correctness guarantees that PR #1871 introduced:

Current state with PR #1871:

Gateway bootstrap works correctly
Regular peers can still hit the race condition (client connects before first handshake completes)

With Option 1 added:

Gateway bootstrap still works correctly (no change)
Regular peers also protected from the race condition
Clean separation of concerns: peer_id must be established before accepting client operations

Implementation Recommendation

I recommend implementing Option 1 as originally proposed:

// Pseudo-code for Option 1
pub struct OpManager {
    // ... existing fields ...
    peer_ready: Arc<AtomicBool>,  // Set to true when peer_id is initialized
}

// In client_events/mod.rs, before processing client requests:
if !op_manager.peer_ready.load(Ordering::SeqCst) {
    // Either return error, or wait until ready
    tracing::warn!("Client attempted operation before peer initialization complete");
    return Err(Error::NotReady);
}

Where to set the ready flag:

For gateways: Set immediately after ConnectionManager::new() (peer_id already known)
For regular peers: Set in HandshakeHandler::get_events() after try_set_peer_key() succeeds (line 296)

This approach:

✅ Works seamlessly with PR fix: enable gateway bootstrap when starting with zero connections #1871's bootstrap logic
✅ Fixes the race condition for regular peers
✅ Provides a clean architectural boundary
✅ No conflicts or negative interactions

Final Verdict

No conflict. PR #1871 and Option 1 are complementary fixes that operate on different parts of the initialization sequence. Option 1 can be safely implemented without worrying about PR #1871.

Job Run | Branch: remove-legacy-preactor-client-mgmt

iduartgomez · 2025-10-02T17:12:48Z

@netsirius we will fix the issue in separate PR and then rebase this one since is a separate issue, lets wait until then for a final test to merging this .

@iduartgomez

This fixes a race condition where clients could connect and send operations (PUT, GET, etc.) before a non-gateway peer completes its initial network handshake and has its peer_id set. Changes: - Add peer_ready AtomicBool to OpManager (true for gateways, false initially for peers) - Add peer_ready to HandshakeHandler to set flag after first successful handshake - Check peer_ready before processing client operations (non-gateways only) - Gateways are unaffected (always ready since peer_id set from config) This ensures non-gateway peers only accept client operations after establishing their network identity, preventing "peer id not found" errors. Addresses race condition exposed by PR #1898 (removal of legacy client management). Per @iduartgomez and @netsirius guidance: implemented as separate fix to avoid conflicts with PR #1898, which will be rebased after this is merged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…urations # Conflicts: # crates/core/src/node/op_state_manager.rs

…dling

sanity · 2025-10-02T18:37:41Z

Multi-Machine Test Results for PR #1898

Status: ❌ FAILED (but NOT a regression - main also fails)

Test Configuration

Test Type: River multi-user chat (remote gateway + local peer)
Gateway: vega.locut.us:31338
Test Tool: freenet-testing-tools/multi-machine-test
Binary: Clean rebuild from PR 1898 HEAD (d8ca97e)

Key Finding: Not a Regression

Both main and PR 1898 fail identically - this is a pre-existing issue with peer-to-gateway connections in the multi-machine test environment, not a bug introduced by this PR.

Test Results Comparison

Main Branch (with PR #1899, without PR #1898)

❌ Test fails after 128.72s
✅ Proper error handling: Shows warning "Client attempted PUT operation before peer initialization complete"
✅ PR fix: Prevent race condition in non-gateway peer client operations #1899's peer_ready check is working correctly
Failure cause: Peer connection to gateway doesn't complete within 55 seconds

PR #1898 Branch

❌ Test fails after 127.78s
⚠️ Different error: "peer id not found at put op, it should be set"
⚠️ Missing the warning from PR fix: Prevent race condition in non-gateway peer client operations #1899
Failure cause: Same - peer connection doesn't complete

Root Cause Analysis

The multi-machine test waits a fixed 40 seconds for peer connection, then immediately attempts River operations:

// From enhanced_test.rs:1188-1190
println!("  ⏳ Waiting for peer to join the ring network (40 seconds)...");
std::thread::sleep(Duration::from_secs(40));  // ← Just sleeps, doesn't verify!
println!("  ✅ Peer connected in {:?}", peer_start.elapsed());

The test doesn't actually verify the connection completed - it just assumes 40 seconds is enough. In practice, the peer-to-gateway handshake is not completing within this time in the multi-machine environment.

Evidence from logs:

No "Outbound gateway connection successful" message
No "Peer initialization complete: peer_ready set to true" message
Connection attempt starts but never completes
After 55+ seconds, riverctl attempts PUT and fails

Secondary Issue: PR #1898 May Have Weakened Error Handling

While not causing the test failure, there's a concerning difference:

Main: Shows proper warning when peer not ready
PR Remove legacy actor client management #1898: Skips the warning, goes straight to fatal error

This suggests PR #1898 may have inadvertently broken the peer_ready check from PR #1899, though the exact mechanism is unclear. The check code appears identical, but the behavior differs.

Recommendations

For PR Remove legacy actor client management #1898:
- ✅ Not blocked by this test failure (it's pre-existing)
- ⚠️ Should investigate why PR fix: Prevent race condition in non-gateway peer client operations #1899's warning doesn't appear
- Consider adding integration test with proper connection verification
For the Test:
- Replace fixed 40-second sleep with actual connection verification
- Check for "Outbound gateway connection successful" in logs
- Verify peer_ready flag is set before proceeding
- Add timeout with clear error message if connection takes too long
- Add binary verification - check git commit hash matches expectation
For the Connection Issue:
- Investigate why peer-to-gateway connections are slow/failing
- Check if this is specific to the vega.locut.us environment
- Consider if network configuration or firewall rules are interfering

Test Artifacts

PR Remove legacy actor client management #1898 test: test-results/river-test-20251002-182826/
Main branch test: test-results/river-test-20251002-183239/

[AI-assisted debugging and comment]

netsirius requested review from iduartgomez and sanity October 2, 2025 09:33

iduartgomez reviewed Oct 2, 2025

View reviewed changes

crates/core/src/node/mod.rs Outdated Show resolved Hide resolved

iduartgomez reviewed Oct 2, 2025

View reviewed changes

crates/core/src/node/p2p_impl.rs Show resolved Hide resolved

iduartgomez reviewed Oct 2, 2025

View reviewed changes

crates/core/src/node/p2p_impl.rs Outdated Show resolved Hide resolved

netsirius requested a review from iduartgomez October 2, 2025 12:03

iduartgomez reviewed Oct 2, 2025

View reviewed changes

sanity approved these changes Oct 2, 2025

View reviewed changes

netsirius force-pushed the remove-legacy-preactor-client-mgmt branch from 954c5e6 to c187d76 Compare October 2, 2025 16:33

sanity mentioned this pull request Oct 2, 2025

Remove legacy pre-actor client management code #1836

Closed

8 tasks

sanity mentioned this pull request Oct 2, 2025

fix: Prevent race condition in non-gateway peer client operations #1899

Merged

netsirius added 5 commits October 2, 2025 19:54

#1836 remove legacy actor client management and update related config…

6a27054

…urations # Conflicts: # crates/core/src/node/op_state_manager.rs

integrate result router into OpManager and clean up legacy client han…

5b699ec

…dling

remove legacy client management code and clean up related functions

6370ef3

remove more legacy types and logic

c70b860

fix

d8ca97e

netsirius force-pushed the remove-legacy-preactor-client-mgmt branch from e9e6465 to d8ca97e Compare October 2, 2025 17:57

iduartgomez added this pull request to the merge queue Oct 2, 2025

Merged via the queue into main with commit dbc66f5 Oct 2, 2025
7 checks passed

iduartgomez deleted the remove-legacy-preactor-client-mgmt branch October 2, 2025 21:36

Uh oh!

Uh oh!

Remove legacy actor client management #1898

Remove legacy actor client management #1898

Conversation

netsirius commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iduartgomez Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

sanity commented Oct 2, 2025

Review Comments

Uh oh!

iduartgomez commented Oct 2, 2025

Uh oh!

claude bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation Complete: Issues Found in PR #1898

Summary

Issues Found

1. Outdated Documentation ❌ NEEDS FIX

2. Unused Sender in testing_impl.rs ✅ Actually Correct

3. Optional Types Successfully Removed ✅ Correctly Fixed

What Went Wrong in This PR?

What the PR Did Successfully

Recommendations

Final Verdict

Uh oh!

sanity commented Oct 2, 2025

Multi-Machine Test Results

Test Configuration

Critical Error Found

Diagnostic Details

Analysis

Recommendation

Uh oh!

iduartgomez commented Oct 2, 2025

Multi-Machine Test Results

Test Configuration

Critical Error Found

Diagnostic Details

Analysis

Recommendation

Uh oh!

claude bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis: "peer id not found" Error

The Issue

What Changed in This PR

The Race Condition

Why This Wasn't Caught Before

The Fix

Recommendation

Uh oh!

netsirius commented Oct 2, 2025

Uh oh!

claude bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-running Tests Analysis

Findings

1. The Race Condition Still Exists ❌

2. Initialization Order Unchanged

3. Recent "fix" Commit

About Re-running Tests

The Root Cause

Summary

Uh oh!

sanity commented Oct 2, 2025

Recommendation

Uh oh!

claude bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis: Interaction Between Option 1 Fix and PR #1871

What PR #1871 Changed

Why Option 1 Won't Conflict

Actually, Option 1 Makes PR #1871 Even Better

Implementation Recommendation

claude bot commented Oct 2, 2025 •

edited

Loading

claude bot commented Oct 2, 2025 •

edited

Loading

claude bot commented Oct 2, 2025 •

edited

Loading

claude bot commented Oct 2, 2025 •

edited

Loading