Skip to content

Conversation

@netsirius
Copy link
Collaborator

Fixes issue #1836

This pull request removes support for legacy client management, making actor-based client management the default and only mode. The changes simplify configuration, environment handling, and related code paths, ensuring all client communication uses the actor-based system. Tests and documentation related to the legacy mode and its feature flag have been removed.

Client Management System Simplification

  • Removed the actor_clients configuration option and all related environment variable handling (FREENET_ACTOR_CLIENTS). Actor-based client management is now always enabled, and legacy mode is no longer supported. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L79-L83), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L114), [[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L377), [[4]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L461-L463), [[5]](https://github.com/freenet/freenet-core/pull/1898/files#diff-9087957d5f7ad95ca762ede82d9fc3adf17684b08f19785fb53dda3ead4d5bfaL69), [[6]](https://github.com/freenet/freenet-core/pull/1898/files#diff-9087957d5f7ad95ca762ede82d9fc3adf17684b08f19785fb53dda3ead4d5bfaL115), [[7]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL226-R226))
  • Updated the client event handling logic to always create and use the actor-based RequestRouter, removing conditional logic for legacy mode. ([crates/core/src/client_events/mod.rsL216-R217](https://github.com/freenet/freenet-core/pull/1898/files#diff-3b08f38e7b62f928a42cb1e50f46d185ae36543ea4206af15246bc82e97a7ac1L216-R217))
  • Simplified the P2P implementation to always install the actor-based session adapter and result router, removing conditional code and related logs for legacy mode. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL226-R226), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL251-R250), [[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL267-R259), [[4]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL277-R276))

Test and Workflow Cleanup

  • Removed all tests and verification logic for legacy mode and the FREENET_ACTOR_CLIENTS flag, including environment variable manipulation and related assertions. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-95579cd60108ff762c8cf1bcc7561347053566c8558f0e677738b3cbb6d19a23L120-L147), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-95579cd60108ff762c8cf1bcc7561347053566c8558f0e677738b3cbb6d19a23L249-L282))
  • Updated CI workflow to eliminate legacy and actor mode test jobs, consolidating testing under the default actor-based configuration. ([.github/workflows/ci.ymlL44-L49](https://github.com/freenet/freenet-core/pull/1898/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fL44-L49))

Codebase and Logging Updates

  • Removed legacy client delivery code paths and related logging from result reporting, ensuring only actor-based delivery is performed. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L392-R392), [[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L426-R426), [[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L435-L483))

@netsirius netsirius requested a review from iduartgomez October 2, 2025 12:03
/// Starts listening to incoming events. Will attempt to join the ring if any gateways have been provided.
async fn run_event_listener<NB, UsrEv>(
cli_response_sender: contract::ClientResponsesSender,
_cli_response_sender: contract::ClientResponsesSender,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed, and maybe the contract::ClientResponsesSender type too?

@sanity
Copy link
Collaborator

sanity commented Oct 2, 2025

Review Comments

This PR successfully removes legacy client management code and consolidates on actor-based client management. The changes are well-structured and comprehensive. A few observations:

Strengths:

  • Clean removal of the FREENET_ACTOR_CLIENTS flag and related configuration
  • Proper cleanup of test code verifying legacy mode behavior
  • Simplification of P2P implementation by removing conditional logic
  • CI workflow updated to remove redundant test jobs

Observations:

  • The PR includes a large Cargo.lock update (255 additions, 422 deletions overall) - dependency updates appear to be noise from normal version bumps
  • All conditional code paths checking actor_clients flag have been removed
  • The RequestRouter is now unconditionally created in client event handling (crates/core/src/client_events/mod.rs:216-217)

Testing:

  • Removed 62 lines of integration verification tests that were specifically testing the flag behavior
  • The remaining tests verify the actor-based system functionality

The changes look solid and achieve the stated goal of making actor-based client management the default and only mode.

[AI-assisted debugging and comment]

@netsirius netsirius force-pushed the remove-legacy-preactor-client-mgmt branch from 954c5e6 to c187d76 Compare October 2, 2025 16:33
@iduartgomez
Copy link
Collaborator

@claude can you take a look at #1836 comments and attempt to find what went wrong in the PR?

@claude
Copy link

claude bot commented Oct 2, 2025

Claude finished @iduartgomez's task —— View job


Investigation Complete: Issues Found in PR #1898

I've thoroughly analyzed the PR against issue #1836 to identify what went wrong. Here's the complete analysis:


Summary

The PR successfully removed the legacy client management system and made actor-based client management the only mode. However, outdated documentation gives the false impression that legacy mode still exists, which is causing confusion during code review.


Issues Found

1. Outdated DocumentationNEEDS FIX

Location: crates/core/src/client_events/mod.rs:206-225

The documentation still describes a "dual-mode" system:

/// This function operates in one of two modes based on `op_manager.actor_clients`:
///
/// **Actor Mode** (default, `actor_clients=true`):
/// - Uses ResultRouter → SessionActor...
///
/// **Legacy Mode** (`actor_clients=false`):
/// - Uses direct client callbacks...

Problem: This is completely outdated. There is no actor_clients flag anymore, no legacy mode, no dual-mode system. Only actor mode exists now.

Impact: This misleading documentation is likely what caused the reviewer's confusion about "still passing options."


2. Unused Sender in testing_impl.rsActually Correct

Location: crates/core/src/node/testing_impl.rs:784

let (client_responses, _cli_response_sender) = contract::client_responses_channel();

The reviewer asked: "this can be removed, and maybe the contract::ClientResponsesSender type too?"

Analysis:

  • The client_responses receiver IS used - it's passed to client_event_handling() at line 805
  • The _cli_response_sender is intentionally unused (hence the _ prefix) because in actor mode, the SessionActor provides its own sender
  • The ClientResponsesSender type must be kept - it's actively used by the SessionActor system

Verdict: This is actually correct as-is. The unused sender is expected in testing mode.


3. Optional Types Successfully RemovedCorrectly Fixed

The PR correctly removed Option<> wrappers:

op_state_manager.rs:68:

// Before: pub result_router_tx: Option<mpsc::Sender<(Transaction, HostResult)>>,
// After:  pub result_router_tx: mpsc::Sender<(Transaction, HostResult)>,

network_bridge/p2p_protoc.rs:130:

// Before: message_processor: Option<Arc<MessageProcessor>>,
// After:  message_processor: Arc<MessageProcessor>,

Status: ✅ Correctly fixed - these are now required (non-optional) because actor mode is always enabled.


What Went Wrong in This PR?

The core implementation is actually correct! What went wrong is:

  1. Documentation wasn't updated - it still describes the old dual-mode system that no longer exists
  2. This caused reviewer confusion - reading docs that say "two modes" while seeing code that's been simplified to one mode creates cognitive dissonance

The reviewer's comment "I still see us passing an option here" was likely because:

  • The documentation incorrectly described dual modes
  • An earlier commit may have had Options that were later removed
  • The final code actually DID remove the Options correctly

What the PR Did Successfully

✅ Removed FREENET_ACTOR_CLIENTS environment variable
✅ Removed actor_clients config field
✅ Removed all conditional legacy mode logic
✅ Made MessageProcessor required (removed Option<>)
✅ Made result_router_tx required (removed Option<>)
✅ Removed legacy client delivery paths
✅ Cleaned up tests and CI workflows
✅ Removed 459 lines, added 293 lines (net -166 lines)


Recommendations

Must Fix:

  1. Update documentation in crates/core/src/client_events/mod.rs:206-225 to remove all references to:
    • "dual-mode"
    • "legacy mode"
    • actor_clients flag
    • The false choice between modes

Optional:
2. Add a comment in testing_impl.rs:784 explaining why _cli_response_sender is unused (it's intentional, not an oversight)


Final Verdict

The PR implementation is sound. The migration from legacy to actor-based client management is complete and correct. The only issue is outdated documentation that creates the false impression that legacy mode still exists.


Job Run | Branch: remove-legacy-preactor-client-mgmt

@sanity
Copy link
Collaborator

sanity commented Oct 2, 2025

Multi-Machine Test Results

I've run the multi-machine test suite on this PR. The test FAILED with a critical error.

Test Configuration

  • Branch: remove-legacy-preactor-client-mgmt (commit 954c5e6)
  • Test: River multi-user chat test (gateway on vega.locut.us, local peer)
  • Date: 2025-10-02 16:32:00

Critical Error Found

The test failed during room creation with the following error:

ERROR freenet::client_events: peer id not found at put op, it should be set
    at crates/core/src/client_events/mod.rs:390

This error caused the WebSocket connection to be dropped immediately:

Error: Failed to receive response: client error: client disconnected

Diagnostic Details

Timeline of events:

  1. ✅ Gateway started successfully (70.6s)
  2. ✅ Peer connected to gateway (55.0s)
  3. ✅ Actor-based infrastructure confirmed active on both gateway and peer
  4. ❌ Room creation PUT operation failed with "peer id not found" error
  5. ❌ WebSocket connection dropped
  6. ❌ Test failed

Key log entries:

16:34:05.966 [RIVER] Sending PUT request for room creation
16:34:05.966 [PEER] ERROR freenet::client_events: peer id not found at put op, it should be set
16:34:05.966 [PEER] INFO dropped connection to client #1
16:34:05.966 [RIVER] Error: Failed to receive response: client error: client disconnected

Analysis

The actor-based client management system is properly initialized and running (confirmed by log messages):

  • Actor-based client management infrastructure installed with result router
  • Session actor starting
  • Result router starting

However, there appears to be a regression where peer ID is not being set during PUT operations. This is likely a bug introduced during the removal of legacy code paths.

Recommendation

This PR should not be merged until this issue is resolved. The error at crates/core/src/client_events/mod.rs:390 suggests that the peer ID tracking mechanism may have been inadvertently affected by the legacy code removal.

Full test logs are available in: ./test-results/river-test-20251002-163200/

[AI-assisted debugging and comment]

@iduartgomez
Copy link
Collaborator

Multi-Machine Test Results

I've run the multi-machine test suite on this PR. The test FAILED with a critical error.

Test Configuration

  • Branch: remove-legacy-preactor-client-mgmt (commit 954c5e6)
  • Test: River multi-user chat test (gateway on vega.locut.us, local peer)
  • Date: 2025-10-02 16:32:00

Critical Error Found

The test failed during room creation with the following error:

ERROR freenet::client_events: peer id not found at put op, it should be set
    at crates/core/src/client_events/mod.rs:390

This error caused the WebSocket connection to be dropped immediately:

Error: Failed to receive response: client error: client disconnected

Diagnostic Details

Timeline of events:

  1. ✅ Gateway started successfully (70.6s)
  2. ✅ Peer connected to gateway (55.0s)
  3. ✅ Actor-based infrastructure confirmed active on both gateway and peer
  4. ❌ Room creation PUT operation failed with "peer id not found" error
  5. ❌ WebSocket connection dropped
  6. ❌ Test failed

Key log entries:

16:34:05.966 [RIVER] Sending PUT request for room creation
16:34:05.966 [PEER] ERROR freenet::client_events: peer id not found at put op, it should be set
16:34:05.966 [PEER] INFO dropped connection to client #1
16:34:05.966 [RIVER] Error: Failed to receive response: client error: client disconnected

Analysis

The actor-based client management system is properly initialized and running (confirmed by log messages):

  • Actor-based client management infrastructure installed with result router
  • Session actor starting
  • Result router starting

However, there appears to be a regression where peer ID is not being set during PUT operations. This is likely a bug introduced during the removal of legacy code paths.

Recommendation

This PR should not be merged until this issue is resolved. The error at crates/core/src/client_events/mod.rs:390 suggests that the peer ID tracking mechanism may have been inadvertently affected by the legacy code removal.

Full test logs are available in: ./test-results/river-test-20251002-163200/

[AI-assisted debugging and comment]

@claude in particular what was changed that may have caused this?

@claude
Copy link

claude bot commented Oct 2, 2025

Claude finished @iduartgomez's task —— View job


Root Cause Analysis: "peer id not found" Error

I've identified the root cause of the test failure. The issue is NOT a bug introduced by removing legacy code, but rather a pre-existing race condition that was being masked by the legacy initialization order.


The Issue

Error Location: crates/core/src/client_events/mod.rs:405

let Some(peer_id) = op_manager.ring.connection_manager.get_peer_key()
else {
    tracing::error!("peer id not found at put op, it should be set");
    return Err(Error::Disconnected);
};

This code checks if the node's peer ID has been set. The peer ID gets initialized in one of two ways:

  1. Gateway nodes: Set immediately from config.peer_id during ConnectionManager::new()
  2. Regular peers: Set to None initially, then populated by try_set_peer_key() during the first successful network handshake

What Changed in This PR

The PR refactored the initialization order in crates/core/src/node/p2p_impl.rs:

BEFORE (with conditional actor_clients flag):

1. Create ConnectionManager
2. Create OpManager with result_router_tx: None
3. IF actor_clients flag enabled:
     - Create result router channel
     - Spawn SessionActor  
     - Spawn ResultRouter
4. Create P2pConnManager
5. Start client event handling

AFTER (actor mode always on):

1. Create result router channel
2. Spawn SessionActor
3. Spawn ResultRouter  
4. Create ConnectionManager
5. Create OpManager with result_router_tx (non-optional)
6. Create P2pConnManager
7. Start client event handling

The Race Condition

The change in initialization order has subtly affected timing:

  1. SessionActor and ResultRouter are now spawned earlier in the startup sequence
  2. This means client connections can be accepted slightly earlier
  3. A client can now connect and send PUT requests BEFORE the peer completes its initial network handshake with the gateway
  4. When a PUT arrives before handshake completion, get_peer_key() returns None → error

This is a timing-sensitive race condition - it likely only manifests under certain conditions:

  • Fast client connections
  • Slow network handshakes
  • The specific test environment (multi-machine test with remote gateway)

Why This Wasn't Caught Before

The legacy initialization order had actors spawning LATER, which accidentally gave more time for the initial handshake to complete before clients could connect. The PR's optimization (spawning actors earlier) exposed this pre-existing vulnerability.


The Fix

There are several possible solutions:

Option 1: Defer client connections until peer_id is set (Recommended)

  • Add a "ready" flag that blocks client event handling until first handshake completes
  • Cleanest solution but requires coordination between network and client layers

Option 2: Handle None peer_id gracefully in operations

  • Instead of returning Error::Disconnected, wait/retry until peer_id is available
  • Less invasive but adds complexity to operation handlers

Option 3: Initialize peer_id earlier for non-gateway nodes

  • Set a temporary peer_id based on listening address during ConnectionManager construction
  • Simpler but may have other implications

Option 4: Revert to legacy spawn order (Quick fix)

  • Move SessionActor/ResultRouter spawning after OpManager creation
  • Masks the issue rather than fixing it, but would unblock the PR

Recommendation

I recommend Option 1 combined with better documentation. The peer_id check exists for a reason - operations should not proceed until the node has established its network identity. The PR correctly exposed a timing assumption that wasn't properly enforced.

Would you like me to implement Option 1, or would you prefer a different approach?


Job Run | Branch: remove-legacy-preactor-client-mgmt

@netsirius
Copy link
Collaborator Author

@claude Can you re-run @sanity and @iduartgomez tests again??

@claude
Copy link

claude bot commented Oct 2, 2025

Claude finished @netsirius's task —— View job


Re-running Tests Analysis

I've analyzed the current state of the code to determine if the race condition identified by @sanity's multi-machine test has been fixed.


Findings

1. The Race Condition Still Exists

The error location at crates/core/src/client_events/mod.rs:403-407 has not been fixed:

let Some(peer_id) = op_manager.ring.connection_manager.get_peer_key()
else {
    tracing::error!("peer id not found at put op, it should be set");
    return Err(Error::Disconnected);
};

This is the exact same code that failed in @sanity's multi-machine test.

2. Initialization Order Unchanged

The initialization sequence in crates/core/src/node/p2p_impl.rs:221-275 remains:

  1. Create result router channel (line 228)
  2. Spawn SessionActor (line 230-237)
  3. Spawn ResultRouter (line 239-246)
  4. Create ConnectionManager (line 250)
  5. Create OpManager (line 251-258)
  6. Start client event handling (line 298+)

This means SessionActor starts before the peer has a chance to complete its initial network handshake, allowing clients to connect and send PUT requests before peer_id is set.

3. Recent "fix" Commit

The most recent commit is e9e6465 fix, but it did not address the race condition issue identified in the test failure.


About Re-running Tests

I cannot re-run @sanity's multi-machine test because it requires:

  • A gateway running on a remote machine (vega.locut.us)
  • A local peer connecting to that gateway
  • The River chat application test harness

This is a custom test setup that I don't have access to replicate.


The Root Cause

As identified in the previous Claude analysis, the issue is a race condition:

  1. Non-gateway peers start with peer_id = None
  2. The peer_id only gets set during the first successful network handshake with another peer
  3. SessionActor now starts before ConnectionManager is fully initialized
  4. Clients can connect via WebSocket before the initial handshake completes
  5. When a PUT request arrives before peer_id is set → Error: "peer id not found"

Summary

The race condition identified by @sanity's multi-machine test has NOT been fixed. The code at the error location remains unchanged, and the initialization order that exposes this race condition is still in place.

To properly verify this PR:

  1. Either @sanity needs to re-run their multi-machine test setup manually
  2. Or the race condition needs to be fixed first (I can implement this if requested)

Would you like me to implement the fix for the race condition?

@sanity
Copy link
Collaborator

sanity commented Oct 2, 2025

Recommendation

I recommend Option 1 combined with better documentation. The peer_id check exists for a reason - operations should not proceed until the node has established its network identity. The PR correctly exposed a timing assumption that wasn't properly enforced.

Would you like me to implement Option 1, or would you prefer a different approach?

@claude Do you think this fix might interact poorly with this recently merged PR? #1871

@claude
Copy link

claude bot commented Oct 2, 2025

Claude finished @sanity's task —— View job


Analysis: Interaction Between Option 1 Fix and PR #1871

Good question! After thoroughly examining PR #1871, I can confirm that Option 1 will work well with PR #1871 and actually complements it nicely. Here's why:


What PR #1871 Changed

PR #1871 ("fix: enable gateway bootstrap when starting with zero connections") solved a critical gateway bootstrapping issue:

The Problem: Gateways starting with zero connections couldn't accept their first joiner because the normal flow requires existing connections to forward through.

The Solution: Added a special "bootstrap acceptance" path that:

  1. Detects when a gateway has 0 connections and only 1 reserved connection (the first joiner)
  2. Marks this connection with is_bootstrap_acceptance=true
  3. Immediately registers the connection in the ring (bypassing normal CheckConnectivity flow)
  4. This allows the gateway to respond to FindOptimalPeer requests from subsequent joiners

Key code locations:

  • crates/core/src/operations/connect.rs:1055-1091 - Bootstrap detection logic
  • crates/core/src/node/network_bridge/p2p_protoc.rs:863-890 - Immediate ring registration for bootstrap connections

Why Option 1 Won't Conflict

The critical insight: PR #1871's bootstrap path only affects gateways, and gateways already have their peer_id set from the start!

Looking at the peer_id initialization:

  • Gateways: peer_id is set immediately from config.peer_id during ConnectionManager::new() (crates/core/src/ring/connection_manager.rs)
  • Regular peers: peer_id starts as None, then gets set via try_set_peer_key() during first outbound gateway handshake (crates/core/src/node/network_bridge/handshake.rs:296)

This means:

  1. Gateways never experience the "peer id not found" error because their peer_id is always set
  2. Regular peers need to complete an outbound connection to get their peer_id
  3. Option 1 (deferring client connections until peer_id is set) only affects regular peers
  4. PR fix: enable gateway bootstrap when starting with zero connections #1871's bootstrap logic only affects gateways

There's no overlap! The two fixes operate on completely separate code paths.


Actually, Option 1 Makes PR #1871 Even Better

The proposed Option 1 fix would improve the correctness guarantees that PR #1871 introduced:

Current state with PR #1871:

  • Gateway bootstrap works correctly
  • Regular peers can still hit the race condition (client connects before first handshake completes)

With Option 1 added:

  • Gateway bootstrap still works correctly (no change)
  • Regular peers also protected from the race condition
  • Clean separation of concerns: peer_id must be established before accepting client operations

Implementation Recommendation

I recommend implementing Option 1 as originally proposed:

// Pseudo-code for Option 1
pub struct OpManager {
    // ... existing fields ...
    peer_ready: Arc<AtomicBool>,  // Set to true when peer_id is initialized
}

// In client_events/mod.rs, before processing client requests:
if !op_manager.peer_ready.load(Ordering::SeqCst) {
    // Either return error, or wait until ready
    tracing::warn!("Client attempted operation before peer initialization complete");
    return Err(Error::NotReady);
}

Where to set the ready flag:

  1. For gateways: Set immediately after ConnectionManager::new() (peer_id already known)
  2. For regular peers: Set in HandshakeHandler::get_events() after try_set_peer_key() succeeds (line 296)

This approach:


Final Verdict

No conflict. PR #1871 and Option 1 are complementary fixes that operate on different parts of the initialization sequence. Option 1 can be safely implemented without worrying about PR #1871.


Job Run | Branch: remove-legacy-preactor-client-mgmt

@iduartgomez
Copy link
Collaborator

@netsirius we will fix the issue in separate PR and then rebase this one since is a separate issue, lets wait until then for a final test to merging this .

sanity added a commit that referenced this pull request Oct 2, 2025
This fixes a race condition where clients could connect and send operations
(PUT, GET, etc.) before a non-gateway peer completes its initial network
handshake and has its peer_id set.

Changes:
- Add peer_ready AtomicBool to OpManager (true for gateways, false initially for peers)
- Add peer_ready to HandshakeHandler to set flag after first successful handshake
- Check peer_ready before processing client operations (non-gateways only)
- Gateways are unaffected (always ready since peer_id set from config)

This ensures non-gateway peers only accept client operations after establishing
their network identity, preventing "peer id not found" errors.

Addresses race condition exposed by PR #1898 (removal of legacy client management).
Per @iduartgomez and @netsirius guidance: implemented as separate fix to avoid
conflicts with PR #1898, which will be rebased after this is merged.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@netsirius netsirius force-pushed the remove-legacy-preactor-client-mgmt branch from e9e6465 to d8ca97e Compare October 2, 2025 17:57
@sanity
Copy link
Collaborator

sanity commented Oct 2, 2025

Multi-Machine Test Results for PR #1898

Status: ❌ FAILED (but NOT a regression - main also fails)

Test Configuration

  • Test Type: River multi-user chat (remote gateway + local peer)
  • Gateway: vega.locut.us:31338
  • Test Tool: freenet-testing-tools/multi-machine-test
  • Binary: Clean rebuild from PR 1898 HEAD (d8ca97e)

Key Finding: Not a Regression

Both main and PR 1898 fail identically - this is a pre-existing issue with peer-to-gateway connections in the multi-machine test environment, not a bug introduced by this PR.

Test Results Comparison

Main Branch (with PR #1899, without PR #1898)

PR #1898 Branch

Root Cause Analysis

The multi-machine test waits a fixed 40 seconds for peer connection, then immediately attempts River operations:

// From enhanced_test.rs:1188-1190
println!("  ⏳ Waiting for peer to join the ring network (40 seconds)...");
std::thread::sleep(Duration::from_secs(40));  // ← Just sleeps, doesn't verify!
println!("  ✅ Peer connected in {:?}", peer_start.elapsed());

The test doesn't actually verify the connection completed - it just assumes 40 seconds is enough. In practice, the peer-to-gateway handshake is not completing within this time in the multi-machine environment.

Evidence from logs:

  • No "Outbound gateway connection successful" message
  • No "Peer initialization complete: peer_ready set to true" message
  • Connection attempt starts but never completes
  • After 55+ seconds, riverctl attempts PUT and fails

Secondary Issue: PR #1898 May Have Weakened Error Handling

While not causing the test failure, there's a concerning difference:

This suggests PR #1898 may have inadvertently broken the peer_ready check from PR #1899, though the exact mechanism is unclear. The check code appears identical, but the behavior differs.

Recommendations

  1. For PR Remove legacy actor client management #1898:

  2. For the Test:

    • Replace fixed 40-second sleep with actual connection verification
    • Check for "Outbound gateway connection successful" in logs
    • Verify peer_ready flag is set before proceeding
    • Add timeout with clear error message if connection takes too long
    • Add binary verification - check git commit hash matches expectation
  3. For the Connection Issue:

    • Investigate why peer-to-gateway connections are slow/failing
    • Check if this is specific to the vega.locut.us environment
    • Consider if network configuration or firewall rules are interfering

Test Artifacts

[AI-assisted debugging and comment]

@iduartgomez iduartgomez added this pull request to the merge queue Oct 2, 2025
Merged via the queue into main with commit dbc66f5 Oct 2, 2025
7 checks passed
@iduartgomez iduartgomez deleted the remove-legacy-preactor-client-mgmt branch October 2, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants