-
-
Couldn't load subscription status.
- Fork 106
Remove legacy actor client management #1898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
crates/core/src/node/testing_impl.rs
Outdated
| /// Starts listening to incoming events. Will attempt to join the ring if any gateways have been provided. | ||
| async fn run_event_listener<NB, UsrEv>( | ||
| cli_response_sender: contract::ClientResponsesSender, | ||
| _cli_response_sender: contract::ClientResponsesSender, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed, and maybe the contract::ClientResponsesSender type too?
Review CommentsThis PR successfully removes legacy client management code and consolidates on actor-based client management. The changes are well-structured and comprehensive. A few observations: Strengths:
Observations:
Testing:
The changes look solid and achieve the stated goal of making actor-based client management the default and only mode. [AI-assisted debugging and comment] |
954c5e6 to
c187d76
Compare
|
Claude finished @iduartgomez's task —— View job Investigation Complete: Issues Found in PR #1898I've thoroughly analyzed the PR against issue #1836 to identify what went wrong. Here's the complete analysis: SummaryThe PR successfully removed the legacy client management system and made actor-based client management the only mode. However, outdated documentation gives the false impression that legacy mode still exists, which is causing confusion during code review. Issues Found1. Outdated Documentation ❌ NEEDS FIXLocation: The documentation still describes a "dual-mode" system: /// This function operates in one of two modes based on `op_manager.actor_clients`:
///
/// **Actor Mode** (default, `actor_clients=true`):
/// - Uses ResultRouter → SessionActor...
///
/// **Legacy Mode** (`actor_clients=false`):
/// - Uses direct client callbacks...
Problem: This is completely outdated. There is no Impact: This misleading documentation is likely what caused the reviewer's confusion about "still passing options." 2. Unused Sender in testing_impl.rs ✅ Actually CorrectLocation: let (client_responses, _cli_response_sender) = contract::client_responses_channel();The reviewer asked: "this can be removed, and maybe the contract::ClientResponsesSender type too?" Analysis:
Verdict: This is actually correct as-is. The unused sender is expected in testing mode. 3. Optional Types Successfully Removed ✅ Correctly FixedThe PR correctly removed
// Before: pub result_router_tx: Option<mpsc::Sender<(Transaction, HostResult)>>,
// After: pub result_router_tx: mpsc::Sender<(Transaction, HostResult)>,
// Before: message_processor: Option<Arc<MessageProcessor>>,
// After: message_processor: Arc<MessageProcessor>,Status: ✅ Correctly fixed - these are now required (non-optional) because actor mode is always enabled. What Went Wrong in This PR?The core implementation is actually correct! What went wrong is:
The reviewer's comment "I still see us passing an option here" was likely because:
What the PR Did Successfully✅ Removed RecommendationsMust Fix:
Optional: Final VerdictThe PR implementation is sound. The migration from legacy to actor-based client management is complete and correct. The only issue is outdated documentation that creates the false impression that legacy mode still exists. |
Multi-Machine Test ResultsI've run the multi-machine test suite on this PR. The test FAILED with a critical error. Test Configuration
Critical Error FoundThe test failed during room creation with the following error: This error caused the WebSocket connection to be dropped immediately: Diagnostic DetailsTimeline of events:
Key log entries: AnalysisThe actor-based client management system is properly initialized and running (confirmed by log messages):
However, there appears to be a regression where peer ID is not being set during PUT operations. This is likely a bug introduced during the removal of legacy code paths. RecommendationThis PR should not be merged until this issue is resolved. The error at Full test logs are available in: [AI-assisted debugging and comment] |
@claude in particular what was changed that may have caused this? |
|
Claude finished @iduartgomez's task —— View job Root Cause Analysis: "peer id not found" ErrorI've identified the root cause of the test failure. The issue is NOT a bug introduced by removing legacy code, but rather a pre-existing race condition that was being masked by the legacy initialization order. The IssueError Location: let Some(peer_id) = op_manager.ring.connection_manager.get_peer_key()
else {
tracing::error!("peer id not found at put op, it should be set");
return Err(Error::Disconnected);
};This code checks if the node's peer ID has been set. The peer ID gets initialized in one of two ways:
What Changed in This PRThe PR refactored the initialization order in BEFORE (with conditional actor_clients flag): 1. Create ConnectionManager
2. Create OpManager with result_router_tx: None
3. IF actor_clients flag enabled:
- Create result router channel
- Spawn SessionActor
- Spawn ResultRouter
4. Create P2pConnManager
5. Start client event handlingAFTER (actor mode always on): 1. Create result router channel
2. Spawn SessionActor
3. Spawn ResultRouter
4. Create ConnectionManager
5. Create OpManager with result_router_tx (non-optional)
6. Create P2pConnManager
7. Start client event handlingThe Race ConditionThe change in initialization order has subtly affected timing:
This is a timing-sensitive race condition - it likely only manifests under certain conditions:
Why This Wasn't Caught BeforeThe legacy initialization order had actors spawning LATER, which accidentally gave more time for the initial handshake to complete before clients could connect. The PR's optimization (spawning actors earlier) exposed this pre-existing vulnerability. The FixThere are several possible solutions: Option 1: Defer client connections until peer_id is set (Recommended)
Option 2: Handle None peer_id gracefully in operations
Option 3: Initialize peer_id earlier for non-gateway nodes
Option 4: Revert to legacy spawn order (Quick fix)
RecommendationI recommend Option 1 combined with better documentation. The peer_id check exists for a reason - operations should not proceed until the node has established its network identity. The PR correctly exposed a timing assumption that wasn't properly enforced. Would you like me to implement Option 1, or would you prefer a different approach? |
|
@claude Can you re-run @sanity and @iduartgomez tests again?? |
|
Claude finished @netsirius's task —— View job Re-running Tests AnalysisI've analyzed the current state of the code to determine if the race condition identified by @sanity's multi-machine test has been fixed. Findings1. The Race Condition Still Exists ❌The error location at let Some(peer_id) = op_manager.ring.connection_manager.get_peer_key()
else {
tracing::error!("peer id not found at put op, it should be set");
return Err(Error::Disconnected);
};This is the exact same code that failed in @sanity's multi-machine test. 2. Initialization Order UnchangedThe initialization sequence in
This means SessionActor starts before the peer has a chance to complete its initial network handshake, allowing clients to connect and send PUT requests before 3. Recent "fix" CommitThe most recent commit is About Re-running TestsI cannot re-run @sanity's multi-machine test because it requires:
This is a custom test setup that I don't have access to replicate. The Root CauseAs identified in the previous Claude analysis, the issue is a race condition:
SummaryThe race condition identified by @sanity's multi-machine test has NOT been fixed. The code at the error location remains unchanged, and the initialization order that exposes this race condition is still in place. To properly verify this PR:
Would you like me to implement the fix for the race condition? |
@claude Do you think this fix might interact poorly with this recently merged PR? #1871 |
|
Claude finished @sanity's task —— View job Analysis: Interaction Between Option 1 Fix and PR #1871Good question! After thoroughly examining PR #1871, I can confirm that Option 1 will work well with PR #1871 and actually complements it nicely. Here's why: What PR #1871 ChangedPR #1871 ("fix: enable gateway bootstrap when starting with zero connections") solved a critical gateway bootstrapping issue: The Problem: Gateways starting with zero connections couldn't accept their first joiner because the normal flow requires existing connections to forward through. The Solution: Added a special "bootstrap acceptance" path that:
Key code locations:
Why Option 1 Won't ConflictThe critical insight: PR #1871's bootstrap path only affects gateways, and gateways already have their peer_id set from the start! Looking at the peer_id initialization:
This means:
There's no overlap! The two fixes operate on completely separate code paths. Actually, Option 1 Makes PR #1871 Even BetterThe proposed Option 1 fix would improve the correctness guarantees that PR #1871 introduced: Current state with PR #1871:
With Option 1 added:
Implementation RecommendationI recommend implementing Option 1 as originally proposed: // Pseudo-code for Option 1
pub struct OpManager {
// ... existing fields ...
peer_ready: Arc<AtomicBool>, // Set to true when peer_id is initialized
}
// In client_events/mod.rs, before processing client requests:
if !op_manager.peer_ready.load(Ordering::SeqCst) {
// Either return error, or wait until ready
tracing::warn!("Client attempted operation before peer initialization complete");
return Err(Error::NotReady);
}Where to set the ready flag:
This approach:
Final VerdictNo conflict. PR #1871 and Option 1 are complementary fixes that operate on different parts of the initialization sequence. Option 1 can be safely implemented without worrying about PR #1871. |
|
@netsirius we will fix the issue in separate PR and then rebase this one since is a separate issue, lets wait until then for a final test to merging this . |
This fixes a race condition where clients could connect and send operations (PUT, GET, etc.) before a non-gateway peer completes its initial network handshake and has its peer_id set. Changes: - Add peer_ready AtomicBool to OpManager (true for gateways, false initially for peers) - Add peer_ready to HandshakeHandler to set flag after first successful handshake - Check peer_ready before processing client operations (non-gateways only) - Gateways are unaffected (always ready since peer_id set from config) This ensures non-gateway peers only accept client operations after establishing their network identity, preventing "peer id not found" errors. Addresses race condition exposed by PR #1898 (removal of legacy client management). Per @iduartgomez and @netsirius guidance: implemented as separate fix to avoid conflicts with PR #1898, which will be rebased after this is merged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…urations # Conflicts: # crates/core/src/node/op_state_manager.rs
e9e6465 to
d8ca97e
Compare
Multi-Machine Test Results for PR #1898Status: ❌ FAILED (but NOT a regression - main also fails) Test Configuration
Key Finding: Not a RegressionBoth main and PR 1898 fail identically - this is a pre-existing issue with peer-to-gateway connections in the multi-machine test environment, not a bug introduced by this PR. Test Results ComparisonMain Branch (with PR #1899, without PR #1898)
PR #1898 Branch
Root Cause AnalysisThe multi-machine test waits a fixed 40 seconds for peer connection, then immediately attempts River operations: // From enhanced_test.rs:1188-1190
println!(" ⏳ Waiting for peer to join the ring network (40 seconds)...");
std::thread::sleep(Duration::from_secs(40)); // ← Just sleeps, doesn't verify!
println!(" ✅ Peer connected in {:?}", peer_start.elapsed());The test doesn't actually verify the connection completed - it just assumes 40 seconds is enough. In practice, the peer-to-gateway handshake is not completing within this time in the multi-machine environment. Evidence from logs:
Secondary Issue: PR #1898 May Have Weakened Error HandlingWhile not causing the test failure, there's a concerning difference:
This suggests PR #1898 may have inadvertently broken the Recommendations
Test Artifacts
[AI-assisted debugging and comment] |
Fixes issue #1836
This pull request removes support for legacy client management, making actor-based client management the default and only mode. The changes simplify configuration, environment handling, and related code paths, ensuring all client communication uses the actor-based system. Tests and documentation related to the legacy mode and its feature flag have been removed.
Client Management System Simplification
actor_clientsconfiguration option and all related environment variable handling (FREENET_ACTOR_CLIENTS). Actor-based client management is now always enabled, and legacy mode is no longer supported. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L79-L83),[[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L114),[[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L377),[[4]](https://github.com/freenet/freenet-core/pull/1898/files#diff-db321b7ff69f10066fb41af08cd55e5257842ff0d9752f08312f2a7dcaba3437L461-L463),[[5]](https://github.com/freenet/freenet-core/pull/1898/files#diff-9087957d5f7ad95ca762ede82d9fc3adf17684b08f19785fb53dda3ead4d5bfaL69),[[6]](https://github.com/freenet/freenet-core/pull/1898/files#diff-9087957d5f7ad95ca762ede82d9fc3adf17684b08f19785fb53dda3ead4d5bfaL115),[[7]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL226-R226))RequestRouter, removing conditional logic for legacy mode. ([crates/core/src/client_events/mod.rsL216-R217](https://github.com/freenet/freenet-core/pull/1898/files#diff-3b08f38e7b62f928a42cb1e50f46d185ae36543ea4206af15246bc82e97a7ac1L216-R217))[[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL226-R226),[[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL251-R250),[[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL267-R259),[[4]](https://github.com/freenet/freenet-core/pull/1898/files#diff-04e0f92a03bc6a97819d3431883aafde2e47828192f88c40fbd11f489cc4901bL277-R276))Test and Workflow Cleanup
FREENET_ACTOR_CLIENTSflag, including environment variable manipulation and related assertions. ([[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-95579cd60108ff762c8cf1bcc7561347053566c8558f0e677738b3cbb6d19a23L120-L147),[[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-95579cd60108ff762c8cf1bcc7561347053566c8558f0e677738b3cbb6d19a23L249-L282))[.github/workflows/ci.ymlL44-L49](https://github.com/freenet/freenet-core/pull/1898/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fL44-L49))Codebase and Logging Updates
[[1]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L392-R392),[[2]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L426-R426),[[3]](https://github.com/freenet/freenet-core/pull/1898/files#diff-2758408631487d6e8f21a2609cb2c08104104c674dbb69965462d882159a7570L435-L483))