-
-
Notifications
You must be signed in to change notification settings - Fork 106
fix: Prevent notification channel starvation in P2P event loop #1903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Problem The tokio::select! in P2P event loop's wait_for_event() was experiencing channel starvation, causing PUT operation notifications to be lost in busy networks. The notification_channel would never be polled when peer_connections had constant activity, leading to operation timeouts. ## Root Cause Since the initial implementation in September 2024 (commit 605ff70), the select! branches were in an order that prioritized network traffic over internal notifications: 1. peer_connections (network) - checked FIRST 2. notification_channel (internal) - checked second Without the biased annotation, tokio::select! randomly polls branches. However, in busy networks peer_connections is constantly ready, effectively starving notification_channel even with random polling due to the high volume of network traffic. ## Solution 1. Added `biased;` annotation to force sequential polling in source order 2. Reordered branches to prioritize notification_channel FIRST: - notification_channel.notifications_receiver (internal) - FIRST - notification_channel.op_execution_receiver (internal) - SECOND - peer_connections (network) - after internal channels This ensures internal operation state machine transitions are processed before handling more network traffic, preventing deadlock where operations wait for their own state transitions that never get processed. ## Testing - ✅ test_put_contract - Verifies basic PUT operations work - ✅ test_put_with_subscribe_flag - Verifies PUT with subscription - ✅ Tested in multi-peer scenarios (ubertest) - 47 notifications received vs 0 before ## Context This fix emerged from debugging the ubertest where PUT operations would consistently timeout. Investigation revealed notifications were sent successfully (OpManager::notify_op_change returned Ok) but never received by the event loop. Channel ID tracking confirmed sender/receiver were correctly paired. The fix is minimal and surgical - only reorders select! branches and adds the biased annotation. No logic changes, no API changes. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical starvation issue in the P2P event loop where notification channels were being starved by high network traffic, causing PUT operation timeouts in busy networks.
Key changes:
- Added
biased;annotation totokio::select!to force sequential polling - Reordered branches to prioritize internal notification channels before network traffic
- Added detailed comments explaining the fix rationale
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…issue - Applied notification channel starvation fix from PR #1903 - Re-enabled test_three_node_network_connectivity (removed #[ignore]) - Removed debug prints added during investigation - Created issue #1904 documenting the peer-to-peer connection failure The test now runs but times out waiting for full mesh formation. Gateway successfully connects to both peers, but peers don't connect to each other. This is the core mesh formation issue that needs investigation. Related: #1904, #1903, #1897 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Not the definitive fix for this kind of issue but good for now.
…issue - Applied notification channel starvation fix from PR #1903 - Re-enabled test_three_node_network_connectivity (removed #[ignore]) - Removed debug prints added during investigation - Created issue #1904 documenting the peer-to-peer connection failure The test now runs but times out waiting for full mesh formation. Gateway successfully connects to both peers, but peers don't connect to each other. This is the core mesh formation issue that needs investigation. Related: #1904, #1903, #1897 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…issue - Applied notification channel starvation fix from PR #1903 - Re-enabled test_three_node_network_connectivity (removed #[ignore]) - Removed debug prints added during investigation - Created issue #1904 documenting the peer-to-peer connection failure The test now runs but times out waiting for full mesh formation. Gateway successfully connects to both peers, but peers don't connect to each other. This is the core mesh formation issue that needs investigation. Related: #1904, #1903, #1897 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem
The
tokio::select!in P2P event loop'swait_for_event()was experiencing channel starvation, causing PUT operation notifications to be lost in busy networks. Thenotification_channelwould never be polled whenpeer_connectionshad constant activity, leading to operation timeouts.Root Cause
Since the initial implementation in September 2024 (commit
605ff70cb), the select! branches prioritized network traffic over internal notifications:peer_connections(network) - checked FIRSTnotification_channel(internal) - checked secondWithout the
biasedannotation,tokio::select!randomly polls branches. However, in busy networkspeer_connectionsis constantly ready, effectively starvingnotification_channeleven with random polling due to high network traffic volume.Solution
biased;annotation to force sequential polling in source ordernotification_channelFIRST:notification_channel.notifications_receiver(internal) - FIRSTnotification_channel.op_execution_receiver(internal) - SECONDpeer_connections(network) - after internal channelsThis ensures internal operation state machine transitions are processed before handling more network traffic, preventing deadlock where operations wait for their own state transitions that never get processed.
Testing
test_put_contract- Verifies basic PUT operations worktest_put_with_subscribe_flag- Verifies PUT with subscriptionContext
This fix emerged from debugging the ubertest where PUT operations would consistently timeout. Investigation revealed notifications were sent successfully (
OpManager::notify_op_changereturnedOk) but never received by the event loop. Channel ID tracking confirmed sender/receiver were correctly paired.The fix is minimal and surgical - only reorders
select!branches and adds thebiasedannotation. No logic changes, no API changes.Impact
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude noreply@anthropic.com