fix: Prevent notification channel starvation in P2P event loop #1903

sanity · 2025-10-03T15:29:54Z

Problem

The tokio::select! in P2P event loop's wait_for_event() was experiencing channel starvation, causing PUT operation notifications to be lost in busy networks. The notification_channel would never be polled when peer_connections had constant activity, leading to operation timeouts.

Root Cause

Since the initial implementation in September 2024 (commit 605ff70cb), the select! branches prioritized network traffic over internal notifications:

peer_connections (network) - checked FIRST
notification_channel (internal) - checked second

Without the biased annotation, tokio::select! randomly polls branches. However, in busy networks peer_connections is constantly ready, effectively starving notification_channel even with random polling due to high network traffic volume.

Solution

Added biased; annotation to force sequential polling in source order
Reordered branches to prioritize notification_channel FIRST:
- notification_channel.notifications_receiver (internal) - FIRST
- notification_channel.op_execution_receiver (internal) - SECOND
- peer_connections (network) - after internal channels

This ensures internal operation state machine transitions are processed before handling more network traffic, preventing deadlock where operations wait for their own state transitions that never get processed.

Testing

✅ test_put_contract - Verifies basic PUT operations work
✅ test_put_with_subscribe_flag - Verifies PUT with subscription
✅ Tested in multi-peer scenarios (ubertest) - 47 notifications received vs 0 before

Context

This fix emerged from debugging the ubertest where PUT operations would consistently timeout. Investigation revealed notifications were sent successfully (OpManager::notify_op_change returned Ok) but never received by the event loop. Channel ID tracking confirmed sender/receiver were correctly paired.

The fix is minimal and surgical - only reorders select! branches and adds the biased annotation. No logic changes, no API changes.

Impact

Severity: High - affects all PUT/UPDATE/GET operations in busy networks
Scope: P2P event loop core - all network operations
Risk: Low - minimal change, well-tested select! pattern

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude noreply@anthropic.com

## Problem The tokio::select! in P2P event loop's wait_for_event() was experiencing channel starvation, causing PUT operation notifications to be lost in busy networks. The notification_channel would never be polled when peer_connections had constant activity, leading to operation timeouts. ## Root Cause Since the initial implementation in September 2024 (commit 605ff70), the select! branches were in an order that prioritized network traffic over internal notifications: 1. peer_connections (network) - checked FIRST 2. notification_channel (internal) - checked second Without the biased annotation, tokio::select! randomly polls branches. However, in busy networks peer_connections is constantly ready, effectively starving notification_channel even with random polling due to the high volume of network traffic. ## Solution 1. Added `biased;` annotation to force sequential polling in source order 2. Reordered branches to prioritize notification_channel FIRST: - notification_channel.notifications_receiver (internal) - FIRST - notification_channel.op_execution_receiver (internal) - SECOND - peer_connections (network) - after internal channels This ensures internal operation state machine transitions are processed before handling more network traffic, preventing deadlock where operations wait for their own state transitions that never get processed. ## Testing - ✅ test_put_contract - Verifies basic PUT operations work - ✅ test_put_with_subscribe_flag - Verifies PUT with subscription - ✅ Tested in multi-peer scenarios (ubertest) - 47 notifications received vs 0 before ## Context This fix emerged from debugging the ubertest where PUT operations would consistently timeout. Investigation revealed notifications were sent successfully (OpManager::notify_op_change returned Ok) but never received by the event loop. Channel ID tracking confirmed sender/receiver were correctly paired. The fix is minimal and surgical - only reorders select! branches and adds the biased annotation. No logic changes, no API changes. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

This PR fixes a critical starvation issue in the P2P event loop where notification channels were being starved by high network traffic, causing PUT operation timeouts in busy networks.

Key changes:

Added biased; annotation to tokio::select! to force sequential polling
Reordered branches to prioritize internal notification channels before network traffic
Added detailed comments explaining the fix rationale

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

…issue - Applied notification channel starvation fix from PR #1903 - Re-enabled test_three_node_network_connectivity (removed #[ignore]) - Removed debug prints added during investigation - Created issue #1904 documenting the peer-to-peer connection failure The test now runs but times out waiting for full mesh formation. Gateway successfully connects to both peers, but peers don't connect to each other. This is the core mesh formation issue that needs investigation. Related: #1904, #1903, #1897 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

iduartgomez

Good catch. Not the definitive fix for this kind of issue but good for now.

…issue - Applied notification channel starvation fix from PR #1903 - Re-enabled test_three_node_network_connectivity (removed #[ignore]) - Removed debug prints added during investigation - Created issue #1904 documenting the peer-to-peer connection failure The test now runs but times out waiting for full mesh formation. Gateway successfully connects to both peers, but peers don't connect to each other. This is the core mesh formation issue that needs investigation. Related: #1904, #1903, #1897 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

sanity requested review from Copilot and iduartgomez October 3, 2025 15:31

Copilot AI reviewed Oct 3, 2025

View reviewed changes

Merge branch 'main' into fix/select-starvation-put-notifications

b5520ff

This was referenced Oct 3, 2025

feat: Add comprehensive application ubertest for freenet-core #1900

Merged

test_three_node_network_connectivity fails: peers don't connect to each other #1904

Closed

iduartgomez approved these changes Oct 3, 2025

View reviewed changes

sanity added this pull request to the merge queue Oct 3, 2025

Merged via the queue into main with commit c5d25c9 Oct 3, 2025
8 checks passed

sanity deleted the fix/select-starvation-put-notifications branch October 3, 2025 16:20

sanity mentioned this pull request Oct 3, 2025

Restore test_three_node_network_connectivity as minimal failing test for peer connectivity bug #1905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: Prevent notification channel starvation in P2P event loop #1903

fix: Prevent notification channel starvation in P2P event loop #1903

Uh oh!

sanity commented Oct 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

iduartgomez left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix: Prevent notification channel starvation in P2P event loop #1903

fix: Prevent notification channel starvation in P2P event loop #1903

Uh oh!

Conversation

sanity commented Oct 3, 2025

Problem

Root Cause

Solution

Testing

Context

Impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

iduartgomez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants