Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Oct 3, 2025

Problem

The tokio::select! in P2P event loop's wait_for_event() was experiencing channel starvation, causing PUT operation notifications to be lost in busy networks. The notification_channel would never be polled when peer_connections had constant activity, leading to operation timeouts.

Root Cause

Since the initial implementation in September 2024 (commit 605ff70cb), the select! branches prioritized network traffic over internal notifications:

  1. peer_connections (network) - checked FIRST
  2. notification_channel (internal) - checked second

Without the biased annotation, tokio::select! randomly polls branches. However, in busy networks peer_connections is constantly ready, effectively starving notification_channel even with random polling due to high network traffic volume.

Solution

  1. Added biased; annotation to force sequential polling in source order
  2. Reordered branches to prioritize notification_channel FIRST:
    • notification_channel.notifications_receiver (internal) - FIRST
    • notification_channel.op_execution_receiver (internal) - SECOND
    • peer_connections (network) - after internal channels

This ensures internal operation state machine transitions are processed before handling more network traffic, preventing deadlock where operations wait for their own state transitions that never get processed.

Testing

  • test_put_contract - Verifies basic PUT operations work
  • test_put_with_subscribe_flag - Verifies PUT with subscription
  • ✅ Tested in multi-peer scenarios (ubertest) - 47 notifications received vs 0 before

Context

This fix emerged from debugging the ubertest where PUT operations would consistently timeout. Investigation revealed notifications were sent successfully (OpManager::notify_op_change returned Ok) but never received by the event loop. Channel ID tracking confirmed sender/receiver were correctly paired.

The fix is minimal and surgical - only reorders select! branches and adds the biased annotation. No logic changes, no API changes.

Impact

  • Severity: High - affects all PUT/UPDATE/GET operations in busy networks
  • Scope: P2P event loop core - all network operations
  • Risk: Low - minimal change, well-tested select! pattern

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude noreply@anthropic.com

## Problem
The tokio::select! in P2P event loop's wait_for_event() was experiencing
channel starvation, causing PUT operation notifications to be lost in busy
networks. The notification_channel would never be polled when peer_connections
had constant activity, leading to operation timeouts.

## Root Cause
Since the initial implementation in September 2024 (commit 605ff70), the
select! branches were in an order that prioritized network traffic over
internal notifications:

1. peer_connections (network) - checked FIRST
2. notification_channel (internal) - checked second

Without the biased annotation, tokio::select! randomly polls branches.
However, in busy networks peer_connections is constantly ready, effectively
starving notification_channel even with random polling due to the high
volume of network traffic.

## Solution
1. Added `biased;` annotation to force sequential polling in source order
2. Reordered branches to prioritize notification_channel FIRST:
   - notification_channel.notifications_receiver (internal) - FIRST
   - notification_channel.op_execution_receiver (internal) - SECOND
   - peer_connections (network) - after internal channels

This ensures internal operation state machine transitions are processed
before handling more network traffic, preventing deadlock where operations
wait for their own state transitions that never get processed.

## Testing
- ✅ test_put_contract - Verifies basic PUT operations work
- ✅ test_put_with_subscribe_flag - Verifies PUT with subscription
- ✅ Tested in multi-peer scenarios (ubertest) - 47 notifications received vs 0 before

## Context
This fix emerged from debugging the ubertest where PUT operations would
consistently timeout. Investigation revealed notifications were sent successfully
(OpManager::notify_op_change returned Ok) but never received by the event loop.
Channel ID tracking confirmed sender/receiver were correctly paired.

The fix is minimal and surgical - only reorders select! branches and adds
the biased annotation. No logic changes, no API changes.

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity sanity requested review from Copilot and iduartgomez October 3, 2025 15:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical starvation issue in the P2P event loop where notification channels were being starved by high network traffic, causing PUT operation timeouts in busy networks.

Key changes:

  • Added biased; annotation to tokio::select! to force sequential polling
  • Reordered branches to prioritize internal notification channels before network traffic
  • Added detailed comments explaining the fix rationale

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

sanity added a commit that referenced this pull request Oct 3, 2025
…issue

- Applied notification channel starvation fix from PR #1903
- Re-enabled test_three_node_network_connectivity (removed #[ignore])
- Removed debug prints added during investigation
- Created issue #1904 documenting the peer-to-peer connection failure

The test now runs but times out waiting for full mesh formation. Gateway
successfully connects to both peers, but peers don't connect to each other.
This is the core mesh formation issue that needs investigation.

Related: #1904, #1903, #1897

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Not the definitive fix for this kind of issue but good for now.

@sanity sanity added this pull request to the merge queue Oct 3, 2025
Merged via the queue into main with commit c5d25c9 Oct 3, 2025
8 checks passed
@sanity sanity deleted the fix/select-starvation-put-notifications branch October 3, 2025 16:20
netsirius pushed a commit that referenced this pull request Oct 3, 2025
…issue

- Applied notification channel starvation fix from PR #1903
- Re-enabled test_three_node_network_connectivity (removed #[ignore])
- Removed debug prints added during investigation
- Created issue #1904 documenting the peer-to-peer connection failure

The test now runs but times out waiting for full mesh formation. Gateway
successfully connects to both peers, but peers don't connect to each other.
This is the core mesh formation issue that needs investigation.

Related: #1904, #1903, #1897

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
iduartgomez pushed a commit that referenced this pull request Oct 5, 2025
…issue

- Applied notification channel starvation fix from PR #1903
- Re-enabled test_three_node_network_connectivity (removed #[ignore])
- Removed debug prints added during investigation
- Created issue #1904 documenting the peer-to-peer connection failure

The test now runs but times out waiting for full mesh formation. Gateway
successfully connects to both peers, but peers don't connect to each other.
This is the core mesh formation issue that needs investigation.

Related: #1904, #1903, #1897

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants