Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 30, 2025

Context

PR #2174 initially added inline timeout checks to the connect operation's process_message method. However, Nacho pointed out in review that this is redundant.

Why This Is Redundant

The op_state_manager module already has a background task that handles transaction timeouts via ttl_set (see op_state_manager.rs lines 708-785). This handles timeouts for all operation types including Connect, so adding inline timeout checks in individual operation handlers is unnecessary duplication.

This PR

Removes the timeout additions that were added in the previous PR, per Nacho's feedback.

Stack

This PR stacks on #2174.

[AI-assisted - Claude]

Copilot finished reviewing on behalf of sanity November 30, 2025 05:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a 30-second timeout mechanism to detect and fail stalled connect operations when a joiner receives no acceptances or progress updates. The timeout prevents operations from waiting indefinitely when no peers respond.

Key changes:

  • Added JOINER_PROGRESS_TIMEOUT constant (30 seconds) and has_timed_out() method to JoinerState
  • Added timeout check in process_message that returns OpError::Timeout when timeout is exceeded
  • Added Timeout variant to the OpError enum

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
crates/core/src/operations/mod.rs Adds Timeout error variant to OpError enum for operation timeout failures
crates/core/src/operations/connect.rs Implements timeout detection for joiner operations in WaitingForResponses state with 30-second progress timeout

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 817 to 830
// Check for joiner timeout before processing any message
if self.gateway.is_some() {
if let Some(ConnectState::WaitingForResponses(ref state)) = self.state {
if state.has_timed_out(Instant::now()) {
tracing::warn!(
tx = %self.id,
last_progress_secs = state.last_progress.elapsed().as_secs(),
accepted_count = state.accepted.len(),
"connect: joiner timed out waiting for responses"
);
return Err(OpError::Timeout);
}
}
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout check only executes when a message is received. If a joiner receives no acceptances and no messages for 30 seconds (the exact scenario described in the PR description), process_message will never be called, so this timeout check will never execute and the operation will wait indefinitely.

Consider implementing the timeout via one of these approaches:

  1. A periodic background task that checks for timed-out operations
  2. A tokio timer/timeout wrapper around the operation
  3. A timeout future that races with message processing

The current implementation only protects against stalls between messages, not the absence of messages entirely.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is not the place for this... we have a background task checking for transactions which may have timed out so this is kind of redundant actually. If we are not cleaning up, that's the place where we should be looking at.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - I'll remove the inline timeout check. The existing op_state_manager background task already handles transaction timeouts via ttl_set (lines 708-785 in op_state_manager.rs), which covers Connect operations along with all other transaction types.

Removing the redundant code now.

[AI-assisted - Claude]

Comment on lines 519 to 522
/// Returns true if no progress has been made within the timeout period.
pub(crate) fn has_timed_out(&self, now: Instant) -> bool {
now.duration_since(self.last_progress) >= JOINER_PROGRESS_TIMEOUT
}
Copy link

Copilot AI Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new timeout functionality lacks test coverage. Consider adding a unit test that verifies:

  1. has_timed_out() returns false when last_progress is recent
  2. has_timed_out() returns true when last_progress exceeds JOINER_PROGRESS_TIMEOUT
  3. last_progress is updated correctly when progress occurs (acceptance or observed address update)

Example test structure:

#[test]
fn joiner_state_timeout_detection() {
    let old_instant = Instant::now() - JOINER_PROGRESS_TIMEOUT - Duration::from_secs(1);
    let state = JoinerState {
        target_connections: 1,
        observed_address: None,
        accepted: HashSet::new(),
        last_progress: old_instant,
    };
    assert!(state.has_timed_out(Instant::now()));
    
    let recent_state = JoinerState {
        target_connections: 1,
        observed_address: None,
        accepted: HashSet::new(),
        last_progress: Instant::now(),
    };
    assert!(!recent_state.has_timed_out(Instant::now()));
}

Copilot uses AI. Check for mistakes.
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch from d596c99 to 872bee8 Compare November 30, 2025 22:20
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from 67ec62c to f67d5c0 Compare November 30, 2025 22:20
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch from 872bee8 to 26e7660 Compare November 30, 2025 23:04
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from f67d5c0 to 8596569 Compare November 30, 2025 23:05
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch from 26e7660 to e960382 Compare November 30, 2025 23:14
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from 8596569 to ef7454b Compare November 30, 2025 23:16
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch from e960382 to 99ab273 Compare November 30, 2025 23:37
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from ef7454b to 86c1367 Compare November 30, 2025 23:39
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch from 99ab273 to 01943b7 Compare November 30, 2025 23:54
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from 86c1367 to 605d86e Compare November 30, 2025 23:55
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch from 01943b7 to 3359306 Compare December 1, 2025 00:51
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from 605d86e to 3c6f81f Compare December 1, 2025 00:58
@sanity sanity force-pushed the fix/parallel-connect-priority1-2173 branch 2 times, most recently from d01029d to 0be445d Compare December 1, 2025 02:01
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from 3c6f81f to ef6bce1 Compare December 1, 2025 02:02
sanity and others added 18 commits December 1, 2025 17:15
The JoinerState tracks last_progress but never enforced a timeout.
If a connect operation received no acceptances, it would wait indefinitely.

Now we check for timeout at the start of process_message:
- If gateway is set (joiner) and in WaitingForResponses state
- And last_progress exceeds JOINER_PROGRESS_TIMEOUT (30s)
- Return OpError::Timeout to fail the operation

This prevents indefinitely stalled connect operations from blocking
new connection attempts.

Fixes the WaitingForResponses timeout issue from #2173.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity sanity force-pushed the fix/parallel-connect-2173 branch from 3e9a61f to 29ccbbd Compare December 1, 2025 23:15
Per Nacho's review: the op_state_manager already has a background task
that handles transaction timeouts via ttl_set. Adding inline timeout
checks in process_message is redundant and not the right place for this.

Removes the timeout additions from the previous commit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@sanity sanity changed the title fix: add timeout to WaitingForResponses state (30s no progress = fail) refactor: remove redundant inline timeout check from connect op Dec 1, 2025
Base automatically changed from fix/parallel-connect-priority1-2173 to fix/seeding-subscriber-nat-2164 December 2, 2025 00:17
@sanity sanity merged commit 110e291 into fix/seeding-subscriber-nat-2164 Dec 2, 2025
8 checks passed
@sanity sanity deleted the fix/parallel-connect-2173 branch December 2, 2025 00:18
sanity added a commit that referenced this pull request Dec 2, 2025
Consolidates changes from PRs #2172, #2174, and #2175:

This builds on PR #2191 (wire protocol cleanup) and adds:
- Fix seeding/subscribe operations to handle PeerAddr::Unknown for NAT scenarios
- Gateway properly fills in observed addresses from packet source
- Improved subscriber address tracking in seeding manager
- Update live_tx and connection tests for new address model

NOTE: This PR requires review - previous PRs (#2174, #2175) had
CHANGES_REQUESTED from Nacho. Submitting consolidated changes for
fresh review.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
sanity added a commit that referenced this pull request Dec 2, 2025
Adds NAT address handling to subscribe/seeding operations:

- Subscribers with PeerAddr::Unknown have their address filled in by gateway
- Gateway observes real UDP source address and updates subscriber address
- SeedingManager tracks subscriber addresses properly
- live_tx tests updated for new address model
- In-memory testing infrastructure updated for PeerAddr

Supersedes PRs #2172, #2174, #2175 (which had changes requested).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants