Skip to content

fix(tests): bump chunk-op timeout to 90s for macOS CI runners#75

Merged
mickvandijke merged 1 commit intomainfrom
fix/e2e-macos-chunk-timeout
Apr 21, 2026
Merged

fix(tests): bump chunk-op timeout to 90s for macOS CI runners#75
mickvandijke merged 1 commit intomainfrom
fix/e2e-macos-chunk-timeout

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

Context

data_types::chunk::tests::test_chunk_store_on_remote_node has been flaking on Test (macos-latest) with:

thread 'data_types::chunk::tests::test_chunk_store_on_remote_node' panicked at tests/e2e/data_types/chunk.rs:260:14:
Failed to store max-size chunk on remote node: Storage("Timeout waiting for remote store response after 30s")

The test transfers a 4 MiB chunk via QUIC on loopback inside a 5-node testnet. The 30 s budget covers QUIC+PQC handshake, payload transfer, and storage confirmation — enough on Linux CI, not enough on macOS runners (nested-virt, roughly half the CPU throughput of the Linux pool). Under the concurrent handshake burst the PQC (ML-KEM-768 + ML-DSA-65) exchange + a 4 MiB transfer + disk write can spill past 30 s on a bad day.

This is the same root cause that ant-client#50 fixed for the client test suite — CPU-constrained macOS runners + a timing budget sized for Linux.

Fix

Bump DEFAULT_CHUNK_OPERATION_TIMEOUT_SECS from 30 s to 90 s. Test-only: no production code path reads this constant. The constant carries a comment explaining why it's larger than the happy-path needs, so future readers don't shrink it back.

Test plan

  • cargo fmt --all --check: clean
  • cargo clippy --all-targets --all-features -- -D warnings -D clippy::unwrap_used -D clippy::expect_used: clean
  • cargo test --features test-utils --test e2e data_types::chunk::tests::test_chunk_store_on_remote_node: passes locally in 1.86 s (happy path unchanged, new budget only matters on the slow runner)
  • Full CI will run on this PR; expect macOS Test matrix to go green.

`data_types::chunk::tests::test_chunk_store_on_remote_node` has been
flaking on `Test (macos-latest)` with:

  Storage("Timeout waiting for remote store response after 30s")

The test transfers a 4 MiB chunk over QUIC on loopback inside a 5-node
testnet, with the 30 s budget covering QUIC+PQC handshake, payload
transfer, and storage confirmation. Linux runners fit comfortably;
macOS runners (nested-virt, roughly half the CPU throughput of the
Linux pool) saturate under the concurrent handshake burst and blow
through 30 s on bad days.

Mirrors the ant-client#50 root cause. 90 s is conservative - happy-path
loopback transfers complete in under a second, so the larger budget
only shows up on flakes. Test-only; no production code path reads
DEFAULT_CHUNK_OPERATION_TIMEOUT_SECS.

Verified locally: test completes in 1.86 s with the new constant.
Copilot AI review requested due to automatic review settings April 21, 2026 06:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a flaky E2E test on macOS CI runners by increasing the default timeout budget used for chunk store/get operations in the E2E testnet harness.

Changes:

  • Increased DEFAULT_CHUNK_OPERATION_TIMEOUT_SECS from 30s to 90s.
  • Added an explanatory comment documenting why the larger timeout is needed on macOS CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/e2e/testnet.rs
Comment on lines +94 to +101
/// Covers the full round-trip: QUIC handshake, up to a 4 MiB payload
/// transfer, and storage confirmation. 30 s was enough on Linux CI but
/// flaked on `macos-latest` runners (nested-virt, roughly half the CPU
/// throughput of the Linux pool) when the 5-node testnet's concurrent
/// QUIC+PQC handshake burst collided with the 4 MiB
/// `test_chunk_store_on_remote_node` fixture. 90 s is deliberately
/// conservative; the happy path completes in well under a second on
/// loopback, so the larger budget only shows up on flakes. Test-only —
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment mentions covering a full QUIC handshake + payload transfer, but this timeout is also used for purely local operations (store_chunk/get_chunk call protocol.try_handle_request without any network/handshake). Consider rewording the comment to reflect that it’s a shared default for both local protocol handling and remote P2P chunk ops, with the 90s budget primarily justified by the remote macOS CI case.

Suggested change
/// Covers the full round-trip: QUIC handshake, up to a 4 MiB payload
/// transfer, and storage confirmation. 30 s was enough on Linux CI but
/// flaked on `macos-latest` runners (nested-virt, roughly half the CPU
/// throughput of the Linux pool) when the 5-node testnet's concurrent
/// QUIC+PQC handshake burst collided with the 4 MiB
/// `test_chunk_store_on_remote_node` fixture. 90 s is deliberately
/// conservative; the happy path completes in well under a second on
/// loopback, so the larger budget only shows up on flakes. Test-only —
/// Shared default for both local protocol handling and remote P2P chunk
/// operations in the E2E harness. Some call sites only execute
/// `protocol.try_handle_request` locally, so they do not involve any network
/// transfer or QUIC handshake; others cover the full remote round-trip,
/// including handshake, up to a 4 MiB payload transfer, and storage
/// confirmation. 30 s was enough on Linux CI but flaked on `macos-latest`
/// runners (nested-virt, roughly half the CPU throughput of the Linux pool)
/// when the 5-node testnet's concurrent QUIC+PQC handshake burst collided
/// with the 4 MiB `test_chunk_store_on_remote_node` fixture. 90 s is
/// deliberately conservative; the happy path completes in well under a second
/// on loopback, so the larger budget only shows up on flakes. Test-only —

Copilot uses AI. Check for mistakes.
@mickvandijke mickvandijke merged commit 5a5d7d4 into main Apr 21, 2026
15 checks passed
@mickvandijke mickvandijke deleted the fix/e2e-macos-chunk-timeout branch April 21, 2026 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants