fix(tests): bump chunk-op timeout to 90s for macOS CI runners#75
fix(tests): bump chunk-op timeout to 90s for macOS CI runners#75mickvandijke merged 1 commit intomainfrom
Conversation
`data_types::chunk::tests::test_chunk_store_on_remote_node` has been
flaking on `Test (macos-latest)` with:
Storage("Timeout waiting for remote store response after 30s")
The test transfers a 4 MiB chunk over QUIC on loopback inside a 5-node
testnet, with the 30 s budget covering QUIC+PQC handshake, payload
transfer, and storage confirmation. Linux runners fit comfortably;
macOS runners (nested-virt, roughly half the CPU throughput of the
Linux pool) saturate under the concurrent handshake burst and blow
through 30 s on bad days.
Mirrors the ant-client#50 root cause. 90 s is conservative - happy-path
loopback transfers complete in under a second, so the larger budget
only shows up on flakes. Test-only; no production code path reads
DEFAULT_CHUNK_OPERATION_TIMEOUT_SECS.
Verified locally: test completes in 1.86 s with the new constant.
There was a problem hiding this comment.
Pull request overview
This PR addresses a flaky E2E test on macOS CI runners by increasing the default timeout budget used for chunk store/get operations in the E2E testnet harness.
Changes:
- Increased
DEFAULT_CHUNK_OPERATION_TIMEOUT_SECSfrom 30s to 90s. - Added an explanatory comment documenting why the larger timeout is needed on macOS CI.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Covers the full round-trip: QUIC handshake, up to a 4 MiB payload | ||
| /// transfer, and storage confirmation. 30 s was enough on Linux CI but | ||
| /// flaked on `macos-latest` runners (nested-virt, roughly half the CPU | ||
| /// throughput of the Linux pool) when the 5-node testnet's concurrent | ||
| /// QUIC+PQC handshake burst collided with the 4 MiB | ||
| /// `test_chunk_store_on_remote_node` fixture. 90 s is deliberately | ||
| /// conservative; the happy path completes in well under a second on | ||
| /// loopback, so the larger budget only shows up on flakes. Test-only — |
There was a problem hiding this comment.
The doc comment mentions covering a full QUIC handshake + payload transfer, but this timeout is also used for purely local operations (store_chunk/get_chunk call protocol.try_handle_request without any network/handshake). Consider rewording the comment to reflect that it’s a shared default for both local protocol handling and remote P2P chunk ops, with the 90s budget primarily justified by the remote macOS CI case.
| /// Covers the full round-trip: QUIC handshake, up to a 4 MiB payload | |
| /// transfer, and storage confirmation. 30 s was enough on Linux CI but | |
| /// flaked on `macos-latest` runners (nested-virt, roughly half the CPU | |
| /// throughput of the Linux pool) when the 5-node testnet's concurrent | |
| /// QUIC+PQC handshake burst collided with the 4 MiB | |
| /// `test_chunk_store_on_remote_node` fixture. 90 s is deliberately | |
| /// conservative; the happy path completes in well under a second on | |
| /// loopback, so the larger budget only shows up on flakes. Test-only — | |
| /// Shared default for both local protocol handling and remote P2P chunk | |
| /// operations in the E2E harness. Some call sites only execute | |
| /// `protocol.try_handle_request` locally, so they do not involve any network | |
| /// transfer or QUIC handshake; others cover the full remote round-trip, | |
| /// including handshake, up to a 4 MiB payload transfer, and storage | |
| /// confirmation. 30 s was enough on Linux CI but flaked on `macos-latest` | |
| /// runners (nested-virt, roughly half the CPU throughput of the Linux pool) | |
| /// when the 5-node testnet's concurrent QUIC+PQC handshake burst collided | |
| /// with the 4 MiB `test_chunk_store_on_remote_node` fixture. 90 s is | |
| /// deliberately conservative; the happy path completes in well under a second | |
| /// on loopback, so the larger budget only shows up on flakes. Test-only — |
Context
data_types::chunk::tests::test_chunk_store_on_remote_nodehas been flaking onTest (macos-latest)with:The test transfers a 4 MiB chunk via QUIC on loopback inside a 5-node testnet. The 30 s budget covers QUIC+PQC handshake, payload transfer, and storage confirmation — enough on Linux CI, not enough on macOS runners (nested-virt, roughly half the CPU throughput of the Linux pool). Under the concurrent handshake burst the PQC (ML-KEM-768 + ML-DSA-65) exchange + a 4 MiB transfer + disk write can spill past 30 s on a bad day.
This is the same root cause that ant-client#50 fixed for the client test suite — CPU-constrained macOS runners + a timing budget sized for Linux.
Fix
Bump
DEFAULT_CHUNK_OPERATION_TIMEOUT_SECSfrom 30 s to 90 s. Test-only: no production code path reads this constant. The constant carries a comment explaining why it's larger than the happy-path needs, so future readers don't shrink it back.Test plan
cargo fmt --all --check: cleancargo clippy --all-targets --all-features -- -D warnings -D clippy::unwrap_used -D clippy::expect_used: cleancargo test --features test-utils --test e2e data_types::chunk::tests::test_chunk_store_on_remote_node: passes locally in 1.86 s (happy path unchanged, new budget only matters on the slow runner)