fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242
Conversation
…r_integration Replace timing-dependent assertions with condition-based waits and add retry logic for HTTP calls in the registry replicator integration tests. Root cause: The tests used sleep(poll_delay + 200ms) and assumed the background poller had completed, but under CI load the 200ms leeway was insufficient. Additionally, HTTP requests to PocketIC could time out under parallel test load. Specific failures: polling loop hadn't exited yet within the sleep window - assert_eq left=3 right=4 - the background poll hadn't updated the registry client to the expected version within the sleep window - get_latest_certified_time > time_after_init - the poll hadn't completed to update the certified time within the sleep window - get_certified_changes_since HTTP timeout - PocketIC HTTP requests timing out under parallel test load Changes: - Add retry logic (5 attempts) to get_all_certified_records to handle transient HTTP timeouts - Add wait_for_replicator_version helper that polls until the registry client reaches the expected version - Add wait_for_not_polling helper that polls until is_polling() returns false - Add wait_for_certified_time_gt helper that polls until certified time exceeds the given threshold - Replace sleep+assert patterns with condition-based waits where the test expects a positive outcome (e.g. replicator is up to date) - Keep sleep-based assertions only for negative checks (asserting something has NOT happened), which are safe because the poll interval is 1 second This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.
There was a problem hiding this comment.
Pull request overview
Deflakes the registry_replicator_integration tests by replacing timing-dependent sleep(...) assertions with condition-based waiting, and by adding small retry logic around PocketIC registry queries to better tolerate transient CI/network load.
Changes:
- Add retry-with-backoff to
PocketIcHelper::get_all_certified_recordsto handle transientget_certified_changes_sincefailures/timeouts. - Introduce condition-wait helpers (
wait_for_replicator_version,wait_for_not_polling,wait_for_certified_time_gt) with a 30s timeout. - Update integration tests to use the new condition-waits instead of fixed sleeps for positive assertions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Only retry on transient RegistryUnreachable errors in get_all_certified_records, fail fast on deterministic errors. Skip backoff sleep on final attempt. - Include current version in wait_for_replicator_version timeout message. - Replace ad-hoc version wait loops with wait_for_replicator_version helper.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The two assert_replicator_not_up_to_date_yet calls that run while background polling is active are inherently non-deterministic: with retry logic in get_all_certified_records(), random_mutate() can take longer than the 1s TEST_POLL_DELAY, allowing the background poller to update before the assertion runs. Remove these two assertions. The test still has 4 deterministic negative assertions (before start_polling and after stop_polling) plus positive assertions that background polling eventually picks up changes.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
RegistryCanister::get_certified_changes_since maps query failures/timeouts (and missing responses) to Error::UnknownError, so only retrying on RegistryUnreachable would still fail immediately on the PocketIC HTTP timeouts this PR targets.
|
Thanks Bas, I improved the PR a bit more to my taste. In particular, I reverted the early return on certain errors and on the last attempt: I would trade the simplicity/readability over winning a few seconds in an edge-case. |
|
Thanks @pierugo-dfinity then I'll open it up for review. |
Co-authored-by: kpop-dfinity <125868903+kpop-dfinity@users.noreply.github.com>
kpop-dfinity
left a comment
There was a problem hiding this comment.
LGTM
Let's wait for @pierugo-dfinity 's approval before merging, though
Root Cause
The integration tests in
registry_replicator_integrationusedsleep(poll_delay + 200ms)and assumed the background poller had completed within that window. Under CI load, the 200ms leeway was insufficient, causing multiple failure modes:assert_eq left=3 right=4— The background poll hadn't updated the registry client to the expected version within the sleep window.get_latest_certified_time > time_after_init— The poll hadn't completed to update the certified time within the sleep window.get_certified_changes_since— PocketIC HTTP requests timing out under parallel test load.Fix
Replace timing-dependent assertions with condition-based waits:
get_all_certified_records: Add retry logic (5 attempts with 500ms backoff) to handle transient HTTP timeouts.wait_for_replicator_version: Polls until the registry client reaches the expected version (30s timeout).wait_for_not_polling: Polls untilis_polling()returns false (30s timeout).wait_for_certified_time_gt: Polls until the certified time exceeds the given threshold (30s timeout).Sleep-based assertions are kept only for negative checks (asserting something has NOT happened), where they are safe because the poll interval is 1 second and the operations between mutation and assertion are fast.
This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.