fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration by basvandijk · Pull Request #9242 · dfinity/ic

basvandijk · 2026-03-07T20:25:48Z

Root Cause

The integration tests in registry_replicator_integration used sleep(poll_delay + 200ms) and assumed the background poller had completed within that window. Under CI load, the 200ms leeway was insufficient, causing multiple failure modes:

assert_eq left=3 right=4 — The background poll hadn't updated the registry client to the expected version within the sleep window.
get_latest_certified_time > time_after_init — The poll hadn't completed to update the certified time within the sleep window.
HTTP timeout on get_certified_changes_since — PocketIC HTTP requests timing out under parallel test load.

Fix

Replace timing-dependent assertions with condition-based waits:

get_all_certified_records: Add retry logic (5 attempts with 500ms backoff) to handle transient HTTP timeouts.
wait_for_replicator_version: Polls until the registry client reaches the expected version (30s timeout).
wait_for_not_polling: Polls until is_polling() returns false (30s timeout).
wait_for_certified_time_gt: Polls until the certified time exceeds the given threshold (30s timeout).

Sleep-based assertions are kept only for negative checks (asserting something has NOT happened), where they are safe because the poll interval is 1 second and the operations between mutation and assertion are fast.

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…r_integration Replace timing-dependent assertions with condition-based waits and add retry logic for HTTP calls in the registry replicator integration tests. Root cause: The tests used sleep(poll_delay + 200ms) and assumed the background poller had completed, but under CI load the 200ms leeway was insufficient. Additionally, HTTP requests to PocketIC could time out under parallel test load. Specific failures: polling loop hadn't exited yet within the sleep window - assert_eq left=3 right=4 - the background poll hadn't updated the registry client to the expected version within the sleep window - get_latest_certified_time > time_after_init - the poll hadn't completed to update the certified time within the sleep window - get_certified_changes_since HTTP timeout - PocketIC HTTP requests timing out under parallel test load Changes: - Add retry logic (5 attempts) to get_all_certified_records to handle transient HTTP timeouts - Add wait_for_replicator_version helper that polls until the registry client reaches the expected version - Add wait_for_not_polling helper that polls until is_polling() returns false - Add wait_for_certified_time_gt helper that polls until certified time exceeds the given threshold - Replace sleep+assert patterns with condition-based waits where the test expects a positive outcome (e.g. replicator is up to date) - Keep sleep-based assertions only for negative checks (asserting something has NOT happened), which are safe because the poll interval is 1 second This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Copilot

Pull request overview

Deflakes the registry_replicator_integration tests by replacing timing-dependent sleep(...) assertions with condition-based waiting, and by adding small retry logic around PocketIC registry queries to better tolerate transient CI/network load.

Changes:

Add retry-with-backoff to PocketIcHelper::get_all_certified_records to handle transient get_certified_changes_since failures/timeouts.
Introduce condition-wait helpers (wait_for_replicator_version, wait_for_not_polling, wait_for_certified_time_gt) with a 30s timeout.
Update integration tests to use the new condition-waits instead of fixed sleeps for positive assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rs/orchestrator/registry_replicator/tests/test.rs

- Only retry on transient RegistryUnreachable errors in get_all_certified_records, fail fast on deterministic errors. Skip backoff sleep on final attempt. - Include current version in wait_for_replicator_version timeout message. - Replace ad-hoc version wait loops with wait_for_replicator_version helper.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rs/orchestrator/registry_replicator/tests/test.rs

The two assert_replicator_not_up_to_date_yet calls that run while background polling is active are inherently non-deterministic: with retry logic in get_all_certified_records(), random_mutate() can take longer than the 1s TEST_POLL_DELAY, allowing the background poller to update before the assertion runs. Remove these two assertions. The test still has 4 deterministic negative assertions (before start_polling and after stop_polling) plus positive assertions that background polling eventually picks up changes.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rs/orchestrator/registry_replicator/tests/test.rs

RegistryCanister::get_certified_changes_since maps query failures/timeouts (and missing responses) to Error::UnknownError, so only retrying on RegistryUnreachable would still fail immediately on the PocketIC HTTP timeouts this PR targets.

pierugo-dfinity · 2026-03-10T09:55:08Z

Thanks Bas, I improved the PR a bit more to my taste. In particular, I reverted the early return on certain errors and on the last attempt: I would trade the simplicity/readability over winning a few seconds in an edge-case.

basvandijk · 2026-03-10T10:00:37Z

Thanks @pierugo-dfinity then I'll open it up for review.

rs/orchestrator/registry_replicator/tests/test.rs

Co-authored-by: kpop-dfinity <125868903+kpop-dfinity@users.noreply.github.com>

kpop-dfinity

LGTM

Let's wait for @pierugo-dfinity 's approval before merging, though

github-actions bot added the fix label Mar 7, 2026

basvandijk requested a review from Copilot March 7, 2026 20:26

Copilot started reviewing on behalf of basvandijk March 7, 2026 20:26 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

rs/orchestrator/registry_replicator/tests/test.rs Show resolved Hide resolved

rs/orchestrator/registry_replicator/tests/test.rs Show resolved Hide resolved

rs/orchestrator/registry_replicator/tests/test.rs Outdated Show resolved Hide resolved

basvandijk requested a review from Copilot March 7, 2026 20:42

Copilot started reviewing on behalf of basvandijk March 7, 2026 20:43 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

rs/orchestrator/registry_replicator/tests/test.rs Show resolved Hide resolved

basvandijk requested a review from Copilot March 7, 2026 21:12

Copilot started reviewing on behalf of basvandijk March 7, 2026 21:13 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

rs/orchestrator/registry_replicator/tests/test.rs Outdated Show resolved Hide resolved

basvandijk and others added 6 commits March 8, 2026 13:38

style: simplify canister calls retry loop

df08838

style: avoid code duplication

0a017de

style: use wait_for_condition for one more use-case

ada53f6

chore: bump constants

5d41729

style: minor style improvs

0ca7e8d

basvandijk marked this pull request as ready for review March 10, 2026 10:00

basvandijk requested a review from a team as a code owner March 10, 2026 10:00

github-actions bot added the @consensus label Mar 10, 2026

kpop-dfinity reviewed Mar 10, 2026

View reviewed changes

rs/orchestrator/registry_replicator/tests/test.rs Outdated Show resolved Hide resolved

rs/orchestrator/registry_replicator/tests/test.rs Outdated Show resolved Hide resolved

rs/orchestrator/registry_replicator/tests/test.rs Show resolved Hide resolved

basvandijk and others added 3 commits March 10, 2026 12:49

Apply suggestions from code review

c8b0e02

Co-authored-by: kpop-dfinity <125868903+kpop-dfinity@users.noreply.github.com>

log intermediate errors in get_all_certified_records

8fa2d27

fix

4eac23e

kpop-dfinity requested a review from pierugo-dfinity March 10, 2026 12:33

kpop-dfinity approved these changes Mar 10, 2026

View reviewed changes

pierugo-dfinity approved these changes Mar 10, 2026

View reviewed changes

pierugo-dfinity added this pull request to the merge queue Mar 10, 2026

Merged via the queue into master with commit 48578ab Mar 10, 2026
41 checks passed

pierugo-dfinity deleted the ai/deflake-registry_replicator_integration-2026-03-07 branch March 10, 2026 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242
pierugo-dfinity merged 12 commits intomasterfrom
ai/deflake-registry_replicator_integration-2026-03-07

basvandijk commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

pierugo-dfinity commented Mar 10, 2026 •

edited

Loading

Uh oh!

basvandijk commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpop-dfinity left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

basvandijk commented Mar 7, 2026

Root Cause

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

pierugo-dfinity commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

basvandijk commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpop-dfinity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pierugo-dfinity commented Mar 10, 2026 •

edited

Loading