Skip to content

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242

Merged
pierugo-dfinity merged 12 commits intomasterfrom
ai/deflake-registry_replicator_integration-2026-03-07
Mar 10, 2026
Merged

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242
pierugo-dfinity merged 12 commits intomasterfrom
ai/deflake-registry_replicator_integration-2026-03-07

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

Root Cause

The integration tests in registry_replicator_integration used sleep(poll_delay + 200ms) and assumed the background poller had completed within that window. Under CI load, the 200ms leeway was insufficient, causing multiple failure modes:

  • assert_eq left=3 right=4 — The background poll hadn't updated the registry client to the expected version within the sleep window.
  • get_latest_certified_time > time_after_init — The poll hadn't completed to update the certified time within the sleep window.
  • HTTP timeout on get_certified_changes_since — PocketIC HTTP requests timing out under parallel test load.

Fix

Replace timing-dependent assertions with condition-based waits:

  • get_all_certified_records: Add retry logic (5 attempts with 500ms backoff) to handle transient HTTP timeouts.
  • wait_for_replicator_version: Polls until the registry client reaches the expected version (30s timeout).
  • wait_for_not_polling: Polls until is_polling() returns false (30s timeout).
  • wait_for_certified_time_gt: Polls until the certified time exceeds the given threshold (30s timeout).

Sleep-based assertions are kept only for negative checks (asserting something has NOT happened), where they are safe because the poll interval is 1 second and the operations between mutation and assertion are fast.


This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…r_integration

Replace timing-dependent assertions with condition-based waits and add
retry logic for HTTP calls in the registry replicator integration tests.

Root cause: The tests used sleep(poll_delay + 200ms) and assumed the
background poller had completed, but under CI load the 200ms leeway was
insufficient. Additionally, HTTP requests to PocketIC could time out under
parallel test load.

Specific failures:
  polling loop hadn't exited yet within the sleep window
- assert_eq left=3 right=4 - the background poll hadn't updated the
  registry client to the expected version within the sleep window
- get_latest_certified_time > time_after_init - the poll hadn't completed
  to update the certified time within the sleep window
- get_certified_changes_since HTTP timeout - PocketIC HTTP requests timing
  out under parallel test load

Changes:
- Add retry logic (5 attempts) to get_all_certified_records to handle
  transient HTTP timeouts
- Add wait_for_replicator_version helper that polls until the registry
  client reaches the expected version
- Add wait_for_not_polling helper that polls until is_polling() returns
  false
- Add wait_for_certified_time_gt helper that polls until certified time
  exceeds the given threshold
- Replace sleep+assert patterns with condition-based waits where the test
  expects a positive outcome (e.g. replicator is up to date)
- Keep sleep-based assertions only for negative checks (asserting something
  has NOT happened), which are safe because the poll interval is 1 second

This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Deflakes the registry_replicator_integration tests by replacing timing-dependent sleep(...) assertions with condition-based waiting, and by adding small retry logic around PocketIC registry queries to better tolerate transient CI/network load.

Changes:

  • Add retry-with-backoff to PocketIcHelper::get_all_certified_records to handle transient get_certified_changes_since failures/timeouts.
  • Introduce condition-wait helpers (wait_for_replicator_version, wait_for_not_polling, wait_for_certified_time_gt) with a 30s timeout.
  • Update integration tests to use the new condition-waits instead of fixed sleeps for positive assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Only retry on transient RegistryUnreachable errors in get_all_certified_records,
  fail fast on deterministic errors. Skip backoff sleep on final attempt.
- Include current version in wait_for_replicator_version timeout message.
- Replace ad-hoc version wait loops with wait_for_replicator_version helper.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The two assert_replicator_not_up_to_date_yet calls that run while
background polling is active are inherently non-deterministic: with
retry logic in get_all_certified_records(), random_mutate() can take
longer than the 1s TEST_POLL_DELAY, allowing the background poller to
update before the assertion runs.

Remove these two assertions. The test still has 4 deterministic negative
assertions (before start_polling and after stop_polling) plus positive
assertions that background polling eventually picks up changes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

basvandijk and others added 6 commits March 8, 2026 13:38
RegistryCanister::get_certified_changes_since maps query failures/timeouts
(and missing responses) to Error::UnknownError, so only retrying on
RegistryUnreachable would still fail immediately on the PocketIC HTTP
timeouts this PR targets.
@pierugo-dfinity
Copy link
Copy Markdown
Contributor

pierugo-dfinity commented Mar 10, 2026

Thanks Bas, I improved the PR a bit more to my taste. In particular, I reverted the early return on certain errors and on the last attempt: I would trade the simplicity/readability over winning a few seconds in an edge-case.

@basvandijk
Copy link
Copy Markdown
Collaborator Author

Thanks @pierugo-dfinity then I'll open it up for review.

@basvandijk basvandijk marked this pull request as ready for review March 10, 2026 10:00
@basvandijk basvandijk requested a review from a team as a code owner March 10, 2026 10:00
basvandijk and others added 3 commits March 10, 2026 12:49
Co-authored-by: kpop-dfinity <125868903+kpop-dfinity@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@kpop-dfinity kpop-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Let's wait for @pierugo-dfinity 's approval before merging, though

@pierugo-dfinity pierugo-dfinity added this pull request to the merge queue Mar 10, 2026
Merged via the queue into master with commit 48578ab Mar 10, 2026
41 checks passed
@pierugo-dfinity pierugo-dfinity deleted the ai/deflake-registry_replicator_integration-2026-03-07 branch March 10, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants