ci/test: deflake retry tests under merge-queue load#813
Closed
vikrantpuppala wants to merge 1 commit into
Closed
Conversation
Two compounding fixes for the flake observed on PR #812's merge_group run, where test_oserror_retries and test_retry_max_count_not_exceeded failed with `assert mock_validate_conn.call_count == 6` because unexpected `/telemetry-ext` requests had been counted alongside the intended session-endpoint retries. 1. tests/e2e/common/retry_test_mixins.py — strengthen `_isolated_from_telemetry()` with two additional defensive patches: - TelemetryClient._send_telemetry → no-op - TelemetryClient._export_event → no-op The existing factory swap installs NoopTelemetryClient for connections created during the test, but doesn't cover real TelemetryClient instances that slip in via other paths (stale module-global, pre-existing client created before the test entered, or code that bypasses initialize_telemetry_client). Patching at the class level for the duration of the context catches all of them. Verified locally: test_oserror_retries goes from flaky-on-CI to 5/5 green in consecutive runs. (test_retry_max_count_not_exceeded still fails on this branch but also fails on baseline main — pre-existing `'SimpleHttpResponse' object has no attribute 'version_string'` issue, unrelated.) 2. .github/workflows/code-coverage.yml — serialise merge_group runs. Previous concurrency group was keyed on github.ref, which is per-PR in the queue (`gh-readonly-queue/main/pr-N-…`). That allowed multiple queue entries to hammer the same warehouse in parallel, stressing telemetry / retry paths that single-PR runs don't exercise. Group merge_group + workflow_dispatch under a single fixed name (`e2e-mq-serial`) so they run one at a time. PR-event runs keep per-ref grouping + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. For this repo's PR volume that's acceptable, and the alternative is flaky merges. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Contributor
Author
|
Folding these changes into #812 instead so the deflake fixes ship together with the telemetry test rewrite. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two compounding fixes for the flake observed on PR #812's
merge_grouprun wheretest_oserror_retriesandtest_retry_max_count_not_exceededfailed withassert mock_validate_conn.call_count == 6— unexpected/telemetry-extrequests were counted alongside the intended session-endpoint retries, inflating the count.Why it was happening
Two interacting causes:
_isolated_from_telemetry()only patchesTelemetryClientFactory.initialize_telemetry_client. That covers new connections created during the test, but not realTelemetryClientinstances that slip in via:initialize_telemetry_client(e.g.connection_failure_logalready passes through the patch, but other paths might not in the future).The merge queue ran multiple entries in parallel against the same warehouse. The previous concurrency group was keyed on
github.ref(per-PR in queue:gh-readonly-queue/main/pr-N-…), so PR ci(code-coverage): move push:main trigger to merge_group #810's and PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812's queue entries ran concurrently. The warehouse was fine for individual queue entries but couldn't handle two simultaneous loads — telemetry/retry paths started intermittently failing on/telemetry-ext, and those failures got counted bymock_validate_conn.What this PR changes
1.
tests/e2e/common/retry_test_mixins.pyStrengthens
_isolated_from_telemetry()with two backstop patches:TelemetryClient._send_telemetry→ no-opTelemetryClient._export_event→ no-opThese cover any path that could reach the telemetry HTTP layer, regardless of how the
TelemetryClientinstance was created. Layer 1 (factory swap) catches the common case; layers 2 and 3 catch edge cases.The existing
@patch("...TelemetryClient._send_telemetry")decorators on individual tests stay in place. They're now redundant with the context manager's layer-2 patch, but removing them would expand the diff for no functional benefit.Local verification:
test_oserror_retriesruns 5/5 green in a row (previously: flaky on CI, deterministic locally because no warehouse contention).2.
.github/workflows/code-coverage.ymlSerialises
merge_groupruns under a single fixed concurrency group (e2e-mq-serial). Only one queue entry runs the suite at a time; subsequent entries queue up behind it.pull_requestruns keep their per-ref +cancel-in-progressbehaviour for fast author feedback.Trade-off: queue throughput drops to one ~17-min run at a time. Acceptable for this repo's PR volume. The alternative is the flake we've been chasing.
What's NOT in this PR
test_retry_max_count_not_exceededstill fails on this branch — but it also fails on baseline main, with'SimpleHttpResponse' object has no attribute 'version_string'. That's an unrelated pre-existing issue (looks like an urllib3 version drift ormocked_server_responsemock breakage). Worth tracking separately; out of scope here.Test plan
test_oserror_retriesflakes under concurrent queue load.This pull request and its description were written by Isaac.