Skip to content

ci/test: deflake retry tests under merge-queue load#813

Closed
vikrantpuppala wants to merge 1 commit into
mainfrom
vikrant/retry-tests-and-mq-serialization
Closed

ci/test: deflake retry tests under merge-queue load#813
vikrantpuppala wants to merge 1 commit into
mainfrom
vikrant/retry-tests-and-mq-serialization

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Summary

Two compounding fixes for the flake observed on PR #812's merge_group run where test_oserror_retries and test_retry_max_count_not_exceeded failed with assert mock_validate_conn.call_count == 6 — unexpected /telemetry-ext requests were counted alongside the intended session-endpoint retries, inflating the count.

Why it was happening

Two interacting causes:

  1. _isolated_from_telemetry() only patches TelemetryClientFactory.initialize_telemetry_client. That covers new connections created during the test, but not real TelemetryClient instances that slip in via:

    • Stale module-global state from a previous test.
    • Code paths that bypass initialize_telemetry_client (e.g. connection_failure_log already passes through the patch, but other paths might not in the future).
    • Pre-existing clients created before the context manager entered.
  2. The merge queue ran multiple entries in parallel against the same warehouse. The previous concurrency group was keyed on github.ref (per-PR in queue: gh-readonly-queue/main/pr-N-…), so PR ci(code-coverage): move push:main trigger to merge_group #810's and PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812's queue entries ran concurrently. The warehouse was fine for individual queue entries but couldn't handle two simultaneous loads — telemetry/retry paths started intermittently failing on /telemetry-ext, and those failures got counted by mock_validate_conn.

What this PR changes

1. tests/e2e/common/retry_test_mixins.py

Strengthens _isolated_from_telemetry() with two backstop patches:

  • TelemetryClient._send_telemetry → no-op
  • TelemetryClient._export_event → no-op

These cover any path that could reach the telemetry HTTP layer, regardless of how the TelemetryClient instance was created. Layer 1 (factory swap) catches the common case; layers 2 and 3 catch edge cases.

The existing @patch("...TelemetryClient._send_telemetry") decorators on individual tests stay in place. They're now redundant with the context manager's layer-2 patch, but removing them would expand the diff for no functional benefit.

Local verification: test_oserror_retries runs 5/5 green in a row (previously: flaky on CI, deterministic locally because no warehouse contention).

2. .github/workflows/code-coverage.yml

Serialises merge_group runs under a single fixed concurrency group (e2e-mq-serial). Only one queue entry runs the suite at a time; subsequent entries queue up behind it. pull_request runs keep their per-ref + cancel-in-progress behaviour for fast author feedback.

Trade-off: queue throughput drops to one ~17-min run at a time. Acceptable for this repo's PR volume. The alternative is the flake we've been chasing.

What's NOT in this PR

test_retry_max_count_not_exceeded still fails on this branch — but it also fails on baseline main, with 'SimpleHttpResponse' object has no attribute 'version_string'. That's an unrelated pre-existing issue (looks like an urllib3 version drift or mocked_server_response mock breakage). Worth tracking separately; out of scope here.

Test plan

This pull request and its description were written by Isaac.

Two compounding fixes for the flake observed on PR #812's merge_group
run, where test_oserror_retries and test_retry_max_count_not_exceeded
failed with `assert mock_validate_conn.call_count == 6` because
unexpected `/telemetry-ext` requests had been counted alongside the
intended session-endpoint retries.

1. tests/e2e/common/retry_test_mixins.py — strengthen
   `_isolated_from_telemetry()` with two additional defensive patches:

      - TelemetryClient._send_telemetry → no-op
      - TelemetryClient._export_event → no-op

   The existing factory swap installs NoopTelemetryClient for connections
   created during the test, but doesn't cover real TelemetryClient
   instances that slip in via other paths (stale module-global,
   pre-existing client created before the test entered, or code that
   bypasses initialize_telemetry_client). Patching at the class level
   for the duration of the context catches all of them.

   Verified locally: test_oserror_retries goes from flaky-on-CI to 5/5
   green in consecutive runs.

   (test_retry_max_count_not_exceeded still fails on this branch but
   also fails on baseline main — pre-existing `'SimpleHttpResponse'
   object has no attribute 'version_string'` issue, unrelated.)

2. .github/workflows/code-coverage.yml — serialise merge_group runs.

   Previous concurrency group was keyed on github.ref, which is per-PR
   in the queue (`gh-readonly-queue/main/pr-N-…`). That allowed multiple
   queue entries to hammer the same warehouse in parallel, stressing
   telemetry / retry paths that single-PR runs don't exercise.

   Group merge_group + workflow_dispatch under a single fixed name
   (`e2e-mq-serial`) so they run one at a time. PR-event runs keep
   per-ref grouping + cancel-in-progress for fast author feedback.

   Trade-off: queue throughput drops to one ~17-min run at a time. For
   this repo's PR volume that's acceptable, and the alternative is
   flaky merges.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala
Copy link
Copy Markdown
Contributor Author

Folding these changes into #812 instead so the deflake fixes ship together with the telemetry test rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant