ci/test: deflake retry tests under merge-queue load by vikrantpuppala · Pull Request #813 · databricks/databricks-sql-python

vikrantpuppala · 2026-05-27T17:00:09Z

Summary

Two compounding fixes for the flake observed on PR #812's merge_group run where test_oserror_retries and test_retry_max_count_not_exceeded failed with assert mock_validate_conn.call_count == 6 — unexpected /telemetry-ext requests were counted alongside the intended session-endpoint retries, inflating the count.

Why it was happening

Two interacting causes:

_isolated_from_telemetry() only patches TelemetryClientFactory.initialize_telemetry_client. That covers new connections created during the test, but not real TelemetryClient instances that slip in via:
- Stale module-global state from a previous test.
- Code paths that bypass initialize_telemetry_client (e.g. connection_failure_log already passes through the patch, but other paths might not in the future).
- Pre-existing clients created before the context manager entered.
The merge queue ran multiple entries in parallel against the same warehouse. The previous concurrency group was keyed on github.ref (per-PR in queue: gh-readonly-queue/main/pr-N-…), so PR ci(code-coverage): move push:main trigger to merge_group #810's and PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812's queue entries ran concurrently. The warehouse was fine for individual queue entries but couldn't handle two simultaneous loads — telemetry/retry paths started intermittently failing on /telemetry-ext, and those failures got counted by mock_validate_conn.

What this PR changes

1. `tests/e2e/common/retry_test_mixins.py`

Strengthens _isolated_from_telemetry() with two backstop patches:

TelemetryClient._send_telemetry → no-op
TelemetryClient._export_event → no-op

These cover any path that could reach the telemetry HTTP layer, regardless of how the TelemetryClient instance was created. Layer 1 (factory swap) catches the common case; layers 2 and 3 catch edge cases.

The existing @patch("...TelemetryClient._send_telemetry") decorators on individual tests stay in place. They're now redundant with the context manager's layer-2 patch, but removing them would expand the diff for no functional benefit.

Local verification: test_oserror_retries runs 5/5 green in a row (previously: flaky on CI, deterministic locally because no warehouse contention).

2. `.github/workflows/code-coverage.yml`

Serialises merge_group runs under a single fixed concurrency group (e2e-mq-serial). Only one queue entry runs the suite at a time; subsequent entries queue up behind it. pull_request runs keep their per-ref + cancel-in-progress behaviour for fast author feedback.

Trade-off: queue throughput drops to one ~17-min run at a time. Acceptable for this repo's PR volume. The alternative is the flake we've been chasing.

What's NOT in this PR

test_retry_max_count_not_exceeded still fails on this branch — but it also fails on baseline main, with 'SimpleHttpResponse' object has no attribute 'version_string'. That's an unrelated pre-existing issue (looks like an urllib3 version drift or mocked_server_response mock breakage). Worth tracking separately; out of scope here.

Test plan

CI passes on this PR (the queue-serialisation kicks in once this lands; until then PR-event runs use the existing pattern).
PR test(telemetry/e2e): make TestTelemetryE2E deterministic + deflake retry tests under merge-queue load #812 (telemetry deflake) re-queues and merges cleanly.
Future merge_group runs show no test_oserror_retries flakes under concurrent queue load.

This pull request and its description were written by Isaac.

Two compounding fixes for the flake observed on PR #812's merge_group run, where test_oserror_retries and test_retry_max_count_not_exceeded failed with `assert mock_validate_conn.call_count == 6` because unexpected `/telemetry-ext` requests had been counted alongside the intended session-endpoint retries. 1. tests/e2e/common/retry_test_mixins.py — strengthen `_isolated_from_telemetry()` with two additional defensive patches: - TelemetryClient._send_telemetry → no-op - TelemetryClient._export_event → no-op The existing factory swap installs NoopTelemetryClient for connections created during the test, but doesn't cover real TelemetryClient instances that slip in via other paths (stale module-global, pre-existing client created before the test entered, or code that bypasses initialize_telemetry_client). Patching at the class level for the duration of the context catches all of them. Verified locally: test_oserror_retries goes from flaky-on-CI to 5/5 green in consecutive runs. (test_retry_max_count_not_exceeded still fails on this branch but also fails on baseline main — pre-existing `'SimpleHttpResponse' object has no attribute 'version_string'` issue, unrelated.) 2. .github/workflows/code-coverage.yml — serialise merge_group runs. Previous concurrency group was keyed on github.ref, which is per-PR in the queue (`gh-readonly-queue/main/pr-N-…`). That allowed multiple queue entries to hammer the same warehouse in parallel, stressing telemetry / retry paths that single-PR runs don't exercise. Group merge_group + workflow_dispatch under a single fixed name (`e2e-mq-serial`) so they run one at a time. PR-event runs keep per-ref grouping + cancel-in-progress for fast author feedback. Trade-off: queue throughput drops to one ~17-min run at a time. For this repo's PR volume that's acceptable, and the alternative is flaky merges. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala · 2026-05-27T17:00:39Z

Folding these changes into #812 instead so the deflake fixes ship together with the telemetry test rewrite.

vikrantpuppala temporarily deployed to azure-prod May 27, 2026 17:00 — with GitHub Actions Inactive

vikrantpuppala closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci/test: deflake retry tests under merge-queue load#813

ci/test: deflake retry tests under merge-queue load#813
vikrantpuppala wants to merge 1 commit into
mainfrom
vikrant/retry-tests-and-mq-serialization

vikrantpuppala commented May 27, 2026

Uh oh!

vikrantpuppala commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vikrantpuppala commented May 27, 2026

Summary

Why it was happening

What this PR changes

1. tests/e2e/common/retry_test_mixins.py

2. .github/workflows/code-coverage.yml

What's NOT in this PR

Test plan

Uh oh!

vikrantpuppala commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `tests/e2e/common/retry_test_mixins.py`

2. `.github/workflows/code-coverage.yml`