Skip to content

Reduce noise in the daily CI duration trend alert#69113

Merged
potiuk merged 1 commit into
apache:mainfrom
potiuk:ci-duration-alert-reduce-noise
Jul 3, 2026
Merged

Reduce noise in the daily CI duration trend alert#69113
potiuk merged 1 commit into
apache:mainfrom
potiuk:ci-duration-alert-reduce-noise

Conversation

@potiuk

@potiuk potiuk commented Jun 29, 2026

Copy link
Copy Markdown
Member

The daily CI Duration Trend Alert (.github/workflows/ci-duration-monitor.yml + scripts/ci/analyze_ci_job_durations.py) has been firing most nights with a wildly varying set of "jobs that got slower", carrying little real signal.

Root cause: the monitor compared a single nightly canary run against the median of the preceding ~24 runs (LATEST_RUNS defaulted to 1). The two sides were asymmetric — a raw, unsmoothed point against a robust median — so any one unlucky run (slow PyPI, GitHub runner queue pressure, a cold cache) tripped the alert. Because a different run is "latest" each day, a different set of jobs got flagged each day. Network-bound constraint-resolution jobs (Finalize tests / Deps 3.x:constraints, Generate constraints, provider installs), which legitimately swing tens of minutes run-to-run, dominated nearly every alert and only needed to clear a 3-minute absolute floor.

Fix:

  • LATEST_RUNS: "3" — compare the median of the last 3 nightly runs against the baseline so the two sides are symmetric and one unlucky run no longer trips it.
  • JOB_MIN_ABS_INCREASE_MINUTES: "6" — require a larger sustained absolute jump before flagging an individual job, so ordinary network variance on the long constraint jobs stops alerting.
  • ONLY_SUCCESSFUL: "true" — pin the baseline to successful (green) canary runs. The script already defaulted to this, but a failed/cancelled canary stops partway and its truncated durations would skew the trend downwards and mask regressions; setting it explicitly keeps the green-only guarantee visible and unchangeable-by-accident at the call site.

No script logic changes — the existing env knobs already support this, so the behaviour is config-only and the script's tests are unaffected. Genuine sustained regressions still trip the alert.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.8)

Generated-by: Claude Code (Opus 4.8) following the guidelines

@potiuk potiuk marked this pull request as ready for review June 29, 2026 06:47
The duration monitor flagged jobs by comparing a single nightly canary run
against the median of the preceding runs, so any one slow run — slow PyPI,
runner queue pressure, a cold cache — tripped the alert. Because a different
run was "latest" each day, a different set of jobs was flagged each day, and
network-bound constraint-resolution jobs that legitimately swing tens of
minutes dominated nearly every alert. The result was a near-daily alert whose
contents swung wildly and carried little signal.

Compare the median of the last few nightly runs against the baseline so the
two sides are symmetric and one unlucky run no longer trips it, and require a
larger absolute jump before flagging individual jobs.

Pin the monitor to successful (green) canary runs only. A failed or cancelled
canary stops partway, so its truncated wall-clock and per-job durations would
skew the baseline downwards and mask real regressions. The script already
defaults to this, but the guarantee is now explicit at the call site so it
cannot be silently changed.
@potiuk potiuk force-pushed the ci-duration-alert-reduce-noise branch from 7252b10 to c58fa46 Compare July 3, 2026 11:41
@potiuk potiuk merged commit e99daee into apache:main Jul 3, 2026
66 checks passed
@potiuk potiuk deleted the ci-duration-alert-reduce-noise branch July 3, 2026 13:45
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Backport successfully created: v3-3-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-3-test PR Link

potiuk added a commit that referenced this pull request Jul 3, 2026
The duration monitor flagged jobs by comparing a single nightly canary run
against the median of the preceding runs, so any one slow run — slow PyPI,
runner queue pressure, a cold cache — tripped the alert. Because a different
run was "latest" each day, a different set of jobs was flagged each day, and
network-bound constraint-resolution jobs that legitimately swing tens of
minutes dominated nearly every alert. The result was a near-daily alert whose
contents swung wildly and carried little signal.

Compare the median of the last few nightly runs against the baseline so the
two sides are symmetric and one unlucky run no longer trips it, and require a
larger absolute jump before flagging individual jobs.

Pin the monitor to successful (green) canary runs only. A failed or cancelled
canary stops partway, so its truncated wall-clock and per-job durations would
skew the baseline downwards and mask real regressions. The script already
defaults to this, but the guarantee is now explicit at the call site so it
cannot be silently changed.

(cherry picked from commit e99daee)
@potiuk potiuk added this to the Airflow 3.3.1 milestone Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants