Skip to content

ci: extend merge_group fail-fast and auto-dequeue failed PRs#21498

Merged
yperbasis merged 2 commits into
mainfrom
yperbasis/mergequeue-failfast-dequeue
May 29, 2026
Merged

ci: extend merge_group fail-fast and auto-dequeue failed PRs#21498
yperbasis merged 2 commits into
mainfrom
yperbasis/mergequeue-failfast-dequeue

Conversation

@yperbasis
Copy link
Copy Markdown
Member

Background

#21483 added fail-fast: ${{ github.event_name == 'merge_group' }} to test-hive-eest.yml only, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem #21483 didn't address: GitHub does not auto-remove the failed PR from the merge queue.

CI Gate run 26573584442 for PR #21374 demonstrated both gaps:

  • hive-eest / rlp, serial failed at 14:29:49 from a transient Docker Hub blip (alpine:latest manifest HEAD returned unknown: while building hive/hiveproxy).
  • hive-eest's fail-fast cancelled siblings in 1 second; hive / test-hive (still on fail-fast: false) kept dispatching matrix legs — engine, api, serial and engine, cancun, parallel started at 14:36:01 / 14:36:17 (~7 min after gh run cancel) and ran to success. ci-gate couldn't reach a terminal state until those finished at 14:43:54, delaying eviction by ~14 minutes.
  • Even once ci-gate reported conclusion: failure at 14:44:05, GitHub did not remove PR performance: cherry-pick 5 improvements to main #21374 from the queue: the entry stayed at position 2 with state UNMERGEABLE. The queue only advanced because PR ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly #21483 was manually jumped over it.

Changes

1. Roll out merge_group fail-fast to the remaining matrix workflows

Same gating as #21483 (${{ github.event_name == 'merge_group' }}), applied to:

  • test-hive.yml
  • test-all-erigon.yml
  • test-all-erigon-race.yml
  • test-eest-spec.yml
  • test-bench.yml
  • test-kurtosis-assertoor.yml

Behaviour matches #21483: in merge_group, first failed shard cancels its siblings at the GitHub API layer (no waiting for runner drain); in pull_request / schedule / workflow_dispatch, all shards continue so authors keep the full per-shard breakdown.

2. Auto-dequeue UNMERGEABLE PRs whose required check failed

New step at the end of ci-gate.yml's ci-gate job:

- name: Dequeue failed merge-queue PR
  if: failure() && github.event_name == 'merge_group'
  ...

The step:

  1. Inspects needs.*.result and skips when all are cancelled with no failure. That pattern is a queue reshuffle (a PR ahead of us merged, our merge-group SHA is stale), where GitHub re-creates a new merge_group event for us; dequeuing here would be wrong. Confirmed by run 26573568764, where ci-gate's job conclusion was failure (needs cancelled → Check all required jobs exits 1) but the run was cancelled by GitHub during a reshuffle.
  2. Parses the PR number from gh-readonly-queue/<base>/pr-<N>-<sha> (handles multi-segment bases like release/3.4).
  3. Resolves the PR number to a GraphQL node ID and calls the dequeuePullRequest mutation. Soft-fails on errors (warning, not non-zero exit) so a dequeue glitch never masks ci-gate's own failure signal.

Permissions bumped from pull-requests: read to pull-requests: write for the mutation.

Why both in one PR

Both target the same incident class (broken PR sits at the head of the queue blocking everything else). The fail-fast change shrinks time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually evicts the failed PR. Either alone is a partial fix — having both means a broken PR's run goes red fast and the queue advances without anyone needing to manually jump over it.

🤖 Generated with Claude Code

PR #21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}`
to `test-hive-eest.yml` only, with the explicit note that other
matrix-bearing reusable workflows could get the same treatment "if
another workflow becomes the bottleneck." It has — and there is also
a second problem #21483 didn't address: GitHub doesn't auto-remove
the failed PR from the queue.

CI Gate run 26573584442 for PR #21374 showed both gaps:

- `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker
  Hub blip (alpine:latest manifest HEAD returned "unknown:" while
  building hive/hiveproxy).
- hive-eest's `fail-fast` cancelled siblings in 1 sec; `hive /
  test-hive` kept dispatching matrix legs and only finished at
  14:43:54, delaying ci-gate's terminal state by ~14 minutes.
- Even once ci-gate reported `conclusion: failure` at 14:44:05,
  GitHub did not remove PR #21374 from the queue: the entry stayed
  at position 2 with state UNMERGEABLE. The queue only advanced
  because PR #21483 was manually `jump`ed over it.

Changes:

1. Apply `fail-fast: ${{ github.event_name == 'merge_group' }}` to
   the remaining ci-gate reusable workflows: `test-hive.yml`,
   `test-all-erigon.yml`, `test-all-erigon-race.yml`,
   `test-eest-spec.yml`, `test-bench.yml`, `test-kurtosis-assertoor.yml`.
   `test-hive-eest.yml` already has it from #21483.

2. Add a `Dequeue failed merge-queue PR` step to `ci-gate.yml` that
   runs on `failure() && github.event_name == 'merge_group'`. It:

   - Inspects `needs.*.result` and skips when all are `cancelled`
     with no `failure` — that pattern is a queue reshuffle (PR ahead
     of us merged, our SHA is stale), where GitHub re-creates a fresh
     merge_group event for us; dequeuing would be wrong. Confirmed by
     run 26573568764, where ci-gate's job conclusion was `failure` but
     the run was cancelled by GitHub during a reshuffle.
   - Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>`,
     resolves it to a GraphQL node ID, and calls the
     `dequeuePullRequest` mutation. Soft-fails on errors so a dequeue
     glitch never masks ci-gate's own failure signal.

   Permissions bumped from `pull-requests: read` to `pull-requests:
   write` for the mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the merge-queue fast-fail pattern from #21483 to all remaining matrix-bearing reusable workflows called by ci-gate, and adds automatic dequeue of UNMERGEABLE PRs whose required checks failed, since GitHub does not evict them automatically.

Changes:

  • Switches fail-fast to ${{ github.event_name == 'merge_group' }} in six matrix workflows (test-hive, test-all-erigon, test-all-erigon-race, test-eest-spec, test-bench, test-kurtosis-assertoor), keeping false-equivalent behavior for PR/schedule/dispatch runs.
  • Adds a Dequeue failed merge-queue PR step to ci-gate.yml that skips on all-cancelled (reshuffle) results, parses the PR number from the merge-queue ref, resolves it to a node ID, and calls the dequeuePullRequest GraphQL mutation, soft-failing on errors.
  • Bumps pull-requests permission from read to write in ci-gate.yml to authorize the mutation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
.github/workflows/ci-gate.yml Adds auto-dequeue step on failure in merge_group; widens pull-requests permission to write; documents rationale.
.github/workflows/test-hive.yml Gates matrix fail-fast to merge_group only.
.github/workflows/test-all-erigon.yml Same merge_group-gated fail-fast change.
.github/workflows/test-all-erigon-race.yml Same merge_group-gated fail-fast change on the race matrix.
.github/workflows/test-eest-spec.yml Same merge_group-gated fail-fast change on EEST spec shards.
.github/workflows/test-bench.yml Same merge_group-gated fail-fast change on bench matrix.
.github/workflows/test-kurtosis-assertoor.yml Same merge_group-gated fail-fast change on Kurtosis suites matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@Giulio2002 Giulio2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward CI workflow update to fail fast in merge_group runs and auto-dequeue failed merge-queue PRs.

@yperbasis yperbasis added this pull request to the merge queue May 29, 2026
Merged via the queue into main with commit 64b55bc May 29, 2026
93 checks passed
@yperbasis yperbasis deleted the yperbasis/mergequeue-failfast-dequeue branch May 29, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants