ci: extend merge_group fail-fast and auto-dequeue failed PRs#21498
Merged
Conversation
PR #21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}` to `test-hive-eest.yml` only, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem #21483 didn't address: GitHub doesn't auto-remove the failed PR from the queue. CI Gate run 26573584442 for PR #21374 showed both gaps: - `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker Hub blip (alpine:latest manifest HEAD returned "unknown:" while building hive/hiveproxy). - hive-eest's `fail-fast` cancelled siblings in 1 sec; `hive / test-hive` kept dispatching matrix legs and only finished at 14:43:54, delaying ci-gate's terminal state by ~14 minutes. - Even once ci-gate reported `conclusion: failure` at 14:44:05, GitHub did not remove PR #21374 from the queue: the entry stayed at position 2 with state UNMERGEABLE. The queue only advanced because PR #21483 was manually `jump`ed over it. Changes: 1. Apply `fail-fast: ${{ github.event_name == 'merge_group' }}` to the remaining ci-gate reusable workflows: `test-hive.yml`, `test-all-erigon.yml`, `test-all-erigon-race.yml`, `test-eest-spec.yml`, `test-bench.yml`, `test-kurtosis-assertoor.yml`. `test-hive-eest.yml` already has it from #21483. 2. Add a `Dequeue failed merge-queue PR` step to `ci-gate.yml` that runs on `failure() && github.event_name == 'merge_group'`. It: - Inspects `needs.*.result` and skips when all are `cancelled` with no `failure` — that pattern is a queue reshuffle (PR ahead of us merged, our SHA is stale), where GitHub re-creates a fresh merge_group event for us; dequeuing would be wrong. Confirmed by run 26573568764, where ci-gate's job conclusion was `failure` but the run was cancelled by GitHub during a reshuffle. - Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>`, resolves it to a GraphQL node ID, and calls the `dequeuePullRequest` mutation. Soft-fails on errors so a dequeue glitch never masks ci-gate's own failure signal. Permissions bumped from `pull-requests: read` to `pull-requests: write` for the mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Extends the merge-queue fast-fail pattern from #21483 to all remaining matrix-bearing reusable workflows called by ci-gate, and adds automatic dequeue of UNMERGEABLE PRs whose required checks failed, since GitHub does not evict them automatically.
Changes:
- Switches
fail-fastto${{ github.event_name == 'merge_group' }}in six matrix workflows (test-hive, test-all-erigon, test-all-erigon-race, test-eest-spec, test-bench, test-kurtosis-assertoor), keepingfalse-equivalent behavior for PR/schedule/dispatch runs. - Adds a
Dequeue failed merge-queue PRstep toci-gate.ymlthat skips on all-cancelled (reshuffle) results, parses the PR number from the merge-queue ref, resolves it to a node ID, and calls thedequeuePullRequestGraphQL mutation, soft-failing on errors. - Bumps
pull-requestspermission fromreadtowriteinci-gate.ymlto authorize the mutation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci-gate.yml | Adds auto-dequeue step on failure in merge_group; widens pull-requests permission to write; documents rationale. |
| .github/workflows/test-hive.yml | Gates matrix fail-fast to merge_group only. |
| .github/workflows/test-all-erigon.yml | Same merge_group-gated fail-fast change. |
| .github/workflows/test-all-erigon-race.yml | Same merge_group-gated fail-fast change on the race matrix. |
| .github/workflows/test-eest-spec.yml | Same merge_group-gated fail-fast change on EEST spec shards. |
| .github/workflows/test-bench.yml | Same merge_group-gated fail-fast change on bench matrix. |
| .github/workflows/test-kurtosis-assertoor.yml | Same merge_group-gated fail-fast change on Kurtosis suites matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Giulio2002
approved these changes
May 28, 2026
Contributor
Giulio2002
left a comment
There was a problem hiding this comment.
LGTM — straightforward CI workflow update to fail fast in merge_group runs and auto-dequeue failed merge-queue PRs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
#21483 added
fail-fast: ${{ github.event_name == 'merge_group' }}totest-hive-eest.ymlonly, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem #21483 didn't address: GitHub does not auto-remove the failed PR from the merge queue.CI Gate run 26573584442 for PR #21374 demonstrated both gaps:
hive-eest / rlp, serialfailed at 14:29:49 from a transient Docker Hub blip (alpine:latestmanifest HEAD returnedunknown:while buildinghive/hiveproxy).fail-fastcancelled siblings in 1 second;hive / test-hive(still onfail-fast: false) kept dispatching matrix legs —engine, api, serialandengine, cancun, parallelstarted at 14:36:01 / 14:36:17 (~7 min aftergh run cancel) and ran tosuccess. ci-gate couldn't reach a terminal state until those finished at 14:43:54, delaying eviction by ~14 minutes.conclusion: failureat 14:44:05, GitHub did not remove PR performance: cherry-pick 5 improvements to main #21374 from the queue: the entry stayed at position 2 with stateUNMERGEABLE. The queue only advanced because PR ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly #21483 was manuallyjumped over it.Changes
1. Roll out merge_group fail-fast to the remaining matrix workflows
Same gating as #21483 (
${{ github.event_name == 'merge_group' }}), applied to:test-hive.ymltest-all-erigon.ymltest-all-erigon-race.ymltest-eest-spec.ymltest-bench.ymltest-kurtosis-assertoor.ymlBehaviour matches #21483: in
merge_group, first failed shard cancels its siblings at the GitHub API layer (no waiting for runner drain); inpull_request/schedule/workflow_dispatch, all shards continue so authors keep the full per-shard breakdown.2. Auto-dequeue UNMERGEABLE PRs whose required check failed
New step at the end of
ci-gate.yml's ci-gate job:The step:
needs.*.resultand skips when all arecancelledwith nofailure. That pattern is a queue reshuffle (a PR ahead of us merged, our merge-group SHA is stale), where GitHub re-creates a new merge_group event for us; dequeuing here would be wrong. Confirmed by run 26573568764, where ci-gate's job conclusion wasfailure(needs cancelled →Check all required jobsexits 1) but the run was cancelled by GitHub during a reshuffle.gh-readonly-queue/<base>/pr-<N>-<sha>(handles multi-segment bases likerelease/3.4).dequeuePullRequestmutation. Soft-fails on errors (warning, not non-zero exit) so a dequeue glitch never masks ci-gate's own failure signal.Permissions bumped from
pull-requests: readtopull-requests: writefor the mutation.Why both in one PR
Both target the same incident class (broken PR sits at the head of the queue blocking everything else). The fail-fast change shrinks time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually evicts the failed PR. Either alone is a partial fix — having both means a broken PR's run goes red fast and the queue advances without anyone needing to manually jump over it.
🤖 Generated with Claude Code