ci: extend merge_group fail-fast and auto-dequeue failed PRs by yperbasis · Pull Request #21498 · erigontech/erigon

yperbasis · 2026-05-28T15:12:55Z

Background

#21483 added fail-fast: ${{ github.event_name == 'merge_group' }} to test-hive-eest.yml only, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem #21483 didn't address: GitHub does not auto-remove the failed PR from the merge queue.

CI Gate run 26573584442 for PR #21374 demonstrated both gaps:

hive-eest / rlp, serial failed at 14:29:49 from a transient Docker Hub blip (alpine:latest manifest HEAD returned unknown: while building hive/hiveproxy).
hive-eest's fail-fast cancelled siblings in 1 second; hive / test-hive (still on fail-fast: false) kept dispatching matrix legs — engine, api, serial and engine, cancun, parallel started at 14:36:01 / 14:36:17 (~7 min after gh run cancel) and ran to success. ci-gate couldn't reach a terminal state until those finished at 14:43:54, delaying eviction by ~14 minutes.
Even once ci-gate reported conclusion: failure at 14:44:05, GitHub did not remove PR performance: cherry-pick 5 improvements to main #21374 from the queue: the entry stayed at position 2 with state UNMERGEABLE. The queue only advanced because PR ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly #21483 was manually jumped over it.

Changes

1. Roll out merge_group fail-fast to the remaining matrix workflows

Same gating as #21483 (${{ github.event_name == 'merge_group' }}), applied to:

test-hive.yml
test-all-erigon.yml
test-all-erigon-race.yml
test-eest-spec.yml
test-bench.yml
test-kurtosis-assertoor.yml

Behaviour matches #21483: in merge_group, first failed shard cancels its siblings at the GitHub API layer (no waiting for runner drain); in pull_request / schedule / workflow_dispatch, all shards continue so authors keep the full per-shard breakdown.

2. Auto-dequeue UNMERGEABLE PRs whose required check failed

New step at the end of ci-gate.yml's ci-gate job:

- name: Dequeue failed merge-queue PR
  if: failure() && github.event_name == 'merge_group'
  ...

The step:

Inspects needs.*.result and skips when all are cancelled with no failure. That pattern is a queue reshuffle (a PR ahead of us merged, our merge-group SHA is stale), where GitHub re-creates a new merge_group event for us; dequeuing here would be wrong. Confirmed by run 26573568764, where ci-gate's job conclusion was failure (needs cancelled → Check all required jobs exits 1) but the run was cancelled by GitHub during a reshuffle.
Parses the PR number from gh-readonly-queue/<base>/pr-<N>-<sha> (handles multi-segment bases like release/3.4).
Resolves the PR number to a GraphQL node ID and calls the dequeuePullRequest mutation. Soft-fails on errors (warning, not non-zero exit) so a dequeue glitch never masks ci-gate's own failure signal.

Permissions bumped from pull-requests: read to pull-requests: write for the mutation.

Why both in one PR

Both target the same incident class (broken PR sits at the head of the queue blocking everything else). The fail-fast change shrinks time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually evicts the failed PR. Either alone is a partial fix — having both means a broken PR's run goes red fast and the queue advances without anyone needing to manually jump over it.

🤖 Generated with Claude Code

PR #21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}` to `test-hive-eest.yml` only, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem #21483 didn't address: GitHub doesn't auto-remove the failed PR from the queue. CI Gate run 26573584442 for PR #21374 showed both gaps: - `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker Hub blip (alpine:latest manifest HEAD returned "unknown:" while building hive/hiveproxy). - hive-eest's `fail-fast` cancelled siblings in 1 sec; `hive / test-hive` kept dispatching matrix legs and only finished at 14:43:54, delaying ci-gate's terminal state by ~14 minutes. - Even once ci-gate reported `conclusion: failure` at 14:44:05, GitHub did not remove PR #21374 from the queue: the entry stayed at position 2 with state UNMERGEABLE. The queue only advanced because PR #21483 was manually `jump`ed over it. Changes: 1. Apply `fail-fast: ${{ github.event_name == 'merge_group' }}` to the remaining ci-gate reusable workflows: `test-hive.yml`, `test-all-erigon.yml`, `test-all-erigon-race.yml`, `test-eest-spec.yml`, `test-bench.yml`, `test-kurtosis-assertoor.yml`. `test-hive-eest.yml` already has it from #21483. 2. Add a `Dequeue failed merge-queue PR` step to `ci-gate.yml` that runs on `failure() && github.event_name == 'merge_group'`. It: - Inspects `needs.*.result` and skips when all are `cancelled` with no `failure` — that pattern is a queue reshuffle (PR ahead of us merged, our SHA is stale), where GitHub re-creates a fresh merge_group event for us; dequeuing would be wrong. Confirmed by run 26573568764, where ci-gate's job conclusion was `failure` but the run was cancelled by GitHub during a reshuffle. - Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>`, resolves it to a GraphQL node ID, and calls the `dequeuePullRequest` mutation. Soft-fails on errors so a dequeue glitch never masks ci-gate's own failure signal. Permissions bumped from `pull-requests: read` to `pull-requests: write` for the mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Extends the merge-queue fast-fail pattern from #21483 to all remaining matrix-bearing reusable workflows called by ci-gate, and adds automatic dequeue of UNMERGEABLE PRs whose required checks failed, since GitHub does not evict them automatically.

Changes:

Switches fail-fast to ${{ github.event_name == 'merge_group' }} in six matrix workflows (test-hive, test-all-erigon, test-all-erigon-race, test-eest-spec, test-bench, test-kurtosis-assertoor), keeping false-equivalent behavior for PR/schedule/dispatch runs.
Adds a Dequeue failed merge-queue PR step to ci-gate.yml that skips on all-cancelled (reshuffle) results, parses the PR number from the merge-queue ref, resolves it to a node ID, and calls the dequeuePullRequest GraphQL mutation, soft-failing on errors.
Bumps pull-requests permission from read to write in ci-gate.yml to authorize the mutation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
.github/workflows/ci-gate.yml	Adds auto-dequeue step on failure in merge_group; widens pull-requests permission to write; documents rationale.
.github/workflows/test-hive.yml	Gates matrix `fail-fast` to merge_group only.
.github/workflows/test-all-erigon.yml	Same merge_group-gated `fail-fast` change.
.github/workflows/test-all-erigon-race.yml	Same merge_group-gated `fail-fast` change on the race matrix.
.github/workflows/test-eest-spec.yml	Same merge_group-gated `fail-fast` change on EEST spec shards.
.github/workflows/test-bench.yml	Same merge_group-gated `fail-fast` change on bench matrix.
.github/workflows/test-kurtosis-assertoor.yml	Same merge_group-gated `fail-fast` change on Kurtosis suites matrix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…failfast-dequeue

Giulio2002

LGTM — straightforward CI workflow update to fail fast in merge_group runs and auto-dequeue failed merge-queue PRs.

yperbasis requested review from lystopad and mriccobene as code owners May 28, 2026 15:12

yperbasis requested review from anacrolix, Copilot and taratorio May 28, 2026 15:15

yperbasis added the QA label May 28, 2026

Copilot started reviewing on behalf of yperbasis May 28, 2026 15:15 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into yperbasis/mergequeue-…

0b49788

…failfast-dequeue

Giulio2002 approved these changes May 28, 2026

View reviewed changes

yperbasis added this pull request to the merge queue May 29, 2026

Merged via the queue into main with commit 64b55bc May 29, 2026
93 checks passed

yperbasis deleted the yperbasis/mergequeue-failfast-dequeue branch May 29, 2026 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: extend merge_group fail-fast and auto-dequeue failed PRs#21498

ci: extend merge_group fail-fast and auto-dequeue failed PRs#21498
yperbasis merged 2 commits into
mainfrom
yperbasis/mergequeue-failfast-dequeue

yperbasis commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Giulio2002 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yperbasis commented May 28, 2026

Background

Changes

1. Roll out merge_group fail-fast to the remaining matrix workflows

2. Auto-dequeue UNMERGEABLE PRs whose required check failed

Why both in one PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Giulio2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants