Skip to content

[aw-failures] Conclusion job brittle when GitHub returns HTTP 500/DNS errors — cascades to setup_globals.cjs not found #28149

@github-actions

Description

@github-actions

Problem

During a transient GitHub infrastructure outage (16:28–16:49 UTC on 2026-04-23), the conclusion job in multiple workflows failed with a two-stage cascade:

  1. Primary error: git checkout fails with HTTP 500 or DNS: Could not resolve host: github.com
  2. Cascade error: Error: Cannot find module '/home/runner/work/_temp/gh-aw/actions/setup_globals.cjs'

The cascade happens because handle_agent_failure.cjs imports setup_globals.cjs from the checked-out repo. When the checkout fails, that module is unavailable — turning a transient infra blip into a hard, unrecoverable conclusion job failure.

Affected Runs (last 6h)

Workflow Run ID Time (UTC) Error
Smoke CI §24846689722 16:29 HTTP 500 + setup_globals.cjs not found
Smoke CI §24846759388 16:31 HTTP 500 + setup_globals.cjs not found
Smoke CI §24846939615 16:35 HTTP 500 + setup_globals.cjs not found
Smoke CI §24847116637 16:39 HTTP 500 + setup_globals.cjs not found
Smoke CI §24852137215 18:32 DNS failure + setup_globals.cjs not found
Test Quality Sentinel §24846641687 16:28 HTTP 500 + setup_globals.cjs not found
Design Decision Gate 🏗️ §24847007622 16:37 HTTP 500 + setup_globals.cjs not found
Slide Deck Maintainer §24847273895 16:43 HTTP 500 + setup_globals.cjs not found

Note: in most of these runs, the agent job succeeded (noop or valid output) — only the conclusion cleanup step failed.

Root Cause

The conclusion job performs a fresh git checkout to access AWF action helpers (e.g., setup_globals.cjs, handle_agent_failure.cjs). When github.com is transiently unreachable (HTTP 500 or DNS failure), this checkout fails. The error cascades because handle_agent_failure.cjs cannot load its dependency, preventing graceful failure reporting.

Proposed Remediation

Option A (Preferred): Add retry with backoff to the conclusion job's git checkout

  • 3 attempts with exponential backoff (e.g., 5s, 15s, 30s)
  • Covers transient HTTP 500 and DNS hiccups without false positives

Option B: Pre-bundle AWF action helpers

  • Bundle setup_globals.cjs and related helpers as part of the AWF harness artifacts
  • Conclusion job reads from bundle instead of live checkout
  • Eliminates the checkout dependency entirely

Option C: Graceful degradation

  • If checkout fails, conclusion job exits with a specific known error code
  • Downstream reporting treats "checkout failure" distinctly from "agent failure"

Success Criteria

  • Conclusion job completes successfully (or exits with a known code) even when github.com returns HTTP 500 or DNS errors
  • A transient GitHub outage no longer produces 8+ false-positive failure issues
  • Agent output (noop, comments, etc.) that was already written by the agent job is not re-flagged as a failure

Parent Issue

Part of failure investigation report: #27730

References:

Generated by [aw] Failure Investigator (6h)

Generated by [aw] Failure Investigator (6h) · ● 539.8K ·

  • expires on Apr 30, 2026, 7:21 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions