Skip to content

[aw-failures] [aw-fi] 6h Failure Analysis: 2026-04-17 07:20–13:20 UTC #26874

@github-actions

Description

@github-actions

Executive Summary

Analysis of 249 workflow runs in the 6-hour window 07:20–13:20 UTC on 2026-04-17.

15 hard failures detected: 7 agentic workflow failures + 7 CI test failures + 1 main-branch CI failure. All 7 individual agentic failures have auto-generated tracking issues. Two root causes are not yet covered: a Node.js binary path issue (P1) and a Copilot timeout pattern (P2).

No P0 failures. No issues are stale or fixed — all were created today.

Failure Clusters

Cluster Runs Severity Tracking
Copilot engine timeouts 4 P2 Individually: #26866, #26865, #26848, #26833
Node.js v24.15.0 binary missing 2 P1 Individually: #26839, #26829 — root cause: #26876
MCP Gateway schema validation (mempalace) 1 P1 Individually: #26852 — root cause: #26822
WASM golden test failures (CI) 8 P2 Out of scope (CI, not agentic)
action_required on PR workflows 96 Expected behavior; not failures

Evidence

Node.js binary missing (P1) — runs 24560753963, 24557049556

Both the Daily Issues Report Generator (§24560753963, 10:34Z) and Daily News (§24557049556, 09:03Z) fail inside the firewall agent container at the copilot driver launch step:

/bin/bash: line 1: /home/runner/work/_tool/node/24.15.0/x64/bin/node: No such file or directory

The entrypoint executes: \$\{GH_AW_NODE_BIN:-node} \$\{RUNNER_TEMP}/gh-aw/actions/copilot_driver.cjs — meaning GH_AW_NODE_BIN is set to the hardcoded toolcache path which does not exist. The container starts successfully; the issue is with the Node.js resolution inside it.

See sub-issue: #26876

MCP Gateway schema validation failure (P1) — run 24562726732

Daily Fact About gh-aw (§24562726732, 11:27Z) fails at "Start MCP Gateway" with:

config:validation_schema Schema validation failed: jsonschema: '/mcpServers/mempalace'
does not validate with .../mcp-gateway-config.schema.json
.../oneOf/0/$ref/required: missing properties: 'container'

The mempalace MCP server config doesn't satisfy any of the three server config variants (stdio requires container, http requires url, custom disallows the current type value). This is a breaking schema change in gh-aw-mcpg v0.2.22 — the workflow's lock file was compiled against v0.2.19 and is now out of sync. Root cause tracked in #26822.

Copilot CLI timeouts (P2) — 4 runs
Run ID Workflow Timeout Turns Tokens
§24563452029 Daily Firewall Logs Collector and Reporter 45 min
§24561152092 Daily Community Attribution Updater 30 min 194 20.5M effective
§24557132232 Dev 30 min
§24564007646 Dead Code Removal Agent 30 min 21 1.5M effective

The Community Attribution run (194 turns, 20.5M effective tokens) and Dead Code Removal (found unreachable functions but ran out of time) show task-complexity timeouts, not infrastructure failures. All individually tracked.

Existing Issue Correlation

Issue Type Notes
#26822 agentic workflows out of sync Root cause Covers mempalace schema + lock file recompile
#26866 Daily Firewall Logs failed Symptom Copilot timeout
#26865 Dead Code Removal Agent failed Symptom Copilot timeout
#26852 Daily Fact About gh-aw failed Symptom Fixed when #26822 resolved
#26848 Daily Community Attribution failed Symptom Copilot timeout
#26839 Daily Issues Report failed Symptom Node.js binary missing
#26833 Dev failed Symptom Copilot timeout
#26829 Daily News failed Symptom Node.js binary missing

Proposed Fix Roadmap

Priority Fix Tracking
P1 Recompile lock files (fixes mempalace schema + lock sync) #26822
P1 Fix Node.js v24.15.0 binary resolution in agent container #26876
P2 Review Copilot timeout limits for high-complexity daily workflows Individual issues

Sub-Issues

References:



Follow-up Window: 2026-04-17 13:20–19:20 UTC

5 hard failures detected across 18 runs in this window. Two distinct root-cause clusters, both with new sub-issues.

Failure Clusters

Cluster Runs Severity Symptom Issues Root-Cause Sub-Issue
AI Moderator — codex 401 auth 3 (+ 1 prior) P1 #26911 #aw_c401
Copilot MCP servers blocked by policy 2 P1 #26909, #26928 #aw_mcp2

Evidence

AI Moderator — codex 401 auth (3 consecutive failures)

All three AI Moderator runs (engine: codex, event: issues) fail at agent activation with:

◆ Reconnecting... 1/5 ... 5/5
◆ unexpected status 401 Unauthorized: Missing bearer or basic authentication in header
  url: (api.openai.com/redacted)

SECRET_OPENAI_API_KEY is present in the runner environment but rejected by the OpenAI API. Secondary signal: chatgpt.com:443 blocked by firewall (1 request) but unrelated to the 401.

Runs: §24577634319, §24579500430, §24579781734

Copilot MCP policy block — Test Quality Sentinel & Auto-Triage Issues

Both workflows emit mcp_policy_error at conclusion. The Copilot CLI refuses MCP connections before agent starts, so neither GitHub API nor safe-outputs tools are available.

  • Test Quality Sentinel: §24580129809 — PR event on copilot/fix-create-pull-request-team-reviewers, 118 turns, 6.27M tokens, no safe outputs
  • Auto-Triage Issues: §24581292756 — scheduled, 102 turns, 5.86M tokens, engine terminated unexpectedly

Note: other copilot-engine workflows (Design Decision Gate, Issue Monster, PR Triage Agent) succeeded in this window, suggesting the policy block is workflow-specific or intermittent.

New Sub-Issues

  • #aw_c401 — Codex 401 persistent auth failure (OPENAI_API_KEY rotation needed)
  • #aw_mcp2 — Copilot MCP policy block (admin must enable "MCP servers in Copilot")

Updated Fix Roadmap

Priority Fix Tracking
P1 Recompile lock files (mempalace schema) #26822
P1 Fix Node.js v24.15.0 binary resolution #26876
P1 Rotate OPENAI_API_KEY for codex engine #aw_c401
P1 Enable "MCP servers in Copilot" org policy #aw_mcp2

References:

Generated by [aw] Failure Investigator (6h) · ● 498.9K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions