Skip to content

[aw] Workflow Health Dashboard — 2026-05-04 #30076

@github-actions

Description

@github-actions

Workflow Health Dashboard — 2026-05-04

Overview

211 workflows total (+2 new). All 211/211 lock files present ✅. A transient failure wave hit at ~01:49 UTC affecting Smoke Claude, Pi, Codex, Copilot ARM64, and OpenCode. Smoke Copilot recovered (success at 00:56). Smoke Gemini and macOS ARM64 remain chronically broken. Daily Model Inventory Checker is a new P0 (Copilot CLI silent crash).

Health Score: 65/100 (→ stable from yesterday)


Critical Issues 🚨

Smoke Gemini (0% success) — P0 — Chronic

Daily Model Inventory Checker (100% failure) — P0 — New

Smoke CI (100% action_required) — P0 — Chronic

Smoke macOS ARM64 (100% failure) — P0 — Chronic since Feb 2026

  • Status: All runs failing since 2026-02-20; no recent issue filed
  • Impact: macOS ARM64 agent compatibility untested

Warnings ⚠️

Transient Failure Wave — 01:49 UTC May 4

Multiple smoke tests failed in the same run batch at 01:49 UTC:

Pattern suggests transient infrastructure issue at that time slot. Smoke Copilot succeeded at 00:56; most other engines showed 1 failure then recovered.

PR-Review Agent Backlog

/cloclo, Archie, Scout, Q, AI Moderator, Content Moderation — all showing action_required (approval-gated). Expected for PR-triggered workflows; worth auditing if volume is growing.

Additional Failures (P1)


Systemic Issues


Recommendations

High (P0):

  1. Investigate macOS ARM64 chronic failures (no issue filed — file one)
  2. Fix Copilot CLI silent crash in Daily Model Inventory ([aw-failures] Daily Model Inventory Checker: Copilot CLI silent startup crash (exit code 1) #30043)
  3. Resolve Smoke Gemini proxy issue ([aw-failures] [aw] P0: Smoke Gemini — Gemini CLI proxy architecture blocks all agent traffic (localhost:8080 not reachable) #29852)

Medium (P1/P2):
4. Investigate 01:49 UTC wave — check runner logs for common cause
5. Audit PR-review agent approval queue backlog
6. Node.js 20 deprecation deadline: Sep 16, 2026 (migrate to Node.js 22)

Low (P3):
7. MCP gateway session timeout risk (#23153) for long-running workflows


Trends

  • Health score: 65/100 (→ stable)
  • New failures: Smoke Claude, Pi, Codex (transient wave) + Model Inventory (new P0)
  • Fixed from yesterday: Smoke Copilot regression resolved ✅
  • Avg success rate (active smoke tests): ~60%
  • 211 workflows total (+2 from 209)

Actions Taken This Run

  • Created health dashboard issue (this)
  • Updated shared memory
  • No new issues created (existing issues cover all P0/P1 items)

Last updated: 2026-05-04T05:39Z | Run: §25302920193

Generated by Workflow Health Manager - Meta-Orchestrator · ● 2M ·

  • expires on May 5, 2026, 5:46 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions