Skip to content

Copilot-engine workflows broken: two distinct failures across v0.67.2 and v0.67.4 #25680

@corygehr

Description

@corygehr

Summary

Our Copilot-engine workflows have been failing since the evening of April 9 across two different gh-aw versions with two different failure modes. We upgraded to v0.67.4 expecting it to fix the first problem, but hit a second one instead. Neither version currently works.

gh-aw version Copilot CLI Failure mode
v0.67.2 v1.0.22 ("latest") Hangs indefinitely → workflow timeout
v0.67.4 v1.0.20 (pinned) Silent crash: exitCode=1, 0B output, ~1s

Key evidence: Our last successful runs (April 9, v0.67.2) used Copilot CLI v1.0.21 — the exact version v0.67.4's release notes blamed for crashes — and they worked fine. This suggests:

  1. Bug A: Copilot CLI v1.0.22 introduced a hang/freeze on startup (broke v0.67.2 when "latest" rolled forward on the evening of April 9)
  2. Bug B: The v0.67.4 runtime environment itself prevents v1.0.20 from starting (new copilot_driver.cjs wrapper, updated sandbox, or chroot changes)

Evidence

Bug A: Copilot CLI v1.0.22 hangs on startup (v0.67.2, no code changes on our side)

Run Workflow CLI Version Failure
(run ID available on request) Copilot-engine workflow v1.0.22 ("latest") Timeout after 5m
(run ID available on request) Copilot-engine workflow v1.0.22 ("latest") Timeout after 15m

These runs were on v0.67.2 with no changes to our workflows. The only difference from our successful runs earlier that day: the latest tag for Copilot CLI rolled from v1.0.21 → v1.0.22 sometime on the evening of April 9.

Bug B: v0.67.4 runtime crashes v1.0.20 on startup

Run Workflow CLI Version Failure
(run ID available on request) Copilot-engine workflow v1.0.20 (pinned) exitCode=1, 1s, 0B output
(run ID available on request) Copilot-engine workflow v1.0.20 (pinned) exitCode=1, 1s, 0B output

copilot-driver output (identical in both):

[copilot-driver] attempt 1: process started (pid=156)
[copilot-driver] attempt 1: process closed exitCode=1 duration=1s stdout=0B stderr=0B hasOutput=false
[copilot-driver] attempt 1 failed: exitCode=1 isCAPIError400=false hasOutput=false retriesRemaining=3
[copilot-driver] attempt 1: no output produced — not retrying

Last known good: v0.67.2 + Copilot CLI v1.0.21

Run Workflow CLI Version Result
(run ID available on request) Copilot-engine workflow v1.0.21 ("latest") ✅ Success
(run ID available on request) Copilot-engine workflow v1.0.21 ("latest") ✅ Success

Timeline

When gh-aw Copilot CLI Result Notes
Apr 9 16:15 v0.67.2 v1.0.21 (latest) ✅ Success Last known good
Apr 9 17:40 v0.67.2 v1.0.21 (latest) ✅ Success Last known good
Apr 9 22:00 v0.67.2 v1.0.22 (latest) ❌ Timeout Bug A — "latest" rolled to v1.0.22, broke everything
Apr 10 16:07 v0.67.4 v1.0.20 (pinned) ❌ Silent crash Bug B — upgraded to fix Bug A, hit new failure
Apr 10 16:10 v0.67.4 v1.0.20 (pinned) ❌ Silent crash Bug B — confirmed reproducible

Additional observations

  • The entrypoint logs a warning in v0.67.4 runs that doesn't appear in v0.67.2 runs:

    [entrypoint][WARN] Failed to transfer /host/home/runner/work/_temp/gh-aw/safeoutputs ownership to chroot user
    
  • The permission-discussions input warning on create-github-app-token is present in both passing and failing runs, so likely unrelated.

  • Checksum verification passes for v1.0.20 — the binary is intact.

  • v0.67.4's release notes attributed the crash to Copilot CLI v1.0.21, but our evidence shows v1.0.21 was the last version that worked. The real break came from v1.0.22 (Bug A) and the v0.67.4 runtime itself (Bug B).

What we think is happening

Bug A — Copilot CLI v1.0.22 hang

  • latest tag rolled from v1.0.21 → v1.0.22 on the evening of April 9
  • v1.0.22 hangs on startup (no crash, no output, just freezes until the workflow timeout fires)
  • Affects any gh-aw version that installs latest (which was every version before v0.67.4 pinned to v1.0.20)

Bug B — v0.67.4 runtime silent crash

  • v0.67.4 pins Copilot CLI to v1.0.20 to avoid v1.0.22, but the v0.67.4 runtime itself prevents v1.0.20 from starting
  • The CLI exits in ~1 second with code 1 and zero output
  • Possible culprits in v0.67.4:
    1. New copilot_driver.cjs wrapper — the CLI is no longer invoked directly; it goes through a Node.js driver. On v0.67.2, copilot was called directly.
    2. Updated AWF sandbox — Firewall v0.25.18, MCP Gateway v0.2.17.
    3. Chroot entrypoint changes — the safeoutputs ownership transfer failure suggests filesystem permission changes in the sandbox.

Impact

  • 22 Copilot-engine workflows are completely blocked
  • No workaround available — v0.67.2 + latest (v1.0.22) hangs, v0.67.4 + v1.0.20 crashes
  • We'd need a gh-aw version that either (a) pins to v1.0.21 or (b) fixes the v0.67.4 runtime to work with v1.0.20

Reproduction

  • Bug A: Compile any Copilot-engine workflow with v0.67.2 (which installs latest = v1.0.22). The agent step will hang until workflow timeout.
  • Bug B: Compile any Copilot-engine workflow with v0.67.4. The agent step will fail immediately with exitCode=1.

Environment

  • Runner: ubuntu-latest (GitHub-hosted)
  • gh-aw tested: v0.67.2 (03e31e064a68e8d5ad890c92f303cfb5a3536006), v0.67.4 (9d6ae06250fc0ec536a0e5f35de313b35bad7246)
  • Copilot CLI versions tested: v1.0.20 (pinned), v1.0.21 (latest, worked), v1.0.22 (latest, hangs)
  • Run IDs and repository details available on request

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions