Bug Report — Copilot CLI retry loop in copilot_driver.cjs is futile after entrypoint unsets COPILOT_GITHUB_TOKEN
Summary
When the first Copilot CLI invocation fails with Authentication failed (Request ID: ...), copilot_driver.cjs retries up to 3 more times with --resume. But between attempt 1 and attempt 2 the entrypoint runs Unset COPILOT_GITHUB_TOKEN from /proc/1/environ. Every subsequent retry therefore fails with Error: No authentication information found. — a different, more permanent error than what the retry logic was designed to recover from. Net result: a single transient auth failure burns through all four attempts and produces a "fatal" job failure (instead of partial success), and the failure log makes diagnosis confusing because the visible error on attempts 2–4 looks like a misconfiguration rather than a follow‑on side‑effect of the entrypoint.
29 runs in microsoft/aspire's pr-docs-check exhibit this exact pattern (Apr 17 – May 18).
Reproduction
- Run any
gh-aw workflow using engine: copilot.
- Force a transient auth failure on attempt 1 (e.g., narrow the
GITHUB_TOKEN permissions, or use a moment when the GitHub auth service is flaky).
- Observe the log sequence (verbatim from run 24540322242):
[copilot-driver] attempt 1: spawning: /usr/local/bin/copilot ... --prompt <redacted>
Retrying up to 120 times with 1s delay (120s total timeout)
Error: Authentication failed (Request ID: 56F8:5F23:3D5BDA:494128:69E11A60)
[copilot-driver] attempt 1: partial execution — will retry with --resume (attempt 2/4)
[copilot-driver] retry 1/3: sleeping 5000ms before next attempt with --resume
[entrypoint] Unset COPILOT_GITHUB_TOKEN from /proc/1/environ ← side effect between attempts
[copilot-driver] attempt 2: spawning: /usr/local/bin/copilot ... --prompt <redacted> --resume
Error: No authentication information found. ← different error now
[copilot-driver] attempt 2: partial execution — will retry with --resume (attempt 3/4)
...
[copilot-driver] attempt 4: partial execution — will retry with --resume (attempt 4/4)
[copilot-driver] all 4 attempts failed
Attempts 2, 3, and 4 cannot possibly succeed: the env variable they need is gone.
Root cause
The Copilot CLI entrypoint is designed to clear COPILOT_GITHUB_TOKEN from process state once it has been read by the first CLI invocation, to keep secrets out of /proc/N/environ for subprocesses. That's a sensible security posture for normal execution. But copilot_driver.cjs's retry loop spawns a fresh copilot subprocess for each attempt (--resume), and each fresh process tries to re‑read COPILOT_GITHUB_TOKEN from its environment. The two behaviors are incompatible.
The retry logic was written assuming the failure modes were transient network/server errors that benefit from --resume. An auth failure is a different class of error and (a) doesn't benefit from --resume, and (b) is guaranteed to cascade into a worse failure because the entrypoint has already invalidated the token.
Affected runs
All 29 are pr-docs-check runs in microsoft/aspire exhibiting the exact pattern (attempt 1: Authentication failed; attempts 2–4: No authentication information found.):
Note: the high April cluster suggests there was a (now resolved) auth secret rotation issue at microsoft/aspire's end on those days that triggered the initial auth failure. The retry‑loop behavior described here amplified the impact: instead of failing once and surfacing a clear "auth expired" error, the workflow burned through 4 attempts and emitted a confusing "all retries exhausted" failure.
Suggested fixes (in order of preference)
- Detect auth‑class errors and skip retries. When attempt 1 fails with
Authentication failed or similar, the driver should mark the run failed immediately rather than retrying — there's nothing for --resume to recover.
- Capture
COPILOT_GITHUB_TOKEN in copilot_driver.cjs before spawning, and re‑inject it into the child's env on each retry. This is the most invasive fix but preserves the security posture (token is held only by the driver process, not in /proc/1).
- Delay entrypoint unset until the driver signals "done with auth". The entrypoint could read a sentinel file written by the driver after the last attempt completes.
- Surface the underlying error. At minimum, on a futile-retry failure the driver's final error message should attribute it correctly: "Auth failed on attempt 1; retries cannot recover (entrypoint clears token between attempts)" — rather than reporting "No authentication information found" which makes it look like a configuration problem.
Workaround for affected users
Set engine.retries: 0 (or 1) in the workflow frontmatter for now, so a transient auth failure surfaces immediately as a clear error rather than masquerading as a configuration problem.
Environment
gh-aw: v0.71.x – v0.72.1
- Copilot CLI:
1.0.31 – 1.0.40
- Affected file(s):
actions/setup/js/copilot_driver.cjs (retry loop)
actions/setup/sh/copilot_entrypoint.sh (token unset side effect)
Bug Report — Copilot CLI retry loop in
copilot_driver.cjsis futile after entrypoint unsetsCOPILOT_GITHUB_TOKENSummary
When the first Copilot CLI invocation fails with
Authentication failed (Request ID: ...),copilot_driver.cjsretries up to 3 more times with--resume. But between attempt 1 and attempt 2 the entrypoint runsUnset COPILOT_GITHUB_TOKEN from /proc/1/environ. Every subsequent retry therefore fails withError: No authentication information found.— a different, more permanent error than what the retry logic was designed to recover from. Net result: a single transient auth failure burns through all four attempts and produces a "fatal" job failure (instead of partial success), and the failure log makes diagnosis confusing because the visible error on attempts 2–4 looks like a misconfiguration rather than a follow‑on side‑effect of the entrypoint.29 runs in
microsoft/aspire'spr-docs-checkexhibit this exact pattern (Apr 17 – May 18).Reproduction
gh-awworkflow usingengine: copilot.GITHUB_TOKENpermissions, or use a moment when the GitHub auth service is flaky).Attempts 2, 3, and 4 cannot possibly succeed: the env variable they need is gone.
Root cause
The Copilot CLI entrypoint is designed to clear
COPILOT_GITHUB_TOKENfrom process state once it has been read by the first CLI invocation, to keep secrets out of/proc/N/environfor subprocesses. That's a sensible security posture for normal execution. Butcopilot_driver.cjs's retry loop spawns a freshcopilotsubprocess for each attempt (--resume), and each fresh process tries to re‑readCOPILOT_GITHUB_TOKENfrom its environment. The two behaviors are incompatible.The retry logic was written assuming the failure modes were transient network/server errors that benefit from
--resume. An auth failure is a different class of error and (a) doesn't benefit from--resume, and (b) is guaranteed to cascade into a worse failure because the entrypoint has already invalidated the token.Affected runs
All 29 are
pr-docs-checkruns inmicrosoft/aspireexhibiting the exact pattern (attempt 1:Authentication failed; attempts 2–4:No authentication information found.):Note: the high April cluster suggests there was a (now resolved) auth secret rotation issue at
microsoft/aspire's end on those days that triggered the initial auth failure. The retry‑loop behavior described here amplified the impact: instead of failing once and surfacing a clear "auth expired" error, the workflow burned through 4 attempts and emitted a confusing "all retries exhausted" failure.Suggested fixes (in order of preference)
Authentication failedor similar, the driver should mark the run failed immediately rather than retrying — there's nothing for--resumeto recover.COPILOT_GITHUB_TOKENincopilot_driver.cjsbefore spawning, and re‑inject it into the child's env on each retry. This is the most invasive fix but preserves the security posture (token is held only by the driver process, not in/proc/1).Workaround for affected users
Set
engine.retries: 0(or1) in the workflow frontmatter for now, so a transient auth failure surfaces immediately as a clear error rather than masquerading as a configuration problem.Environment
gh-aw:v0.71.x–v0.72.11.0.31–1.0.40actions/setup/js/copilot_driver.cjs(retry loop)actions/setup/sh/copilot_entrypoint.sh(token unset side effect)