Skip to content

Copilot CLI retry loop in copilot_driver.cjs is futile after entrypoint unsets COPILOT_GITHUB_TOKEN between attempts #33069

@IEvangelist

Description

@IEvangelist

Bug Report — Copilot CLI retry loop in copilot_driver.cjs is futile after entrypoint unsets COPILOT_GITHUB_TOKEN

Summary

When the first Copilot CLI invocation fails with Authentication failed (Request ID: ...), copilot_driver.cjs retries up to 3 more times with --resume. But between attempt 1 and attempt 2 the entrypoint runs Unset COPILOT_GITHUB_TOKEN from /proc/1/environ. Every subsequent retry therefore fails with Error: No authentication information found. — a different, more permanent error than what the retry logic was designed to recover from. Net result: a single transient auth failure burns through all four attempts and produces a "fatal" job failure (instead of partial success), and the failure log makes diagnosis confusing because the visible error on attempts 2–4 looks like a misconfiguration rather than a follow‑on side‑effect of the entrypoint.

29 runs in microsoft/aspire's pr-docs-check exhibit this exact pattern (Apr 17 – May 18).

Reproduction

  1. Run any gh-aw workflow using engine: copilot.
  2. Force a transient auth failure on attempt 1 (e.g., narrow the GITHUB_TOKEN permissions, or use a moment when the GitHub auth service is flaky).
  3. Observe the log sequence (verbatim from run 24540322242):
[copilot-driver] attempt 1: spawning: /usr/local/bin/copilot ... --prompt <redacted>
Retrying up to 120 times with 1s delay (120s total timeout)
Error: Authentication failed (Request ID: 56F8:5F23:3D5BDA:494128:69E11A60)
[copilot-driver] attempt 1: partial execution — will retry with --resume (attempt 2/4)
[copilot-driver] retry 1/3: sleeping 5000ms before next attempt with --resume

[entrypoint] Unset COPILOT_GITHUB_TOKEN from /proc/1/environ          ← side effect between attempts

[copilot-driver] attempt 2: spawning: /usr/local/bin/copilot ... --prompt <redacted> --resume
Error: No authentication information found.                            ← different error now
[copilot-driver] attempt 2: partial execution — will retry with --resume (attempt 3/4)
...
[copilot-driver] attempt 4: partial execution — will retry with --resume (attempt 4/4)
[copilot-driver] all 4 attempts failed

Attempts 2, 3, and 4 cannot possibly succeed: the env variable they need is gone.

Root cause

The Copilot CLI entrypoint is designed to clear COPILOT_GITHUB_TOKEN from process state once it has been read by the first CLI invocation, to keep secrets out of /proc/N/environ for subprocesses. That's a sensible security posture for normal execution. But copilot_driver.cjs's retry loop spawns a fresh copilot subprocess for each attempt (--resume), and each fresh process tries to re‑read COPILOT_GITHUB_TOKEN from its environment. The two behaviors are incompatible.

The retry logic was written assuming the failure modes were transient network/server errors that benefit from --resume. An auth failure is a different class of error and (a) doesn't benefit from --resume, and (b) is guaranteed to cascade into a worse failure because the entrypoint has already invalidated the token.

Affected runs

All 29 are pr-docs-check runs in microsoft/aspire exhibiting the exact pattern (attempt 1: Authentication failed; attempts 2–4: No authentication information found.):

Run Date (UTC)
24540322242 2026‑04‑17
24541374877 2026‑04‑17
24558289333 2026‑04‑17
24567976724 2026‑04‑17
24590170144 2026‑04‑17
24596000277 2026‑04‑18
24596525798 2026‑04‑18
24598328452 2026‑04‑18
24598362640 2026‑04‑18
24599895506 2026‑04‑18
24642148749 2026‑04‑19
24644884189 2026‑04‑20
24644897167 2026‑04‑20
24645605554 2026‑04‑20
24647670927 2026‑04‑20
24648197357 2026‑04‑20
24649903125 2026‑04‑20
24650389071 2026‑04‑20
24650829118 2026‑04‑20
24663948569 2026‑04‑20
24678356268 2026‑04‑20
24678611950 2026‑04‑20
24686244424 2026‑04‑20
24694445938 2026‑04‑20
24785093922 2026‑04‑22
25064632217 2026‑04‑28
25068752056 2026‑04‑28
26019861837 (attempt 1) 2026‑05‑18
26019861837 (attempt 2) 2026‑05‑18

Note: the high April cluster suggests there was a (now resolved) auth secret rotation issue at microsoft/aspire's end on those days that triggered the initial auth failure. The retry‑loop behavior described here amplified the impact: instead of failing once and surfacing a clear "auth expired" error, the workflow burned through 4 attempts and emitted a confusing "all retries exhausted" failure.

Suggested fixes (in order of preference)

  1. Detect auth‑class errors and skip retries. When attempt 1 fails with Authentication failed or similar, the driver should mark the run failed immediately rather than retrying — there's nothing for --resume to recover.
  2. Capture COPILOT_GITHUB_TOKEN in copilot_driver.cjs before spawning, and re‑inject it into the child's env on each retry. This is the most invasive fix but preserves the security posture (token is held only by the driver process, not in /proc/1).
  3. Delay entrypoint unset until the driver signals "done with auth". The entrypoint could read a sentinel file written by the driver after the last attempt completes.
  4. Surface the underlying error. At minimum, on a futile-retry failure the driver's final error message should attribute it correctly: "Auth failed on attempt 1; retries cannot recover (entrypoint clears token between attempts)" — rather than reporting "No authentication information found" which makes it look like a configuration problem.

Workaround for affected users

Set engine.retries: 0 (or 1) in the workflow frontmatter for now, so a transient auth failure surfaces immediately as a clear error rather than masquerading as a configuration problem.

Environment

  • gh-aw: v0.71.xv0.72.1
  • Copilot CLI: 1.0.311.0.40
  • Affected file(s):
    • actions/setup/js/copilot_driver.cjs (retry loop)
    • actions/setup/sh/copilot_entrypoint.sh (token unset side effect)

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions