ci(diag): isolate OAuth failure to agent SDK vs runner#698
Conversation
Add a temporary diagnostic step that runs `claude --print` directly on the runner with the OAuth token from the secret, bypassing the action / agent SDK entirely. The same invocation is known to succeed locally (clean HOME, only CLAUDE_CODE_OAUTH_TOKEN as env), so: - if this step also succeeds in CI: the bug is isolated to how the agent SDK forwards auth to its spawned child, and we have a minimal upstream repro. - if this step also fails in CI: the failure is environmental (network / runner) rather than SDK-only, and the issue moves up one layer. Token-leak hardening is documented inline. The step is intended to be removed (and the token rotated) once the question is answered.
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd3ae28858
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Sanity-check OAuth on the runner (diag only) | ||
| env: | ||
| CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} | ||
| run: | |
There was a problem hiding this comment.
Allow diagnostic failure without skipping main Claude step
If claude --print fails in this diagnostic step, the job stops immediately and Run Claude Code is skipped by default, which prevents the pass/fail matrix in the commit message from ever being observed in the failure case. In the exact scenario this diagnostic is meant to detect (runner-side OAuth failure), the workflow won’t execute the comparison step, so the isolation logic is incomplete.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 800a99a: marked the verify step continue-on-error: true so a runner-side OAuth failure no longer aborts the job. Run Claude Code still runs, both step results are observable, and the (diag fail / main fail) → "environmental" matrix entry can be detected.
| env: | ||
| CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} |
There was a problem hiding this comment.
Limit OAuth secret exposure to the claude command
Setting CLAUDE_CODE_OAUTH_TOKEN at step scope makes it available to the curl | bash installer and any subprocesses, even though only the final claude --print call needs it. This unnecessarily broadens the secret exposure window; if the installer script or its execution path is compromised/noisy, the token can be leaked before the diagnostic command runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 800a99a: split into two steps — Install Claude CLI for diag (no secret in env, runs curl | bash) and Verify OAuth on the runner (secret scoped only to this step, used only by the single claude --print call). Installer subprocesses now have zero exposure to the OAuth token.
…error
P1: with set -e the diag step would abort the job on its own failure,
skipping `Run Claude Code` — exactly the (diag fail / main fail)
case the comparison matrix is meant to detect. Mark the verify
step `continue-on-error: true` so both step results are always
observed.
P2: Token at step env scope is also visible to the curl|bash installer
and any subprocesses it spawns. Split into two steps: the installer
has no secret in env; only the verify step has it, scoped to a
single claude invocation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 800a99aa43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Install Claude CLI for diag (no secrets in env) | ||
| run: | | ||
| set -euo pipefail | ||
| set +x | ||
| curl -fsSL https://claude.ai/install.sh | bash -s -- 2.1.119 >/dev/null 2>&1 |
There was a problem hiding this comment.
Allow installer diagnostic step to fail without aborting job
The new Install Claude CLI for diag step is still mandatory, so a transient failure downloading https://claude.ai/install.sh will stop the workflow before Run Claude Code executes. That means temporary diagnostic infrastructure can block normal @claude handling and you still lose the intended pass/fail isolation matrix whenever install fails. Make this installer step non-blocking (or otherwise ensure the main Claude step always runs) so diagnostics cannot short-circuit the primary workflow.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in d3d05ba: marked the installer step continue-on-error: true as well. Now neither a transient install failure (e.g. fetching https://claude.ai/install.sh) nor a runner-side OAuth failure can abort the job — Run Claude Code always runs, and the (install fail / main fail) case is observable from the run summary just like (verify fail / main fail).
Round-2 P1: a transient failure of curl https://claude.ai/install.sh would otherwise abort the job and skip Run Claude Code, letting the temporary diagnostic infrastructure block normal @claude handling. Both diag steps are now continue-on-error: true. If install fails, the verify step then fails too (no claude on PATH) but workflow proceeds to the main step, and the (install fail / main fail) case is also observable from the run summary.
Purpose (temporary diagnostic)
Insert a step that runs
claude --printdirectly on the runner with the OAuth token from the secret, bypassinganthropics/claude-code-actionand the agent SDK entirely. The same invocation is known-good locally (clean HOME, onlyCLAUDE_CODE_OAUTH_TOKENin env), so the result of this step in CI tells us where the bug lives:Run Claude CodestepCould not resolve [authentication])Scope
Run Claude Codestep.Run Claude Codestep (still v1.0.107 pinned + env).Token-leak hardening (inline rationale)
set -euo pipefailonly — neverset -x; explicitset +xdefends againstACTIONS_STEP_DEBUG=truealso enabling xtrace.CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets... }}) and is never expanded into a command line.claudeinherits it via fork, not viaargv.install.shstdout/stderr are dropped (>/dev/null 2>&1).--debugor--verboseonclaude.HOME=$(mktemp -d)is a fresh tmp dir, so cached creds and~/.claude/.credentials.jsongo to an ephemeral location that is destroyed with the runner.Operational note. As an additional safety net, after the diag PR runs once we plan to rotate
CLAUDE_CODE_OAUTH_TOKEN(re-issue viaclaude setup-token, update the secret, revoke the old session at claude.ai). Even in the unlikely event the diag step's logs leaked the token in some unmasked form, rotation makes the leaked value useless.Test plan
@claude review-ing this PR (or any PR).Sanity-check OAuth on the runner (diag only)— pass/failRun Claude Code— pass/failCLAUDE_CODE_OAUTH_TOKEN.Self-review
CI workflow change only — single new step. 5-lens collapses to: no data-loss / concurrency / consistency surface, no perf impact (one extra ~10s step on
@clauderuns only), no test coverage applicable.The risk surface is "could leak the token in logs"; mitigations are documented inline and in this body, plus the rotation step in the test plan keeps a safety margin even against unknown-unknowns.