Skip to content

ci(diag): isolate OAuth failure to agent SDK vs runner#698

Merged
bootjp merged 3 commits intomainfrom
fix/claude-action-oauth-diag
Apr 28, 2026
Merged

ci(diag): isolate OAuth failure to agent SDK vs runner#698
bootjp merged 3 commits intomainfrom
fix/claude-action-oauth-diag

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 28, 2026

Purpose (temporary diagnostic)

Insert a step that runs claude --print directly on the runner with the OAuth token from the secret, bypassing anthropics/claude-code-action and the agent SDK entirely. The same invocation is known-good locally (clean HOME, only CLAUDE_CODE_OAUTH_TOKEN in env), so the result of this step in CI tells us where the bug lives:

Diag step Run Claude Code step Conclusion
✅ pass ❌ fail (Could not resolve [authentication]) bug is in the action / agent SDK's auth forwarding to its child claude process — minimal upstream repro
❌ fail ❌ fail environmental: GitHub-hosted runner ↔ Anthropic OAuth backend (network, region, IP)

Scope

  • Adds one step before the existing Run Claude Code step.
  • Does NOT touch the Run Claude Code step (still v1.0.107 pinned + env).
  • Step is marked TEMPORARY in comments. Will be removed in a follow-up once the question is answered.

Token-leak hardening (inline rationale)

  • set -euo pipefail only — never set -x; explicit set +x defends against ACTIONS_STEP_DEBUG=true also enabling xtrace.
  • The token is only referenced via env (CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets... }}) and is never expanded into a command line. claude inherits it via fork, not via argv.
  • install.sh stdout/stderr are dropped (>/dev/null 2>&1).
  • No --debug or --verbose on claude.
  • HOME=$(mktemp -d) is a fresh tmp dir, so cached creds and ~/.claude/.credentials.json go to an ephemeral location that is destroyed with the runner.
  • GitHub Actions' secret-masking covers any literal occurrence in stdout/stderr regardless.

Operational note. As an additional safety net, after the diag PR runs once we plan to rotate CLAUDE_CODE_OAUTH_TOKEN (re-issue via claude setup-token, update the secret, revoke the old session at claude.ai). Even in the unlikely event the diag step's logs leaked the token in some unmasked form, rotation makes the leaked value useless.

Test plan

  • Run the workflow once by @claude review-ing this PR (or any PR).
  • Read both step results:
    • Sanity-check OAuth on the runner (diag only) — pass/fail
    • Run Claude Code — pass/fail
  • Decide next move from the conclusion table above.
  • Rotate CLAUDE_CODE_OAUTH_TOKEN.
  • Open a follow-up PR that removes the diag step.

Self-review

CI workflow change only — single new step. 5-lens collapses to: no data-loss / concurrency / consistency surface, no perf impact (one extra ~10s step on @claude runs only), no test coverage applicable.

The risk surface is "could leak the token in logs"; mitigations are documented inline and in this body, plus the rotation step in the test plan keeps a safety margin even against unknown-unknowns.

Add a temporary diagnostic step that runs `claude --print` directly on
the runner with the OAuth token from the secret, bypassing the action /
agent SDK entirely. The same invocation is known to succeed locally
(clean HOME, only CLAUDE_CODE_OAUTH_TOKEN as env), so:

- if this step also succeeds in CI: the bug is isolated to how the
  agent SDK forwards auth to its spawned child, and we have a minimal
  upstream repro.
- if this step also fails in CI: the failure is environmental (network
  / runner) rather than SDK-only, and the issue moves up one layer.

Token-leak hardening is documented inline. The step is intended to be
removed (and the token rotated) once the question is answered.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 25 minutes and 1 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bfd25fc-e6ef-4871-af67-c14904a30198

📥 Commits

Reviewing files that changed from the base of the PR and between e019a21 and d3d05ba.

📒 Files selected for processing (1)
  • .github/workflows/claude.yml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/claude-action-oauth-diag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp bootjp enabled auto-merge April 28, 2026 13:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd3ae28858

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/workflows/claude.yml Outdated
Comment on lines +50 to +53
- name: Sanity-check OAuth on the runner (diag only)
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
run: |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow diagnostic failure without skipping main Claude step

If claude --print fails in this diagnostic step, the job stops immediately and Run Claude Code is skipped by default, which prevents the pass/fail matrix in the commit message from ever being observed in the failure case. In the exact scenario this diagnostic is meant to detect (runner-side OAuth failure), the workflow won’t execute the comparison step, so the isolation logic is incomplete.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 800a99a: marked the verify step continue-on-error: true so a runner-side OAuth failure no longer aborts the job. Run Claude Code still runs, both step results are observable, and the (diag fail / main fail) → "environmental" matrix entry can be detected.

Comment on lines +51 to +52
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit OAuth secret exposure to the claude command

Setting CLAUDE_CODE_OAUTH_TOKEN at step scope makes it available to the curl | bash installer and any subprocesses, even though only the final claude --print call needs it. This unnecessarily broadens the secret exposure window; if the installer script or its execution path is compromised/noisy, the token can be leaked before the diagnostic command runs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 800a99a: split into two steps — Install Claude CLI for diag (no secret in env, runs curl | bash) and Verify OAuth on the runner (secret scoped only to this step, used only by the single claude --print call). Installer subprocesses now have zero exposure to the OAuth token.

…error

P1: with set -e the diag step would abort the job on its own failure,
    skipping `Run Claude Code` — exactly the (diag fail / main fail)
    case the comparison matrix is meant to detect. Mark the verify
    step `continue-on-error: true` so both step results are always
    observed.

P2: Token at step env scope is also visible to the curl|bash installer
    and any subprocesses it spawns. Split into two steps: the installer
    has no secret in env; only the verify step has it, scoped to a
    single claude invocation.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 800a99aa43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +58 to +62
- name: Install Claude CLI for diag (no secrets in env)
run: |
set -euo pipefail
set +x
curl -fsSL https://claude.ai/install.sh | bash -s -- 2.1.119 >/dev/null 2>&1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow installer diagnostic step to fail without aborting job

The new Install Claude CLI for diag step is still mandatory, so a transient failure downloading https://claude.ai/install.sh will stop the workflow before Run Claude Code executes. That means temporary diagnostic infrastructure can block normal @claude handling and you still lose the intended pass/fail isolation matrix whenever install fails. Make this installer step non-blocking (or otherwise ensure the main Claude step always runs) so diagnostics cannot short-circuit the primary workflow.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d3d05ba: marked the installer step continue-on-error: true as well. Now neither a transient install failure (e.g. fetching https://claude.ai/install.sh) nor a runner-side OAuth failure can abort the job — Run Claude Code always runs, and the (install fail / main fail) case is observable from the run summary just like (verify fail / main fail).

Round-2 P1: a transient failure of curl https://claude.ai/install.sh
would otherwise abort the job and skip Run Claude Code, letting the
temporary diagnostic infrastructure block normal @claude handling.

Both diag steps are now continue-on-error: true. If install fails, the
verify step then fails too (no claude on PATH) but workflow proceeds to
the main step, and the (install fail / main fail) case is also
observable from the run summary.
@bootjp bootjp disabled auto-merge April 28, 2026 14:06
@bootjp bootjp merged commit 25ec963 into main Apr 28, 2026
11 checks passed
@bootjp bootjp deleted the fix/claude-action-oauth-diag branch April 28, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant