Skip to content

release: v4.4.1 — isolate runner OAuth credential from shared /root/.claude/#308

Merged
askalf merged 1 commit into
masterfrom
fix/v4.4.1-runner-credential-isolation
May 17, 2026
Merged

release: v4.4.1 — isolate runner OAuth credential from shared /root/.claude/#308
askalf merged 1 commit into
masterfrom
fix/v4.4.1-runner-credential-isolation

Conversation

@askalf
Copy link
Copy Markdown
Owner

@askalf askalf commented May 17, 2026

What does this PR do?

Operational hardening. The v4.2.2 walkthrough seeded the runner's CC credential at `/root/.claude/.credentials.json`. On boxes where that path is also mounted into other CC clients (docker services that mount the host's `/root/.claude/` as a credentials volume, operator SSH sessions, etc.), both clients use the same access/refresh tokens. When either refreshes, the other's bearer can be silently invalidated until its next refresh attempt. We hit one such 401 during v4.2.2 setup; the 30-min cron cadence absorbed it, but it's a real failure mode for higher-frequency setups (e.g. v4.4.0's auto-rebake firing during a cycle that happens to overlap a token refresh).

Fix

Both runner workflows now pin `HOME: /root/.claude-runner` on every step that spawns CC. Runner's credential lives at `/root/.claude-runner/.claude/.credentials.json`, isolated from `/root/.claude/` — refreshes on the two paths are now independent.

  • `cc-drift-template-watch.yml`: `Run drift check` and `Auto-rebake + open PR` steps both get `env: HOME: /root/.claude-runner`
  • `compat-test-self-hosted.yml`: `Start dario proxy (passthrough mode)` step gets the same
  • `docs/drift-monitor.md`: documents the isolated-credential flow as the recommended pattern for boxes that share with other CC clients; simpler default (`~/.claude/`) still works for runner-only hosts

Verification

Generated a fresh OAuth credential on the production runner via `HOME=/root/.claude-runner dario login --manual`. dario writes its credentials to `/.dario/credentials.json`; CC reads from `/.claude/.credentials.json`. Same JSON format though (top-level `claudeAiOauth` key), so setup mirrors the file. Confirmed:

  • `HOME=/root/.claude-runner claude --print` returns PONG (auth works)
  • Full `--check` against the runner's clone with isolated HOME reports `no drift detected. exit 0`
  • Platform's `/root/.claude/.credentials.json` (mtime 13:48 UTC, hours before the v4.4.1 work) untouched

How to test

```bash
git fetch origin fix/v4.4.1-runner-credential-isolation
git checkout fix/v4.4.1-runner-credential-isolation
npm run build && npm test # 74/74

End-to-end: once merged, the next 30-min watcher cron tick exercises

the new HOME pinning. Manual workflow_dispatch on master also available.

```

Checklist

  • `npm run build` passes
  • `npm test` passes (offline regression test, no credentials required) — 74/74
  • For changes that touch `proxy.ts`, `cc-template.ts`, or streaming behavior: tested with `dario proxy --verbose` + `node test/compat.mjs` (requires credentials) — N/A: workflow + docs only, no src/ changes; the compat-test workflow itself is one of the files modified though, which means this PR's path filter triggers compat-test on the bot's PR after the auto-release if any
  • No new runtime dependencies added
  • No tokens/secrets in code or logs

The v4.2.2 walkthrough seeded the runner's credential at
/root/.claude/.credentials.json. On boxes where that path is also
mounted into other CC clients — docker services, operator SSH
sessions — both clients use the same access/refresh tokens. When
either refreshes, the other's bearer can be silently invalidated.
We hit one 401 during v4.2.2 setup; the 30-min cron cadence
absorbed it but it's a real failure mode.

Fix: both runner workflows now pin HOME=/root/.claude-runner on
every step that spawns CC. Setup writes the runner's credential
under /root/.claude-runner/.claude/.credentials.json, isolated from
the platform path.

- cc-drift-template-watch.yml: Run drift check + Auto-rebake + open
  PR steps both get env HOME=/root/.claude-runner
- compat-test-self-hosted.yml: Start dario proxy step gets the same
- docs/drift-monitor.md: documents the isolated flow as the
  recommended pattern for shared boxes; default ~/.claude/ still
  works for runner-only hosts

Verified end-to-end on the production runner: generated fresh
credential via HOME=/root/.claude-runner dario login --manual,
mirrored dario's ~/.dario/credentials.json to CC's
~/.claude/.credentials.json (same JSON format, top-level
claudeAiOauth key), confirmed `claude --print` returns PONG and
--check reports no drift. Platform's /root/.claude/ untouched.

Pure operational hardening. No src/ changes. 74/74 default suite
green.
@askalf askalf enabled auto-merge (squash) May 17, 2026 19:19
@github-actions
Copy link
Copy Markdown
Contributor

Compat test: ✅ PASSED

Ran node test/compat.mjs against dario proxy --passthrough on the self-hosted runner for commit c9aa7fc68bb9504e12c67804fdf93bc8ae566ac9.

Output
============================================================
  dario Compatibility Validation (--passthrough)
  2026-05-17T19:19:49.957Z
============================================================

⚠️  NOTE: All requests are 429ing and falling back to CLI.
   This is expected in --passthrough without priority routing.
   Tool use and header tests will fail (CLI limitations).
   Re-run after 5h window resets for direct API results.

--- Anthropic Messages API (Hermes) ---
❌ #1 Anthropic non-stream: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
❌ #2 Anthropic stream: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
❌ #3 SSE framing: HTTP 401

--- Passthrough Verification ---
❌ #4 No thinking injection: HTTP 401
❌ #5 Client betas preserved: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}

--- Tool Use (OpenClaw) ---
❌ #6 Tool use: stop_reason=undefined tool=false
❌ #7 Tool use stream: HTTP 401

--- OpenAI Compat ---
❌ #8 OpenAI non-stream: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
❌ #9 OpenAI stream: HTTP 401

--- Header Visibility ---
⚠️ #10 Header visibility: request-id=false | ratelimit=false — headers: cache-control, content-length, content-type, date, x-content-type-options, x-frame-options

============================================================
  RESULTS: 0 passed, 9 failed, 1 warnings
============================================================

Failed:
  #1 Anthropic non-stream: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
  #2 Anthropic stream: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
  #3 SSE framing: HTTP 401
  #4 No thinking injection: HTTP 401
  #5 Client betas preserved: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
  #6 Tool use: stop_reason=undefined tool=false
  #7 Tool use stream: HTTP 401
  #8 OpenAI non-stream: HTTP 401: {"error":"Unauthorized","message":"Invalid or missing API key"}
  #9 OpenAI stream: HTTP 401

Full workflow run

@askalf askalf merged commit 55334b1 into master May 17, 2026
10 checks passed
@askalf askalf deleted the fix/v4.4.1-runner-credential-isolation branch May 17, 2026 19:21
askalf added a commit that referenced this pull request May 17, 2026
Both compat-test-self-hosted.yml and cc-billing-classifier-canary.yml
were silently piggybacking on the platform's existing dario
instance (askalf-dario docker container at :3456), not the
freshly-built dist they were supposed to test.

Mechanism: dario proxy's EADDRINUSE handler probes /health when
its target port is occupied, sees an existing dario, prints
"dario — already running" and exits 0 (intentional: makes
`dario login` / `dario proxy` idempotent for users). On the
production runner the docker askalf-dario already binds :3456,
so the workflow's `dario proxy` short-circuits and the workflow's
curls hit the platform's dario using PLATFORM credentials.

For the canary: produced 401 + claim='' because the platform's
account is in a different state right now.

For compat-test: every PR check on PRs #303, #304, #306, #308,
#310, #311 was validating the platform dario, not the PR's
freshly-built dist. The PR-time gate was measuring the wrong
thing.

Fix: both workflows now bind --port 3457 and the harnesses read
DARIO_TEST_URL=http://127.0.0.1:3457. Eliminates the port
collision.

Validated locally on the production runner: HOME=/root/.claude-
runner dario proxy --port 3457 starts clean, /health responds,
single tiny haiku request returns 200 with a subscription
representative-claim. The runner workflow will produce the same
result once landed.

75/75 default suite green. No src/ changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant