Describe the bug
When multiple eca server processes run on the same machine (e.g. one per workspace in a multi-project workflow), they race during OAuth access-token refresh against https://console.anthropic.com/v1/oauth/token. The losing process receives:
Anthropic refresh token failed: "{\"error\": \"invalid_grant\", \"error_description\": \"Refresh token not found or invalid\"}"
Error: Auth token renew failed
After this fires, the affected session cannot recover automatically. The user has to /login again, which destroys the active chat from the user's perspective.
To Reproduce
- Authenticate ECA with Anthropic Max OAuth (
/login -> max).
- Start a second
eca server process for a different workspace (e.g. open ECA in another project, or use emacs --daemon and spawn ECA from a second workspace).
- Let both processes idle ~1 hour until access tokens approach expiry.
- Send a prompt in both near-simultaneously.
- Observe: one prompt succeeds and rotates the refresh token; the other fails with
Anthropic refresh token failed: invalid_grant.
Higher process counts amplify probability. With N=3+ processes, failures observed daily.
Expected behavior
Token refresh should be safe under concurrent processes. The losers of the race should detect the rotation that already happened, adopt the new tokens from disk, and continue without surfacing an error.
Additional context
Root cause
All eca server processes share ~/.cache/eca/db.transit.json for OAuth state (refresh-token, access-token, expires-at). On startup each process reads this file into its in-memory db* atom. There is no file lock around the refresh flow:
src/eca/llm_providers/anthropic.clj oauth-refresh (around line 548) calls POST /v1/oauth/token with the in-memory refresh-token.
- On success it swaps new tokens into
db* and writes to disk via db/update-global-cache!.
When access tokens expire (~1 hour TTL), multiple processes detect expiry near-simultaneously and each call oauth-refresh with the same refresh-token they loaded at startup. Anthropic rotates the refresh token on every successful refresh; the first call invalidates the old token server-side and subsequent calls within the same window receive invalid_grant. The losers' in-memory and on-disk refresh-token becomes permanently invalid until manual /login.
Likely the same root cause as an earlier report
editor-code-assistant/eca-emacs#177 (closed) included a user report from @snoopier: "Get a lot of 401 today and it's absolutely random". The repo owner replied "I noticed anthropic and other models are throwing this randomly. I believe we can have a way to configure in ECA a match for status-code and body to consider as retry". The fix added retryRules config.
retryRules helps for transient 401s (the access token is stale; retry triggers a fresh refresh). It does not recover from invalid_grant on the refresh endpoint itself, because the refresh-token is permanently invalid. The "absolutely random" pattern matches what you would expect from a refresh race.
Diagnostic data from my setup
$ ps -eo pid,etime,command | grep "[e]ca server"
99996 1-23:55 /opt/homebrew/bin/eca server
97576 2-00:35 /opt/homebrew/bin/eca server
10027 04:35 /opt/homebrew/bin/eca server
29094 59 /opt/homebrew/bin/eca server
29434 55 /opt/homebrew/bin/eca server
5 concurrent eca server processes against a single shared ~/.cache/eca/db.transit.json. Failure observed daily. Error message verbatim:
Anthropic refresh token failed: "{\"error\": \"invalid_grant\", \"error_description\": \"Refresh token not found or invalid\"}"
Anthropic auth in db.transit.json:
{"anthropic" {:step :login/done
:mode :max
:type :auth/oauth
:refresh-token "sk-ant-ort01-..."
:api-key "sk-ant-oat01-..."
:expires-at <epoch>}}
Proposed fix
Wrap oauth-refresh (and the equivalent in openai.clj and oauth.clj's refresh-token!) with a file-lock plus re-read pattern:
(defn ^:private with-token-refresh-lock [cache-file f]
(let [lock-file (io/file (str cache-file ".lock"))]
(io/make-parents lock-file)
(with-open [raf (java.io.RandomAccessFile. lock-file "rw")
ch (.getChannel raf)
lk (.lock ch)]
(f))))
;; Inside the refresh path, after acquiring the lock:
;; 1. Re-read db.transit.json from disk
;; 2. If on-disk refresh-token differs from in-memory: another process
;; refreshed; adopt disk values, skip HTTP, return success
;; 3. Otherwise: call HTTP refresh, write new tokens, return
Java FileLock on the JVM works; the GraalVM native image also supports it. The lock is exclusive and held only across the refresh call (sub-second), so contention is negligible.
Simpler alternative (lower bar): on invalid_grant response, re-read db.transit.json once and retry with the disk-fresh refresh token before surfacing the error. This does not prevent the race but auto-recovers from it.
Affected files
src/eca/llm_providers/anthropic.clj (oauth-refresh, :login/renew-token step)
src/eca/llm_providers/openai.clj (parallel oauth-refresh)
src/eca/oauth.clj (refresh-token! for MCP server OAuth, same race possible)
src/eca/db.clj (good place for the lock wrapper)
Workaround until fixed
- Run only one
eca server process at a time, or
- Use
ANTHROPIC_API_KEY (loses Max subscription billing).
Severity
For users with multi-workspace or emacs --daemon workflows: daily auth failures, repeated browser-based re-logins, lost chat context. The issue scales with concurrent process count and is invisible to single-session users, which is likely why it has gone undiagnosed despite the symptom appearing in eca-emacs#177.
Describe the bug
When multiple
eca serverprocesses run on the same machine (e.g. one per workspace in a multi-project workflow), they race during OAuth access-token refresh againsthttps://console.anthropic.com/v1/oauth/token. The losing process receives:After this fires, the affected session cannot recover automatically. The user has to
/loginagain, which destroys the active chat from the user's perspective.To Reproduce
/login->max).eca serverprocess for a different workspace (e.g. open ECA in another project, or useemacs --daemonand spawn ECA from a second workspace).Anthropic refresh token failed: invalid_grant.Higher process counts amplify probability. With N=3+ processes, failures observed daily.
Expected behavior
Token refresh should be safe under concurrent processes. The losers of the race should detect the rotation that already happened, adopt the new tokens from disk, and continue without surfacing an error.
Additional context
Root cause
All
eca serverprocesses share~/.cache/eca/db.transit.jsonfor OAuth state (refresh-token,access-token,expires-at). On startup each process reads this file into its in-memorydb*atom. There is no file lock around the refresh flow:src/eca/llm_providers/anthropic.cljoauth-refresh(around line 548) callsPOST /v1/oauth/tokenwith the in-memoryrefresh-token.db*and writes to disk viadb/update-global-cache!.When access tokens expire (~1 hour TTL), multiple processes detect expiry near-simultaneously and each call
oauth-refreshwith the samerefresh-tokenthey loaded at startup. Anthropic rotates the refresh token on every successful refresh; the first call invalidates the old token server-side and subsequent calls within the same window receiveinvalid_grant. The losers' in-memory and on-diskrefresh-tokenbecomes permanently invalid until manual/login.Likely the same root cause as an earlier report
editor-code-assistant/eca-emacs#177 (closed) included a user report from @snoopier: "Get a lot of 401 today and it's absolutely random". The repo owner replied "I noticed anthropic and other models are throwing this randomly. I believe we can have a way to configure in ECA a match for status-code and body to consider as retry". The fix added
retryRulesconfig.retryRuleshelps for transient 401s (the access token is stale; retry triggers a fresh refresh). It does not recover frominvalid_granton the refresh endpoint itself, because therefresh-tokenis permanently invalid. The "absolutely random" pattern matches what you would expect from a refresh race.Diagnostic data from my setup
5 concurrent
eca serverprocesses against a single shared~/.cache/eca/db.transit.json. Failure observed daily. Error message verbatim:Anthropic auth in
db.transit.json:{"anthropic" {:step :login/done :mode :max :type :auth/oauth :refresh-token "sk-ant-ort01-..." :api-key "sk-ant-oat01-..." :expires-at <epoch>}}Proposed fix
Wrap
oauth-refresh(and the equivalent inopenai.cljandoauth.clj'srefresh-token!) with a file-lock plus re-read pattern:Java
FileLockon the JVM works; the GraalVM native image also supports it. The lock is exclusive and held only across the refresh call (sub-second), so contention is negligible.Simpler alternative (lower bar): on
invalid_grantresponse, re-readdb.transit.jsononce and retry with the disk-fresh refresh token before surfacing the error. This does not prevent the race but auto-recovers from it.Affected files
src/eca/llm_providers/anthropic.clj(oauth-refresh,:login/renew-tokenstep)src/eca/llm_providers/openai.clj(paralleloauth-refresh)src/eca/oauth.clj(refresh-token!for MCP server OAuth, same race possible)src/eca/db.clj(good place for the lock wrapper)Workaround until fixed
eca serverprocess at a time, orANTHROPIC_API_KEY(loses Max subscription billing).Severity
For users with multi-workspace or
emacs --daemonworkflows: daily auth failures, repeated browser-based re-logins, lost chat context. The issue scales with concurrent process count and is invisible to single-session users, which is likely why it has gone undiagnosed despite the symptom appearing in eca-emacs#177.