OAuth refresh race condition when multiple eca server processes coexist (Anthropic Max, likely others)

**Describe the bug**

When multiple `eca server` processes run on the same machine (e.g. one per workspace in a multi-project workflow), they race during OAuth access-token refresh against `https://console.anthropic.com/v1/oauth/token`. The losing process receives:

```
Anthropic refresh token failed: "{\"error\": \"invalid_grant\", \"error_description\": \"Refresh token not found or invalid\"}"
Error: Auth token renew failed
```

After this fires, the affected session cannot recover automatically. The user has to `/login` again, which destroys the active chat from the user's perspective.

**To Reproduce**

1. Authenticate ECA with Anthropic Max OAuth (`/login` -> `max`).
2. Start a second `eca server` process for a different workspace (e.g. open ECA in another project, or use `emacs --daemon` and spawn ECA from a second workspace).
3. Let both processes idle ~1 hour until access tokens approach expiry.
4. Send a prompt in both near-simultaneously.
5. Observe: one prompt succeeds and rotates the refresh token; the other fails with `Anthropic refresh token failed: invalid_grant`.

Higher process counts amplify probability. With N=3+ processes, failures observed daily.

**Expected behavior**

Token refresh should be safe under concurrent processes. The losers of the race should detect the rotation that already happened, adopt the new tokens from disk, and continue without surfacing an error.

**Additional context**

Root cause

All `eca server` processes share `~/.cache/eca/db.transit.json` for OAuth state (`refresh-token`, `access-token`, `expires-at`). On startup each process reads this file into its in-memory `db*` atom. There is no file lock around the refresh flow:

- `src/eca/llm_providers/anthropic.clj` `oauth-refresh` (around line 548) calls `POST /v1/oauth/token` with the in-memory `refresh-token`.
- On success it swaps new tokens into `db*` and writes to disk via `db/update-global-cache!`.

When access tokens expire (~1 hour TTL), multiple processes detect expiry near-simultaneously and each call `oauth-refresh` with the same `refresh-token` they loaded at startup. Anthropic rotates the refresh token on every successful refresh; the first call invalidates the old token server-side and subsequent calls within the same window receive `invalid_grant`. The losers' in-memory and on-disk `refresh-token` becomes permanently invalid until manual `/login`.

Likely the same root cause as an earlier report

editor-code-assistant/eca-emacs#177 (closed) included a user report from @snoopier: "Get a lot of 401 today and it's absolutely random". The repo owner replied "I noticed anthropic and other models are throwing this randomly. I believe we can have a way to configure in ECA a match for status-code and body to consider as retry". The fix added `retryRules` config.

`retryRules` helps for transient 401s (the access token is stale; retry triggers a fresh refresh). It does not recover from `invalid_grant` on the refresh endpoint itself, because the `refresh-token` is permanently invalid. The "absolutely random" pattern matches what you would expect from a refresh race.

Diagnostic data from my setup

```
$ ps -eo pid,etime,command | grep "[e]ca server"
99996  1-23:55  /opt/homebrew/bin/eca server
97576  2-00:35  /opt/homebrew/bin/eca server
10027    04:35  /opt/homebrew/bin/eca server
29094       59  /opt/homebrew/bin/eca server
29434       55  /opt/homebrew/bin/eca server
```

5 concurrent `eca server` processes against a single shared `~/.cache/eca/db.transit.json`. Failure observed daily. Error message verbatim:

```
Anthropic refresh token failed: "{\"error\": \"invalid_grant\", \"error_description\": \"Refresh token not found or invalid\"}"
```

Anthropic auth in `db.transit.json`:

```clojure
{"anthropic" {:step :login/done
              :mode :max
              :type :auth/oauth
              :refresh-token "sk-ant-ort01-..."
              :api-key "sk-ant-oat01-..."
              :expires-at <epoch>}}
```

Proposed fix

Wrap `oauth-refresh` (and the equivalent in `openai.clj` and `oauth.clj`'s `refresh-token!`) with a file-lock plus re-read pattern:

```clojure
(defn ^:private with-token-refresh-lock [cache-file f]
  (let [lock-file (io/file (str cache-file ".lock"))]
    (io/make-parents lock-file)
    (with-open [raf (java.io.RandomAccessFile. lock-file "rw")
                ch  (.getChannel raf)
                lk  (.lock ch)]
      (f))))

;; Inside the refresh path, after acquiring the lock:
;;   1. Re-read db.transit.json from disk
;;   2. If on-disk refresh-token differs from in-memory: another process
;;      refreshed; adopt disk values, skip HTTP, return success
;;   3. Otherwise: call HTTP refresh, write new tokens, return
```

Java `FileLock` on the JVM works; the GraalVM native image also supports it. The lock is exclusive and held only across the refresh call (sub-second), so contention is negligible.

Simpler alternative (lower bar): on `invalid_grant` response, re-read `db.transit.json` once and retry with the disk-fresh refresh token before surfacing the error. This does not prevent the race but auto-recovers from it.

Affected files

- `src/eca/llm_providers/anthropic.clj` (`oauth-refresh`, `:login/renew-token` step)
- `src/eca/llm_providers/openai.clj` (parallel `oauth-refresh`)
- `src/eca/oauth.clj` (`refresh-token!` for MCP server OAuth, same race possible)
- `src/eca/db.clj` (good place for the lock wrapper)

Workaround until fixed

- Run only one `eca server` process at a time, or
- Use `ANTHROPIC_API_KEY` (loses Max subscription billing).

Severity

For users with multi-workspace or `emacs --daemon` workflows: daily auth failures, repeated browser-based re-logins, lost chat context. The issue scales with concurrent process count and is invisible to single-session users, which is likely why it has gone undiagnosed despite the symptom appearing in eca-emacs#177.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OAuth refresh race condition when multiple eca server processes coexist (Anthropic Max, likely others) #462

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

OAuth refresh race condition when multiple eca server processes coexist (Anthropic Max, likely others) #462

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions