feat(coordination): action acquires per-repo lock for cross-surface mutual exclusion#548
Conversation
Plan was written before the Unit 1 extraction settled on its final shape. Action source still lives at `src/` — only main.ts and post.ts moved to `apps/action/src/` as thin re-export shims. Capture the 5 decisions that scope Unit 3 implementation: - self-test: gateway-only (skip on Action invocations) - heartbeat: not in v1 Action (15-min TTL with stale takeover) - run-state: lock-only (no createRun/transitionRun in Action) - S3-disabled: skip lock cleanly, preserve single-surface compat - event-type scope: lock all events (PR, issue, schedule, dispatch)
…utual exclusion
Add an acquire-lock phase that runs after dedup so the GitHub Action and the
Discord gateway never execute concurrently against the same repo. Lock release
runs in cleanup's finally block so the next surface waits at most for the run's
critical section, not the 15-min TTL.
- New phase `runAcquireLock` returns a discriminated union: acquired (with
lockEtag), held-by-other (skip cleanly with holder details), s3-disabled (no
coordination configured, proceed), or error (log and proceed to preserve
single-surface behavior).
- Cleanup's CleanupPhaseOptions takes `lockEtag` and releases via runtime's
`releaseLock` after S3 sync and cache save. Release errors are non-fatal.
- holder_id encodes `action:{run_id}:{run_attempt}` so Discord-side log
inspection and operator recovery (`/fro-bot force-release-lock`) can
identify the holder unambiguously.
Decisions captured in plan Unit 3 (2026-04-25):
- Self-test (`validateProviderSemantics`) skipped on Action invocations —
the gateway is the long-lived process and validates provider semantics at
startup.
- No heartbeat in v1 — the 15-min TTL covers the median ~2-min Action run;
rare long runs recover through stale takeover.
- No RunState record — the lock alone provides cross-surface mutual exclusion;
GitHub already tracks workflow run state.
- S3-disabled is graceful: coordination is opt-in.
- All event types are locked (PR, issue, schedule, dispatch).
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
Clean implementation of per-repo coordination locking. The discriminated union on AcquireLockResult guarantees exhaustive outcome handling, the finally-block release in cleanup ensures the lock is always freed, and all error paths are correctly non-fatal. Phase positioning (after dedup, before acknowledge) is the right call — no lock window wasted on deduplicated runs.
Blocking issues
None
Non-blocking concerns
-
Cleanup lock release is untested. All three cleanup tests pass
lockEtag: null, so thefinallyblock (lines 202-236 ofcleanup.ts) withreleaseLock, its success/failure branches, and its catch handler are never exercised. A test with a non-nulllockEtagthat assertsreleaseLockis called (and one where it rejects to confirm non-fatal handling) would close the gap. Low urgency since the logic is straightforward and release failures are tolerated by TTL, but worth adding. -
Minor heartbeat interval inconsistency.
acquire-lock.tspassesheartbeatIntervalMs: 0(intentional — no heartbeat in v1), whilecleanup.tspassesDEFAULT_HEARTBEAT_INTERVAL_MStoreleaseLock. The value is likely irrelevant for release, but using0in both places would keep the intent consistent and avoid confusion for future readers.
Missing tests
- Cleanup path when
lockEtagis non-null (release called, release succeeds) - Cleanup path when
releaseLockreturns{success: false}(warning logged, run still succeeds) - Cleanup path when
releaseLockthrows (caught, warning logged)
Risk assessment (LOW): likelihood of regression, security exposure, or blast radius
- Regression: Low. The lock is opt-in (S3 must be enabled), and failure modes are non-fatal. Existing single-surface deployments without S3 are unaffected —
s3-disabledshort-circuits cleanly. - Security: None. Lock metadata (
holderId,surface,runId) is public GitHub Actions data. Conditional S3 writes via ETags prevent lock-stomping. - Blast radius: Confined to the new phase and the cleanup
finallyblock. No changes to existing phase interfaces beyond the additivelockEtagproperty onCleanupPhaseOptions.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 24940169557 |
| Cache | hit |
| Session | ses_2399fe6d4ffeDfYAcDJ3QZpDMc |
…ck (#634) * docs(plans): reconcile statuses against shipped reality Four plans previously marked 'active' have shipped: - agent-cohesion-session-continuity (deterministic session titles + buildLogicalKey) - compounding-wiki (vault, schedule, seed pages via PRs #489 #491 #494) - manual-delivery-mode (output-mode input via PR #517) - gateway-discord-v1 Units 1-3 (PRs #541 #547 #548) Gateway v1 plan stays active with Units 4-8 unshipped. * ci(release): fall back to main's tree on unresolvable merge conflicts The reset-and-merge step uses 'git merge --no-ff -Xtheirs origin/main' to synthesize next from the last release tag. -Xtheirs handles content conflicts but cannot resolve rename/rename conflicts, which fire every time main's bundle artifact hash changes (dist/artifact-*.js). On conflict, take main's tree verbatim via 'git checkout origin/main -- .' and commit it as the merge result. The release branch's purpose is to mirror main; biasing fully to main on conflicts preserves that intent without manual intervention.
The GitHub Action and the Discord gateway will share the same repos. Without
coordination, two surfaces can run against the same repo concurrently, race on
S3 writes, and produce conflicting branches/comments. This wires the Action
into the lock primitives shipped in #547 so that surface mutual exclusion is
enforced at the entry point.
What this does
A new
acquire-lockphase runs after dedup. Its outcome drives behavior:The holder ID is
action:{run_id}:{run_attempt}so the Discord-side/fro-bot force-release-lockoperator command can identify exactly whichAction holds the lock when manual recovery is needed.
Notes
finallyso it always runs, even on cleanupfailure. Release errors are non-fatal — the 15-min TTL is the safety net.
long runs recover via stale-takeover on the next acquisition attempt.
validateProviderSemantics— the gateway, as thelong-lived process, owns provider validation at startup.
RunStaterecord is created; GitHub already tracks workflow run state.
Test plan
same-surface contention, S3 unavailable, and the defensive no-ETag case.
lockEtag: null(the new option).