feat(coordination): action acquires per-repo lock for cross-surface mutual exclusion by marcusrbrown · Pull Request #548 · fro-bot/agent

marcusrbrown · 2026-04-25T20:39:27Z

The GitHub Action and the Discord gateway will share the same repos. Without
coordination, two surfaces can run against the same repo concurrently, race on
S3 writes, and produce conflicting branches/comments. This wires the Action
into the lock primitives shipped in #547 so that surface mutual exclusion is
enforced at the entry point.

What this does

A new acquire-lock phase runs after dedup. Its outcome drives behavior:

Outcome	Behavior
acquired	Run proceeds; cleanup releases the lock after S3 sync and cache save.
held-by-other	Run skips cleanly (exit 0). Holder ID and surface are logged for operator visibility.
s3-disabled	Object store not configured — coordination is opt-in; run proceeds without a lock.
error	Log and proceed to preserve single-surface behavior. Stale lock recovers via takeover.

The holder ID is action:{run_id}:{run_attempt} so the Discord-side
/fro-bot force-release-lock operator command can identify exactly which
Action holds the lock when manual recovery is needed.

Notes

Lock release lives in cleanup's finally so it always runs, even on cleanup
failure. Release errors are non-fatal — the 15-min TTL is the safety net.
The 15-min TTL with no heartbeat covers the ~2-min median Action run; rare
long runs recover via stale-takeover on the next acquisition attempt.
The Action does not run validateProviderSemantics — the gateway, as the
long-lived process, owns provider validation at startup.
The lock alone provides cross-surface mutual exclusion. No RunState
record is created; GitHub already tracks workflow run state.

Test plan

8 new test scenarios covering all four outcomes plus stale-takeover race,
same-surface contention, S3 unavailable, and the defensive no-ETag case.
Existing cleanup tests updated to pass lockEtag: null (the new option).
Full suite: 1290 tests pass, 0 lint errors, types clean, build clean.

Plan was written before the Unit 1 extraction settled on its final shape. Action source still lives at `src/` — only main.ts and post.ts moved to `apps/action/src/` as thin re-export shims. Capture the 5 decisions that scope Unit 3 implementation: - self-test: gateway-only (skip on Action invocations) - heartbeat: not in v1 Action (15-min TTL with stale takeover) - run-state: lock-only (no createRun/transitionRun in Action) - S3-disabled: skip lock cleanly, preserve single-surface compat - event-type scope: lock all events (PR, issue, schedule, dispatch)

…utual exclusion Add an acquire-lock phase that runs after dedup so the GitHub Action and the Discord gateway never execute concurrently against the same repo. Lock release runs in cleanup's finally block so the next surface waits at most for the run's critical section, not the 15-min TTL. - New phase `runAcquireLock` returns a discriminated union: acquired (with lockEtag), held-by-other (skip cleanly with holder details), s3-disabled (no coordination configured, proceed), or error (log and proceed to preserve single-surface behavior). - Cleanup's CleanupPhaseOptions takes `lockEtag` and releases via runtime's `releaseLock` after S3 sync and cache save. Release errors are non-fatal. - holder_id encodes `action:{run_id}:{run_attempt}` so Discord-side log inspection and operator recovery (`/fro-bot force-release-lock`) can identify the holder unambiguously. Decisions captured in plan Unit 3 (2026-04-25): - Self-test (`validateProviderSemantics`) skipped on Action invocations — the gateway is the long-lived process and validates provider semantics at startup. - No heartbeat in v1 — the 15-min TTL covers the median ~2-min Action run; rare long runs recover through stale takeover. - No RunState record — the lock alone provides cross-surface mutual exclusion; GitHub already tracks workflow run state. - S3-disabled is graceful: coordination is opt-in. - All event types are locked (PR, issue, schedule, dispatch).

fro-bot

Verdict: PASS

Clean implementation of per-repo coordination locking. The discriminated union on AcquireLockResult guarantees exhaustive outcome handling, the finally-block release in cleanup ensures the lock is always freed, and all error paths are correctly non-fatal. Phase positioning (after dedup, before acknowledge) is the right call — no lock window wasted on deduplicated runs.

Blocking issues

None

Non-blocking concerns

Cleanup lock release is untested. All three cleanup tests pass lockEtag: null, so the finally block (lines 202-236 of cleanup.ts) with releaseLock, its success/failure branches, and its catch handler are never exercised. A test with a non-null lockEtag that asserts releaseLock is called (and one where it rejects to confirm non-fatal handling) would close the gap. Low urgency since the logic is straightforward and release failures are tolerated by TTL, but worth adding.
Minor heartbeat interval inconsistency. acquire-lock.ts passes heartbeatIntervalMs: 0 (intentional — no heartbeat in v1), while cleanup.ts passes DEFAULT_HEARTBEAT_INTERVAL_MS to releaseLock. The value is likely irrelevant for release, but using 0 in both places would keep the intent consistent and avoid confusion for future readers.

Missing tests

Cleanup path when lockEtag is non-null (release called, release succeeds)
Cleanup path when releaseLock returns {success: false} (warning logged, run still succeeds)
Cleanup path when releaseLock throws (caught, warning logged)

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

Regression: Low. The lock is opt-in (S3 must be enabled), and failure modes are non-fatal. Existing single-surface deployments without S3 are unaffected — s3-disabled short-circuits cleanly.
Security: None. Lock metadata (holderId, surface, runId) is public GitHub Actions data. Conditional S3 writes via ETags prevent lock-stomping.
Blast radius: Confined to the new phase and the cleanup finally block. No changes to existing phase interfaces beyond the additive lockEtag property on CleanupPhaseOptions.

Run Summary

Field	Value
Event	pull_request
Repository	fro-bot/agent
Run ID	24940169557
Cache	hit
Session	ses_2399fe6d4ffeDfYAcDJ3QZpDMc

…ck (#634) * docs(plans): reconcile statuses against shipped reality Four plans previously marked 'active' have shipped: - agent-cohesion-session-continuity (deterministic session titles + buildLogicalKey) - compounding-wiki (vault, schedule, seed pages via PRs #489 #491 #494) - manual-delivery-mode (output-mode input via PR #517) - gateway-discord-v1 Units 1-3 (PRs #541 #547 #548) Gateway v1 plan stays active with Units 4-8 unshipped. * ci(release): fall back to main's tree on unresolvable merge conflicts The reset-and-merge step uses 'git merge --no-ff -Xtheirs origin/main' to synthesize next from the last release tag. -Xtheirs handles content conflicts but cannot resolve rename/rename conflicts, which fire every time main's bundle artifact hash changes (dist/artifact-*.js). On conflict, take main's tree verbatim via 'git checkout origin/main -- .' and commit it as the merge result. The release branch's purpose is to mirror main; biasing fully to main on conflicts preserves that intent without manual intervention.

marcusrbrown added 2 commits April 25, 2026 13:20

marcusrbrown requested a review from fro-bot as a code owner April 25, 2026 20:39

fro-bot added the agent: working label Apr 25, 2026

fro-bot approved these changes Apr 25, 2026

View reviewed changes

fro-bot removed the agent: working label Apr 25, 2026

marcusrbrown merged commit fbcb62a into main Apr 25, 2026
10 checks passed

marcusrbrown deleted the feat/gateway-v1-unit-3 branch April 25, 2026 21:14

fro-bot Bot mentioned this pull request Apr 25, 2026

chore(release): pending release v0.42.0 #546

Merged

fro-bot mentioned this pull request Apr 26, 2026

Daily Maintenance Report #252

Open

47 tasks

marcusrbrown mentioned this pull request May 17, 2026

chore: reconcile plan statuses + fix release workflow conflict fallback #634

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coordination): action acquires per-repo lock for cross-surface mutual exclusion#548

feat(coordination): action acquires per-repo lock for cross-surface mutual exclusion#548
marcusrbrown merged 2 commits into
mainfrom
feat/gateway-v1-unit-3

marcusrbrown commented Apr 25, 2026

Uh oh!

fro-bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcusrbrown commented Apr 25, 2026

What this does

Notes

Test plan

Uh oh!

fro-bot left a comment

Choose a reason for hiding this comment

Verdict: PASS

Blocking issues

Non-blocking concerns

Missing tests

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants