Skip to content

feat(coordination): action acquires per-repo lock for cross-surface mutual exclusion#548

Merged
marcusrbrown merged 2 commits into
mainfrom
feat/gateway-v1-unit-3
Apr 25, 2026
Merged

feat(coordination): action acquires per-repo lock for cross-surface mutual exclusion#548
marcusrbrown merged 2 commits into
mainfrom
feat/gateway-v1-unit-3

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

The GitHub Action and the Discord gateway will share the same repos. Without
coordination, two surfaces can run against the same repo concurrently, race on
S3 writes, and produce conflicting branches/comments. This wires the Action
into the lock primitives shipped in #547 so that surface mutual exclusion is
enforced at the entry point.

What this does

A new acquire-lock phase runs after dedup. Its outcome drives behavior:

Outcome Behavior
acquired Run proceeds; cleanup releases the lock after S3 sync and cache save.
held-by-other Run skips cleanly (exit 0). Holder ID and surface are logged for operator visibility.
s3-disabled Object store not configured — coordination is opt-in; run proceeds without a lock.
error Log and proceed to preserve single-surface behavior. Stale lock recovers via takeover.

The holder ID is action:{run_id}:{run_attempt} so the Discord-side
/fro-bot force-release-lock operator command can identify exactly which
Action holds the lock when manual recovery is needed.

Notes

  • Lock release lives in cleanup's finally so it always runs, even on cleanup
    failure. Release errors are non-fatal — the 15-min TTL is the safety net.
  • The 15-min TTL with no heartbeat covers the ~2-min median Action run; rare
    long runs recover via stale-takeover on the next acquisition attempt.
  • The Action does not run validateProviderSemantics — the gateway, as the
    long-lived process, owns provider validation at startup.
  • The lock alone provides cross-surface mutual exclusion. No RunState
    record is created; GitHub already tracks workflow run state.

Test plan

  • 8 new test scenarios covering all four outcomes plus stale-takeover race,
    same-surface contention, S3 unavailable, and the defensive no-ETag case.
  • Existing cleanup tests updated to pass lockEtag: null (the new option).
  • Full suite: 1290 tests pass, 0 lint errors, types clean, build clean.

Plan was written before the Unit 1 extraction settled on its final shape.
Action source still lives at `src/` — only main.ts and post.ts moved to
`apps/action/src/` as thin re-export shims.

Capture the 5 decisions that scope Unit 3 implementation:
- self-test: gateway-only (skip on Action invocations)
- heartbeat: not in v1 Action (15-min TTL with stale takeover)
- run-state: lock-only (no createRun/transitionRun in Action)
- S3-disabled: skip lock cleanly, preserve single-surface compat
- event-type scope: lock all events (PR, issue, schedule, dispatch)
…utual exclusion

Add an acquire-lock phase that runs after dedup so the GitHub Action and the
Discord gateway never execute concurrently against the same repo. Lock release
runs in cleanup's finally block so the next surface waits at most for the run's
critical section, not the 15-min TTL.

- New phase `runAcquireLock` returns a discriminated union: acquired (with
  lockEtag), held-by-other (skip cleanly with holder details), s3-disabled (no
  coordination configured, proceed), or error (log and proceed to preserve
  single-surface behavior).
- Cleanup's CleanupPhaseOptions takes `lockEtag` and releases via runtime's
  `releaseLock` after S3 sync and cache save. Release errors are non-fatal.
- holder_id encodes `action:{run_id}:{run_attempt}` so Discord-side log
  inspection and operator recovery (`/fro-bot force-release-lock`) can
  identify the holder unambiguously.

Decisions captured in plan Unit 3 (2026-04-25):
- Self-test (`validateProviderSemantics`) skipped on Action invocations —
  the gateway is the long-lived process and validates provider semantics at
  startup.
- No heartbeat in v1 — the 15-min TTL covers the median ~2-min Action run;
  rare long runs recover through stale takeover.
- No RunState record — the lock alone provides cross-surface mutual exclusion;
  GitHub already tracks workflow run state.
- S3-disabled is graceful: coordination is opt-in.
- All event types are locked (PR, issue, schedule, dispatch).
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

Clean implementation of per-repo coordination locking. The discriminated union on AcquireLockResult guarantees exhaustive outcome handling, the finally-block release in cleanup ensures the lock is always freed, and all error paths are correctly non-fatal. Phase positioning (after dedup, before acknowledge) is the right call — no lock window wasted on deduplicated runs.

Blocking issues

None

Non-blocking concerns

  1. Cleanup lock release is untested. All three cleanup tests pass lockEtag: null, so the finally block (lines 202-236 of cleanup.ts) with releaseLock, its success/failure branches, and its catch handler are never exercised. A test with a non-null lockEtag that asserts releaseLock is called (and one where it rejects to confirm non-fatal handling) would close the gap. Low urgency since the logic is straightforward and release failures are tolerated by TTL, but worth adding.

  2. Minor heartbeat interval inconsistency. acquire-lock.ts passes heartbeatIntervalMs: 0 (intentional — no heartbeat in v1), while cleanup.ts passes DEFAULT_HEARTBEAT_INTERVAL_MS to releaseLock. The value is likely irrelevant for release, but using 0 in both places would keep the intent consistent and avoid confusion for future readers.

Missing tests

  • Cleanup path when lockEtag is non-null (release called, release succeeds)
  • Cleanup path when releaseLock returns {success: false} (warning logged, run still succeeds)
  • Cleanup path when releaseLock throws (caught, warning logged)

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

  • Regression: Low. The lock is opt-in (S3 must be enabled), and failure modes are non-fatal. Existing single-surface deployments without S3 are unaffected — s3-disabled short-circuits cleanly.
  • Security: None. Lock metadata (holderId, surface, runId) is public GitHub Actions data. Conditional S3 writes via ETags prevent lock-stomping.
  • Blast radius: Confined to the new phase and the cleanup finally block. No changes to existing phase interfaces beyond the additive lockEtag property on CleanupPhaseOptions.

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 24940169557
Cache hit
Session ses_2399fe6d4ffeDfYAcDJ3QZpDMc

@marcusrbrown marcusrbrown merged commit fbcb62a into main Apr 25, 2026
10 checks passed
@marcusrbrown marcusrbrown deleted the feat/gateway-v1-unit-3 branch April 25, 2026 21:14
@fro-bot fro-bot mentioned this pull request Apr 26, 2026
47 tasks
marcusrbrown added a commit that referenced this pull request May 17, 2026
…ck (#634)

* docs(plans): reconcile statuses against shipped reality

Four plans previously marked 'active' have shipped:
- agent-cohesion-session-continuity (deterministic session titles + buildLogicalKey)
- compounding-wiki (vault, schedule, seed pages via PRs #489 #491 #494)
- manual-delivery-mode (output-mode input via PR #517)
- gateway-discord-v1 Units 1-3 (PRs #541 #547 #548)

Gateway v1 plan stays active with Units 4-8 unshipped.

* ci(release): fall back to main's tree on unresolvable merge conflicts

The reset-and-merge step uses 'git merge --no-ff -Xtheirs origin/main' to
synthesize next from the last release tag. -Xtheirs handles content
conflicts but cannot resolve rename/rename conflicts, which fire every
time main's bundle artifact hash changes (dist/artifact-*.js).

On conflict, take main's tree verbatim via 'git checkout origin/main -- .'
and commit it as the merge result. The release branch's purpose is to
mirror main; biasing fully to main on conflicts preserves that intent
without manual intervention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants