Skip to content

[Optimus] fix: eliminate TOCTOU race in Meta-Cron concurrency guard (fixes #511)#513

Merged
cloga merged 1 commit intomasterfrom
fix/511-patrol-overlap-guard
Mar 20, 2026
Merged

[Optimus] fix: eliminate TOCTOU race in Meta-Cron concurrency guard (fixes #511)#513
cloga merged 1 commit intomasterfrom
fix/511-patrol-overlap-guard

Conversation

@cloga
Copy link
Copy Markdown
Owner

@cloga cloga commented Mar 20, 2026

Summary

Diagnoses and fixes the recurring hourly patrol overlap bug (Issue #511) where concurrency_policy: "Forbid" was violated, allowing Run #213 and Run #214 to execute concurrently.

Root Cause (3 compounding issues)

  1. TOCTOU race: tick() used isLocked() (read-only) for the concurrency check, then fire() called createLock() separately. Two ticks within the same minute could both pass isLocked() → false before either created the lock.

  2. Wrong PID in lock: Lock files recorded process.pid (the MCP server), not the spawned child worker PID. When the MCP server restarted, isPidRunning() saw the old server PID as dead and treated the lock as stale — even though the child worker was still running.

  3. Staleness threshold == cron period: The 1-hour time-based staleness threshold exactly matched the hourly cron period (0 * * * *), creating an edge case where the lock was declared stale at the exact moment the next tick fired.

Changes

  • Atomic guard in tick(): Replaced the isLocked() check + separate createLock() pattern with a single createLock() call in tick() as the atomic concurrency gate (uses O_CREAT|O_EXCL / 'wx' flag).
  • updateLockPid(): New function that rewrites the lock file with the child process PID after spawn(), so isPidRunning() checks the actual worker.
  • 2h staleness threshold: Increased from 1 hour to 2 hours to prevent edge-case overlap with the hourly cron period. Aligns with the existing 2-hour safety timer.
  • Lock cleanup on early exit: Added deleteLock() calls for tick-dedup rejection and dry-run code paths.
  • Exported lock functions: For test access.

Test Results

  • Build: npm run build — exit 0, zero errors
  • Unit tests: 9 new vitest tests covering lock lifecycle, stale lock recovery, PID update, and time-based threshold — all pass
  • Files changed: src/mcp/meta-cron-engine.ts, src/test/meta-cron-locks.test.ts, optimus-plugin/dist/mcp-server.js (rebuilt)

Closes #511


🤖 Created by senior-full-stack-builder via Optimus Spartan Swarm

The hourly patrol overlap bug occurred because:
1. tick() used isLocked() (read-only check) then fire() called createLock()
   separately — a TOCTOU gap allowing two ticks to both see unlocked state
2. Lock files recorded the MCP server PID, not the child worker PID, causing
   isPidRunning() to check the wrong process after server restarts
3. The 1-hour time-based staleness threshold exactly matched the hourly cron
   period, creating an edge case at the tick boundary

Changes:
- Move atomic createLock() into tick() as the single concurrency guard
- Add updateLockPid() to write child process PID after spawn
- Increase time-based staleness threshold from 1h to 2h (matches safety timer)
- Release lock on tick-dedup rejection and dry-run paths
- Add 9 vitest unit tests for lock lifecycle

Co-Authored-By: Claude <noreply@anthropic.com>
@cloga cloga merged commit 7d87459 into master Mar 20, 2026
@cloga cloga deleted the fix/511-patrol-overlap-guard branch March 20, 2026 14:16
cloga pushed a commit that referenced this pull request Mar 20, 2026
…follow-up)

The overlap recurrence at 15:00Z (Runs #215/#216) was caused by the old
MCP server process still running pre-fix code after PR #513 merged. This
is not a code bug — the fix requires a server restart to take effect.

However, this commit also hardens the leader election against a latent
TOCTOU race where two processes simultaneously detect a stale leader,
both unlink+create, and one deletes the other's freshly created lock:

- Add unique nonce to scheduler leader lock data
- After stale-leader reclaim (unlink+wx create), re-read the lock and
  verify the nonce matches before confirming leadership
- If nonce mismatches, another process won the race — back off

Also cleaned up the dead scheduler-leader.lock (PID 362356).

Co-Authored-By: Claude <noreply@anthropic.com>
cloga added a commit that referenced this pull request Mar 20, 2026
…follow-up)

The overlap recurrence at 15:00Z (Runs #215/#216) was caused by the old
MCP server process still running pre-fix code after PR #513 merged. This
is not a code bug — the fix requires a server restart to take effect.

However, this commit also hardens the leader election against a latent
TOCTOU race where two processes simultaneously detect a stale leader,
both unlink+create, and one deletes the other's freshly created lock:

- Add unique nonce to scheduler leader lock data
- After stale-leader reclaim (unlink+wx create), re-read the lock and
  verify the nonce matches before confirming leadership
- If nonce mismatches, another process won the race — back off

Also cleaned up the dead scheduler-leader.lock (PID 362356).

Co-authored-by: Long Chen (from Dev Box) <lochen@microsoft.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Optimus] Bug: hourly patrol concurrency guard still permits overlapping runs

1 participant