[Optimus] fix: eliminate TOCTOU race in Meta-Cron concurrency guard (fixes #511) by cloga · Pull Request #513 · cloga/optimus-code

cloga · 2026-03-20T14:16:27Z

Summary

Diagnoses and fixes the recurring hourly patrol overlap bug (Issue #511) where concurrency_policy: "Forbid" was violated, allowing Run #213 and Run #214 to execute concurrently.

Root Cause (3 compounding issues)

TOCTOU race: tick() used isLocked() (read-only) for the concurrency check, then fire() called createLock() separately. Two ticks within the same minute could both pass isLocked() → false before either created the lock.
Wrong PID in lock: Lock files recorded process.pid (the MCP server), not the spawned child worker PID. When the MCP server restarted, isPidRunning() saw the old server PID as dead and treated the lock as stale — even though the child worker was still running.
Staleness threshold == cron period: The 1-hour time-based staleness threshold exactly matched the hourly cron period (0 * * * *), creating an edge case where the lock was declared stale at the exact moment the next tick fired.

Changes

Atomic guard in tick(): Replaced the isLocked() check + separate createLock() pattern with a single createLock() call in tick() as the atomic concurrency gate (uses O_CREAT|O_EXCL / 'wx' flag).
updateLockPid(): New function that rewrites the lock file with the child process PID after spawn(), so isPidRunning() checks the actual worker.
2h staleness threshold: Increased from 1 hour to 2 hours to prevent edge-case overlap with the hourly cron period. Aligns with the existing 2-hour safety timer.
Lock cleanup on early exit: Added deleteLock() calls for tick-dedup rejection and dry-run code paths.
Exported lock functions: For test access.

Test Results

Build: npm run build — exit 0, zero errors
Unit tests: 9 new vitest tests covering lock lifecycle, stale lock recovery, PID update, and time-based threshold — all pass
Files changed: src/mcp/meta-cron-engine.ts, src/test/meta-cron-locks.test.ts, optimus-plugin/dist/mcp-server.js (rebuilt)

Closes #511

🤖 Created by senior-full-stack-builder via Optimus Spartan Swarm

The hourly patrol overlap bug occurred because: 1. tick() used isLocked() (read-only check) then fire() called createLock() separately — a TOCTOU gap allowing two ticks to both see unlocked state 2. Lock files recorded the MCP server PID, not the child worker PID, causing isPidRunning() to check the wrong process after server restarts 3. The 1-hour time-based staleness threshold exactly matched the hourly cron period, creating an edge case at the tick boundary Changes: - Move atomic createLock() into tick() as the single concurrency guard - Add updateLockPid() to write child process PID after spawn - Increase time-based staleness threshold from 1h to 2h (matches safety timer) - Release lock on tick-dedup rejection and dry-run paths - Add 9 vitest unit tests for lock lifecycle Co-Authored-By: Claude <noreply@anthropic.com>

…follow-up) The overlap recurrence at 15:00Z (Runs #215/#216) was caused by the old MCP server process still running pre-fix code after PR #513 merged. This is not a code bug — the fix requires a server restart to take effect. However, this commit also hardens the leader election against a latent TOCTOU race where two processes simultaneously detect a stale leader, both unlink+create, and one deletes the other's freshly created lock: - Add unique nonce to scheduler leader lock data - After stale-leader reclaim (unlink+wx create), re-read the lock and verify the nonce matches before confirming leadership - If nonce mismatches, another process won the race — back off Also cleaned up the dead scheduler-leader.lock (PID 362356). Co-Authored-By: Claude <noreply@anthropic.com>

…follow-up) The overlap recurrence at 15:00Z (Runs #215/#216) was caused by the old MCP server process still running pre-fix code after PR #513 merged. This is not a code bug — the fix requires a server restart to take effect. However, this commit also hardens the leader election against a latent TOCTOU race where two processes simultaneously detect a stale leader, both unlink+create, and one deletes the other's freshly created lock: - Add unique nonce to scheduler leader lock data - After stale-leader reclaim (unlink+wx create), re-read the lock and verify the nonce matches before confirming leadership - If nonce mismatches, another process won the race — back off Also cleaned up the dead scheduler-leader.lock (PID 362356). Co-authored-by: Long Chen (from Dev Box) <lochen@microsoft.com> Co-authored-by: Claude <noreply@anthropic.com>

cloga added the optimus-bot label Mar 20, 2026

cloga merged commit 7d87459 into master Mar 20, 2026

cloga deleted the fix/511-patrol-overlap-guard branch March 20, 2026 14:16

cloga added the agent-merged label Mar 20, 2026

cloga mentioned this pull request Mar 20, 2026

[Task] senior-full-stack-builder: Re-attempt Issue #511 with session continuity. The previous fix was delivered an... #514

Closed

cloga mentioned this pull request Mar 20, 2026

[Optimus] fix: add nonce verification to scheduler leader election (Issue #511 follow-up) #515

Merged

cloga mentioned this pull request Mar 22, 2026

[Optimus] fix: defend cron lock against server-crash stale-PID bypass (Issue #511) #521

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimus] fix: eliminate TOCTOU race in Meta-Cron concurrency guard (fixes #511)#513

[Optimus] fix: eliminate TOCTOU race in Meta-Cron concurrency guard (fixes #511)#513
cloga merged 1 commit intomasterfrom
fix/511-patrol-overlap-guard

cloga commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cloga commented Mar 20, 2026

Summary

Root Cause (3 compounding issues)

Changes

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant