Background
atomic_agents/_locks.py uses fcntl filesystem locks (AgentLock writes a .lock file at the agent root and acquires it via fcntl.flock). Every agent call holds the lock for the duration of the run.
This works perfectly on a single box. It breaks the moment you try:
- Multiple processes on different hosts (NFS doesn't reliably honor
fcntl)
- Containerized deployments where the lock dir is shared but the kernel isn't
- Cloud Run / Lambda / serverless where filesystems are ephemeral
- Redis-backed scale-out where locks should be Redis advisory locks
This is the most urgent of the protocol-pattern abstractions — every other primitive has a single-box workaround, but locks are the cliff for multi-process deployments. Quote from internal scaling review (2026-05-08): "the one that actually breaks first if anyone tries to run atomic-agents on more than one box."
Why it matters
Tier 1 of the framework is single-tenant single-box. Tier 2 is multi-process or multi-host. Without a LockBackend protocol, Tier 2 is structurally impossible without forking the framework or replacing every lock site individually.
Concrete users blocked: Meridian wants to run atomic-agents-driven workflows on Cloud Run. Bishop's gizmo deployment wants to run multiple agents in parallel without race conditions on shared memory. Any future SaaS deployment.
What to change
Mirror the MemoryBackend pattern (#57):
- New module
atomic_agents/locks/ with backend.py (Protocol) and filesystem.py (default FilesystemLockBackend wrapping current fcntl logic).
LockBackend protocol exposes: acquire(name, timeout), release(handle), is_held(name), capability advertisement (single-host vs distributed).
- Replace direct
AgentLock instantiation in agent.py, dream.py, and any other lock site with agent.lock_backend.acquire(...).
- Backend registry:
register_backend("filesystem", FilesystemLockBackend). Future RedisLockBackend and PostgresAdvisoryLockBackend plug in identically.
- Spec doc
docs/spec/21-lock-backend.md describing the protocol + acceptable backends.
Acceptance
- All existing dream/agent tests pass with
FilesystemLockBackend as default.
- Protocol conformance test suite (~15 tests) —
acquire returns handle, concurrent acquires of same name block one, release releases, acquire with timeout=0 raises if held, is_held reflects state, etc. Reusable for any future backend.
- A Redis-shaped mock backend implements the protocol correctly to prove distributed-shaped locks fit the contract.
Open questions for design
- Lock granularity: agent-level (current) vs note-level vs run-level. Memory backend's optimistic concurrency (
expected_content_sha256) reduces some lock pressure — does that change the granularity story?
- Reentrancy: current
fcntl lock is per-process; Redis locks would need explicit reentrancy. Protocol contract?
- Lease + heartbeat: Redis advisory locks need TTL + renewal. How does that surface in the protocol without leaking Redis-isms?
Context
Background
atomic_agents/_locks.pyusesfcntlfilesystem locks (AgentLockwrites a.lockfile at the agent root and acquires it viafcntl.flock). Every agent call holds the lock for the duration of the run.This works perfectly on a single box. It breaks the moment you try:
fcntl)This is the most urgent of the protocol-pattern abstractions — every other primitive has a single-box workaround, but locks are the cliff for multi-process deployments. Quote from internal scaling review (2026-05-08): "the one that actually breaks first if anyone tries to run atomic-agents on more than one box."
Why it matters
Tier 1 of the framework is single-tenant single-box. Tier 2 is multi-process or multi-host. Without a
LockBackendprotocol, Tier 2 is structurally impossible without forking the framework or replacing every lock site individually.Concrete users blocked: Meridian wants to run atomic-agents-driven workflows on Cloud Run. Bishop's gizmo deployment wants to run multiple agents in parallel without race conditions on shared memory. Any future SaaS deployment.
What to change
Mirror the
MemoryBackendpattern (#57):atomic_agents/locks/withbackend.py(Protocol) andfilesystem.py(defaultFilesystemLockBackendwrapping currentfcntllogic).LockBackendprotocol exposes:acquire(name, timeout),release(handle),is_held(name), capability advertisement (single-host vs distributed).AgentLockinstantiation inagent.py,dream.py, and any other lock site withagent.lock_backend.acquire(...).register_backend("filesystem", FilesystemLockBackend). FutureRedisLockBackendandPostgresAdvisoryLockBackendplug in identically.docs/spec/21-lock-backend.mddescribing the protocol + acceptable backends.Acceptance
FilesystemLockBackendas default.acquire returns handle,concurrent acquires of same name block one,release releases,acquire with timeout=0 raises if held,is_held reflects state, etc. Reusable for any future backend.Open questions for design
expected_content_sha256) reduces some lock pressure — does that change the granularity story?fcntllock is per-process; Redis locks would need explicit reentrancy. Protocol contract?Context
MemoryBackendfrom PR refactor(memory): extract MemoryBackend protocol; FilesystemBackend default #57