Skip to content

[backend] LockBackend — abstract filesystem locks for multi-process / distributed deployments #60

@dep0we

Description

@dep0we

Background

atomic_agents/_locks.py uses fcntl filesystem locks (AgentLock writes a .lock file at the agent root and acquires it via fcntl.flock). Every agent call holds the lock for the duration of the run.

This works perfectly on a single box. It breaks the moment you try:

  • Multiple processes on different hosts (NFS doesn't reliably honor fcntl)
  • Containerized deployments where the lock dir is shared but the kernel isn't
  • Cloud Run / Lambda / serverless where filesystems are ephemeral
  • Redis-backed scale-out where locks should be Redis advisory locks

This is the most urgent of the protocol-pattern abstractions — every other primitive has a single-box workaround, but locks are the cliff for multi-process deployments. Quote from internal scaling review (2026-05-08): "the one that actually breaks first if anyone tries to run atomic-agents on more than one box."

Why it matters

Tier 1 of the framework is single-tenant single-box. Tier 2 is multi-process or multi-host. Without a LockBackend protocol, Tier 2 is structurally impossible without forking the framework or replacing every lock site individually.

Concrete users blocked: Meridian wants to run atomic-agents-driven workflows on Cloud Run. Bishop's gizmo deployment wants to run multiple agents in parallel without race conditions on shared memory. Any future SaaS deployment.

What to change

Mirror the MemoryBackend pattern (#57):

  1. New module atomic_agents/locks/ with backend.py (Protocol) and filesystem.py (default FilesystemLockBackend wrapping current fcntl logic).
  2. LockBackend protocol exposes: acquire(name, timeout), release(handle), is_held(name), capability advertisement (single-host vs distributed).
  3. Replace direct AgentLock instantiation in agent.py, dream.py, and any other lock site with agent.lock_backend.acquire(...).
  4. Backend registry: register_backend("filesystem", FilesystemLockBackend). Future RedisLockBackend and PostgresAdvisoryLockBackend plug in identically.
  5. Spec doc docs/spec/21-lock-backend.md describing the protocol + acceptable backends.

Acceptance

  • All existing dream/agent tests pass with FilesystemLockBackend as default.
  • Protocol conformance test suite (~15 tests) — acquire returns handle, concurrent acquires of same name block one, release releases, acquire with timeout=0 raises if held, is_held reflects state, etc. Reusable for any future backend.
  • A Redis-shaped mock backend implements the protocol correctly to prove distributed-shaped locks fit the contract.

Open questions for design

  • Lock granularity: agent-level (current) vs note-level vs run-level. Memory backend's optimistic concurrency (expected_content_sha256) reduces some lock pressure — does that change the granularity story?
  • Reentrancy: current fcntl lock is per-process; Redis locks would need explicit reentrancy. Protocol contract?
  • Lease + heartbeat: Redis advisory locks need TTL + renewal. How does that surface in the protocol without leaking Redis-isms?

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendProtocol-pattern backend abstractions (memory, logs, locks, etc.)enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions