Skip to content

fix: resolve LockException under concurrent multi-process execution#4804

Merged
greysonlalonde merged 11 commits intomainfrom
gl/fix/concurrent-storage-locking
Mar 11, 2026
Merged

fix: resolve LockException under concurrent multi-process execution#4804
greysonlalonde merged 11 commits intomainfrom
gl/fix/concurrent-storage-locking

Conversation

@greysonlalonde
Copy link
Contributor

@greysonlalonde greysonlalonde commented Mar 11, 2026

Summary

  • Increase portalocker.Lock timeout from default 5s to 120s in ChromaDB factory — the direct cause of LockException(BlockingIOError(11, 'Resource temporarily unavailable')) under concurrent Airflow workers
  • Enable PRAGMA journal_mode=WAL on both SQLite databases (kickoff task outputs + flow persistence) for concurrent read/write support
  • Add cross-process file lock (portalocker) around all LanceDB write paths to prevent contention across Airflow worker processes

Test plan

  • Run concurrent crew kickoffs from multiple Airflow workers against the same storage directory
  • Verify no LockException errors in logs
  • Verify memory read/write correctness under concurrent load
  • Confirm existing unit tests pass

Note

Medium Risk
Touches multiple persistence backends (SQLite, LanceDB, ChromaDB) to change locking and journal settings; mistakes could cause deadlocks or data loss/corruption under load.

Overview
Improves robustness under concurrent multi-process execution by standardizing cross-process locking and increasing timeouts.

SQLite persistence (flow state + kickoff task outputs) now connects with a longer timeout and enables PRAGMA journal_mode=WAL to better support concurrent readers/writers.

LanceDB writes are now serialized across processes using a shared lock_store lock around table creation/indexing, compaction/optimize, and all write operations, while keeping commit-conflict retries. ChromaDB client creation switches from ad-hoc lockfiles to the new centralized lock_store (with a longer default lock wait and optional Redis-backed distributed locks).

Written by Cursor Bugbot for commit 8794776. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@greysonlalonde greysonlalonde merged commit 534f070 into main Mar 11, 2026
45 checks passed
@greysonlalonde greysonlalonde deleted the gl/fix/concurrent-storage-locking branch March 11, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants