feat: managed-SQLite storage layer (Local + Azure Blob backends)#112
Merged
Conversation
Design for a storage abstraction over a managed SQLite file with pluggable LocalBackend / AzureBlobBackend, a thin DatabaseStore orchestrator (for_write / ensure_fresh, no SQL surface), wiring of SqliteCorpus + SqliteVecVectorStore via shared batch sessions and generation-based read-connection refresh, and corpus_search E2E parameterisation over backend kind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD task list covering: storage types/errors, StorageBackend ABC, LocalBackend, DatabaseStore orchestrator, AzureBlobBackend + [storage-azure] extra, wiring of SqliteCorpus + SqliteVecVectorStore, shared-store ingest batch, corpus_search backend env toggle, and Azurite-parameterised E2E tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements Task 4: DatabaseStore lock+sync+upload state machine over StorageBackend, with RetryPolicy-driven upload retries, crash-recovery via dirty sidecar flag, and an InMemoryBackend fake for unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add AzureBlobBackend backed by Azure Blob Storage, a new [storage-azure] optional-dependency extra, nightly Azurite integration tests, and a lazy __getattr__ re-export in the storage package so the SDK is only imported on demand. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…se_lock BlobClient.get_blob_client_lease does not exist in the Azure SDK. Real leases were never released, so subsequent acquire_lock calls within the lease TTL would see 409 lease-busy.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire SqliteVecVectorStore to the DatabaseStore backend: constructor now accepts a path (auto-wrapped, back-compat) or a pre-built DatabaseStore; public upsert/delete overrides forward the session= kwarg to _upsert/_delete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CorpusAgent now builds a single DatabaseStore shared by SqliteCorpus and SqliteVecVectorStore; the ingest pipeline detects identity (corpus._store is vector_store._store) and wraps both upserts in one for_write batch, halving backend uploads on the typical corpus_search path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ckends Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AzureBlobBackend records renew-loop failures; check before each operation so they surface as StorageLeaseError instead of being silently ignored. Spec contract; matches no-silent-errors guardrail. - Pyright fix: assert target is not None after the runtime TypeError guard so the Path(target) call type-checks. - LockToken.expires_at now reflects the lease duration instead of 'already expired now'. - New regression test ensures lease errors propagate through DatabaseStore.for_write. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Session-scoped azurite_connection_string fixture in tests/conftest.py: uses AZURITE_CONNECTION_STRING when set, otherwise auto-starts a Docker Azurite container for the session, otherwise skips. Local devs no longer need a manual docker run step. - Parametrised db_store fixtures lazily request the fixture only for the "azurite" param via getfixturevalue, so "local" runs on machines without Docker. - Removed module-level pytest.skip in test_azure_backend_azurite since the fixture handles availability detection. - Nightly workflow declares Azurite as a services: container and exports AZURITE_CONNECTION_STRING so the fixture short-circuits Docker startup. Adds the storage-azure extra to uv sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckend - pipeline.py: vector_store is typed against VectorStoreProtocol which doesn't expose ._store or session=. Add targeted type: ignore on the dynamic-only attrs guarded at runtime by hasattr(...). - storage/__init__.py: import AzureBlobBackend under TYPE_CHECKING so pyright sees the symbol, while __getattr__ keeps the runtime lazy to avoid forcing [storage-azure] on every consumer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-gate doesn't install [storage-azure], so importing azure.storage.blob inside the [azurite] param raised ModuleNotFoundError instead of being cleanly skipped. Guard with pytest.importorskip in the db_store fixtures and in the azurite_connection_string conftest fixture.
- corpus.py / sqlite_vec_store.py: open a fresh sqlite3 connection per read query instead of caching across coroutines. The cached-conn approach would close a connection in use by another reader on a generation bump (race -> sqlite3.InterfaceError). WAL + cheap connect cost makes per-call connections fine. Adds PRAGMA busy_timeout=30s so concurrent writers wait on each other instead of failing fast. - azure_backend.py: track _active_lease_id in acquire/release_lock and forward it as lease= on every subsequent upload so Azure accepts writes on the leased blob. Map MatchConditions correctly: IfNotModified -> If-Match, IfMissing -> If-None-Match: *. Always re-stat after upload to canonicalise the etag (response and get_blob_properties may format it differently). Add --skipApiVersionCheck on Docker-launched Azurite to handle SDK/image version drift; same flag in CI services container. - test_benchmark_smoke: corpus doc count went from 12 to 14 after the Acme Corp datasets landed on main; update the assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-review fixes: - backend.py: replace ... abstract method bodies with explicit raise NotImplementedError (CodeQL: 'statement has no effect'). - local_backend.py: sentinel mode 0o644 -> 0o600 (CodeQL: 'overly permissive file permissions'). - storage/__init__.py: eager-import AzureBlobBackend when storage-azure is installed; on ImportError stash and re-raise from __getattr__. Static analysers see a real symbol; runtime callers without the extra get a clear install-suggesting error. - test_database_store.py: replace pytest.raises block with try/except so the post-block assertion is reachable to static analysis. Concurrency regression fix: - corpus.py / sqlite_vec_store.py: revert to a long-lived cached sqlite3 connection serialised by _lock for both reads and writes, reopened on cache-file generation bump. Per-call open/close caused intermittent 'database is locked' under concurrent ingest. Long- lived conn matches pre-refactor behaviour and is reliably green. - timeout=30 on sqlite3.connect (Python module's busy timeout, which overrides the PRAGMA) so generation-bump conn replacement waits instead of failing fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services.options accepts docker create flags only — --entrypoint takes a single executable, not a command string. Run Azurite via docker run in a workflow step so we can pass --skipApiVersionCheck to handle SDK/image version drift, with a curl-based readiness wait.
- examples/corpus_search/cli.py: replace duplicated literals with module-level constants for the CORPUS_SEARCH_BACKEND env var name, the 'local' / 'azure' kind values, the Azure URL/blob env var names, and the default 'corpus.sqlite' blob name. - tests/conftest.py: expose AZURITE_CONNECTION_STRING_ENV, DB_STORE_LOCAL, DB_STORE_AZURITE, and DB_STORE_BACKENDS as module-level constants. The two parametrised db_store fixtures (corpus_search E2E and the real-vectorstore integration test) now import these instead of repeating the labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
miguelgfierro
approved these changes
May 6, 2026
Contributor
miguelgfierro
left a comment
There was a problem hiding this comment.
No me funciona el review the github. Lo único que comentaba era meter la connection string en key vault
2 tasks
ancongui
pushed a commit
that referenced
this pull request
May 31, 2026
feat: managed-SQLite storage layer (Local + Azure Blob backends)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DatabaseStoreorchestrator over a pluggableStorageBackendABC (LocalBackend,AzureBlobBackend) for managing the read/write lifecycle of a single SQLite file. Optimises for sole/dominant-writer batches: skip-download on etag match, single upload per batch, persistent local cache + sidecar with crash-recovery via adirtyflag.SqliteCorpusandSqliteVecVectorStorethrough the new layer (constructor accepts a path or aDatabaseStore; existing call sites work unchanged). Read connections refresh via a generation counter when the cache file is replaced.for_writeblock so a SharePoint-style batch produces one upload, not two.corpus_searchexample gains aCORPUS_SEARCH_BACKEND=local|azureenv toggle; existing E2E tests now run parameterised over local (default) and Azurite (whenAZURITE_CONNECTION_STRINGis set).[storage-azure]extra (azure-storage-blob,azure-identity).Spec:
docs/superpowers/specs/2026-05-06-db-storage-backend-design.mdPlan:
docs/superpowers/plans/2026-05-06-db-storage-backend.mdTest plan
uv run pytest -m "not nightly" -q— 1269 passed, 5 skipped, 0 faileduv run pytest tests/unit/storage/ -q— allLocalBackend+DatabaseStoreunit tests green (lifecycle, retry exhaustion, crash recovery, conditional uploads, generation bumping, lease-error propagation)uv run pytest tests/unit/rag/ tests/unit/vectorstores/ tests/unit/rag/ingest/ -q— back-compat path-based constructors still pass; newDatabaseStore-backed constructor smoke tests pass; coordinated-batch single-upload regression test pinuv run pyright src/fireflyframework_agentic/storage src/fireflyframework_agentic/rag/corpus.py src/fireflyframework_agentic/vectorstores/sqlite_vec_store.py— cleanAZURITE_CONNECTION_STRING. Skipped on the PR gate; runnable locally viadocker run -p 10000:10000 mcr.microsoft.com/azure-storage/azuritethenAZURITE_CONNECTION_STRING=... uv run pytest tests/integration/storage/ -m nightlyCORPUS_SEARCH_BACKEND=local uv run python -m examples.corpus_search ingest --folder ... --root ./kg ...CORPUS_SEARCH_AZURE_CONTAINER_URL+DefaultAzureCredential, the same command runs against Azure Blob Storage with no other code changes🤖 Generated with Claude Code