feat(corpus_rag): pluggable storage backend via CorpusBackendRegistry#197
Merged
Conversation
The MCP corpus tools (list_corpora, _assert_corpus_exists, corpus_sql, …)
used to assume LocalBackend semantics: walk CORPUS_ROOT directly, build
sqlite paths from the filesystem, open sqlite by path. Open `# TODO`s
in corpus_rag.py called this out as the blocker for any non-filesystem
storage backend.
Introduce a CorpusBackendRegistry abstraction:
- Per-corpus lookup → StorageBackend for the corpus_id
- Enumeration → list of corpora with size + modified
The framework ships LocalCorpusBackendRegistry (default, filesystem
under CORPUS_ROOT). Alternative implementations live in examples/ and
are activated via CORPUS_BACKEND_REGISTRY_FACTORY="module:callable",
mirroring FIREFLY_MCP_TOKEN_STORE_FACTORY. Keeps the framework
vendor-neutral — no Azure SDKs reach the storage/ namespace.
examples/corpus_search/ gains an AzureCorpusBackendRegistry that maps
each corpus to a <corpus_id>.sqlite blob in a configured container and
enumerates by listing the container. Authenticates via
DefaultAzureCredential so the same code runs under managed identity
(Container Apps) and az login (local dev).
Other changes:
- DatabaseStore.exists() — small public helper so callers don't need
to reach into ._backend to ask "does this corpus exist yet"
- _assert_corpus_exists is now async (routes through ensure_fresh) so
the local working copy is synced from the canonical artifact under
remote backends; four call sites updated to await
- Fix stale imports in cli.py and two test files that referenced a
framework-namespaced AzureBlobBackend that never existed (the class
lives in examples/corpus_search/azure_backend.py); imports now point
at the actual location
Tests added in test_corpus_rag_list_filter.py cover:
- list_corpora delegates to the registry and forwards the source label
- contextvar filter applies after registry enumeration (per-caller
authorisation still enforced at the tool boundary)
- resolve_registry_factory error paths
- LocalCorpusBackendRegistry output shape matches the pre-pivot
list_corpora contract
PR gate caught two:
- corpus_backend.py wasn't ruff-formatted (long resolve_registry_factory
error message). ``ruff format`` reflows it.
- Pyright wouldn't narrow ``_REGISTRY`` to non-None across the
early-return inside ``_registry()``. Read into a local first and
return the local; pyright tracks that without complaint.
| registry: CorpusBackendRegistry = factory() | ||
| else: | ||
| registry = LocalCorpusBackendRegistry() | ||
| _REGISTRY = registry |
…kend.py Two file moves at PR-review feedback: * Framework: merge `corpus_backend.py` into `corpus.py`. SqliteCorpus (per-corpus persistence) and CorpusBackendRegistry (index across corpora) are different abstraction layers but answer the same overall question — "the corpus layer". The combined file stays under 530 lines and one place to look beats two for the framework's storage surface. * Example: merge `azure_corpus_registry.py` into `azure_backend.py`. The registry holds an AzureBlobBackend per corpus, so the two are tightly coupled; examples can be denser than framework code, and a single ``examples/corpus_search/azure_backend.py`` keeps "all Azure bits" in one place. Import paths updated in `corpus_rag.py` and the unit test; factory spec in the error message updated to the merged location. No behaviour change; same 283 tests pass.
| a local directory, a blob container, or something else. Free- | ||
| form; no parser depends on it. | ||
| """ | ||
| ... |
| implementation can hold long-lived clients / credentials in the | ||
| returned backend. | ||
| """ | ||
| ... |
| sorts and filters the result; implementations don't apply the | ||
| authorisation contextvar themselves. | ||
| """ | ||
| ... |
This was referenced May 19, 2026
ancongui
pushed a commit
that referenced
this pull request
May 31, 2026
…ckend feat(corpus_rag): pluggable storage backend via CorpusBackendRegistry
ancongui
pushed a commit
that referenced
this pull request
May 31, 2026
PR #197 added azure-storage-blob to the corpus-search extra but didn't refresh uv.lock. With `uv sync --frozen`, the runtime image silently skipped the new dep — visible at first AzureCorpusBackendRegistry call: `No module named 'azure.storage'`. Regenerating adds `azure-storage-blob v12.29.0` to the lock; deploy picks it up on next build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CorpusBackendRegistryabstraction so the MCP corpus tools (list_corpora,corpus_query,corpus_sql, …) can serve corpora from non-filesystem storage without code in the framework knowing how to talk to that storage.LocalCorpusBackendRegistry(the existing filesystem-under-CORPUS_ROOTbehaviour). Alternative registries — e.g. blob containers — live inexamples/and are selected byCORPUS_BACKEND_REGISTRY_FACTORY="module:callable", mirroringFIREFLY_MCP_TOKEN_STORE_FACTORY.AzureCorpusBackendRegistrylands inexamples/corpus_search/, mapping each corpus to a<corpus_id>.sqliteblob in a configured container and enumerating by listing the container. Authenticates viaDefaultAzureCredential.Why
The MCP corpus tools used to assume
LocalBackendsemantics (walkCORPUS_ROOTfor enumeration, build sqlite paths from the filesystem, open sqlite byPath). Two open# TODOs incorpus_rag.pyflagged this as the blocker for any backend whose canonical artifact isn't a local file (AzureBlobBackend, future S3/GCS variants).This PR removes those
# TODOs by:DatabaseStore.ensure_fresh()so the local working copy is always synced from the canonical artifact, regardless of where that artifact lives.examples/, behind a factory-string lookup.Changes
fireflyframework_agentic/rag/corpus_backend.pyCorpusBackendRegistryProtocol,LocalCorpusBackendRegistry,resolve_registry_factoryfireflyframework_agentic/storage/database_store.pyDatabaseStore.exists()helperfireflyframework_agentic/tools/builtins/corpus_rag.pylist_corporadelegates to registry;_assert_corpus_existsis async and routes throughensure_fresh; four call sites updated toawait; removed filesystem-specific# TODOsexamples/corpus_search/azure_corpus_registry.pyAzureCorpusBackendRegistry+build_registry()factoryexamples/corpus_search/cli.pyAzureBlobBackendlives in the example, not infireflyframework_agentic.storagetests/integration/test_ingest_with_real_vectorstore.py,tests/examples/corpus_search/test_query_path.pypyproject.tomlazure-storage-blobandazure-identityto thecorpus-searchextra (the example's runtime stack already pulls these in spirit; the new registry makes the dependency explicit)tests/unit/tools/test_corpus_rag_list_filter.pyBackwards compatibility
CORPUS_BACKEND_REGISTRY_FACTORYunset, the framework usesLocalCorpusBackendRegistryagainstCORPUS_ROOT. Existing on-disk layout (<root>/<corpus_id>/corpus.sqlite) is forward-compatible._corpus_root()kept as a thin shim overLocalCorpusBackendRegistry().rootso the existing structured-ingest test suite continues to work unchanged.from fireflyframework_agentic.storage import AzureBlobBackendimports incli.pyand two integration tests were already broken onmain(the class never lived in that namespace) but were guarded bypytest.importorskip. They now point at the correct location.Test plan
pytest tests/unit/tools/ tests/unit/storage/ tests/unit/exposure/— 276 passpytest tests/integration/test_mcp_corpus_e2e.py tests/integration/test_mcp_corpus_concurrency.py tests/integration/test_corpus_agent_structured.py tests/integration/test_corpus_query_grounding.py— 7 passruff checkover changed files cleanCORPUS_BACKEND_REGISTRY_FACTORY=examples.corpus_search.azure_corpus_registry:build_registry+CORPUS_AZURE_CONTAINER_URL=<container>, ingest, query, check RBAC denial — outside this PRFollow-ups
agent.py:_ensure_query_readyTODO aboutStructuredRetrieverconsumingself.root / "corpus.sqlite"directly is still open; structured corpora on remote backends will need that fixed before they work. Out of scope here — this PR moves the easier-and-more-common unstructured path to the registry.examples/corpus_search/cli.pystill has its ownCORPUS_SEARCH_BACKENDenv-var dance; a future cleanup could route it through the same registry mechanism to keep the example and the production code path identical.