Skip to content

feat(corpus_rag): pluggable storage backend via CorpusBackendRegistry#197

Merged
miguelgfierro merged 3 commits into
mainfrom
feat/corpus-rag-blob-backend
May 19, 2026
Merged

feat(corpus_rag): pluggable storage backend via CorpusBackendRegistry#197
miguelgfierro merged 3 commits into
mainfrom
feat/corpus-rag-blob-backend

Conversation

@miguelgfierro
Copy link
Copy Markdown
Contributor

Summary

  • Introduces a CorpusBackendRegistry abstraction so the MCP corpus tools (list_corpora, corpus_query, corpus_sql, …) can serve corpora from non-filesystem storage without code in the framework knowing how to talk to that storage.
  • The framework ships LocalCorpusBackendRegistry (the existing filesystem-under-CORPUS_ROOT behaviour). Alternative registries — e.g. blob containers — live in examples/ and are selected by CORPUS_BACKEND_REGISTRY_FACTORY="module:callable", mirroring FIREFLY_MCP_TOKEN_STORE_FACTORY.
  • An AzureCorpusBackendRegistry lands in examples/corpus_search/, mapping each corpus to a <corpus_id>.sqlite blob in a configured container and enumerating by listing the container. Authenticates via DefaultAzureCredential.

Why

The MCP corpus tools used to assume LocalBackend semantics (walk CORPUS_ROOT for enumeration, build sqlite paths from the filesystem, open sqlite by Path). Two open # TODOs in corpus_rag.py flagged this as the blocker for any backend whose canonical artifact isn't a local file (AzureBlobBackend, future S3/GCS variants).

This PR removes those # TODOs by:

  1. Routing existence + freshness through DatabaseStore.ensure_fresh() so the local working copy is always synced from the canonical artifact, regardless of where that artifact lives.
  2. Delegating enumeration to the registry, so the question "what corpora exist?" is answered by whichever backend owns them.
  3. Keeping the framework free of vendor SDKs — Azure-specific code lives in examples/, behind a factory-string lookup.

Changes

File Change
fireflyframework_agentic/rag/corpus_backend.py New module: CorpusBackendRegistry Protocol, LocalCorpusBackendRegistry, resolve_registry_factory
fireflyframework_agentic/storage/database_store.py New DatabaseStore.exists() helper
fireflyframework_agentic/tools/builtins/corpus_rag.py list_corpora delegates to registry; _assert_corpus_exists is async and routes through ensure_fresh; four call sites updated to await; removed filesystem-specific # TODOs
examples/corpus_search/azure_corpus_registry.py New: AzureCorpusBackendRegistry + build_registry() factory
examples/corpus_search/cli.py Fix stale import — AzureBlobBackend lives in the example, not in fireflyframework_agentic.storage
tests/integration/test_ingest_with_real_vectorstore.py, tests/examples/corpus_search/test_query_path.py Same import fix
pyproject.toml Add azure-storage-blob and azure-identity to the corpus-search extra (the example's runtime stack already pulls these in spirit; the new registry makes the dependency explicit)
tests/unit/tools/test_corpus_rag_list_filter.py Add coverage for the registry pivot

Backwards compatibility

  • Default behaviour unchanged: with CORPUS_BACKEND_REGISTRY_FACTORY unset, the framework uses LocalCorpusBackendRegistry against CORPUS_ROOT. Existing on-disk layout (<root>/<corpus_id>/corpus.sqlite) is forward-compatible.
  • _corpus_root() kept as a thin shim over LocalCorpusBackendRegistry().root so the existing structured-ingest test suite continues to work unchanged.
  • The stale from fireflyframework_agentic.storage import AzureBlobBackend imports in cli.py and two integration tests were already broken on main (the class never lived in that namespace) but were guarded by pytest.importorskip. They now point at the correct location.

Test plan

  • pytest tests/unit/tools/ tests/unit/storage/ tests/unit/exposure/ — 276 pass
  • pytest tests/integration/test_mcp_corpus_e2e.py tests/integration/test_mcp_corpus_concurrency.py tests/integration/test_corpus_agent_structured.py tests/integration/test_corpus_query_grounding.py — 7 pass
  • ruff check over changed files clean
  • Operator-side verification on the integrator's MCP deployment: set CORPUS_BACKEND_REGISTRY_FACTORY=examples.corpus_search.azure_corpus_registry:build_registry + CORPUS_AZURE_CONTAINER_URL=<container>, ingest, query, check RBAC denial — outside this PR

Follow-ups

  • The agent.py:_ensure_query_ready TODO about StructuredRetriever consuming self.root / "corpus.sqlite" directly is still open; structured corpora on remote backends will need that fixed before they work. Out of scope here — this PR moves the easier-and-more-common unstructured path to the registry.
  • examples/corpus_search/cli.py still has its own CORPUS_SEARCH_BACKEND env-var dance; a future cleanup could route it through the same registry mechanism to keep the example and the production code path identical.

The MCP corpus tools (list_corpora, _assert_corpus_exists, corpus_sql, …)
used to assume LocalBackend semantics: walk CORPUS_ROOT directly, build
sqlite paths from the filesystem, open sqlite by path. Open `# TODO`s
in corpus_rag.py called this out as the blocker for any non-filesystem
storage backend.

Introduce a CorpusBackendRegistry abstraction:

  - Per-corpus lookup → StorageBackend for the corpus_id
  - Enumeration → list of corpora with size + modified

The framework ships LocalCorpusBackendRegistry (default, filesystem
under CORPUS_ROOT). Alternative implementations live in examples/ and
are activated via CORPUS_BACKEND_REGISTRY_FACTORY="module:callable",
mirroring FIREFLY_MCP_TOKEN_STORE_FACTORY. Keeps the framework
vendor-neutral — no Azure SDKs reach the storage/ namespace.

examples/corpus_search/ gains an AzureCorpusBackendRegistry that maps
each corpus to a <corpus_id>.sqlite blob in a configured container and
enumerates by listing the container. Authenticates via
DefaultAzureCredential so the same code runs under managed identity
(Container Apps) and az login (local dev).

Other changes:
  - DatabaseStore.exists() — small public helper so callers don't need
    to reach into ._backend to ask "does this corpus exist yet"
  - _assert_corpus_exists is now async (routes through ensure_fresh) so
    the local working copy is synced from the canonical artifact under
    remote backends; four call sites updated to await
  - Fix stale imports in cli.py and two test files that referenced a
    framework-namespaced AzureBlobBackend that never existed (the class
    lives in examples/corpus_search/azure_backend.py); imports now point
    at the actual location

Tests added in test_corpus_rag_list_filter.py cover:
  - list_corpora delegates to the registry and forwards the source label
  - contextvar filter applies after registry enumeration (per-caller
    authorisation still enforced at the tool boundary)
  - resolve_registry_factory error paths
  - LocalCorpusBackendRegistry output shape matches the pre-pivot
    list_corpora contract
Comment thread fireflyframework_agentic/rag/corpus_backend.py Fixed
Comment thread fireflyframework_agentic/rag/corpus_backend.py Fixed
Comment thread fireflyframework_agentic/rag/corpus_backend.py Fixed
PR gate caught two:
  - corpus_backend.py wasn't ruff-formatted (long resolve_registry_factory
    error message). ``ruff format`` reflows it.
  - Pyright wouldn't narrow ``_REGISTRY`` to non-None across the
    early-return inside ``_registry()``. Read into a local first and
    return the local; pyright tracks that without complaint.
registry: CorpusBackendRegistry = factory()
else:
registry = LocalCorpusBackendRegistry()
_REGISTRY = registry
…kend.py

Two file moves at PR-review feedback:

* Framework: merge `corpus_backend.py` into `corpus.py`. SqliteCorpus
  (per-corpus persistence) and CorpusBackendRegistry (index across
  corpora) are different abstraction layers but answer the same overall
  question — "the corpus layer". The combined file stays under 530
  lines and one place to look beats two for the framework's storage
  surface.
* Example: merge `azure_corpus_registry.py` into `azure_backend.py`.
  The registry holds an AzureBlobBackend per corpus, so the two are
  tightly coupled; examples can be denser than framework code, and a
  single ``examples/corpus_search/azure_backend.py`` keeps "all Azure
  bits" in one place.

Import paths updated in `corpus_rag.py` and the unit test; factory
spec in the error message updated to the merged location.

No behaviour change; same 283 tests pass.
a local directory, a blob container, or something else. Free-
form; no parser depends on it.
"""
...
implementation can hold long-lived clients / credentials in the
returned backend.
"""
...
sorts and filters the result; implementations don't apply the
authorisation contextvar themselves.
"""
...
@miguelgfierro miguelgfierro merged commit b6fda6e into main May 19, 2026
9 checks passed
@miguelgfierro miguelgfierro deleted the feat/corpus-rag-blob-backend branch May 19, 2026 13:20
ancongui pushed a commit that referenced this pull request May 31, 2026
…ckend

feat(corpus_rag): pluggable storage backend via CorpusBackendRegistry
ancongui pushed a commit that referenced this pull request May 31, 2026
PR #197 added azure-storage-blob to the corpus-search extra but didn't
refresh uv.lock. With `uv sync --frozen`, the runtime image silently
skipped the new dep — visible at first AzureCorpusBackendRegistry call:
`No module named 'azure.storage'`. Regenerating adds
`azure-storage-blob v12.29.0` to the lock; deploy picks it up on next
build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant