Skip to content

feat: add embeddings module and vector store backends#42

Merged
ancongui merged 13 commits into
mainfrom
feat/embeddings-vectorstores
Feb 24, 2026
Merged

feat: add embeddings module and vector store backends#42
ancongui merged 13 commits into
mainfrom
feat/embeddings-vectorstores

Conversation

@ancongui
Copy link
Copy Markdown
Contributor

Summary

  • Add embeddings module with 8 providers (OpenAI, Azure, Cohere, Google, Mistral, Voyage, Bedrock, Ollama), auto-batching, similarity utilities, and a provider registry
  • Add vector stores module with 4 backends (In-Memory, ChromaDB, Pinecone, Qdrant), auto-embedding, search_text convenience method, namespace scoping, and a store registry
  • Add pipeline integration via EmbeddingStep and RetrievalStep for RAG workflows
  • Add 108 tests covering all providers, backends, types, and pipeline steps (1383 total suite passing)
  • Add docs/embeddings.md and docs/vectorstores.md user guides
  • Update pyproject.toml with all optional dependency groups

Test plan

  • 108 embedding/vectorstore tests passing
  • 1383 total test suite passing (zero regressions)
  • ruff lint clean across all new files
  • All providers tested with mocked SDKs
  • Auto-embedding, search_text, namespace isolation verified
  • Pipeline EmbeddingStep and RetrievalStep integration tested

Add core data models (EmbeddingResult, EmbeddingUsage) using Pydantic v2
and pure-Python vector math functions (cosine_similarity, dot_product,
euclidean_distance) with no external dependencies. Includes full test
coverage for both modules.
…ching

Introduce the core extensibility layer for the embeddings system:
- EmbeddingProtocol: runtime-checkable Protocol for duck typing
- BaseEmbedder: abstract base class with auto-batching, error wrapping,
  and configurable batch_size/max_retries from framework config
Simple registry following the ToolRegistry pattern, supporting
register, get, unregister, and list_names operations keyed by name.
Implement OpenAIEmbedder that subclasses BaseEmbedder to generate
embeddings via the OpenAI API. Supports text-embedding-3-small,
text-embedding-3-large, and text-embedding-ada-002 models with
optional output dimensions. Includes full test coverage with mocked
AsyncOpenAI client.
…, and Voyage embedding providers

Add 7 embedding providers subclassing BaseEmbedder with try/except
SDK imports for safe module loading when packages aren't installed.
Each provider wraps errors in EmbeddingProviderError. Includes full
test coverage with mocked SDK calls for all providers.
Export all core symbols (BaseEmbedder, EmbeddingProtocol, EmbedderRegistry,
EmbeddingResult, EmbeddingUsage, cosine_similarity, dot_product,
euclidean_distance) so users can import directly from
fireflyframework_genai.embeddings. Add test verifying public API imports.
…in-memory backend

Implements Tasks 8-10: the vector store core module with VectorDocument,
SearchResult, SearchFilter types, BaseVectorStore ABC with auto-embedding
and error wrapping, VectorStoreProtocol for duck typing, and a brute-force
cosine similarity InMemoryVectorStore for dev/testing use.
…rStoreRegistry

Add three external vector store backends that subclass BaseVectorStore
and wrap their respective SDKs (chromadb, pinecone, qdrant-client),
with graceful ImportError handling when SDKs are not installed.
Add VectorStoreRegistry for named store instance management.
All 28 new tests mock external SDKs and pass alongside existing suite.
…p pipeline steps, and add optional deps

- Task 12: Export all vectorstores types, base classes, and registry from __init__.py
- Task 13: Add EmbeddingStep and RetrievalStep to pipeline steps module for
  embedding text and retrieving from vector stores within DAG pipelines
- Task 14: Add optional dependency groups for embeddings, provider-specific
  embeddings (openai, cohere, google, mistral, voyage), and vector store
  backends (chroma, pinecone, qdrant) in pyproject.toml
…ctorstores

Remove unused VectorStoreConnectionError/VectorStoreError imports from
external vector store backends and apply consistent formatting.
…stores

- Fix ChromaVectorStore._delete() to respect namespace via where clause
- Fix base.py search_text() to raise VectorStoreError instead of ValueError
- Fix AzureEmbedder: make azure_endpoint required (no empty default)
- Fix MistralEmbedder: pass api_key directly (no empty string fallback)
- Add all provider exports to embeddings/providers/__init__.py
- Add external backend exports to vectorstores/__init__.py
- Add embeddings/vectorstores to top-level __init__.py exports
- Add proper type hints (EmbeddingProtocol, VectorStoreProtocol) to pipeline steps
- Add missing pyproject.toml optional deps (azure, bedrock, ollama embeddings)
- Complete the 'all' extra with all embedding and vectorstore groups
- Add docs/embeddings.md and docs/vectorstores.md user guides
- Update docs/README.md with Embeddings & Vector Stores section
- Fix lint issues in test files (unused imports, unused variable)
@ancongui ancongui enabled auto-merge (squash) February 24, 2026 13:15
async def test_upsert(self, mock_point_struct, mock_client_cls):
mock_client = AsyncMock()
mock_client_cls.return_value = mock_client
mock_point_struct.side_effect = lambda **kw: MagicMock(**kw)
return result


def _match_filter(doc: VectorDocument, f: SearchFilter) -> bool:
Comment on lines +10 to +18
from qdrant_client.models import (
Distance,
FieldCondition,
Filter,
MatchValue,
PointIdsList,
PointStruct,
VectorParams,
)
class VectorStoreProtocol(Protocol):
"""Structural protocol for vector stores."""

async def upsert(self, documents: list[VectorDocument], namespace: str = "default") -> None: ...
top_k: int = 5,
namespace: str = "default",
filters: list[SearchFilter] | None = None,
) -> list[SearchResult]: ...
top_k: int = 5,
namespace: str = "default",
filters: list[SearchFilter] | None = None,
) -> list[SearchResult]: ...
filters: list[SearchFilter] | None = None,
) -> list[SearchResult]: ...

async def delete(self, ids: list[str], namespace: str = "default") -> None: ...
raise VectorStoreError(f"Delete failed: {exc}") from exc

@abstractmethod
async def _upsert(self, documents: list[VectorDocument], namespace: str) -> None: ...
top_k: int,
namespace: str,
filters: list[SearchFilter] | None,
) -> list[SearchResult]: ...
) -> list[SearchResult]: ...

@abstractmethod
async def _delete(self, ids: list[str], namespace: str) -> None: ...
…ores

- Add type: ignore[import-untyped] for optional SDK imports (google,
  voyage, chromadb, pinecone, qdrant) not installed in CI
- Guard Cohere response.embeddings and .float_ against None
- Filter out None embeddings in Mistral response
- Add type: ignore[union-attr] for genai.embed_content (guarded in __init__)
- Add type: ignore[misc] for Qdrant model constructors (guarded in __init__)
@ancongui ancongui merged commit 7f3ef8d into main Feb 24, 2026
12 checks passed
@ancongui ancongui deleted the feat/embeddings-vectorstores branch February 24, 2026 14:42
ancongui added a commit that referenced this pull request May 31, 2026
* feat(embeddings): add exceptions and config fields for embeddings and vector stores

* feat(embeddings): add embedding types and similarity utility functions

Add core data models (EmbeddingResult, EmbeddingUsage) using Pydantic v2
and pure-Python vector math functions (cosine_similarity, dot_product,
euclidean_distance) with no external dependencies. Includes full test
coverage for both modules.

* feat(embeddings): add embedding protocol and base class with auto-batching

Introduce the core extensibility layer for the embeddings system:
- EmbeddingProtocol: runtime-checkable Protocol for duck typing
- BaseEmbedder: abstract base class with auto-batching, error wrapping,
  and configurable batch_size/max_retries from framework config

* feat(embeddings): add embedder registry for named embedder instances

Simple registry following the ToolRegistry pattern, supporting
register, get, unregister, and list_names operations keyed by name.

* feat(embeddings): add OpenAI embedding provider

Implement OpenAIEmbedder that subclasses BaseEmbedder to generate
embeddings via the OpenAI API. Supports text-embedding-3-small,
text-embedding-3-large, and text-embedding-ada-002 models with
optional output dimensions. Includes full test coverage with mocked
AsyncOpenAI client.

* feat(embeddings): add Cohere, Google, Mistral, Ollama, Bedrock, Azure, and Voyage embedding providers

Add 7 embedding providers subclassing BaseEmbedder with try/except
SDK imports for safe module loading when packages aren't installed.
Each provider wraps errors in EmbeddingProviderError. Includes full
test coverage with mocked SDK calls for all providers.

* feat(embeddings): wire up public API exports in embeddings __init__.py

Export all core symbols (BaseEmbedder, EmbeddingProtocol, EmbedderRegistry,
EmbeddingResult, EmbeddingUsage, cosine_similarity, dot_product,
euclidean_distance) so users can import directly from
fireflyframework_genai.embeddings. Add test verifying public API imports.

* feat(vectorstores): add vector store types, protocol/base class, and in-memory backend

Implements Tasks 8-10: the vector store core module with VectorDocument,
SearchResult, SearchFilter types, BaseVectorStore ABC with auto-embedding
and error wrapping, VectorStoreProtocol for duck typing, and a brute-force
cosine similarity InMemoryVectorStore for dev/testing use.

* feat(vectorstores): add ChromaDB, Pinecone, Qdrant backends and VectorStoreRegistry

Add three external vector store backends that subclass BaseVectorStore
and wrap their respective SDKs (chromadb, pinecone, qdrant-client),
with graceful ImportError handling when SDKs are not installed.
Add VectorStoreRegistry for named store instance management.
All 28 new tests mock external SDKs and pass alongside existing suite.

* feat: wire up vectorstores public API, add EmbeddingStep/RetrievalStep pipeline steps, and add optional deps

- Task 12: Export all vectorstores types, base classes, and registry from __init__.py
- Task 13: Add EmbeddingStep and RetrievalStep to pipeline steps module for
  embedding text and retrieving from vector stores within DAG pipelines
- Task 14: Add optional dependency groups for embeddings, provider-specific
  embeddings (openai, cohere, google, mistral, voyage), and vector store
  backends (chroma, pinecone, qdrant) in pyproject.toml

* style: fix lint errors and apply ruff formatting to embeddings and vectorstores

Remove unused VectorStoreConnectionError/VectorStoreError imports from
external vector store backends and apply consistent formatting.

* fix: resolve audit issues and add documentation for embeddings/vectorstores

- Fix ChromaVectorStore._delete() to respect namespace via where clause
- Fix base.py search_text() to raise VectorStoreError instead of ValueError
- Fix AzureEmbedder: make azure_endpoint required (no empty default)
- Fix MistralEmbedder: pass api_key directly (no empty string fallback)
- Add all provider exports to embeddings/providers/__init__.py
- Add external backend exports to vectorstores/__init__.py
- Add embeddings/vectorstores to top-level __init__.py exports
- Add proper type hints (EmbeddingProtocol, VectorStoreProtocol) to pipeline steps
- Add missing pyproject.toml optional deps (azure, bedrock, ollama embeddings)
- Complete the 'all' extra with all embedding and vectorstore groups
- Add docs/embeddings.md and docs/vectorstores.md user guides
- Update docs/README.md with Embeddings & Vector Stores section
- Fix lint issues in test files (unused imports, unused variable)

* fix: resolve all pyright type-check errors in embeddings and vectorstores

- Add type: ignore[import-untyped] for optional SDK imports (google,
  voyage, chromadb, pinecone, qdrant) not installed in CI
- Guard Cohere response.embeddings and .float_ against None
- Filter out None embeddings in Mistral response
- Add type: ignore[union-attr] for genai.embed_content (guarded in __init__)
- Add type: ignore[misc] for Qdrant model constructors (guarded in __init__)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant