feat: add embeddings module and vector store backends#42
Merged
Conversation
Add core data models (EmbeddingResult, EmbeddingUsage) using Pydantic v2 and pure-Python vector math functions (cosine_similarity, dot_product, euclidean_distance) with no external dependencies. Includes full test coverage for both modules.
…ching Introduce the core extensibility layer for the embeddings system: - EmbeddingProtocol: runtime-checkable Protocol for duck typing - BaseEmbedder: abstract base class with auto-batching, error wrapping, and configurable batch_size/max_retries from framework config
Simple registry following the ToolRegistry pattern, supporting register, get, unregister, and list_names operations keyed by name.
Implement OpenAIEmbedder that subclasses BaseEmbedder to generate embeddings via the OpenAI API. Supports text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002 models with optional output dimensions. Includes full test coverage with mocked AsyncOpenAI client.
…, and Voyage embedding providers Add 7 embedding providers subclassing BaseEmbedder with try/except SDK imports for safe module loading when packages aren't installed. Each provider wraps errors in EmbeddingProviderError. Includes full test coverage with mocked SDK calls for all providers.
Export all core symbols (BaseEmbedder, EmbeddingProtocol, EmbedderRegistry, EmbeddingResult, EmbeddingUsage, cosine_similarity, dot_product, euclidean_distance) so users can import directly from fireflyframework_genai.embeddings. Add test verifying public API imports.
…in-memory backend Implements Tasks 8-10: the vector store core module with VectorDocument, SearchResult, SearchFilter types, BaseVectorStore ABC with auto-embedding and error wrapping, VectorStoreProtocol for duck typing, and a brute-force cosine similarity InMemoryVectorStore for dev/testing use.
…rStoreRegistry Add three external vector store backends that subclass BaseVectorStore and wrap their respective SDKs (chromadb, pinecone, qdrant-client), with graceful ImportError handling when SDKs are not installed. Add VectorStoreRegistry for named store instance management. All 28 new tests mock external SDKs and pass alongside existing suite.
…p pipeline steps, and add optional deps - Task 12: Export all vectorstores types, base classes, and registry from __init__.py - Task 13: Add EmbeddingStep and RetrievalStep to pipeline steps module for embedding text and retrieving from vector stores within DAG pipelines - Task 14: Add optional dependency groups for embeddings, provider-specific embeddings (openai, cohere, google, mistral, voyage), and vector store backends (chroma, pinecone, qdrant) in pyproject.toml
…ctorstores Remove unused VectorStoreConnectionError/VectorStoreError imports from external vector store backends and apply consistent formatting.
…stores - Fix ChromaVectorStore._delete() to respect namespace via where clause - Fix base.py search_text() to raise VectorStoreError instead of ValueError - Fix AzureEmbedder: make azure_endpoint required (no empty default) - Fix MistralEmbedder: pass api_key directly (no empty string fallback) - Add all provider exports to embeddings/providers/__init__.py - Add external backend exports to vectorstores/__init__.py - Add embeddings/vectorstores to top-level __init__.py exports - Add proper type hints (EmbeddingProtocol, VectorStoreProtocol) to pipeline steps - Add missing pyproject.toml optional deps (azure, bedrock, ollama embeddings) - Complete the 'all' extra with all embedding and vectorstore groups - Add docs/embeddings.md and docs/vectorstores.md user guides - Update docs/README.md with Embeddings & Vector Stores section - Fix lint issues in test files (unused imports, unused variable)
| async def test_upsert(self, mock_point_struct, mock_client_cls): | ||
| mock_client = AsyncMock() | ||
| mock_client_cls.return_value = mock_client | ||
| mock_point_struct.side_effect = lambda **kw: MagicMock(**kw) |
| return result | ||
|
|
||
|
|
||
| def _match_filter(doc: VectorDocument, f: SearchFilter) -> bool: |
Comment on lines
+10
to
+18
| from qdrant_client.models import ( | ||
| Distance, | ||
| FieldCondition, | ||
| Filter, | ||
| MatchValue, | ||
| PointIdsList, | ||
| PointStruct, | ||
| VectorParams, | ||
| ) |
| class VectorStoreProtocol(Protocol): | ||
| """Structural protocol for vector stores.""" | ||
|
|
||
| async def upsert(self, documents: list[VectorDocument], namespace: str = "default") -> None: ... |
| top_k: int = 5, | ||
| namespace: str = "default", | ||
| filters: list[SearchFilter] | None = None, | ||
| ) -> list[SearchResult]: ... |
| top_k: int = 5, | ||
| namespace: str = "default", | ||
| filters: list[SearchFilter] | None = None, | ||
| ) -> list[SearchResult]: ... |
| filters: list[SearchFilter] | None = None, | ||
| ) -> list[SearchResult]: ... | ||
|
|
||
| async def delete(self, ids: list[str], namespace: str = "default") -> None: ... |
| raise VectorStoreError(f"Delete failed: {exc}") from exc | ||
|
|
||
| @abstractmethod | ||
| async def _upsert(self, documents: list[VectorDocument], namespace: str) -> None: ... |
| top_k: int, | ||
| namespace: str, | ||
| filters: list[SearchFilter] | None, | ||
| ) -> list[SearchResult]: ... |
| ) -> list[SearchResult]: ... | ||
|
|
||
| @abstractmethod | ||
| async def _delete(self, ids: list[str], namespace: str) -> None: ... |
…ores - Add type: ignore[import-untyped] for optional SDK imports (google, voyage, chromadb, pinecone, qdrant) not installed in CI - Guard Cohere response.embeddings and .float_ against None - Filter out None embeddings in Mistral response - Add type: ignore[union-attr] for genai.embed_content (guarded in __init__) - Add type: ignore[misc] for Qdrant model constructors (guarded in __init__)
ancongui
added a commit
that referenced
this pull request
May 31, 2026
* feat(embeddings): add exceptions and config fields for embeddings and vector stores * feat(embeddings): add embedding types and similarity utility functions Add core data models (EmbeddingResult, EmbeddingUsage) using Pydantic v2 and pure-Python vector math functions (cosine_similarity, dot_product, euclidean_distance) with no external dependencies. Includes full test coverage for both modules. * feat(embeddings): add embedding protocol and base class with auto-batching Introduce the core extensibility layer for the embeddings system: - EmbeddingProtocol: runtime-checkable Protocol for duck typing - BaseEmbedder: abstract base class with auto-batching, error wrapping, and configurable batch_size/max_retries from framework config * feat(embeddings): add embedder registry for named embedder instances Simple registry following the ToolRegistry pattern, supporting register, get, unregister, and list_names operations keyed by name. * feat(embeddings): add OpenAI embedding provider Implement OpenAIEmbedder that subclasses BaseEmbedder to generate embeddings via the OpenAI API. Supports text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002 models with optional output dimensions. Includes full test coverage with mocked AsyncOpenAI client. * feat(embeddings): add Cohere, Google, Mistral, Ollama, Bedrock, Azure, and Voyage embedding providers Add 7 embedding providers subclassing BaseEmbedder with try/except SDK imports for safe module loading when packages aren't installed. Each provider wraps errors in EmbeddingProviderError. Includes full test coverage with mocked SDK calls for all providers. * feat(embeddings): wire up public API exports in embeddings __init__.py Export all core symbols (BaseEmbedder, EmbeddingProtocol, EmbedderRegistry, EmbeddingResult, EmbeddingUsage, cosine_similarity, dot_product, euclidean_distance) so users can import directly from fireflyframework_genai.embeddings. Add test verifying public API imports. * feat(vectorstores): add vector store types, protocol/base class, and in-memory backend Implements Tasks 8-10: the vector store core module with VectorDocument, SearchResult, SearchFilter types, BaseVectorStore ABC with auto-embedding and error wrapping, VectorStoreProtocol for duck typing, and a brute-force cosine similarity InMemoryVectorStore for dev/testing use. * feat(vectorstores): add ChromaDB, Pinecone, Qdrant backends and VectorStoreRegistry Add three external vector store backends that subclass BaseVectorStore and wrap their respective SDKs (chromadb, pinecone, qdrant-client), with graceful ImportError handling when SDKs are not installed. Add VectorStoreRegistry for named store instance management. All 28 new tests mock external SDKs and pass alongside existing suite. * feat: wire up vectorstores public API, add EmbeddingStep/RetrievalStep pipeline steps, and add optional deps - Task 12: Export all vectorstores types, base classes, and registry from __init__.py - Task 13: Add EmbeddingStep and RetrievalStep to pipeline steps module for embedding text and retrieving from vector stores within DAG pipelines - Task 14: Add optional dependency groups for embeddings, provider-specific embeddings (openai, cohere, google, mistral, voyage), and vector store backends (chroma, pinecone, qdrant) in pyproject.toml * style: fix lint errors and apply ruff formatting to embeddings and vectorstores Remove unused VectorStoreConnectionError/VectorStoreError imports from external vector store backends and apply consistent formatting. * fix: resolve audit issues and add documentation for embeddings/vectorstores - Fix ChromaVectorStore._delete() to respect namespace via where clause - Fix base.py search_text() to raise VectorStoreError instead of ValueError - Fix AzureEmbedder: make azure_endpoint required (no empty default) - Fix MistralEmbedder: pass api_key directly (no empty string fallback) - Add all provider exports to embeddings/providers/__init__.py - Add external backend exports to vectorstores/__init__.py - Add embeddings/vectorstores to top-level __init__.py exports - Add proper type hints (EmbeddingProtocol, VectorStoreProtocol) to pipeline steps - Add missing pyproject.toml optional deps (azure, bedrock, ollama embeddings) - Complete the 'all' extra with all embedding and vectorstore groups - Add docs/embeddings.md and docs/vectorstores.md user guides - Update docs/README.md with Embeddings & Vector Stores section - Fix lint issues in test files (unused imports, unused variable) * fix: resolve all pyright type-check errors in embeddings and vectorstores - Add type: ignore[import-untyped] for optional SDK imports (google, voyage, chromadb, pinecone, qdrant) not installed in CI - Guard Cohere response.embeddings and .float_ against None - Filter out None embeddings in Mistral response - Add type: ignore[union-attr] for genai.embed_content (guarded in __init__) - Add type: ignore[misc] for Qdrant model constructors (guarded in __init__)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
search_textconvenience method, namespace scoping, and a store registryEmbeddingStepandRetrievalStepfor RAG workflowsdocs/embeddings.mdanddocs/vectorstores.mduser guidespyproject.tomlwith all optional dependency groupsTest plan