This destination does three things:
- Split records into chunks and separates metadata from text data
- Embeds text data into an embedding vector
- Stores the metadata and embedding vector in a vector database
The record processing is using the text split components from https://python.langchain.com/docs/modules/data_connection/document_transformers/.
There are two possible providers for generating embeddings, OpenAI and Fake embeddings for testing purposes.
Embedded documents are stored in a vector database. Currently, Pinecone and the locally stored DocArrayHnswSearch are supported.
For all three components, it's easily possible to add new integrations based on the existing abstractions of the langchain library. In some cases (like the pinecone integration), it's necessary to use the underlying APIs directly to implement more features or improve performance.
The pinecone integration is adding stream and primary key to the vector metadata which allows for deduped incremental and full refreshes. It's using the official pinecone python client.
You can use the test_pinecone.py
file to check whether the pipeline works as expected.
The DocArrayHnswSearch integration is storing the vector metadata in a local file in the local root (/local
in the container, /tmp/airbyte_local
on the host). It's not possible to dedupe records, so only full refresh syncs are supported. DocArrayHnswSearch uses hnswlib under the hood, but the integration is fully relying on the langchain abstraction.
The Chroma integration is storing the vector metadata in a local file in the local root (/local
in the container, /tmp/airbyte_local
on the host), similar to the DocArrayHnswSearch. This is called the "persistent client" mode in Chroma. The integration is mostly using langchains abstraction, but it can also dedupe records and reset streams independently.
You can use the test_local.py
file to check whether the pipeline works as expected.