New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 New Destination Vector Database (powered by LangChain) #26184
Conversation
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great start
My main advice would be to keep in mind how this should be structured to achieve maximum throughput since it seems the current bandwidths we're hitting are actually bad. Not something for v1 but probably an important thing to keep in mind.
I noticed a couple of things seem hard-bound to specific providers e.g: OpenAI embeddings, probably make our life easier if we distinguish between generic stuff and concrete implementations
Also we should have way more unit tests ;)
airbyte-integrations/connectors/destination-langchain/metadata.yaml
Outdated
Show resolved
Hide resolved
schema_extra = {"group":"processing"} | ||
|
||
|
||
class EmbeddingConfigModel(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be called OpenAI embedding Config or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a second option for embeddings (fake embeddings that can be used to test the pipeline) and generalized these parts
import re | ||
|
||
|
||
class ProcessingConfigModel(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is spec.json generated from this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generated on the flow, removed the old spec.json
airbyte-integrations/connectors/destination-langchain/destination_langchain/config.py
Show resolved
Hide resolved
) -> Iterable[AirbyteMessage]: | ||
config_model = ConfigModel.parse_obj(config) | ||
self._init_indexer(config_model) | ||
self.processor = Processor(config_model.processing, configured_catalog, max_metadata_size=self.indexer.max_metadata_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the name Processor
is ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to docment_processor
return relevant_fields | ||
|
||
def _extract_metadata(self, record: AirbyteRecordMessage) -> dict: | ||
metadata = record.data.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a heavy operation on performance (tomorrow problem though)
relevant_fields = record.data | ||
return relevant_fields | ||
|
||
def _extract_metadata(self, record: AirbyteRecordMessage) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm reading this correctly metadata is everything in the input record that isn't a text/data field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, we could also make it configurable. Technically you could use the column selection feature for this.
@@ -0,0 +1,26 @@ | |||
import time | |||
|
|||
def measure_time(func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat use of decorators
METADATA_STREAM_FIELD = "_airbyte_stream" | ||
METADATA_NATURAL_ID_FIELD = "_natural_id" | ||
|
||
class Processor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naive question. Why do we need this if embedding APIs can tokenize themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tokenization is just turning utf8 encoded text into a sequence of one-hot vector representations and as you are saying the embedding API takes care of that: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
What the processor is doing is turning the records into "documents":
- Split them up appropriately
- Separate text and metadata
- Add bookkeeping metadata for incremental sync (id and stream name)
|
||
@property | ||
def embedding_dimensions(self) -> int: | ||
return 1536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the size of the embedding vectors of the openai model, added a comment to explain
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ❌ |
Connector version semver check | ✅ |
Connector version increment check | ✅ |
QA checks | ❌ |
Code format checks | ❌ |
Connector package install | ❌ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ❌ |
Connector version semver check | ✅ |
Connector version increment check | ✅ |
QA checks | ❌ |
Code format checks | ❌ |
Connector package install | ❌ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of comments here and there while taking a break. I can recheck later if you want
airbyte-integrations/connectors/destination-langchain/destination_langchain/batcher.py
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/destination.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/destination.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/unit_tests/destination_test.py
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/batcher.py
Outdated
Show resolved
Hide resolved
...te-integrations/connectors/destination-langchain/destination_langchain/document_processor.py
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/unit_tests/docarray_indexer_test.py
Outdated
Show resolved
Hide resolved
@maxi297 Thanks for your review, addressed your points. Could you take another look? |
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ❌ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
airbyte-integrations/connectors/destination-langchain/README.md
Outdated
Show resolved
Hide resolved
""" | ||
Generate documents from records. | ||
:param records: List of AirbyteRecordMessages | ||
:return: Tuple of (List of document chunks, Natural id to delete if applicable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment, either here or in destination.py, what the IDs to delete are about? It's not clear what natural ids are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an identifier of a record so we can remove documents belonging to a record before inserting new documents when in append-dedup mode. As one record might be split across multiple documents, it has to be made part of the metadata. It's leveraged here:
airbyte/airbyte-integrations/connectors/destination-langchain/destination_langchain/indexer.py
Line 75 in b3b556f
self.pinecone_index.delete(filter={METADATA_NATURAL_ID_FIELD: {"$in": delete_ids}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to record_id instead of natural id to make it easier to understand and added a comment.
try: | ||
self.embeddings.embed_query("test") | ||
except Exception as e: | ||
return str(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's generally better to use repr
than str
because it returns the "printable representation of the object".
Example with a ValueError:
>>> try:
... raise ValueError("hello")
... except Exception as e:
... print(str(e))
...
hello
>>> try:
... raise ValueError("hello")
... except Exception as e:
... print(repr(e))
...
ValueError('hello')
That being said, we should be able to do even better and return the stacktrace. Here's how we do it in sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly I feel like in this case, str(e)
plus a stacktrace is more readable? the ValueError
piece makes it harder to read for an end user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, added this
|
||
:::caution | ||
|
||
DocArrayHnswSearch is meant to be used on a local workstation and won't work on Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't have to be done now, but we'll need a way to disable DocArrayHnswSearch
before we can add this destination to Cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout, how does that work? Similarly to the *-strict-encrypt
variants on connectors? Or can we adjust the spec somehow?
airbyte-integrations/connectors/destination-langchain/destination_langchain/indexer.py
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/indexer.py
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/config.py
Show resolved
Hide resolved
...te-integrations/connectors/destination-langchain/destination_langchain/document_processor.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/indexer.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/destination-langchain/destination_langchain/indexer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @flash1293! Left a couple of comments.
I ran a langchain destination locally per the instructions in the PR description, using DocArrayHnswSearch
+ OpenAI
and got a green sync, but fyi there were a couple of unexpected error messages in the logs.
errors: $.mode: must be a constant value pinecone, $.mode: does not have a value in the enumeration [pinecone], $.pinecone_key: is missing but it is required, $.pinecone_environment: is missing but it is required, $.index: is missing but it is required
errors: $.openai_key: object found, string expected
errors: $.mode: must be a constant value fake, $.mode: does not have a value in the enumeration [fake]
raise Exception( | ||
f"DocArrayHnswSearchIndexer only supports overwrite mode, got {stream.destination_sync_mode} for stream {stream.stream.name}" | ||
) | ||
for file in os.listdir(self.config.destination_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's some asymmetry between this and the PineconeIndexer, which only deletes an index for the specific stream that's being re-synced. Is it worth putting a warning in the connector UI that the files in this directory will be deleted before every sync?
|
||
@property | ||
def embedding_dimensions(self) -> int: | ||
# vector size produced by text-embedding-ada-002 model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the vector size and/or model be configurable? Is text-embedding-ada-002
the default? I'm not seeing it mentioned anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mentioned in the documentation and as a description as part of the spec:
airbyte/airbyte-integrations/connectors/destination-langchain/destination_langchain/config.py
Line 44 in 0a571f1
"description": "Use the OpenAI API to embed text. This option is using the text-embedding-ada-002 model with 1536 embedding dimensions." * Embedding - convert the text into a vector representation using a pre-trained model (currently only OpenAI `text-embedding-ada-002` is supported)
I think it makes sense to start like this (OpenAI recently deprecated all other embedding models as this one is preferable in all situations), we probably want to add other embeddings at a later stage though that might have different numbers of dimensions (the number of dimensions is an intrinsic part of the embedding model).
I've noticed that too, but for me those show up on other syncs as well (e.g. this one is from slack to local json), so I didn't think it's a problem:
|
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good to me 🎸
be sure to go through and update the PR checklist before merging https://github.com/airbytehq/airbyte/actions/runs/5518023163/jobs/10061428323?pr=26184
* required for the testing need to go to `TEST_REQUIREMENTS` list | ||
|
||
### Publishing a new version of the connector | ||
You've checked out the repo, implemented a million dollar feature, and you're ready to share your changes with the world. Now what? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💰
@@ -0,0 +1,42 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add acceptance tests for this connector?
destination-langchain test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/destination-langchain/metadata.yaml | ✅ |
Connector version semver check | ✅ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build destination-langchain docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-langchain test
Excellent work on this integration! It would be great to see this destination appear in the Connectors list of Airbyte. Consider categorizing this destination under 'LangChain' and 'Pinecone' since Pinecone is unlocked thanks to LangChain. |
Thanks for the feedback @slavakurilyak , we have a bunch of plans in this direction, stay tuned! |
…6184) * basic version * polish * iterate * keep working * fix spec * wip * improve destination * basic unit tests * move embedding dimensionality into embedder * improve several things * adjust documentation * remove unnecessary call * add some debug information * fix local destination * various small fixes * bring tests into order * document and add batching to pinecone * checklist * improve performance a bit and add test * fix formatting * fix metadata * install C++ 11 on python base * no more alpine for ci-connector-ops * remove hard-to-run test * more documentation * better documentation * add icon * some small adjustments * review comments * format * review comments --------- Co-authored-by: alafanechere <augustin.lafanechere@gmail.com> Co-authored-by: Augustin <augustin@airbyte.io>
How to test
airbyte-integrations/connectors/destination-langchain
docker build . -t airbyte/destination-langchain:dev
airbyte/destination-langchain
, image tag isdev
What
Closes #27821
Closes #27971
This PR adds the "Langchain" destination which combines the langchain helpers for processing, embedding and indexing to build a unified destination for various vector stores that takes care of the embedding on the fly and makes it super easy to get started with building integrations on top of langchain.
https://www.loom.com/share/db63e997d01d49118bf382dff876f7c1
How
General class setup implementing the main functionality:
Unclear parts / TODOs
New Connector
Todos
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.0.0.1
Dockerfile
has version0.0.1
README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog with an entry for the initial version. See changelog exampledocs/integrations/README.md