-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CrateDB to LangChain LLM framework #1
base: release-v0.2.3
Are you sure you want to change the base?
Conversation
f8d8d49
to
85bf8c7
Compare
9fd7cd7
to
c139332
Compare
b35e066
to
b4289e6
Compare
Initial commit of rl_chain code
8a8fc4e
to
e3d07c4
Compare
docs/docs/modules/data_connection/document_loaders/sqlalchemy.mdx
Outdated
Show resolved
Hide resolved
libs/langchain/tests/integration_tests/document_loaders/test_sqlalchemy_cratedb.py
Outdated
Show resolved
Hide resolved
29cf863
to
f75a3d7
Compare
The implementation is based on the generic `SQLChatMessageHistory`.
When not adding any embeddings upfront, the runtime model factory was not able to derive the vector dimension size, because the SQLAlchemy models have not been initialized correctly.
From now on, _all_ instances of SQLAlchemy model types will be created at runtime through the `ModelFactory` utility. By using `__table_args__ = {"keep_existing": True}` on the ORM entity definitions, this seems to work well, even with multiple invocations of `CrateDBVectorSearch.from_texts()` using different `collection_name` argument values. While being at it, this patch also fixes a few linter errors.
When deleting a collection, also delete its associated embeddings.
It is a special adapter which provides similarity search across multiple collections. It can not be used for indexing documents.
The CrateDB adapter works a bit different compared to the pgvector adapter it is building upon: Because the dimensionality of the vector field needs to be specified at table creation time, but because it is also a runtime parameter in LangChain, the table creation needs to be delayed. In some cases, the tables do not exist yet, but this is only relevant for the case when the user requests to pre-delete the collection, using the `pre_delete_collection` argument. So, do the error handling only there instead, and _not_ on the generic data model utility functions.
…eddings The performance gains can be substantially.
The test cases can be written substantially more elegant.
No worries. The patch is constantly being tested on behalf of both CI/GHA and real users invoking Jupyter Notebooks at https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/llm-langchain. |
@surister: Within the original post, there is a backlog item:
If you want to spend a few cycles here, you may want to have a look and evaluate whether you can contribute by improving the documentation correspondingly. Relevant Jupyter Notebooks for CrateDB are surely not top-notch, or even up-to-speed, yet. This patch will be there for a while and everyone is welcome to contribute corresponding refinements. In the meanwhile, I will be working on a few relevant patches also needed before upstreaming. |
docs/docs/integrations/memory/cratedb_chat_message_history.ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more suggestions coming from a self-review.
docs/docs/integrations/memory/cratedb_chat_message_history.ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That documentation may already exist upstream. Please check.
docs/docs/integrations/memory/cratedb_chat_message_history.ipynb
Outdated
Show resolved
Hide resolved
def embed_query(self, text: str) -> List[float]: | ||
"""Return consistent embeddings for the text, if seen before, or a constant | ||
one if the text is unknown.""" | ||
return self.embed_documents([text])[0] | ||
if text not in self.known_texts: | ||
return [float(1.0)] * (self.dimensionality - 1) + [float(0.0)] | ||
return [float(1.0)] * (self.dimensionality - 1) + [ | ||
float(self.known_texts.index(text)) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this patch still needs those adjustments, they will need to be vendorized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available per sqlalchemy-cratedb. This file can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available per sqlalchemy-cratedb. This file can be removed.
libs/community/langchain_community/vectorstores/cratedb/extended.py
Outdated
Show resolved
Hide resolved
The CrateDB SQLAlchemy dialect needs more love, so it was separated from the DBAPI HTTP driver.
About
Discussing the patch to add support for CrateDB to LangChain, to be submitted upstream. Do not merge.
What's inside
FLOAT_VECTOR
/KNN_MATCH
functionality through LangChain's vector store subsystem.Documentation
Notebooks
Backlog
FIXME
andTODO
remarks./cc @matriv, @seut, @marijaselakovic, @karynzv: You may also want to have a review on it? Thanks!