Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CrateDB to LangChain LLM framework #1

Draft
wants to merge 24 commits into
base: release-v0.2.3
Choose a base branch
from

Conversation

amotl
Copy link

@amotl amotl commented Sep 16, 2023

About

Discussing the patch to add support for CrateDB to LangChain, to be submitted upstream. Do not merge.

What's inside

Documentation

Notebooks

Backlog

/cc @matriv, @seut, @marijaselakovic, @karynzv: You may also want to have a review on it? Thanks!

@amotl amotl force-pushed the cratedb branch 5 times, most recently from 29cf863 to f75a3d7 Compare October 27, 2023 20:39
amotl added 16 commits June 7, 2024 03:23
The implementation is based on the generic `SQLChatMessageHistory`.
When not adding any embeddings upfront, the runtime model factory was
not able to derive the vector dimension size, because the SQLAlchemy
models have not been initialized correctly.
From now on, _all_ instances of SQLAlchemy model types will be created
at runtime through the `ModelFactory` utility.

By using `__table_args__ = {"keep_existing": True}` on the ORM entity
definitions, this seems to work well, even with multiple invocations
of `CrateDBVectorSearch.from_texts()` using different `collection_name`
argument values.

While being at it, this patch also fixes a few linter errors.
When deleting a collection, also delete its associated embeddings.
It is a special adapter which provides similarity search across multiple
collections. It can not be used for indexing documents.
The CrateDB adapter works a bit different compared to the pgvector
adapter it is building upon: Because the dimensionality of the vector
field needs to be specified at table creation time, but because it is
also a runtime parameter in LangChain, the table creation needs to be
delayed.

In some cases, the tables do not exist yet, but this is only relevant
for the case when the user requests to pre-delete the collection, using
the `pre_delete_collection` argument. So, do the error handling only
there instead, and _not_ on the generic data model utility functions.
…eddings

The performance gains can be substantially.
The test cases can be written substantially more elegant.
@amotl amotl changed the base branch from release-v0.2.2 to cratedb-v0.2.3 June 7, 2024 01:33
@amotl amotl closed this Jun 7, 2024
@amotl amotl reopened this Jun 7, 2024
@amotl amotl changed the base branch from cratedb-v0.2.3 to release-v0.2.3 June 7, 2024 01:37
@amotl
Copy link
Author

amotl commented Jun 7, 2024

No worries. The patch is constantly being tested on behalf of both CI/GHA and real users invoking Jupyter Notebooks at https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/llm-langchain.

@amotl
Copy link
Author

amotl commented Jun 7, 2024

@surister: Within the original post, there is a backlog item:

Bring documentation up to speed. Have a look at those blueprints: [...]

If you want to spend a few cycles here, you may want to have a look and evaluate whether you can contribute by improving the documentation correspondingly. Relevant Jupyter Notebooks for CrateDB are surely not top-notch, or even up-to-speed, yet. This patch will be there for a while and everyone is welcome to contribute corresponding refinements.

In the meanwhile, I will be working on a few relevant patches also needed before upstreaming.

Copy link
Author

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions coming from a self-review.

docs/docs/integrations/document_loaders/cratedb.ipynb Outdated Show resolved Hide resolved
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That documentation may already exist upstream. Please check.

docs/docs/integrations/vectorstores/cratedb.ipynb Outdated Show resolved Hide resolved
Comment on lines 52 to +59
def embed_query(self, text: str) -> List[float]:
"""Return consistent embeddings for the text, if seen before, or a constant
one if the text is unknown."""
return self.embed_documents([text])[0]
if text not in self.known_texts:
return [float(1.0)] * (self.dimensionality - 1) + [float(0.0)]
return [float(1.0)] * (self.dimensionality - 1) + [
float(self.known_texts.index(text))
]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this patch still needs those adjustments, they will need to be vendorized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available per sqlalchemy-cratedb. This file can be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available per sqlalchemy-cratedb. This file can be removed.

docs/docs/integrations/providers/cratedb.mdx Outdated Show resolved Hide resolved
The CrateDB SQLAlchemy dialect needs more love, so it was separated from
the DBAPI HTTP driver.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants