Add support for CrateDB to LangChain LLM framework #1

amotl · 2023-09-16T18:10:01Z

About

Discussing the patch to add support for CrateDB to LangChain, to be submitted upstream. Do not merge.

What's inside

Support for CrateDB's FLOAT_VECTOR / KNN_MATCH functionality through LangChain's vector store subsystem.
Support for loading documents from CrateDB through LangChain's document loader subsystem.
Support for managing "chat" history using LangChain's conversational memory subsystem.

Documentation

integrations/providers/cratedb.mdx

Notebooks

Backlog

See comments below.
Review all FIXME and TODO remarks.
Bring documentation up to speed. Have a look at those blueprints:

/cc @matriv, @seut, @marijaselakovic, @karynzv: You may also want to have a review on it? Thanks!

Initial commit of rl_chain code

docs/docs/integrations/document_loaders/cratedb.ipynb

docs/docs/integrations/document_loaders/sqlalchemy.ipynb

docs/docs/integrations/providers/cratedb.mdx

docs/docs/modules/data_connection/document_loaders/sqlalchemy.mdx

docs/docs/integrations/providers/cratedb.mdx

libs/langchain/tests/integration_tests/document_loaders/test_sqlalchemy_cratedb.py

libs/langchain/langchain/vectorstores/cratedb/base.py

The implementation is based on the generic `SQLChatMessageHistory`.

When not adding any embeddings upfront, the runtime model factory was not able to derive the vector dimension size, because the SQLAlchemy models have not been initialized correctly.

From now on, _all_ instances of SQLAlchemy model types will be created at runtime through the `ModelFactory` utility. By using `__table_args__ = {"keep_existing": True}` on the ORM entity definitions, this seems to work well, even with multiple invocations of `CrateDBVectorSearch.from_texts()` using different `collection_name` argument values. While being at it, this patch also fixes a few linter errors.

When deleting a collection, also delete its associated embeddings.

It is a special adapter which provides similarity search across multiple collections. It can not be used for indexing documents.

The CrateDB adapter works a bit different compared to the pgvector adapter it is building upon: Because the dimensionality of the vector field needs to be specified at table creation time, but because it is also a runtime parameter in LangChain, the table creation needs to be delayed. In some cases, the tables do not exist yet, but this is only relevant for the case when the user requests to pre-delete the collection, using the `pre_delete_collection` argument. So, do the error handling only there instead, and _not_ on the generic data model utility functions.

…eddings The performance gains can be substantially.

The test cases can be written substantially more elegant.

amotl · 2024-06-07T01:38:45Z

No worries. The patch is constantly being tested on behalf of both CI/GHA and real users invoking Jupyter Notebooks at https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/llm-langchain.

amotl · 2024-06-07T01:48:41Z

@surister: Within the original post, there is a backlog item:

Bring documentation up to speed. Have a look at those blueprints: [...]

If you want to spend a few cycles here, you may want to have a look and evaluate whether you can contribute by improving the documentation correspondingly. Relevant Jupyter Notebooks for CrateDB are surely not top-notch, or even up-to-speed, yet. This patch will be there for a while and everyone is welcome to contribute corresponding refinements.

In the meanwhile, I will be working on a few relevant patches also needed before upstreaming.

Get rid of dependency to cratedb-toolkit #25

docs/docs/integrations/memory/cratedb_chat_message_history.ipynb

amotl

A few more suggestions coming from a self-review.

docs/docs/integrations/document_loaders/cratedb.ipynb

docs/docs/integrations/memory/cratedb_chat_message_history.ipynb

amotl · 2024-06-25T16:04:35Z

docs/docs/integrations/document_loaders/sqlalchemy.ipynb

That documentation may already exist upstream. Please check.

community[minor]: Add SQLDatabaseLoader document loader langchain-ai/langchain#18281

docs/docs/integrations/memory/cratedb_chat_message_history.ipynb

docs/docs/integrations/vectorstores/cratedb.ipynb

amotl · 2024-06-25T16:17:16Z

libs/langchain/tests/integration_tests/cache/fake_embeddings.py

    def embed_query(self, text: str) -> List[float]:
        """Return consistent embeddings for the text, if seen before, or a constant
        one if the text is unknown."""
-        return self.embed_documents([text])[0]
+        if text not in self.known_texts:
+            return [float(1.0)] * (self.dimensionality - 1) + [float(0.0)]
+        return [float(1.0)] * (self.dimensionality - 1) + [
+            float(self.known_texts.index(text))
+        ]


If this patch still needs those adjustments, they will need to be vendorized.

amotl · 2024-06-25T16:18:43Z

libs/community/langchain_community/vectorstores/cratedb/sqlalchemy_type.py

Available per sqlalchemy-cratedb. This file can be removed.

amotl · 2024-06-25T16:18:54Z

libs/community/langchain_community/vectorstores/cratedb/sqlalchemy_patch.py

Available per sqlalchemy-cratedb. This file can be removed.

libs/community/langchain_community/vectorstores/cratedb/extended.py

docs/docs/integrations/providers/cratedb.mdx

The CrateDB SQLAlchemy dialect needs more love, so it was separated from the DBAPI HTTP driver.

amotl force-pushed the cratedb branch from 466775a to 99c0a1f Compare September 16, 2023 18:14

amotl mentioned this pull request Jun 16, 2024

SQLAlchemy: Polyfill for transparently synchronizing data with REFRESH TABLE crate/sqlalchemy-cratedb#83

Open

amotl force-pushed the cratedb branch 4 times, most recently from f8d8d49 to 85bf8c7 Compare September 16, 2023 22:37

amotl mentioned this pull request Sep 18, 2023

SQLAlchemy: Polyfill for AUTOINCREMENT columns crate/sqlalchemy-cratedb#77

Open

amotl force-pushed the cratedb branch 3 times, most recently from 9fd7cd7 to c139332 Compare September 17, 2023 22:01

amotl mentioned this pull request Sep 17, 2023

[LangChain] Add example programs and notebooks crate/cratedb-examples#85

Merged

1 task

amotl force-pushed the cratedb branch 2 times, most recently from b35e066 to b4289e6 Compare September 18, 2023 11:09

amotl force-pushed the cratedb branch from b4289e6 to 1843ce4 Compare September 25, 2023 20:08

amotl force-pushed the cratedb branch from 083a1e6 to 9d968fe Compare October 3, 2023 21:22

amotl mentioned this pull request Oct 10, 2023

Contrib: Add a few SQLAlchemy patches and polyfills crate/cratedb-toolkit#59

Merged

amotl pushed a commit that referenced this pull request Oct 11, 2023

Merge pull request #1 from VowpalWabbit/add_rl_chain

e942330

Initial commit of rl_chain code

amotl force-pushed the cratedb branch 7 times, most recently from 8a8fc4e to e3d07c4 Compare October 17, 2023 14:58

amotl commented Oct 19, 2023

View reviewed changes

amotl force-pushed the cratedb branch 5 times, most recently from 29cf863 to f75a3d7 Compare October 27, 2023 20:39

amotl added 16 commits June 7, 2024 03:23

Generalize SQLChatMessageHistory to make code a bit more reusable

ac5b89c

CrateDB memory: Add conversational memory support

0a959e0

The implementation is based on the generic `SQLChatMessageHistory`.

CrateDB vector: Fix usage when only reading, and not storing

ca519f0

When not adding any embeddings upfront, the runtime model factory was not able to derive the vector dimension size, because the SQLAlchemy models have not been initialized correctly.

CrateDB vector: Unable to invoke add_embeddings without embeddings

04587eb

CrateDB vector: Fix cascading deletes

cd2c9ab

When deleting a collection, also delete its associated embeddings.

CrateDB vector: Add CrateDBVectorSearchMultiCollection

651d840

It is a special adapter which provides similarity search across multiple collections. It can not be used for indexing documents.

CrateDB vector: Improve testing when initialized without dimensionality

c56144a

CrateDB vector: Use SA's bulk_save_objects method for inserting emb…

73b570c

…eddings The performance gains can be substantially.

CrateDB vector: Test non-deterministic values by using pytest.approx

7941059

The test cases can be written substantially more elegant.

CrateDB vector: Fix initialization of vector dimensionality

c176dc4

CrateDB: Refactor to langchain_community

300c4d8

CrateDB vector: Adjustments for updates to pgvector adapter

6807074

CrateDB vector: Relax test constraint

de3ff7f

CrateDB: SQLAlchemyLoader has been superseded by SQLDatabaseLoader

0a75262

amotl changed the base branch from release-v0.2.2 to cratedb-v0.2.3 June 7, 2024 01:33

amotl closed this Jun 7, 2024

amotl force-pushed the cratedb branch from 40bcf25 to 0a75262 Compare June 7, 2024 01:33

amotl reopened this Jun 7, 2024

amotl changed the base branch from cratedb-v0.2.3 to release-v0.2.3 June 7, 2024 01:37

amotl force-pushed the cratedb branch from 9cb7140 to 0a75262 Compare June 7, 2024 01:37

amotl commented Jun 18, 2024

View reviewed changes

docs/docs/integrations/memory/cratedb_chat_message_history.ipynb Outdated Show resolved Hide resolved

amotl commented Jun 25, 2024

View reviewed changes

Dependencies: Migrate from crate[sqlalchemy] to sqlalchemy-cratedb

e3b0e13

The CrateDB SQLAlchemy dialect needs more love, so it was separated from the DBAPI HTTP driver.

amotl force-pushed the cratedb branch from 783c029 to e3b0e13 Compare June 25, 2024 17:12

amotl added 2 commits June 26, 2024 13:24

CrateDB: Stop using CrateDB Toolkit. Use sqlalchemy-cratedb 0.38.0.

080fc94

CrateDB: Stop using local FloatVector implementation

f5f26d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CrateDB to LangChain LLM framework #1

Add support for CrateDB to LangChain LLM framework #1

amotl commented Sep 16, 2023 •

edited

Loading

amotl commented Jun 7, 2024

amotl commented Jun 7, 2024 •

edited

Loading

amotl left a comment

amotl Jun 25, 2024

amotl Jun 25, 2024

amotl Jun 25, 2024

amotl Jun 25, 2024

Add support for CrateDB to LangChain LLM framework #1

Are you sure you want to change the base?

Add support for CrateDB to LangChain LLM framework #1

Conversation

amotl commented Sep 16, 2023 • edited Loading

About

What's inside

Documentation

Notebooks

Backlog

amotl commented Jun 7, 2024

amotl commented Jun 7, 2024 • edited Loading

amotl left a comment

Choose a reason for hiding this comment

amotl Jun 25, 2024

Choose a reason for hiding this comment

amotl Jun 25, 2024

Choose a reason for hiding this comment

amotl Jun 25, 2024

Choose a reason for hiding this comment

amotl Jun 25, 2024

Choose a reason for hiding this comment

amotl commented Sep 16, 2023 •

edited

Loading

amotl commented Jun 7, 2024 •

edited

Loading