Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: chromadb.api.configuration.InvalidConfigurationError: batch_size must be less than or equal to sync_threshold #2574

Open
dddxst opened this issue Jul 25, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@dddxst
Copy link

dddxst commented Jul 25, 2024

What happened?

from typing import List

import chromadb
from chromadb.api.configuration import HNSWConfiguration
from chromadb.api.models.Collection import Collection
from chromadb.utils.embedding_functions.sentence_transformer_embedding_function import
SentenceTransformerEmbeddingFunction

from read_word import extract_titles

class EmbeddingDB:
def init(self, db, embedding_function=None):
"""
docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma

    m3_model = "D:/models/BGE_models"
    model = SentenceTransformer(m3_model)
    client = chromadb.HttpClient(host='localhost', port=8000)
    :param db:
    :param embedding_function:
    """
    self.db = db
    self.embedding_function = embedding_function

def get_or_create_collection(self, name) -> Collection:
    configuration = HNSWConfiguration(batch_size=100, sync_threshold=100)
    if self.embedding_function:
        collection = self.db.get_or_create_collection(
            name=name,
            # embedding_function=self.embedding_function,
            # configuration=configuration
        )
    else:
        collection = self.db.get_or_create_collection(name=name)

    return collection

def add(self, collection_name: str, string: List[str]):
    """

    :param collection_name: 集合的名字
    :param string:
    :return:
    """
    collection = self.get_or_create_collection(collection_name)
    collection.add(
        embeddings=self.embedding_function(string),
        documents=string,
        ids=[f"id{num}" for num in range(len(string))]
    )
    return collection

def delete_collection(self, name: str) -> None:
    self.db.delete_collection(name=name)

embedding_function1 = SentenceTransformerEmbeddingFunction(model_name=m3_model)
client = chromadb.HttpClient(host='xx.xx.xx.xx', port=8000)

eDB = EmbeddingDB(client, embedding_function1)
titles, docs = extract_titles('wt.docx')

def load_data():
# eDB.delete_collection('docs')
# eDB.delete_collection('titles')

eDB.add("docs", docs)
eDB.add("titles", titles)

if name == 'main':
load_data()

the error occur on ubuntu,but it will not occur on windows

Versions

v0.5.4, ubuntu22 (or centos7.9), python3.11.9

Relevant log output

File "/root/proj/datautils.py", line 72, in load_data
    eDB.add("docs", docs)
  File "/root/proj/datautils.py", line 47, in add
    collection = self.get_or_create_collection(collection_name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/proj/datautils.py", line 30, in get_or_create_collection
    collection = self.db.get_or_create_collection(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/client.py", line 166, in get_or_create_collection
    model = self._server.get_or_create_collection(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 146, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/fastapi.py", line 247, in get_or_create_collection
    return self.create_collection(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 146, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/fastapi.py", line 206, in create_collection
    model = CollectionModel.from_json(resp_json)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/types.py", line 139, in from_json
    configuration = CollectionConfigurationInternal.from_json(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 217, in from_json
    return cls(parameters=parameters)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 115, in __init__
    parameter.value = child_type.from_json(parameter.value)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 217, in from_json
    return cls(parameters=parameters)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 130, in __init__
    self.configuration_validator()
  File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 286, in configuration_validator
    raise InvalidConfigurationError(
chromadb.api.configuration.InvalidConfigurationError: batch_size must be less than or equal to sync_threshold
@dddxst dddxst added the bug Something isn't working label Jul 25, 2024
@mikethemerry
Copy link

I've just spent three evenings tracking down the same bug and have managed to figure this out in the last half hour or so.

I think this is a regression introduced by https://github.com/chroma-core/chroma/pull/2526/files

I'm still figuring out the reproduction steps, but I think the process is

  1. Deploy chroma and create a collection using <=0.5.4 with metadata={"hnsw:space": "cosine"} or similar. Specifically for me
    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine"},
        )

This will create the collection with the defaults in 0.5.4 where sync_threshold=100 and batch_size=1000

  1. Upgrade your client to 0.5.5
  2. It is now checking the sync_threshold and batch_size with the existing defaults and throwing the error

I haven't read through all of the other changes to the HNSW work in 0.5.5 but it looks like there's some changes to persistent properties and similar. I actually was trying to change the configured properties specifically with different metadata definitions and similar, but was having a lot of troubles. Specifically, this was not fixed by changing that code to

    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine", "sync_threshold":1000, "batch_size":100},
        )

As a short term, I would suggest a downgrade to 0.5.4 (this has worked for me) and wait for a patch as the 0.5.5 is still in pre-release.

@tazarov
Copy link
Contributor

tazarov commented Jul 25, 2024

@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.

To fix the problem (ideally, we should've added a migration script to do that, but alas):

If in docker:

Connect to your docker container:

apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"

@dodeeric
Copy link

@mikethemerry, thanks to you it did not take three evenings to me to solve my problem, but only 3 minutes...

@dddxst
Copy link
Author

dddxst commented Jul 26, 2024

@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.

To fix the problem (ideally, we should've added a migration script to do that, but alas):

If in docker:

Connect to your docker container:

apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"

tks,it works when update to 0.5.5,but error occur on windows ...

@dddxst
Copy link
Author

dddxst commented Jul 26, 2024

I've just spent three evenings tracking down the same bug and have managed to figure this out in the last half hour or so.

I think this is a regression introduced by https://github.com/chroma-core/chroma/pull/2526/files

I'm still figuring out the reproduction steps, but I think the process is

  1. Deploy chroma and create a collection using <=0.5.4 with metadata={"hnsw:space": "cosine"} or similar. Specifically for me
    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine"},
        )

This will create the collection with the defaults in 0.5.4 where sync_threshold=100 and batch_size=1000

  1. Upgrade your client to 0.5.5
  2. It is now checking the sync_threshold and batch_size with the existing defaults and throwing the error

I haven't read through all of the other changes to the HNSW work in 0.5.5 but it looks like there's some changes to persistent properties and similar. I actually was trying to change the configured properties specifically with different metadata definitions and similar, but was having a lot of troubles. Specifically, this was not fixed by changing that code to

    self.collection = self.vdb.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_function,
            metadata={"hnsw:space": "cosine", "sync_threshold":1000, "batch_size":100},
        )

As a short term, I would suggest a downgrade to 0.5.4 (this has worked for me) and wait for a patch as the 0.5.5 is still in pre-release.

tks

@tazarov
Copy link
Contributor

tazarov commented Jul 26, 2024

@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.
To fix the problem (ideally, we should've added a migration script to do that, but alas):
If in docker:
Connect to your docker container:

apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"

tks,it works when update to 0.5.5,but error occur on windows ...

Can you share the error you get on Windows?

@codetheweb
Copy link
Contributor

Hey everyone--I believe this is caused by a version mismatch; this shouldn't happen if your client and server are on the same version. Please make sure that your server and client are both on 0.5.5 and let us know if this is still happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants