Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ValueError is not raised when adding a document with an existing id #1062

Open
yuhuishi-convect opened this issue Aug 30, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@yuhuishi-convect
Copy link

What happened?

According to the doc, adding a document with an existing id should raise an ValueError.

Somehow, the ValueError is not raised when running the following commands

import chromadb 
c = chromadb.Client()
cc = c.get_or_create_collection("test")
cc.add(ids=['id1'], documents=['some document'], embeddings=[[1,2,3]])
cc.get()
cc.add(ids=['id1'], documents=['updated document'], embeddings=[[2,3,4]])   # this prints a warning instead of throwing an error
cc.get()

The above code only prints a warning message instead of throwing the actual error

In [14]: cc.add(ids=['id1'], documents=['updated document'], embeddings=[[2,3,4]])
Insert of existing embedding ID: id1
Add of existing embedding ID: id1

If this is expected, then maybe the doc should be modified to reflect that.

Versions

chroma -- 0.4.8
python -- 3.8
os -- ubuntu 20.04

Relevant log output

No response

@yuhuishi-convect yuhuishi-convect added the bug Something isn't working label Aug 30, 2023
@a-atchuta-ram
Copy link

a-atchuta-ram commented Sep 2, 2023

Hello,

Thank you for providing a detailed report. I've been able to reproduce the issue based on your description.

According to the documentation, it is expected that inserting a document with an existing ID should raise a ValueError. However, based on the code you've provided, it seems that only a warning is printed instead.

Upon investigation, I have identified two potential solutions:

Implementation Change: The implementation of the seen and dup sets in the validate_ids function located at chromadb/api/types.py needs revision. Currently, these sets become empty every time the function is called, resulting in the behaviour you've observed. A change to persist these sets between function calls or an alternative method to track duplicates should address the issue.

Documentation Update and Clarity:

import chromadb
client = chromadb.Client()

collection = client.create_collection("sample_collection")
collection.add(
documents=["This is document1", "This is document2"],
metadatas=[{"source": "notion"}, {"source": "google-docs"}],
ids=["doc1", "doc1"]
)

When you try running this code, it does give a DuplicateIDError. So the documentation needs to be updated accordingly.
I've submitted a pull request for the doc update chroma-core/docs#130

@urig
Copy link

urig commented Sep 17, 2023

I'd like to ask whether the change should be done in validate_ids()?

This function seems to be correctly identifying and rejecting duplicate id values in a single call. This looks like the correct implementation when it's called directly from get() and delete() as well as when called indirectly from add().

Code changes to prevent adding the same id in separate calls to add() should likely not interfere with this area?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants