-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: remove the duplicate data #1947
Comments
I actually fixed this by making my own filter function within the code where i retrieve my top K docs.
|
Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
i dont know if its still relevant, but can i see the code?. |
@zyu, how are you tracking your data in Chroma? Is that via the document ID or metadata? I think it is essential to understand how you make sure that content is a duplicate of something previously added. |
I've been using the hash code of a document as its ID and leveraging the upsert function to add or update documents in the vector store. This approach effectively removes duplicate documents, as only one instance of each unique document is stored. However, I've encountered significant challenges when it comes to updating or deleting documents. Here’s a detailed breakdown of the issue: Scenario: Problems: Delete Issues: If I want to delete a specific file and all documents originating from that file, I face difficulties. Since only one instance of the document exists in the vector store, deleting the document could inadvertently remove it from other files that also reference it. This challenge is particularly difficult because the metadata doesn't support lists or sets. A workaround is to store lists as strings and then convert these strings back to lists every time I .get() the data. However, this workaround necessitates fetching all the data upfront, which makes the process unnecessarily cumbersome and inefficient. |
Describe the problem
I want to use chroma to remove the duplicate data, I have a lot of duplicate data in the data, I looked at the interface has no way to redo, there is no way for me to remove the duplicate data
Describe the proposed solution
remove the duplicate data
Alternatives considered
remove the duplicate data
Importance
would make my life easier
Additional Information
No response
The text was updated successfully, but these errors were encountered: