[Feature Request]: remove the duplicate data #1947

zyu · 2024-03-29T14:26:52Z

Describe the problem

I want to use chroma to remove the duplicate data, I have a lot of duplicate data in the data, I looked at the interface has no way to redo, there is no way for me to remove the duplicate data

Describe the proposed solution

remove the duplicate data

Alternatives considered

remove the duplicate data

Importance

would make my life easier

Additional Information

No response

ceyhuncakir · 2024-04-01T10:48:59Z

I actually fixed this by making my own filter function within the code where i retrieve my top K docs.

def filter_top_k_docs(
    top_k_docs: dict
) -> dict:
    """
    This function is made to filter out any duplicate entries from the chromadb query results:
    
    Args:
        top_k_docs (dict): A dictionary containing various elements from the search results

    Returns:
        dict: The top k results that is filtered from the chromadb query results
    """

    embeddings = top_k_docs['embeddings'][0]
    unique_embeddings = {}
    new_indices = []

    for idx, embedding in enumerate(embeddings):
        embedding_tuple = tuple(embedding)  
        if embedding_tuple not in unique_embeddings:
            unique_embeddings[embedding_tuple] = idx
            new_indices.append(idx)

   
    for key in top_k_docs:
        if top_k_docs[key] is not None:
            top_k_docs[key] = [[item for idx, item in enumerate(top_k_docs[key][0]) if idx in new_indices]]

    return top_k_docs

zyu · 2024-04-07T03:13:32Z

Traceback (most recent call last):
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 652, in raise_chroma_error
resp.raise_for_status()
File "/opt/miniconda3/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8090/api/v1/collections/183b4fe9-b24a-4136-ad77-0943677de6a5/get

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/zyu/code/repeat/updatetime.py", line 38, in
r = collection.get(
^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/models/Collection.py", line 211, in get
get_results = self._client._get(
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/init.py", line 127, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 436, in _get
raise_chroma_error(resp)
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 654, in raise_chroma_error
raise (Exception(resp.text))
Exception: {"error":"IndexError('list assignment index out of range')"}

ceyhuncakir · 2024-04-15T15:17:14Z

i dont know if its still relevant, but can i see the code?.

tazarov · 2024-04-16T05:15:27Z

@zyu, how are you tracking your data in Chroma? Is that via the document ID or metadata? I think it is essential to understand how you make sure that content is a duplicate of something previously added.

syshin0116 · 2024-05-31T07:37:13Z

I've been using the hash code of a document as its ID and leveraging the upsert function to add or update documents in the vector store. This approach effectively removes duplicate documents, as only one instance of each unique document is stored.

However, I've encountered significant challenges when it comes to updating or deleting documents. Here’s a detailed breakdown of the issue:

Scenario:
Document Duplication Across Multiple Files: Suppose a specific document (or chunk) is duplicated across several files.
Single Instance in Vector Store: Using the hash-based ID and upsert method, only one instance of this document is added to the vector store, regardless of how many files it appears in.
Maintaining File References: Ideally, I need to maintain a list of files that reference this document. This list would allow me to track all files that contain the document.

Problems:
Update Issues: When a document needs to be updated, there is no straightforward way to identify all the files that reference the document. This complicates ensuring that the document’s associations remain accurate.

Delete Issues: If I want to delete a specific file and all documents originating from that file, I face difficulties. Since only one instance of the document exists in the vector store, deleting the document could inadvertently remove it from other files that also reference it.

This challenge is particularly difficult because the metadata doesn't support lists or sets. A workaround is to store lists as strings and then convert these strings back to lists every time I .get() the data. However, this workaround necessitates fetching all the data upfront, which makes the process unnecessarily cumbersome and inefficient.

zyu added the enhancement New feature or request label Mar 29, 2024

tazarov mentioned this issue Apr 25, 2024

[BUG]: The batch, the sync and the missing vector #2062

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: remove the duplicate data #1947

[Feature Request]: remove the duplicate data #1947

zyu commented Mar 29, 2024

ceyhuncakir commented Apr 1, 2024

zyu commented Apr 7, 2024

ceyhuncakir commented Apr 15, 2024

tazarov commented Apr 16, 2024

syshin0116 commented May 31, 2024

[Feature Request]: remove the duplicate data #1947

[Feature Request]: remove the duplicate data #1947

Comments

zyu commented Mar 29, 2024

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

ceyhuncakir commented Apr 1, 2024

zyu commented Apr 7, 2024

ceyhuncakir commented Apr 15, 2024

tazarov commented Apr 16, 2024

syshin0116 commented May 31, 2024