Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: remove the duplicate data #1947

Open
zyu opened this issue Mar 29, 2024 · 5 comments
Open

[Feature Request]: remove the duplicate data #1947

zyu opened this issue Mar 29, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@zyu
Copy link

zyu commented Mar 29, 2024

Describe the problem

I want to use chroma to remove the duplicate data, I have a lot of duplicate data in the data, I looked at the interface has no way to redo, there is no way for me to remove the duplicate data

Describe the proposed solution

remove the duplicate data

Alternatives considered

remove the duplicate data

Importance

would make my life easier

Additional Information

No response

@zyu zyu added the enhancement New feature or request label Mar 29, 2024
@ceyhuncakir
Copy link

I actually fixed this by making my own filter function within the code where i retrieve my top K docs.

def filter_top_k_docs(
    top_k_docs: dict
) -> dict:
    """
    This function is made to filter out any duplicate entries from the chromadb query results:
    
    Args:
        top_k_docs (dict): A dictionary containing various elements from the search results

    Returns:
        dict: The top k results that is filtered from the chromadb query results
    """

    embeddings = top_k_docs['embeddings'][0]
    unique_embeddings = {}
    new_indices = []

    for idx, embedding in enumerate(embeddings):
        embedding_tuple = tuple(embedding)  
        if embedding_tuple not in unique_embeddings:
            unique_embeddings[embedding_tuple] = idx
            new_indices.append(idx)

   
    for key in top_k_docs:
        if top_k_docs[key] is not None:
            top_k_docs[key] = [[item for idx, item in enumerate(top_k_docs[key][0]) if idx in new_indices]]

    return top_k_docs

@zyu
Copy link
Author

zyu commented Apr 7, 2024

Traceback (most recent call last):
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 652, in raise_chroma_error
resp.raise_for_status()
File "/opt/miniconda3/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8090/api/v1/collections/183b4fe9-b24a-4136-ad77-0943677de6a5/get

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/zyu/code/repeat/updatetime.py", line 38, in
r = collection.get(
^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/models/Collection.py", line 211, in get
get_results = self._client._get(
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/init.py", line 127, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 436, in _get
raise_chroma_error(resp)
File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 654, in raise_chroma_error
raise (Exception(resp.text))
Exception: {"error":"IndexError('list assignment index out of range')"}

@ceyhuncakir
Copy link

i dont know if its still relevant, but can i see the code?.

@tazarov
Copy link
Contributor

tazarov commented Apr 16, 2024

@zyu, how are you tracking your data in Chroma? Is that via the document ID or metadata? I think it is essential to understand how you make sure that content is a duplicate of something previously added.

@syshin0116
Copy link

I've been using the hash code of a document as its ID and leveraging the upsert function to add or update documents in the vector store. This approach effectively removes duplicate documents, as only one instance of each unique document is stored.

However, I've encountered significant challenges when it comes to updating or deleting documents. Here’s a detailed breakdown of the issue:

Scenario:
Document Duplication Across Multiple Files: Suppose a specific document (or chunk) is duplicated across several files.
Single Instance in Vector Store: Using the hash-based ID and upsert method, only one instance of this document is added to the vector store, regardless of how many files it appears in.
Maintaining File References: Ideally, I need to maintain a list of files that reference this document. This list would allow me to track all files that contain the document.

Problems:
Update Issues: When a document needs to be updated, there is no straightforward way to identify all the files that reference the document. This complicates ensuring that the document’s associations remain accurate.

Delete Issues: If I want to delete a specific file and all documents originating from that file, I face difficulties. Since only one instance of the document exists in the vector store, deleting the document could inadvertently remove it from other files that also reference it.

This challenge is particularly difficult because the metadata doesn't support lists or sets. A workaround is to store lists as strings and then convert these strings back to lists every time I .get() the data. However, this workaround necessitates fetching all the data upfront, which makes the process unnecessarily cumbersome and inefficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants