Describe the bug
In
|
new_doc = Document(content=chunk, meta=deepcopy(doc.meta)) |
Documents with the same content (and same initial meta data) will be assigned the same id in the RecursiveDocumentSplitter. As a result, the run method of the RecursiveDocumentSplitter might return documents with the same id. That looks like a bug to me too.
What could be a fix is to first create the new meta data, as in the line new_doc.meta["split_id"] = split_nr and only afterward create a new document. In addition we should add the id of the parent document. I have in mind something like:
meta=deepcopy(doc.meta)
meta["parent_id"] = doc.id
meta["split_id"] = split_nr
meta["split_idx_start"] = current_position
meta["_split_overlap"] = [] if self.split_overlap > 0 else None
new_doc = Document(content=chunk, meta=meta)
Error message
None. Documents with the same id might be handled as duplicates later in a pipeline.
Expected behavior
Different chunks with same content and differing meta data should have different document ids.
Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce
Steps to reproduce the behavior
FAQ Check
System:
- OS:
- GPU/CPU:
- Haystack version (commit or version number):
- DocumentStore:
- Reader:
- Retriever:
Describe the bug
In
haystack/haystack/components/preprocessors/recursive_splitter.py
Line 426 in a28b285
Documents with the same content (and same initial meta data) will be assigned the same id in the RecursiveDocumentSplitter. As a result, the run method of the RecursiveDocumentSplitter might return documents with the same id. That looks like a bug to me too.
What could be a fix is to first create the new meta data, as in the line
new_doc.meta["split_id"] = split_nrand only afterward create a new document. In addition we should add the id of the parent document. I have in mind something like:Error message
None. Documents with the same id might be handled as duplicates later in a pipeline.
Expected behavior
Different chunks with same content and differing meta data should have different document ids.
Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce
Steps to reproduce the behavior
FAQ Check
System: