Skip to content

RecursiveDocumentSplitter updates Document's meta field after initializing it #9508

@julian-risch

Description

@julian-risch

Describe the bug
In

new_doc = Document(content=chunk, meta=deepcopy(doc.meta))

Documents with the same content (and same initial meta data) will be assigned the same id in the RecursiveDocumentSplitter. As a result, the run method of the RecursiveDocumentSplitter might return documents with the same id. That looks like a bug to me too.

What could be a fix is to first create the new meta data, as in the line new_doc.meta["split_id"] = split_nr and only afterward create a new document. In addition we should add the id of the parent document. I have in mind something like:

meta=deepcopy(doc.meta)
meta["parent_id"] = doc.id
meta["split_id"] = split_nr
meta["split_idx_start"] = current_position
meta["_split_overlap"] = [] if self.split_overlap > 0 else None
new_doc = Document(content=chunk, meta=meta)

Error message
None. Documents with the same id might be handled as duplicates later in a pipeline.

Expected behavior
Different chunks with same content and differing meta data should have different document ids.

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce
Steps to reproduce the behavior

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number):
  • DocumentStore:
  • Reader:
  • Retriever:

Metadata

Metadata

Assignees

Labels

Contributions wanted!Looking for external contributionsP1High priority, add to the next sprint

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions