Skip to content

fix: make Document.id deterministic regardless of meta key order #11445

@Aarkin7

Description

@Aarkin7

Describe the bug
Document.id is supposed to be a content fingerprint, but it actually depends on the insertion order of keys in meta. So two Documents with the same content and the same metadata can end up with different IDs depending on how the meta dict was built, which silently breaks duplicate detection in document stores and any cache keyed on the document ID.

The cause: _create_id builds the hash input with an f-string that includes {meta}, and dict's repr reflects insertion order since Python 3.7.

Error message
No error. The same logical document just gets written twice (or N times), and you only notice later when retrieval metrics drift or storage/embedding costs are higher than expected.

Expected behavior
Two Documents with equal meta (regardless of how the dict was constructed) should produce the same id. DuplicatePolicy.SKIP / FAIL should actually catch these as duplicates.

Additional context
Anywhere meta is built up dynamically is exposed, different converters, JSON parsed by different libraries, code paths that prepend vs. append default fields, etc. The embedding and sparse_embedding fields also get stringified raw, so the same issue could in principle show up there if those dicts aren't insertion-stable.

To Reproduce

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

a = Document(content="hello", meta={"a": 1, "b": 2})
b = Document(content="hello", meta={"b": 2, "a": 1})

print(a.meta == b.meta)  # True
print(a.id == b.id)      # False  <- bug

store = InMemoryDocumentStore()
store.write_documents([a, b], policy=DuplicatePolicy.SKIP)
print(store.count_documents())  # 2  (expected 1)

FAQ Check

System:

  • OS: macOS 26.4.1
  • Haystack version: main (reproduced on the current main branch)

Metadata

Metadata

Assignees

Labels

P2Medium priority, add to the next sprint if no P1 available

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions