Describe the bug
Document.id is supposed to be a content fingerprint, but it actually depends on the insertion order of keys in meta. So two Documents with the same content and the same metadata can end up with different IDs depending on how the meta dict was built, which silently breaks duplicate detection in document stores and any cache keyed on the document ID.
The cause: _create_id builds the hash input with an f-string that includes {meta}, and dict's repr reflects insertion order since Python 3.7.
Error message
No error. The same logical document just gets written twice (or N times), and you only notice later when retrieval metrics drift or storage/embedding costs are higher than expected.
Expected behavior
Two Documents with equal meta (regardless of how the dict was constructed) should produce the same id. DuplicatePolicy.SKIP / FAIL should actually catch these as duplicates.
Additional context
Anywhere meta is built up dynamically is exposed, different converters, JSON parsed by different libraries, code paths that prepend vs. append default fields, etc. The embedding and sparse_embedding fields also get stringified raw, so the same issue could in principle show up there if those dicts aren't insertion-stable.
To Reproduce
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
a = Document(content="hello", meta={"a": 1, "b": 2})
b = Document(content="hello", meta={"b": 2, "a": 1})
print(a.meta == b.meta) # True
print(a.id == b.id) # False <- bug
store = InMemoryDocumentStore()
store.write_documents([a, b], policy=DuplicatePolicy.SKIP)
print(store.count_documents()) # 2 (expected 1)
FAQ Check
System:
- OS: macOS 26.4.1
- Haystack version: main (reproduced on the current main branch)
Describe the bug
Document.id is supposed to be a content fingerprint, but it actually depends on the insertion order of keys in meta. So two Documents with the same content and the same metadata can end up with different IDs depending on how the meta dict was built, which silently breaks duplicate detection in document stores and any cache keyed on the document ID.
The cause: _create_id builds the hash input with an f-string that includes {meta}, and dict's repr reflects insertion order since Python 3.7.
Error message
No error. The same logical document just gets written twice (or N times), and you only notice later when retrieval metrics drift or storage/embedding costs are higher than expected.
Expected behavior
Two Documents with equal meta (regardless of how the dict was constructed) should produce the same id. DuplicatePolicy.SKIP / FAIL should actually catch these as duplicates.
Additional context
Anywhere meta is built up dynamically is exposed, different converters, JSON parsed by different libraries, code paths that prepend vs. append default fields, etc. The embedding and sparse_embedding fields also get stringified raw, so the same issue could in principle show up there if those dicts aren't insertion-stable.
To Reproduce
FAQ Check
System: