-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using text hash as id to prevent document duplication #1000
Conversation
docs/_src/api/api/retriever.md
Outdated
@@ -355,6 +355,7 @@ train a DensePassageRetrieval model | |||
- `train_filename`: training filename | |||
- `dev_filename`: development set filename, file to be used by model in eval step of training | |||
- `test_filename`: test set filename, file to be used by model in test step after training | |||
- `max_sample`: maximum number of input samples to convert. Can be used for debugging a smaller dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why these changes are in this PR. I have not added these
Thanks, @lalitpagaria . I will review it later today. BTW this is #1000 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the implementation. Looks easy and solid.
One thing that is not clear to me yet: What's the behaviour if I add now a second "duplicate" document with the same hash? Will I replace the existing doc with the new one or will I ignore the new doc? Is this behavior consistent across all document stores? Probably we can cover it by adding a few additional test cases and ideally adding a warning message in write_document()
to inform users about duplicates.
(Sorry didn't have time today to test the behavior myself)
I have added a test for that. But there is constancy issue -
Even though I added the test but I think better to make memory store also constant ie throw exception in case of writing document with duplicate id. Check here to know what happen when we write documents with duplicate ids https://github.com/deepset-ai/haystack/runs/2442736307 |
Thanks for checking the behavior and adding the test. We should definitely make it consistent across doc stores. I think just throwing an exception is not an ideal user experience here. Imagine I add 100 docs in a batch via IMO it would be nicer to issue a warning including the problematic duplicate document and make sure that the rest of the documents get indexed correctly. What do you think @lalitpagaria @oryx1729 ? |
I agree with you @tholor
In all cases write_documents should return inserted document ids (same in case of rest API). So user knows what was written and what was skipped. We can tackle this in separate PR as resultant changes will be bigger, also handling these in ES, memory and SQL based store would be different. |
I am fine with tackling the above behavior in a separate PR. However, let's at least make sure that the Documentstore's have consistent behavior in the meantime. So I'd suggest:
|
I agree this make sense -
Implementing this is not easy as elastic search always return bulkexception even in case of network error during transaction. Similarly SQL store can throw another constraint error during write documents function.
|
…_keys from object attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge.
We will address the aforementioned limitations in a separate PR.
Proposed changes:
This PR is in response to slack discussion.
New haystack users most of the time encounter duplicate answer in their result. This happen due to duplicate ingestion of same text passage multiple time because each time haystack generate new id via uuid. In order to prevent this we will be using text hash as id to prevent document duplication. Also providing a way customize it where user pass value such that cleaned text, file hash, paragraph number or page no to customize id generation via hashing. Fast hashing algorithm
MurmurHash
is being used for 128 bit hash generation.Status (please check what you already did):