Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using text hash as id to prevent document duplication #1000

Merged
merged 11 commits into from
May 17, 2021
Merged

Using text hash as id to prevent document duplication #1000

merged 11 commits into from
May 17, 2021

Conversation

lalitpagaria
Copy link
Contributor

@lalitpagaria lalitpagaria commented Apr 26, 2021

Proposed changes:

This PR is in response to slack discussion.

New haystack users most of the time encounter duplicate answer in their result. This happen due to duplicate ingestion of same text passage multiple time because each time haystack generate new id via uuid. In order to prevent this we will be using text hash as id to prevent document duplication. Also providing a way customize it where user pass value such that cleaned text, file hash, paragraph number or page no to customize id generation via hashing. Fast hashing algorithm MurmurHash is being used for 128 bit hash generation.

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

@@ -355,6 +355,7 @@ train a DensePassageRetrieval model
- `train_filename`: training filename
- `dev_filename`: development set filename, file to be used by model in eval step of training
- `test_filename`: test set filename, file to be used by model in test step after training
- `max_sample`: maximum number of input samples to convert. Can be used for debugging a smaller dataset.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why these changes are in this PR. I have not added these

@lalitpagaria
Copy link
Contributor Author

@oryx1729 @tholor
Can you please review.
Test is failing to some unrelated error (failing to download model Helsinki-NLP/opus-mt-en-de).
I think rerun of test will solve this.

@tholor
Copy link
Member

tholor commented Apr 27, 2021

Thanks, @lalitpagaria . I will review it later today.

BTW this is #1000 🎉

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the implementation. Looks easy and solid.

One thing that is not clear to me yet: What's the behaviour if I add now a second "duplicate" document with the same hash? Will I replace the existing doc with the new one or will I ignore the new doc? Is this behavior consistent across all document stores? Probably we can cover it by adding a few additional test cases and ideally adding a warning message in write_document() to inform users about duplicates.

(Sorry didn't have time today to test the behavior myself)

@lalitpagaria
Copy link
Contributor Author

I have added a test for that. But there is constancy issue -

  • Memory store - Allow duplicate
  • ES based store - Throw BulkIndexError
  • SQL based store - Throw IntegrityError exception because of UNIQUE constraint

Even though I added the test but I think better to make memory store also constant ie throw exception in case of writing document with duplicate id.

Check here to know what happen when we write documents with duplicate ids https://github.com/deepset-ai/haystack/runs/2442736307

@tholor
Copy link
Member

tholor commented Apr 28, 2021

Thanks for checking the behavior and adding the test. We should definitely make it consistent across doc stores.

I think just throwing an exception is not an ideal user experience here. Imagine I add 100 docs in a batch via write_documents() and one of them is a duplicate. I will receive the error but how should I go on? How can I index the other 99 docs?

IMO it would be nicer to issue a warning including the problematic duplicate document and make sure that the rest of the documents get indexed correctly. What do you think @lalitpagaria @oryx1729 ?

@lalitpagaria
Copy link
Contributor Author

I agree with you @tholor
In this case we need to provide three options in write_documents function -

  1. Ignore duplicate (with warning) - Default option
  2. Overwrite if exist
  3. Fail if duplicate

In all cases write_documents should return inserted document ids (same in case of rest API). So user knows what was written and what was skipped.

We can tackle this in separate PR as resultant changes will be bigger, also handling these in ES, memory and SQL based store would be different.

@lalitpagaria
Copy link
Contributor Author

@tholor @oryx1729 Please let me know if anything needed in this PR.

@tholor
Copy link
Member

tholor commented Apr 30, 2021

I am fine with tackling the above behavior in a separate PR. However, let's at least make sure that the Documentstore's have consistent behavior in the meantime. So I'd suggest:

  • throw exception also in memorystore (e.g. by checking ids before inserting docs)
  • throw the same exception across all stores so that it can easily be catched

@lalitpagaria
Copy link
Contributor Author

I agree this make sense -

throw exception also in memorystore (e.g. by checking ids before inserting docs)

Implementing this is not easy as elastic search always return bulkexception even in case of network error during transaction. Similarly SQL store can throw another constraint error during write documents function.

throw the same exception across all stores so that it can easily be catched

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge.
We will address the aforementioned limitations in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants