Using text hash as id to prevent document duplication #1000

lalitpagaria · 2021-04-26T21:39:29Z

Proposed changes:

This PR is in response to slack discussion.

New haystack users most of the time encounter duplicate answer in their result. This happen due to duplicate ingestion of same text passage multiple time because each time haystack generate new id via uuid. In order to prevent this we will be using text hash as id to prevent document duplication. Also providing a way customize it where user pass value such that cleaned text, file hash, paragraph number or page no to customize id generation via hashing. Fast hashing algorithm MurmurHash is being used for 128 bit hash generation.

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

… a way customize it.

lalitpagaria · 2021-04-26T21:40:59Z

docs/_src/api/api/retriever.md

@@ -355,6 +355,7 @@ train a DensePassageRetrieval model
 - `train_filename`: training filename
 - `dev_filename`: development set filename, file to be used by model in eval step of training
 - `test_filename`: test set filename, file to be used by model in test step after training
+- `max_sample`: maximum number of input samples to convert. Can be used for debugging a smaller dataset.


Not sure why these changes are in this PR. I have not added these

lalitpagaria · 2021-04-27T10:43:29Z

@oryx1729 @tholor
Can you please review.
Test is failing to some unrelated error (failing to download model Helsinki-NLP/opus-mt-en-de).
I think rerun of test will solve this.

tholor · 2021-04-27T11:03:38Z

Thanks, @lalitpagaria . I will review it later today.

BTW this is #1000 🎉

tholor

I like the implementation. Looks easy and solid.

One thing that is not clear to me yet: What's the behaviour if I add now a second "duplicate" document with the same hash? Will I replace the existing doc with the new one or will I ignore the new doc? Is this behavior consistent across all document stores? Probably we can cover it by adding a few additional test cases and ideally adding a warning message in write_document() to inform users about duplicates.

(Sorry didn't have time today to test the behavior myself)

lalitpagaria · 2021-04-27T17:50:29Z

I have added a test for that. But there is constancy issue -

Memory store - Allow duplicate
ES based store - Throw BulkIndexError
SQL based store - Throw IntegrityError exception because of UNIQUE constraint

Even though I added the test but I think better to make memory store also constant ie throw exception in case of writing document with duplicate id.

Check here to know what happen when we write documents with duplicate ids https://github.com/deepset-ai/haystack/runs/2442736307

tholor · 2021-04-28T06:30:18Z

Thanks for checking the behavior and adding the test. We should definitely make it consistent across doc stores.

I think just throwing an exception is not an ideal user experience here. Imagine I add 100 docs in a batch via write_documents() and one of them is a duplicate. I will receive the error but how should I go on? How can I index the other 99 docs?

IMO it would be nicer to issue a warning including the problematic duplicate document and make sure that the rest of the documents get indexed correctly. What do you think @lalitpagaria @oryx1729 ?

lalitpagaria · 2021-04-28T06:59:03Z

I agree with you @tholor
In this case we need to provide three options in write_documents function -

Ignore duplicate (with warning) - Default option
Overwrite if exist
Fail if duplicate

In all cases write_documents should return inserted document ids (same in case of rest API). So user knows what was written and what was skipped.

We can tackle this in separate PR as resultant changes will be bigger, also handling these in ES, memory and SQL based store would be different.

lalitpagaria · 2021-04-30T17:21:12Z

@tholor @oryx1729 Please let me know if anything needed in this PR.

tholor · 2021-04-30T18:29:09Z

I am fine with tackling the above behavior in a separate PR. However, let's at least make sure that the Documentstore's have consistent behavior in the meantime. So I'd suggest:

throw exception also in memorystore (e.g. by checking ids before inserting docs)
throw the same exception across all stores so that it can easily be catched

lalitpagaria · 2021-05-02T06:39:22Z

I agree this make sense -

throw exception also in memorystore (e.g. by checking ids before inserting docs)

Implementing this is not easy as elastic search always return bulkexception even in case of network error during transaction. Similarly SQL store can throw another constraint error during write documents function.

throw the same exception across all stores so that it can easily be catched

…_keys from object attribute

…hash

tholor

Ready to merge.
We will address the aforementioned limitations in a separate PR.

lalitpagaria and others added 2 commits April 26, 2021 23:30

using text hash as id to prevent document duplication. Also providing…

8d49513

… a way customize it.

Add latest docstring and tutorial changes

d1abe5b

lalitpagaria commented Apr 26, 2021

View reviewed changes

Fixing duplicate value test when text is same

3dd82e2

tholor reviewed Apr 27, 2021

View reviewed changes

Adding test for duplicate ids in document store

d7c9be7

Changing exception to generic Exception type

e6ce6a8

tholor and others added 6 commits May 17, 2021 13:27

add exception for inmemory. update docstring Document. remove id_hash…

6db7c1d

…_keys from object attribute

Merge branch 'master' into doc_hash

ad9621e

Add latest docstring and tutorial changes

6d6457b

Merge branch 'master' into doc_hash

27b8c5c

Merge branch 'doc_hash' of github.com:lalitpagaria/haystack into doc_…

0baebd1

…hash

Add latest docstring and tutorial changes

31a98eb

tholor approved these changes May 17, 2021

View reviewed changes

tholor merged commit f46b09c into deepset-ai:master May 17, 2021

This was referenced May 17, 2021

Improve handling of duplicate documents during write_documents() #1069

Closed

Error when using InMemoryDocumentStore() #1090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using text hash as id to prevent document duplication #1000

Using text hash as id to prevent document duplication #1000

lalitpagaria commented Apr 26, 2021 •

edited by tholor

Loading

lalitpagaria Apr 26, 2021

lalitpagaria commented Apr 27, 2021

tholor commented Apr 27, 2021

tholor left a comment

lalitpagaria commented Apr 27, 2021

tholor commented Apr 28, 2021

lalitpagaria commented Apr 28, 2021

lalitpagaria commented Apr 30, 2021

tholor commented Apr 30, 2021

lalitpagaria commented May 2, 2021

tholor left a comment

Using text hash as id to prevent document duplication #1000

Using text hash as id to prevent document duplication #1000

Conversation

lalitpagaria commented Apr 26, 2021 • edited by tholor Loading

lalitpagaria Apr 26, 2021

Choose a reason for hiding this comment

lalitpagaria commented Apr 27, 2021

tholor commented Apr 27, 2021

tholor left a comment

Choose a reason for hiding this comment

lalitpagaria commented Apr 27, 2021

tholor commented Apr 28, 2021

lalitpagaria commented Apr 28, 2021

lalitpagaria commented Apr 30, 2021

tholor commented Apr 30, 2021

lalitpagaria commented May 2, 2021

tholor left a comment

Choose a reason for hiding this comment

lalitpagaria commented Apr 26, 2021 •

edited by tholor

Loading