Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for update_existing_documents to sql and faiss document stores #584

Merged
merged 6 commits into from
Nov 16, 2020
Merged

Adding support for update_existing_documents to sql and faiss document stores #584

merged 6 commits into from
Nov 16, 2020

Conversation

lalitpagaria
Copy link
Contributor

Resolve #562

@lalitpagaria
Copy link
Contributor Author

@tholor and @tanaysoni please review

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @lalitpagaria !

It seems to me that we are missing one case: When update_existing_documents = True and we call write_documents() with an existing document_id we will update the data in SQL. What about the associated embedding in FAISS? From what I can tell, we will add the new vector in FAISS, add the new vector_id to SQL and leave the "old vector" in FAISS untouched.

I believe deleting single vectors from FAISS is not possible for all index types (https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#removing-elements-from-an-index), but in that case, we should at least log a warning when settingupdate_existing_documents = True and include it in the docstrings.

My second concern would be around speed if update_existing_documents = True as SQL was often the bottleneck before, but I haven't measured it.

What do you think?

@lalitpagaria
Copy link
Contributor Author

@tholor Yes you are correct, I have added following warning -

if self.update_existing_documents and add_vectors:
            logger.warning("You have enabled `update_existing_documents` feature and "
                           "`FAISSDocumentStore` does not support update in existing `faiss_index`.\n"
                           "Please call `update_embeddings` method to repopulate `faiss_index`")

My second concern would be around speed if update_existing_documents = True as SQL was often the bottleneck before, but I haven't measured it.

We can add this info in doc_string of update_existing_documents parameter, but it is up to user to decide based on their use case.

@lalitpagaria
Copy link
Contributor Author

Can you please update the doc_string as I am away from my system.

@tanaysoni
Copy link
Contributor

Thank you for working on this, @lalitpagaria. It looks good-to-go!

@tanaysoni tanaysoni merged commit 3f81c93 into deepset-ai:master Nov 16, 2020
@lalitpagaria lalitpagaria deleted the add_update_existing_documents branch November 16, 2020 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate document bulk index error with Faiss document store
3 participants