Add batch update of embeddings in document stores #733

tanaysoni · 2021-01-13T16:10:03Z

This PR introduces a get_all_documents_generator() method for the document stores. It yields documents iteratively, making it memory efficient when working with a large number of documents.

brandenchan · 2021-01-14T16:31:41Z

haystack/document_store/elasticsearch.py

+        index: Optional[str] = None,
+        filters: Optional[Dict[str, List[str]]] = None,
+        return_embedding: Optional[bool] = None,
+        page_number: Optional[int] = None,


I find the term page here quite confusing because it is essentially a document. I see that these two variables are used to take a slice of the full set of documents. Some terminology more along the lines of documents and slices would be clearer in my opinion.

brandenchan · 2021-01-14T16:32:32Z

haystack/document_store/elasticsearch.py

@@ -387,10 +392,12 @@ def get_label_count(self, index: Optional[str] = None) -> int:
        return self.get_document_count(index=index)

    def get_all_documents(


With the new params, this fn can be configured to return less than all documents making the name of this fn quite confusing. Probably not somethign to deal with in this PR (not least since it would be a breaking change) but definitely worth addressing as an issue.

lalitpagaria · 2021-01-15T15:21:10Z

haystack/document_store/sql.py

@@ -166,6 +173,8 @@ def get_all_documents(
            DocumentORM.text,
            DocumentORM.vector_id
        ).filter_by(index=index)
+        if page_number is not None and page_size is not None:
+            documents_query = documents_query.offset(page_number * page_size).limit(page_size)


Pagination via offset can cause performance issue as DB have to scan through all the previous rows.
Window query (many major DB have support for Postgres, Oracle, MySQL etc) or Non-Window query using LIMIT (for SQLite) will give good performance advantage. For more information: https://github.com/sqlalchemy/sqlalchemy/wiki/RangeQuery-and-WindowedRangeQuery

This would be easy to implement if get_all_documents() returns an iterator that yields all documents with a windowed query. As per the current implementation in this PR, each paginated query is an independent call to get_all_documents(). Having said that, it might be a good idea to implement an optional mode where get_all_documents() returns an iterator with paginated results. This would abstract away the details of pagination and allow us to use different pagination for each document store while maintaining a uniform interface with get_all_documents().

lalitpagaria · 2021-01-15T15:30:04Z

haystack/document_store/faiss.py

+        filters: Optional[Dict[str, List[str]]] = None,
+        return_embedding: Optional[bool] = None,
+        page_number: Optional[int] = None,
+        page_size: Optional[int] = None,
    ) -> List[Document]:


We can take NextToken class which is nothing but have page_number and page_size.
And get_documents function can return Tuple[List[Document], NextToken]. This NextToken can be used as iterator. WDYT?

lalitpagaria · 2021-01-15T15:34:20Z

haystack/document_store/faiss.py

+        vector_id = self.faiss_index.ntotal
+        page_number = 0
+        for _ in tqdm(range(0, document_count, self.index_buffer_size)):
+            documents = self.get_all_documents(index=index, page_number=page_number, page_size=self.index_buffer_size)


If we add page_number and page_size also as parameters of update_embeddings function then we don't need index_buffer_size. This will also make update_embeddings paginated write function.

lalitpagaria · 2021-01-15T15:38:08Z

haystack/document_store/faiss.py

-        if len(documents) == 0:
-            logger.warning("Calling DocumentStore.update_embeddings() on an empty index")
-            return
+        document_count = self.get_document_count(index=index)


Not related to this PR how we will handle simultaneous/concurrent call to update_embeddings and write_documents functions. That will affect value of self.get_document_count(index=index) function and also create problem in vector_id duplication. Mainly do we need Global Lock as this looks like critical section.

Yes, we'd need some guards for concurrent use of the FAISSDocumentStore. In the next weeks, we also plan to explore if there's a community interest in a document store like Milvus that already would take care of many scenarios pertaining to production deployments.

tholor

Looking good!

haystack/document_store/elasticsearch.py

tanaysoni force-pushed the batch-update-embeddings branch from d075f04 to 3d811d2 Compare January 13, 2021 16:34

tanaysoni changed the title ~~WIP: Add batch update of embeddings in document stores~~ Add batch update of embeddings in document stores Jan 14, 2021

tanaysoni requested review from tholor and brandenchan January 14, 2021 12:40

brandenchan suggested changes Jan 14, 2021

View reviewed changes

This was referenced Jan 14, 2021

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Closed

DensePassageRetriever update embeddings out of memory issue #666

Closed

lalitpagaria suggested changes Jan 15, 2021

View reviewed changes

tanaysoni self-assigned this Jan 20, 2021

tanaysoni added 7 commits January 20, 2021 17:29

Add batch update of embeddings in document stores

79f8518

Resolve merge conflict

65aacb2

Remove document ordering dependency in tests

72fc413

Adjust index buffer size for tests

2189f7a

Adjust ES Scroll Slice

01dca2f

Use generator for document store pagination

de13496

Add pagination for InMemoryDocumentStore

0597085

tanaysoni force-pushed the batch-update-embeddings branch from 7fa6cef to 0597085 Compare January 20, 2021 16:32

tanaysoni added 9 commits January 21, 2021 09:57

Fix missing index parameter in FAISS update_embeddings()

008d5b8

Fix FAISS update_embeddings()

78ff0b4

Update FAISS tests

5e62017

Update eval tests

854a071

Revert code formatting change

c7f1374

Fix document count in FAISS update embeddings

5698981

Fix vector_ids reset in SQLDocumentStore

07ac32d

Update doctrings

88bb28d

Update docstring

e4c8e3e

tholor approved these changes Jan 21, 2021

View reviewed changes

haystack/document_store/elasticsearch.py Outdated Show resolved Hide resolved

haystack/document_store/elasticsearch.py Outdated Show resolved Hide resolved

tholor merged commit 337376c into master Jan 21, 2021

tholor deleted the batch-update-embeddings branch January 21, 2021 15:00

tanaysoni mentioned this pull request Jan 25, 2021

Remote PostgreSQL on VM connection timeout #605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch update of embeddings in document stores #733

Add batch update of embeddings in document stores #733

tanaysoni commented Jan 13, 2021 •

edited

brandenchan Jan 14, 2021

brandenchan Jan 14, 2021

lalitpagaria Jan 15, 2021

tanaysoni Jan 18, 2021

lalitpagaria Jan 15, 2021

lalitpagaria Jan 15, 2021

lalitpagaria Jan 15, 2021

tanaysoni Jan 18, 2021 •

edited

tholor left a comment

		@@ -387,10 +392,12 @@ def get_label_count(self, index: Optional[str] = None) -> int:
		return self.get_document_count(index=index)

		def get_all_documents(

Add batch update of embeddings in document stores #733

Add batch update of embeddings in document stores #733

Conversation

tanaysoni commented Jan 13, 2021 • edited

brandenchan Jan 14, 2021

Choose a reason for hiding this comment

brandenchan Jan 14, 2021

Choose a reason for hiding this comment

lalitpagaria Jan 15, 2021

Choose a reason for hiding this comment

tanaysoni Jan 18, 2021

Choose a reason for hiding this comment

lalitpagaria Jan 15, 2021

Choose a reason for hiding this comment

lalitpagaria Jan 15, 2021

Choose a reason for hiding this comment

lalitpagaria Jan 15, 2021

Choose a reason for hiding this comment

tanaysoni Jan 18, 2021 • edited

Choose a reason for hiding this comment

tholor left a comment

Choose a reason for hiding this comment

tanaysoni commented Jan 13, 2021 •

edited

tanaysoni Jan 18, 2021 •

edited