Add `run_batch` method to all nodes and `Pipeline` to allow batch querying #2481

bogdankostic · 2022-05-02T13:12:06Z

This PR facilitates batch processing with query Pipelines. This is done by adding a run_batch and a corresponding predict_batch/retrieve_batch/... method to each of the nodes. In order to achieve this, transformers dependency has been upgraded to the newest version, because in the version currently used, batch processing for transformers' question-answering-pipeline is not supported.

As decided in #1239, the Pipeline's run_batch method can take a single query or a list of queries as its queries argument and a single list of Documents or a list of lists of Documents as its documents argument. The behavior of the Pipeline depends on whether a single value or list of values are provided. In general, the following applies:

	Single Query	List of Queries
List of Docs	Single query is applied to each doc individually -> answers are returned for each single Document	Each query is applied to each doc individually -> answers are returned for each query-doc-pair
List of Lists of Docs	Single query is applied to each list of Documents -> aggregation of answers on each list of Documents	Each query is applied to its corresponding doc list (based on same index in the provided lists of values) -> aggregation of answers on each list of Documents

if run_batch is called on an indexing pipeline, the pipeline's run method will be called.

Breaking changes

The input and output of the FARMReader's predict_batch method changed. (Therefore, the output of the QuestionAnswerGenerationPipeline also changed.)

Old way

# Old input
FARMReader.predict_batch(
    query_doc_list=[
        {"question": "Some sample query", "docs": [Document(content="sample doc1"), Document(content="sample doc 2")]},
        {"question": "Some sample query", "docs": [Document(content="sample doc3"), Document(content="sample doc 4")]}
    ]
)

# Old output
[
    {"query": "Some sample query", "no_ans_gap": max_no_ans_gap, "answers": Answers from doc1 and doc2, "label": cur_label},
    {"query": "Some sample query", "no_ans_gap": max_no_ans_gap, "answers": Answers from doc3 and doc4, "label": cur_label}
]

New way

# New input
FARMReader.predict_batch(
    queries=["Some sample query", "Some sample query"]
    documents=[[Document(content="sample doc1"), Document(content="sample doc 2")], [Document(content="sample doc3"), Document(content="sample doc 4")]]

# New output
{"queries": ["Some sample query", "Some sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], "no_ans_gaps": [[no_ans_gaps from query1], [no_ans_gaps from query2]]}

Closes #1239

# Conflicts: # haystack/document_stores/elasticsearch.py # haystack/nodes/base.py # haystack/nodes/retriever/sparse.py # haystack/pipelines/base.py

# Conflicts: # haystack/document_stores/elasticsearch.py # haystack/nodes/answer_generator/base.py # haystack/nodes/question_generator/question_generator.py # haystack/nodes/ranker/base.py # haystack/nodes/retriever/sparse.py # haystack/nodes/summarizer/base.py # haystack/pipelines/base.py

julian-risch

Looks quite good to me! I just have a couple of things that we should discuss/change before merging. Here is a list (see my individual comments for details). Happy to have a call anytime.

batch_size should maybe be an attribute of every node and it should be possible to set it in the init method of each node (yaml pipeline definition)
Maybe two of the cases list of docs/queries can actually be joined?
Make error message more extensive and print not only expected type but also given type
Allow a different filter per query in queries?
truncation_warning every time predict is executed
multiple_doc_lists naming: more than one list? FOr example in batch_prediction_single_query_multiple_doc_lists
Use distilled, tiny model instead of roberta-base-squad in test case
Transformers upgrade separately would be better in separate PR
Check whether tutorials are still working (QuestionGenerator could break I think)

haystack/nodes/answer_generator/base.py

julian-risch · 2022-05-02T14:11:23Z

haystack/document_stores/elasticsearch.py

+        if headers is None:
+            headers = {}
+
+        single_query = False


If the handling of the edge case with a single query makes the implementation much more complex, we could discuss whether queries must be a list in all cases. The edge case would be a list with a single item then.

As discussed, we can address this question and decide what to do in a separate PR.

julian-risch · 2022-05-02T14:16:36Z

haystack/nodes/answer_generator/base.py

+
+        # Query case 2: list of queries
+        elif isinstance(queries, list) and len(queries) > 0 and isinstance(queries[0], str):
+            # Docs case 1: single list of Documents -> apply each query to all Documents


This case does essentially the same as "# Docs case 1: single list of Documents -> apply single query to all Documents". So if queries is a string, we could make it a list [queries] and then use the same code for both cases with the for loop for query in queries:. Merging these two cases could simplify the code a little bit. What do you think? I like the readability of the current version. 👍

As discussed, we can address this question and decide what to do in a separate PR.

haystack/nodes/document_classifier/base.py

haystack/nodes/extractor/entity.py

test/test_summarizer.py

test/test_extractor.py

test/test_document_classifier.py

setup.cfg

# Conflicts: # docs/_src/api/api/document_store.md # docs/_src/api/api/ranker.md # docs/_src/api/api/reader.md # haystack/nodes/base.py # haystack/pipelines/base.py # test/test_pipeline.py # test/test_tokenization.py

…in pipelines/base.py and document_stores/base.py

review-notebook-app · 2022-05-10T15:19:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

julian-risch · 2022-05-11T07:07:31Z

haystack/document_stores/deepsetcloud.py

+        all_terms_must_match: bool = False,
+        scale_score: bool = True,
+    ) -> Union[List[Document], List[List[Document]]]:
+        raise NotImplementedError("DeepsetCloudDocumentStore currently does not support query_batch method.")


Could we maybe add this without much overhead before the next release? Is there an issue already? It would be a shame if we can't use all the batch processing in dc because of that. What we need for that is a self.client.query_batch() call here and then also a query_batch() implementation in IndexClient.

I would suggest to just call query here multiple times, as there is no batch_query endpoint in dc (yet). (In query_batch()in IndexClient we would need to call the documents-query endpoint multiple times.) Once dc has a batch_query endpoint, we can have a query_batch() implementation for the IndexClient. What do you think?

Yes, agreed. 👍

julian-risch

Looks great to me! 👍 Let's just not forget about query_batch in DCDocumentStore so that we can make use of these nice new batch processing features.

bogdankostic and others added 3 commits May 2, 2022 14:46

Add run_batch methods for batch querying

7270847

Merge remote-tracking branch 'origin/master' into batch_query

44d7792

# Conflicts: # haystack/document_stores/elasticsearch.py # haystack/nodes/base.py # haystack/nodes/retriever/sparse.py # haystack/pipelines/base.py

Update Documentation & Code Style

0a1bf91

julian-risch added the topic:pipeline label May 2, 2022

bogdankostic and others added 17 commits May 2, 2022 18:42

Fix mypy

1ef25c9

Update Documentation & Code Style

d843c04

Fix mypy

5884364

Fix linter

8572a7c

Fix tests

93527af

Update Documentation & Code Style

a15e997

Fix tests

664c444

Merge remote-tracking branch 'origin/batch_query' into batch_query

b77820c

Update Documentation & Code Style

a5c5e5f

Fix mypy

df6d780

Fix rest api test

2a7131b

Merge remote-tracking branch 'origin/batch_query' into batch_query

c65adfe

Update Documentation & Code Style

37163c8

Add Doc strings

7f5a638

Merge remote-tracking branch 'origin/batch_query' into batch_query

b7d6831

Update Documentation & Code Style

4dcc06e

bogdankostic added action:needs documentation breaking change labels May 3, 2022

bogdankostic marked this pull request as ready for review May 3, 2022 15:47

bogdankostic requested a review from julian-risch May 3, 2022 15:47

julian-risch requested changes May 6, 2022

View reviewed changes

bogdankostic added 4 commits May 9, 2022 11:59

Merge remote-tracking branch 'origin/master' into batch_query

f8168a4

# Conflicts: # docs/_src/api/api/document_store.md # docs/_src/api/api/ranker.md # docs/_src/api/api/reader.md # haystack/nodes/base.py # haystack/pipelines/base.py # test/test_pipeline.py # test/test_tokenization.py

Add batch_size as attribute to nodes supporting batching

beabd6a

Adapt error messages

3d2820c

Adapt type of filters in retrievers

8128549

bogdankostic and others added 17 commits May 9, 2022 16:45

Unify multiple_doc_lists tests

2a61dbe

Use smaller models in extractor tests

34d4a51

Add return types to JoinAnswers and RouteDocuments

bd3f285

Adapt return statements in reader's run_batch method

8cb6612

Allow list of filters

d867ded

Adapt error messages

36bef29

Update Documentation & Code Style

abf0518

Fix tests

3dc38cf

Fix mypy

26619be

Merge remote-tracking branch 'origin/batch_query' into batch_query

f09531e

Adapt print_questions

571c156

Remove disabling warning about too many public methods

799587d

Add flag for pylint to disable warning about too many public methods …

ebb77da

…in pipelines/base.py and document_stores/base.py

Add type check

993a82d

Update Documentation & Code Style

dcb8fba

Adapt tutorial 11

6a35faa

Merge remote-tracking branch 'origin/batch_query' into batch_query

4d4be4a

Update Documentation & Code Style

ef13ec7

bogdankostic requested a review from julian-risch May 10, 2022 15:52

julian-risch reviewed May 11, 2022

View reviewed changes

julian-risch approved these changes May 11, 2022

View reviewed changes

bogdankostic and others added 2 commits May 11, 2022 10:14

Add query_batch method for DCDocStore

bbcd82f

Update Documentation & Code Style

82404b2

bogdankostic merged commit 738e008 into master May 11, 2022

bogdankostic deleted the batch_query branch May 11, 2022 09:11

bogdankostic mentioned this pull request May 11, 2022

Rethink inputs to run_batch #2529

Closed

This was referenced May 19, 2022

Add optional progress bar for batch methods #2580

Closed

Add run_batch for standard pipelines #2595

Merged

julian-risch mentioned this pull request Jun 7, 2022

Pipeline eval should use batch processing for queries #2636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `run_batch` method to all nodes and `Pipeline` to allow batch querying #2481

Add `run_batch` method to all nodes and `Pipeline` to allow batch querying #2481

bogdankostic commented May 2, 2022 •

edited

Loading

julian-risch left a comment

julian-risch May 2, 2022

julian-risch May 11, 2022

julian-risch May 2, 2022

julian-risch May 11, 2022

review-notebook-app bot commented May 10, 2022

julian-risch May 11, 2022

bogdankostic May 11, 2022

julian-risch May 11, 2022

julian-risch left a comment

Add run_batch method to all nodes and Pipeline to allow batch querying #2481

Add run_batch method to all nodes and Pipeline to allow batch querying #2481

Conversation

bogdankostic commented May 2, 2022 • edited Loading

Breaking changes

Old way

New way

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch May 2, 2022

Choose a reason for hiding this comment

julian-risch May 11, 2022

Choose a reason for hiding this comment

julian-risch May 2, 2022

Choose a reason for hiding this comment

julian-risch May 11, 2022

Choose a reason for hiding this comment

review-notebook-app bot commented May 10, 2022

julian-risch May 11, 2022

Choose a reason for hiding this comment

bogdankostic May 11, 2022

Choose a reason for hiding this comment

julian-risch May 11, 2022

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

Add `run_batch` method to all nodes and `Pipeline` to allow batch querying #2481

Add `run_batch` method to all nodes and `Pipeline` to allow batch querying #2481

bogdankostic commented May 2, 2022 •

edited

Loading