Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add run_batch method to all nodes and Pipeline to allow batch querying #2481

Merged
merged 45 commits into from
May 11, 2022

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented May 2, 2022

This PR facilitates batch processing with query Pipelines. This is done by adding a run_batch and a corresponding predict_batch/retrieve_batch/... method to each of the nodes. In order to achieve this, transformers dependency has been upgraded to the newest version, because in the version currently used, batch processing for transformers' question-answering-pipeline is not supported.

As decided in #1239, the Pipeline's run_batch method can take a single query or a list of queries as its queries argument and a single list of Documents or a list of lists of Documents as its documents argument. The behavior of the Pipeline depends on whether a single value or list of values are provided. In general, the following applies:

Single Query List of Queries
List of Docs Single query is applied to each doc individually -> answers are returned for each single Document Each query is applied to each doc individually -> answers are returned for each query-doc-pair
List of Lists of Docs Single query is applied to each list of Documents -> aggregation of answers on each list of Documents Each query is applied to its corresponding doc list (based on same index in the provided lists of values) -> aggregation of answers on each list of Documents

if run_batch is called on an indexing pipeline, the pipeline's run method will be called.

Breaking changes

The input and output of the FARMReader's predict_batch method changed. (Therefore, the output of the QuestionAnswerGenerationPipeline also changed.)

Old way

# Old input
FARMReader.predict_batch(
    query_doc_list=[
        {"question": "Some sample query", "docs": [Document(content="sample doc1"), Document(content="sample doc 2")]},
        {"question": "Some sample query", "docs": [Document(content="sample doc3"), Document(content="sample doc 4")]}
    ]
)

# Old output
[
    {"query": "Some sample query", "no_ans_gap": max_no_ans_gap, "answers": Answers from doc1 and doc2, "label": cur_label},
    {"query": "Some sample query", "no_ans_gap": max_no_ans_gap, "answers": Answers from doc3 and doc4, "label": cur_label}
]

New way

# New input
FARMReader.predict_batch(
    queries=["Some sample query", "Some sample query"]
    documents=[[Document(content="sample doc1"), Document(content="sample doc 2")], [Document(content="sample doc3"), Document(content="sample doc 4")]]

# New output
{"queries": ["Some sample query", "Some sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], "no_ans_gaps": [[no_ans_gaps from query1], [no_ans_gaps from query2]]}

Closes #1239

bogdankostic and others added 3 commits May 2, 2022 14:46
# Conflicts:
#	haystack/document_stores/elasticsearch.py
#	haystack/nodes/base.py
#	haystack/nodes/retriever/sparse.py
#	haystack/pipelines/base.py
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me! I just have a couple of things that we should discuss/change before merging. Here is a list (see my individual comments for details). Happy to have a call anytime.

  • batch_size should maybe be an attribute of every node and it should be possible to set it in the init method of each node (yaml pipeline definition)
  • Maybe two of the cases list of docs/queries can actually be joined?
  • Make error message more extensive and print not only expected type but also given type
  • Allow a different filter per query in queries?
  • truncation_warning every time predict is executed
  • multiple_doc_lists naming: more than one list? FOr example in batch_prediction_single_query_multiple_doc_lists
  • Use distilled, tiny model instead of roberta-base-squad in test case
  • Transformers upgrade separately would be better in separate PR
  • Check whether tutorials are still working (QuestionGenerator could break I think)

haystack/nodes/answer_generator/base.py Outdated Show resolved Hide resolved
if headers is None:
headers = {}

single_query = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the handling of the edge case with a single query makes the implementation much more complex, we could discuss whether queries must be a list in all cases. The edge case would be a list with a single item then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we can address this question and decide what to do in a separate PR.


# Query case 2: list of queries
elif isinstance(queries, list) and len(queries) > 0 and isinstance(queries[0], str):
# Docs case 1: single list of Documents -> apply each query to all Documents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case does essentially the same as "# Docs case 1: single list of Documents -> apply single query to all Documents". So if queries is a string, we could make it a list [queries] and then use the same code for both cases with the for loop for query in queries:. Merging these two cases could simplify the code a little bit. What do you think? I like the readability of the current version. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we can address this question and decide what to do in a separate PR.

haystack/nodes/document_classifier/base.py Show resolved Hide resolved
haystack/nodes/extractor/entity.py Outdated Show resolved Hide resolved
test/test_summarizer.py Outdated Show resolved Hide resolved
test/test_extractor.py Outdated Show resolved Hide resolved
test/test_extractor.py Outdated Show resolved Hide resolved
test/test_document_classifier.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
# Conflicts:
#	docs/_src/api/api/document_store.md
#	docs/_src/api/api/ranker.md
#	docs/_src/api/api/reader.md
#	haystack/nodes/base.py
#	haystack/pipelines/base.py
#	test/test_pipeline.py
#	test/test_tokenization.py
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

all_terms_must_match: bool = False,
scale_score: bool = True,
) -> Union[List[Document], List[List[Document]]]:
raise NotImplementedError("DeepsetCloudDocumentStore currently does not support query_batch method.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe add this without much overhead before the next release? Is there an issue already? It would be a shame if we can't use all the batch processing in dc because of that. What we need for that is a self.client.query_batch() call here and then also a query_batch() implementation in IndexClient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to just call query here multiple times, as there is no batch_query endpoint in dc (yet). (In query_batch()in IndexClient we would need to call the documents-query endpoint multiple times.) Once dc has a batch_query endpoint, we can have a query_batch() implementation for the IndexClient. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. 👍

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! 👍 Let's just not forget about query_batch in DCDocumentStore so that we can make use of these nice new batch processing features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for batch querying when using Pipelines
3 participants