Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return intermediate nodes output in pipelines #1558

Merged
merged 40 commits into from
Oct 7, 2021
Merged

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Oct 5, 2021

Related to #1193

Proposed changes:
These changes make nodes capable of recording some debug information during execution. This is accomplished by managing one extra key in the output dictionary, called _debug.

By default, the data collected includes the input, the output and the logs produced by nodes (all or some of the nodes, depending on the configuration). However, a node can choose to add its own debug information under _debug, and such information will be preserved. The content of each node's _debug entry will be in the final response (grouped by producer, see sample output below).

Note that the content of _debug is generally passed from node to node, but to avoid infinite recursion it is removed from the output that is stored in the_debug key itself (see example output below)

Details:

  • Modifies BaseComponent.run() to make every node accept debug and debug_logs as parameters, and if detected, saves them in the instance state as attributes. This will enable users to set these values through pipeline.run(params={'node_name':{'debug': True}}).

  • Modifies pipeline.run() to accept debug and debug_logs as attributes, and to apply them to each node's parameters, overwriting whatever was set in the params (see the example below).

  • Modifies BaseComponent._dispatch_run() to deal properly with the _debug key's content.

  • For the logs collection, introduces an implicit decorator to BaseComponent.run() that, if it detects the attribute debug, set to True, in the state of the current object, will record the debug logs of the execution of a specific node and push them to their _debug. These logs are also printed to the console debug_logs is also defined and set to True.

Example code:

from haystack.document_store import ElasticsearchDocumentStore
from haystack.retriever.sparse import ElasticsearchRetriever
from haystack.retriever.dense import DensePassageRetriever
from haystack.reader import FARMReader
from haystack.pipeline import Pipeline, JoinDocuments

def main():

    document_store_with_docs = ElasticsearchDocumentStore()
    es_retriever = ElasticsearchRetriever(document_store=document_store_with_docs)
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")

    pipeline = Pipeline()
    pipeline.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
    pipeline.add_node(component=reader, name="Reader", inputs=["ESRetriever"])

    prediction = pipeline.run(
        query="Who lives in Berlin?", 
        params={
            # New API: `debug` and `debug_logs` can be passed to single nodes as parameters
            # Note that subclasses of `BaseComponent` don't need to explicitly support them for this to work
            "ESRetriever": {"top_k": 10, "debug": True, "debug_logs": True},
            "Reader": {"top_k": 3}
        },
        # New API: the debug parameters can also be passed to `run()` directly
        # They will override any node-specific setting
        debug=True,
        debug_logs=True,
    )

    # Note: printing in JSON helps detecting circular references (`pprint` instead can deal with them)
    print("############### DEBUG LOGS #####################")
    import json
    print(json.dumps(response, default=str, indent=4),"\n")
    print("################################################")

if __name__ == "__main__":
    main()

Example output:

{
    "answers": [],
    "_debug": {
        "ESRetriever": {
            "logs": [
                "Retriever query: {'size': '10', 'query': {'bool...",
                "POST http://localhost:9200/document/_se...",
                "> {\"size\":\"10\",\"query\":{\"bool\":{\"shou...",
                "< {\"took\":3,\"timed_out\":false,\"_shards\"...",
                "Retrieved documents with IDs: [67341323..."
            ],
            "input": {
                "root_node": "Query",
                "query": "Who lives in Berlin?",
                "top_k": 10,
                "debug": true
            },
            "output": {
                "documents": []
            }
        },
        "Reader": {
            "logs": [],
            "input": {
                "documents": [],
                "query": "Who lives in Berlin?",
                "top_k": 3,
                "debug": true
            },
            "output": {
                "answers": []
            }
        }
    },
    "documents": [],
    "root_node": "Query",
    "params": {
        "ESRetriever": {
            "top_k": 10
        },
        "Reader": {
            "top_k": 3
        },
        "Query": {
            "debug": true
        }
    },
    "query": "Who lives in Berlin?",
    "node_id": "Reader"
} 

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

@ZanSara ZanSara linked an issue Oct 5, 2021 that may be closed by this pull request
@ZanSara ZanSara self-assigned this Oct 5, 2021
haystack/pipeline.py Outdated Show resolved Hide resolved
@ZanSara ZanSara changed the title WIP Return intermediate nodes output in pipelines Return intermediate nodes output in pipelines Oct 6, 2021
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already looking very good. Left three minor comments

haystack/pipeline.py Show resolved Hide resolved
haystack/schema.py Show resolved Hide resolved
haystack/__init__.py Show resolved Hide resolved
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 7, 2021

Hey I'll put back ImMemoryLogger in schema.py for now, because in utils.py it causes a circular import issue :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return Intermediate Node Output
3 participants