Fix dependency graph for indexing pipelines during codegen #2311

tstadel · 2022-03-15T11:49:47Z

This fixes a NetworkUnfeasible Exception during topological_sort of component construction code lines in generate_code() if we need the retriever during indexing:
Graph contains a cycle or graph changed during iteration

Sample YAML to reproduce:

version: 'unstable'
components:    # define all the building-blocks for Pipeline
- name: DocumentStore
  type: DeepsetCloudDocumentStore
- name: Retriever
  type: ElasticsearchRetriever
  params:
    document_store: DocumentStore    # params can reference other components defined in the YAML
- name: TextFileConverter
  type: TextConverter
- name: Preprocessor
  type: PreProcessor

pipelines:
- name: indexing
  nodes:
    - name: TextFileConverter
      inputs: [File]
    - name: Preprocessor
      inputs: [TextFileConverter]
    - name: Retriever
      inputs: [Preprocessor]
    - name: DocumentStore
      inputs: [Retriever]

Retriever depends on DocumentStore during initialization.
DocumentStore depends on Retriever's output during pipeline execution.

During fixing that some other bugs in Pipeline.get_config() surfaced:

it didn't reuse components if they are the same instance
default values were not properly added if return_defaults was true (due to the new paradigm of _component_config which stores only given params)

This is also fixed in this PR.
And I took the opportunity to refactor Pipeline.get_config() such as it is possible to have dependent components beyond one level of dependency (i.e. Pipeline.get_config() is fully recursive now).

Proposed changes:

neglect weak node dependencies (they are only there to improve the order of component creation which makes it more understandable in complex pipelines) if there is already a hard component dependency
fix bugs in get_config:
- fix return_defaults=True option
- reuse dependent components if they are the same instance
- make get_config() fully recursive in terms of dependent components

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests

tstadel · 2022-03-15T11:50:01Z

Test to be added.

… into fix_indexing_codegen

tstadel · 2022-03-15T17:33:07Z

Test added.

… into fix_indexing_codegen

julian-risch

LGTM! Very nice! 👌 Quite complex functionality though. Great to see you added several tests.
@ZanSara I just tagged you as a reviewer because you said you are interested in this PR. In that case, it makes sense that you give feedback before merging the PR. Otherwise, it's ready to be merged.

haystack/pipelines/base.py

julian-risch · 2022-03-17T09:03:11Z

haystack/pipelines/config.py

+                # e.g. DensePassageRetriever depends on ElasticsearchDocumentStore.
+                # In indexing pipelines ElasticsearchDocumentStore depends on DensePassageRetriever's output.
+                # But this second dependency is looser, so we neglect it.
+                if not graph.has_edge(node_name, input):


Just a check whether I understand this correctly: As an example, in an indexing pipeline ElasticsearchDocumentStore could be the node and DensePassageRetriever could be the input. Then we check here whether there is an edge in the graph from ElasticsearchDocumentStore to DensePassageRetriever and if that's not the case then we add an edge with the opposite direction, correct?

Yes, correct.

Just a curiosity: why we need to know what are the runtime dependencies between the nodes at this stage? I thought this function is used in the codegen code to create the nodes in a working order (so nodes that are required as input will be initialized first). The fact that later on in the pipeline a node will get output from another should not matter... right?

Yes, right. I chose to include those runtime dependencies to make the order of init calls more readable. With it it'll be in pipeline execution order (like a human would normally instantiate it), without it it would be in arbitrary order and less comprehensive for humans. But that's the only reason.

ZanSara

Cool one! I've managed to comment only on BaseComponent for now, I'll read the new get_config() in a few hours. No worries, only my typical nitpicks 😄

haystack/nodes/base.py

ZanSara · 2022-03-17T09:40:24Z

haystack/nodes/base.py

+        self._component_config["name"] = value
+
+    @property
+    def sub_components(self) -> List[BaseComponent]:


The name might be slightly confusing... This is the list of nodes this components depends on, right? We might call it required_components or something that highlights the dependency relationship

I'd simply call it dependencies. What do you think about it?

I rather go with utilized_components. At least that makes the most sense to me ;-)

ZanSara · 2022-03-17T09:42:36Z

haystack/nodes/base.py

+        return [param for param in self._component_config["params"].values() if isinstance(param, BaseComponent)]
+
+    @property
+    def type(self) -> str:


Technically this property is redundant, type(node) would give you the same value is it?

I could not easily get type out of the _component_config dict because it required refactoring get_config deeply, but now that we're at it, it might be worth trying

No, this type returns a string: so type(node).name basically.
I also thought about moving name and type out of _component_config. But on the other hand we have the complete config in one place with it. I like that idea but I'm open for discussing this.

ZanSara · 2022-03-17T09:48:57Z

haystack/nodes/base.py

+    def get_params(self, return_defaults: bool = False) -> Dict[str, Any]:
+        component_signature = self._get_signature()
+        params: Dict[str, Any] = {}
+        for key, value in self._component_config["params"].items():
+            if value != component_signature[key].default or return_defaults:
+                params[key] = value
+        if return_defaults:
+            for key, param in component_signature.items():
+                if key not in params:
+                    params[key] = param.default
+        return params


Seems strangely convoluted. I think the following would work the same:

def get_params(self, return_defaults: bool = False) -> Dict[str, Any]: explicit_params = deepcopy(self._component_config["params"]) if not return_defaults: return explicit_params default_params = {key: param.default for key, param in self._get_signature().items()} return {**default_params, **explicit_params}

(untested)

This version would not account for default values within explicit_params. I think we want to suppress them too if return_defaults=False. But I can definitely get rid of the second loop.

I see what you mean, but on the other hand I would not necessarily expect this method to filter out defaults values if they were explicitly given. Keeping them would also make sure that if I load a YAML and I save it back, it's going to come out exactly as I loaded it, no differences, regardless of whether I gave the default parameters or not.

Not a big deal though, feel free to keep the current behavior 👍

I left it like that for now as even with having all default_params in one variable we do not gain much understandability or compactness.

ZanSara · 2022-03-17T09:52:01Z

haystack/nodes/base.py

@@ -207,3 +235,13 @@ def _dispatch_run(self, **kwargs) -> Tuple[Dict, str]:

        output["params"] = params
        return output, stream
+
+    @classmethod
+    def _get_signature(cls) -> Dict[str, inspect.Parameter]:


Just FYI: there's another very similar method in _json_schema.py:

haystack/haystack/nodes/_json_schema.py

Line 90 in b8a3c93

def get_typed_signature(call: Callable[..., Any]) -> inspect.Signature:

It's interesting that you seem to have taken a completely different approach to the problem... I don't think we need any action here though

ZanSara

Very happy about this one. I found only one small possibility for a bug but overall good job! 🙂

ZanSara · 2022-03-17T13:46:14Z

haystack/pipelines/base.py

-                    if component_signature[param_key].default != param_value or return_defaults is True:
-                        components[node]["params"][param_key] = param_value
+            component: BaseComponent = node_attributes["component"]
+            if node_name != component.name:


Just curious really, what could lead to this situation? Manual edit of _components_config?

Yep, Pipeline's add_node sets all names appropriately.

haystack/pipelines/base.py

ZanSara · 2022-03-17T14:13:19Z

haystack/pipelines/base.py

+                component_names.update(sub_component_names)
+        return component_names
+
+    def _set_sub_component_names(self, component: BaseComponent, component_names: Optional[Set[str]] = None):


Just thinking out loud here. I see that this method recursively names all the "sub" components. But we're already going down recursively with _add_component_to_definitions. So how about naming any unnamed component in _add_component_to_definitions as soon as we meet them? That would mean simply to replace PipelineError(f"Component with config '{component._component_config}' does not have a name.") with a function that names the component.

That would also remove the need for _get_all_component_names (which in turn is recursive, so yet another layer of recursion removed).

Unfortunately during _add_component_to_definitions() we do not have all the component_names already assigned within that pipeline. That's a requirement for generating names in order to not end up with duplicates. So _get_all_component_names would be required either way.
I also thought about assigning (virtual) names during get_config() and thus in _add_component_to_definitions(). However it occured cleaner to me that all the names are already being assigned when the component is added to the pipeline because that already happens for non-sub-components.

ZanSara · 2022-03-17T14:18:44Z

haystack/pipelines/config.py

+                # e.g. DensePassageRetriever depends on ElasticsearchDocumentStore.
+                # In indexing pipelines ElasticsearchDocumentStore depends on DensePassageRetriever's output.
+                # But this second dependency is looser, so we neglect it.
+                if not graph.has_edge(node_name, input):


Just a curiosity: why we need to know what are the runtime dependencies between the nodes at this stage? I thought this function is used in the codegen code to create the nodes in a working order (so nodes that are required as input will be initialized first). The fact that later on in the pipeline a node will get output from another should not matter... right?

ZanSara · 2022-03-17T14:20:20Z

test/test_pipeline.py

+    intermediate = ParentComponent(dependent=child)
+    parent = ParentComponent(dependent=intermediate)
+    p_ensemble = Pipeline()
+    p_ensemble.add_node(component=parent, name="Parent", inputs=["Query"])


Ok now, what happens if I call this node ParentComponent? Or if I call two nodes with the same name explicitly? I didn't see a check for this case anywhere.

Good you spotted this. Here was actually a bug as the name of the node to be added wasn't known to the graph when running _get_all_component_names(): fixed that.

tstadel and others added 2 commits March 15, 2022 12:42

fix dependency graph for indexing pipelines

5b334e9

Update Documentation & Code Style

c239de9

tstadel added type:bug Something isn't working topic:pipeline journey:intermediate labels Mar 15, 2022

tstadel requested a review from julian-risch March 15, 2022 11:49

tstadel and others added 3 commits March 15, 2022 14:20

add test and fix get_config for existing components

e9dfe24

Merge branch 'fix_indexing_codegen' of github.com:deepset-ai/haystack…

5ca8390

… into fix_indexing_codegen

Update Documentation & Code Style

d0d2d8b

tstadel and others added 11 commits March 15, 2022 18:36

fix mypy finding

a527edc

refactored Pipeline.get_config

2b0a7aa

Update Documentation & Code Style

8dac1fd

split to_code test into get_config test and generate_code test

3ebd861

fix child component handling in get_config()

4a1776b

Update Documentation & Code Style

13f082a

fix get_params

1fc2aad

Merge branch 'fix_indexing_codegen' of github.com:deepset-ai/haystack…

8021b1b

… into fix_indexing_codegen

make get_config fully recursive

e97bf62

add multi level dependency test

418b271

Update Documentation & Code Style

3f9119e

julian-risch requested a review from ZanSara March 17, 2022 08:44

julian-risch approved these changes Mar 17, 2022

View reviewed changes

ZanSara reviewed Mar 17, 2022

View reviewed changes

ZanSara approved these changes Mar 17, 2022

View reviewed changes

tstadel added 4 commits March 17, 2022 17:42

add some review feedback

06c1e78

fix multiple dependent components of same type

eaa3663

fix mypy finding

6afe384

rename dependencies to utilized_components

3847291

tstadel merged commit 8f7dd13 into master Mar 17, 2022

tstadel deleted the fix_indexing_codegen branch March 17, 2022 21:03

ZanSara mentioned this pull request Apr 11, 2022

Separate concepts of "Retriever" and "Embedder" #2403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dependency graph for indexing pipelines during codegen #2311

Fix dependency graph for indexing pipelines during codegen #2311

tstadel commented Mar 15, 2022 •

edited

tstadel commented Mar 15, 2022

tstadel commented Mar 15, 2022

julian-risch left a comment

julian-risch Mar 17, 2022

tstadel Mar 17, 2022

ZanSara Mar 17, 2022

tstadel Mar 17, 2022 •

edited

ZanSara left a comment

ZanSara Mar 17, 2022

tstadel Mar 17, 2022

tstadel Mar 17, 2022

ZanSara Mar 17, 2022

tstadel Mar 17, 2022

ZanSara Mar 17, 2022

tstadel Mar 17, 2022

ZanSara Mar 17, 2022

tstadel Mar 17, 2022

ZanSara Mar 17, 2022

ZanSara left a comment

ZanSara Mar 17, 2022

tstadel Mar 17, 2022

ZanSara Mar 17, 2022

tstadel Mar 17, 2022 •

edited

ZanSara Mar 17, 2022

ZanSara Mar 17, 2022

tstadel Mar 17, 2022

Fix dependency graph for indexing pipelines during codegen #2311

Fix dependency graph for indexing pipelines during codegen #2311

Conversation

tstadel commented Mar 15, 2022 • edited

tstadel commented Mar 15, 2022

tstadel commented Mar 15, 2022

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstadel Mar 17, 2022 • edited

Choose a reason for hiding this comment

ZanSara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZanSara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstadel Mar 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstadel commented Mar 15, 2022 •

edited

tstadel Mar 17, 2022 •

edited

tstadel Mar 17, 2022 •

edited