Ensure that `ProcessNode.get_builder_restart` fully restores all inputs including metadata inputs #5801

sphuber · 2022-11-27T08:00:23Z

~~Note that this requires a new version of plumpy to fix a critical bug.~~ The updated version of plumpy has been released and the requirements have been updated.

This essentially makes it possible to call get_builder_restart for any completed ProcessNode and actually get a builder that perfectly matched the inputs that were used for the creation of the node. Up till now this was not possible, especially not for WorkChains. This is the first step to making it possible and easy to, given a provenance graph, rerun any part of it to reproduce it. The one and only downside of this is that part of the input information is duplicated in the database. I feel that this duplication is worth the gain though.

sphuber · 2022-11-29T22:22:32Z

@zhubonan you might be interested in this. Would be good to have your feedback if you are still using AiiDA.

zhubonan · 2022-11-30T14:14:19Z

Thanks this lookds great to me! This is a functionality that I have being waiting for a long time. I can see the get_builder_restart method is removed - I guess this is moved to the plumpy upstream?

Just one comment - in the rase case that the user supplies some inputs that are not json serializable, would it except immediately?

As for storing duplicated data, I think it is not a problem because before this the only way to make get_builder_restart works for workchains is to store the options keys for the calc jobs in a Dict node (this is the case in aiida-vasp). Which also stores duplicated data in the database. Even with this the user still have to set the other keys of metadata such as metadata.label which makes doing restarts cumbersome.

sphuber · 2022-11-30T14:32:49Z

Thanks this lookds great to me! This is a functionality that I have being waiting for a long time. I can see the get_builder_restart method is removed - I guess this is moved to the plumpy upstream?

No, it is still there, it is defined on the Process base class. What is removed in this PR was the method on the CalcJob class, which overrode the method of the Process parent in order to add the metadata options. But this change is now superfluous as this is now taken care off by Process itself for all process classes.

Just one comment - in the rase case that the user supplies some inputs that are not json serializable, would it except immediately?

Yes, if the port is also non_db then now it will except. I thought that this use case would probably not exist since I think this would anyway fail, for at least WorkChains, as when the checkpoint is saved, the entire process instance (including inputs) is YAML serialized. However, there may be edge-cases for inputs that are YAML-serializable but not JSON-serializable, or this could be done for non-workchain classes, such as CalcJobs. So maybe it is best to change this behavior to emitting a UserWarning instead and skipping the input from storing?

zhubonan · 2022-11-30T14:38:23Z

Thanks, I see!

So maybe it is best to change this behavior to emitting a UserWarning instead and skipping the input from storing?

Yes it would be useful to emitt a warning I think.

sphuber · 2022-12-05T20:54:54Z

@ltalirz maybe you are interested in reviewing this? This is the first step required to a PoC I am working on that allows to rerun any workflow run in AiiDA and perfectly reproduce it. The second step is to support Docker containers (which I also have running) and final step is a function that, given a completed workflow node, can recreate the original builder exactly and relaunch it, automatically creating all the external codes necessary and rerunning in docker containers. As you have said before, this would be a very powerful demonstrator, and after this we are almost there.

ltalirz

Thanks a lot @sphuber, very nice to see that we are now on the road towards this project!

I guess one comment I would have is that, after this change, the AiiDA API is effectively lying to the user, right? The user explicitly requests an input to be non_db, but AiiDA simply ignores this and stores the information in the database nevertheless.

If there was a reason for the user not to store this information in the database (e.g. sensitive / size / ...), this could be problematic.

It might be worth discussing this at the AiiDA meeting, i.e. including both the original rationale behind making some fields non_db, and what a future name for this field could be.

aiida/engine/processes/process.py

ltalirz · 2022-12-05T21:15:26Z

aiida/engine/processes/process.py

+
+            result[key] = non_db_value
+
+        return result


You could also

Suggested change

return result

return result or None

here, unless you want the distinction between {} and None

This won't work, since None is technically a value and so I could get a return value:

{ 'sub': { { 'namespace': { 'a': None } } }

And now the prune call to prune empty mappings will no longer work since technically the sub.namespace.a namespace is no longer empty.

Scratch that, this case is handled because None values are automatically handled in the same recursive loop and ignored. So I think your suggestion also works, but effectively there is no change in behavior.

sphuber · 2022-12-05T21:36:02Z

It might be worth discussing this at the AiiDA meeting, i.e. including both the original rationale behind making some fields non_db, and what a future name for this field could be.

The original rationale of non_db was purely an internal one. When we wrote the new process-style interface for v1.0, we needed to deal somehow with inputs that were not nodes. Notably, these were "inputs" that were set on the process node attributes, all of which are essentially part of the metadata namespace, such as label, description and the options of the CalcJob process. It was never really intended to be used to actually circumvent provenance storage, but I can see how maybe some users may have ended up using it. There is a team meeting tomorrow morning and I will bring this up to see if people think we need a deprecation pathway for this. It would be quite cumbersome to implement I think. Do you have any indications of people using non_db to pass sensitive data as inputs and preventing it from being stored in the database?

ltalirz · 2022-12-05T21:49:03Z

Ok, I see. I just browsed the changes quickly and it looked like the storing of this information was new, but I guess this applies only to the non_db ports outside the metadata.
Do those two need to be handled separately?

From your explanation, I guess the meaning is closer to orm=False than non_db=True.
I don't know of the non_db field being used to pass sensitive information, but if the behavior changes we may need to at least update the documentation.

sphuber · 2022-12-05T21:56:28Z

I just browsed the changes quickly and it looked like the storing of this information was new, but I guess this applies only to the non_db ports outside the metadata.

Indeed, for the metadata inputs (i.e. those defined by aiida-core base classes), the data was anyway already being stored. It was just being done manually in the Process and CalcJob class.

That being said, if a user defined a custom process with an input port designated as non_db, that will now be stored (as long as it is JSON-serializable). So if they used this functionality with the exact purposes to not have it stored in any way, that would definitely break that. I always thought that people would mostly just use this functionality (if at all) to get around the requirement that all inputs have to be Node instances. For example, I have seen people use this to pass Group instances as inputs to a WorkChain. This works, but is anyway something that should be discouraged because it is breaking provenance and leads to unpredictable results.

Do those two need to be handled separately?

Kind of yes, because they are not all stored in the same way. For example, the label and description just happen to be stored in a dedicated column on the Node database model, whereas most of the others are just stored in the attributes. Additional complication is that the metadata.options are not stored in their "namespace" but are stored flat. So for example the metadata.options.queue option is simply stored in the attributes directly under queue. Another difference is that currently not only those options that are explicitly defined are stored in the attributes, but also the values provided by the port defaults. This is not the case for what is stored in this PR: there I only store what is explicitly passed by the caller, i.e., the "raw" inputs.

So although it may be possible in principle to generalize all this, this would require quite some discussions, data migrations and other laborious stuff that may not be worth it.

I don't know of the non_db field being used to pass sensitive information, but if the behavior changes we may need to at least update the documentation.

Absolutely, that is a given.

sphuber · 2022-12-13T09:25:58Z

@ltalirz as promised I discussed with the team and we agree with you that it is best to not change the behavior of non_db. So instead, we define a new keyword metadata which takes on the originally intended role of non_db: marking a port whose value is not stored as Data and linked to the process node, but is stored on the process node directly, somehow. The keyword was chosen because it plays nicely with the fact that all the originally non_db ports that need to be changed to metadata port are also directly part of the metadata input namespace. non_db ports specified in plugins will continue working as before, i.e., their values are not permanently stored.

The only question remaining is that I haven't documented the new metadata keyword, but that is because it is not clear whether it should be. It is not clear whether we should allow or encourage plugins to add metadata inputs to processes and have them automatically stored in the attributes or something like that. For the time being, I would not do this as it can always be added later once we have had more time to see if this is an important valid use-case.

ltalirz

Thanks @sphuber !

That seems like a good solution to me.

I understand that non_db is not deprecated for the moment and the idea is to have both non_db and metadata supported for the time being? (fine with me)

aiida/engine/processes/ports.py

ltalirz · 2022-12-13T10:40:39Z

aiida/engine/processes/ports.py

+        return self._explicitly_set
+
+    @property
+    def metadata(self) -> bool:


The name was discussed, and the preference was for metadata over is_metadata, correct?

We (@chrisjsewell and @giovannipizzi ) agreed on metadata but I have to say we didn't consider a lot of other options. is_metadata wasn't brought up for example. Happy to think about the naming if others think this is also a better option.

Brought it up during today's meeting and we agree that is_metadata is a better name. Have updated the code and commits.

aiida/engine/processes/process.py

ltalirz · 2022-12-13T10:48:45Z

aiida/engine/processes/process.py

+        # Store JSON-serializable values of ``netadata`` ports in the node's attributes. Note that instead of passing in
+        # the ``metadata`` inputs directly, the entire namespace of raw inputs is passed. The reason is that although
+        # currently in ``aiida-core`` all input ports that set ``metadata=True`` in the port specification are located
+        # within the ``metadata`` port namespace, this may not always be the case. The ``_filter_serializable_metadata``


you might want to mention how you plan to use this rather than just that it could be used.
(or just remember to update this comment when you add your next PR)

This essentially pertains to my last comment regarding the last remaining open question. Should we allow other plugins to add metadata ports and have them store data in the process node's attributes? If so, then this is necessary. If we don't, then we may still want to keep this although it can be done differently. I am not sure what the answer to the question is though

tests/engine/test_process.py

The `_recursive_merge` method could raise a `KeyError` for a process builder that contains a dynamic port namespace. In this case it would be possible to merge in a value that contains a nested dictionay which would try to access directly `dictionary[key]`, where `dictionary` is the namespace of the builder, and so hit a `KeyError` since this nested namespace didn't exist yet in the dynamical namespace. The solution is to explicitly check if the `key` already exists in the builder's namespace, and if not, we assig the entire `value` to it, as there is no nothing to recurse into.

The method is a generic utility to operate on mappings and is not specific to the `ProcessBuilder`. It needs to be used elsewhere soon so therefore it is moved to a standalone function in a new utility module `aiida.engine.processes.utils`.

In the engine redesign from `v0.x` to `v1.0` the process interface needed a way to distinguish inputs that are lined up to the process node as nodes themselves and those that are stored directly on the process node, for example as an attribute. To this end the `non_db` keyword was added to the `InputPort`. This keyword was used for the `metadata` input namespace, to designate that these inputs would not be `Data` instances. The name is quite unfortunate, however, since the inputs of these ports actually do get stored in the database, contrary to what the keyword suggests. The most straightforward solution would be to rename the keyword, however, for better or worse, the keyword has been adopted by (a limited amount of) plugin packages. A known application is to pass in `Group` instances which are not storable as nodes, but can be passed through a `non_db` port. Hypothetically, users may have used the port as well to pass in sensitive data that should never be stored. Renaming the port or changing its behavior is therefore likely to break existing code. Instead, the `is_metadata` keyword is added to the `InputPort` and `PortNamespace` through the `WithMetadata` mixin. When set to `True` this keyword functions as the original intention of the `non_db` flag and an `is_metadata` port signals that its value will be stored in the database but directly on the `ProcessNode` instead of being linked as a `Data` node. The naming makes sense, because the only use of this keyword is by the `metadata` input namespace that all `Process` classes have. The inputs in this namespace are stored, through custom logic in the `Process` and `CalcJob` class, in the attributes of the node or dedicated columns of the node database model, such as the label and the description. This addition leaves the behavior of `non_db` inputs unchanged and so plugin packages that use this keyword should continue to function as before.

The promise of AiiDA's provenance is that all inputs to a `Process` are stored as nodes in the provenance graph linked to a node representing the process. From the very early beginning, however, there needed to be exceptions of inputs to processes that were not nodes. Notable examples were the various "options" set for calculation jobs. Even in AiiDA v0.x these "settings" as they were called back then, were stored in the attributes of the node. In the AiiDA v1.0 redesign, where all inputs of a process are defined through the process spec, these non-node inputs were implemented by allowing input ports to be made non-database storable, indicated by the `non-db` argument in the port declaration. These inputs would be passed to the `Process` instance and would be available during its lifetime, but would not be stored in the database. Once again there are exceptions as certain inputs defined by `aiida-core` are stored on the node, but in various places. Notable examples are the `label` and `description` of the process, and the `metadata.options` of the `CalcJob` class. This historical decision has as a direct result in that it is difficult if not impossible in certain cases to reconstruct the exact input dictionary that was used to run a `Process` from the data stored on the `ProcessNode`. From a provenance point of view, this is a huge weakpoint and is what is being corrected here. The input ports marked `non_db=True` on the base process classes provided by `aiida-core`, `Process` and `CalcJob` were changed to use the new `is_metadata` keyword instead in the previous commit, to remove the inconsistency between the naming and behavior. In this commit, all inputs that correspond to `is_metadata` ports and are JSON-serializable are stored in the attributes of the process node under the key `metadata_inputs`. All `is_metadata` input ports that are defined on process base classes by `aiida-core`, such as `Process` and `CalcJob` *are* JSON serializable. However, plugin packages can implement process classes with ports that accept inputs that are not JSON serializable, which is why this additional condition has to be added. But all inputs defined by `aiida-core` should be covered.

The `get_builder_restart` method on the `ProcessNode` base class would return a `ProcessBuilder` with the inputs set to those attached to that node instance. The `CalcJobNode` would override this to add the metadata options as well. This would be ok for `CalcJobNode`s, but if a restart builder was created from a `WorkChainNode` that calls a `CalcJobNode` it would have also received options, but those would not be restored. One could think that when calling `get_restart_builder` on a `WorkChainNode` that we can just go down the callstack, find all the `CalcJobNode`s and set the options in the respective input namespaces. But this would not exactly reproduce the original inputs, as the options that a calculation job has received could have been a modified version of the original options passed to the workchain, changed in the logic of the workchain. Instead, now that the exact metadata inputs for each process are stored in the attribute of the node, added in the previous commit, it is this dictionary that is used to restore the exact original inputs. It not only addresses the problem of incorrect `CalcJob` options, but it also restores any other metadata inputs such as the `label` and `description`.

ltalirz

thanks @sphuber ! no further comments from my side

sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 3 times, most recently from 4ae09fc to 7a730b4 Compare November 29, 2022 16:01

sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 2 times, most recently from 2f6277e to 571499b Compare December 5, 2022 16:03

ltalirz self-requested a review December 5, 2022 21:19

ltalirz reviewed Dec 5, 2022

View reviewed changes

sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 3 times, most recently from 248f9c5 to fd31f2b Compare December 12, 2022 10:21

sphuber requested a review from ltalirz December 13, 2022 09:26

sphuber force-pushed the feature/4089/process-store-non-db-inputs branch from fd31f2b to 38c9bbe Compare December 13, 2022 09:28

ltalirz reviewed Dec 13, 2022

View reviewed changes

sphuber added 2 commits December 14, 2022 12:48

sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 2 times, most recently from 8c18f86 to 59ebc13 Compare December 14, 2022 11:56

sphuber added 3 commits December 14, 2022 13:07

sphuber force-pushed the feature/4089/process-store-non-db-inputs branch from 59ebc13 to d46d77b Compare December 14, 2022 12:14

ltalirz self-requested a review December 14, 2022 13:28

ltalirz approved these changes Dec 14, 2022

View reviewed changes

sphuber changed the title ~~Ensure that ProcessNode.get_builder_restart fully restores all inputs including non_db ones~~ Ensure that ProcessNode.get_builder_restart fully restores all inputs including metadata inputs Dec 14, 2022

sphuber merged commit a0cf2ba into aiidateam:main Dec 14, 2022

sphuber deleted the feature/4089/process-store-non-db-inputs branch December 14, 2022 13:48

sphuber mentioned this pull request Feb 23, 2023

Usability: Allow users to relaunch a previously completed process with minimal required setup aiidateam/team-compass#9

Open

4 tasks

sphuber mentioned this pull request Apr 23, 2023

get_builder_restart() failing due to absence of resources #5977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that `ProcessNode.get_builder_restart` fully restores all inputs including metadata inputs #5801

Ensure that `ProcessNode.get_builder_restart` fully restores all inputs including metadata inputs #5801

sphuber commented Nov 27, 2022 •

edited

sphuber commented Nov 29, 2022

zhubonan commented Nov 30, 2022

sphuber commented Nov 30, 2022

zhubonan commented Nov 30, 2022

sphuber commented Dec 5, 2022

ltalirz left a comment •

edited

ltalirz Dec 5, 2022 •

edited

sphuber Dec 12, 2022

sphuber Dec 12, 2022

sphuber commented Dec 5, 2022

ltalirz commented Dec 5, 2022 •

edited

sphuber commented Dec 5, 2022 •

edited

sphuber commented Dec 13, 2022

ltalirz left a comment

ltalirz Dec 13, 2022

sphuber Dec 13, 2022

sphuber Dec 14, 2022

ltalirz Dec 13, 2022

sphuber Dec 13, 2022

ltalirz left a comment

Ensure that ProcessNode.get_builder_restart fully restores all inputs including metadata inputs #5801

Ensure that ProcessNode.get_builder_restart fully restores all inputs including metadata inputs #5801

Conversation

sphuber commented Nov 27, 2022 • edited

sphuber commented Nov 29, 2022

zhubonan commented Nov 30, 2022

sphuber commented Nov 30, 2022

zhubonan commented Nov 30, 2022

sphuber commented Dec 5, 2022

ltalirz left a comment • edited

Choose a reason for hiding this comment

ltalirz Dec 5, 2022 • edited

Choose a reason for hiding this comment

sphuber Dec 12, 2022

Choose a reason for hiding this comment

sphuber Dec 12, 2022

Choose a reason for hiding this comment

sphuber commented Dec 5, 2022

ltalirz commented Dec 5, 2022 • edited

sphuber commented Dec 5, 2022 • edited

sphuber commented Dec 13, 2022

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Dec 13, 2022

Choose a reason for hiding this comment

sphuber Dec 13, 2022

Choose a reason for hiding this comment

sphuber Dec 14, 2022

Choose a reason for hiding this comment

ltalirz Dec 13, 2022

Choose a reason for hiding this comment

sphuber Dec 13, 2022

Choose a reason for hiding this comment

ltalirz left a comment

Choose a reason for hiding this comment

Ensure that `ProcessNode.get_builder_restart` fully restores all inputs including metadata inputs #5801

Ensure that `ProcessNode.get_builder_restart` fully restores all inputs including metadata inputs #5801

sphuber commented Nov 27, 2022 •

edited

ltalirz left a comment •

edited

ltalirz Dec 5, 2022 •

edited

ltalirz commented Dec 5, 2022 •

edited

sphuber commented Dec 5, 2022 •

edited