Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that ProcessNode.get_builder_restart fully restores all inputs including metadata inputs #5801

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Nov 27, 2022

Fixes #4089

Note that this requires a new version of plumpy to fix a critical bug. The updated version of plumpy has been released and the requirements have been updated.

This essentially makes it possible to call get_builder_restart for any completed ProcessNode and actually get a builder that perfectly matched the inputs that were used for the creation of the node. Up till now this was not possible, especially not for WorkChains. This is the first step to making it possible and easy to, given a provenance graph, rerun any part of it to reproduce it. The one and only downside of this is that part of the input information is duplicated in the database. I feel that this duplication is worth the gain though.

@sphuber sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 3 times, most recently from 4ae09fc to 7a730b4 Compare November 29, 2022 16:01
@sphuber
Copy link
Contributor Author

sphuber commented Nov 29, 2022

@zhubonan you might be interested in this. Would be good to have your feedback if you are still using AiiDA.

@zhubonan
Copy link
Contributor

Thanks this lookds great to me! This is a functionality that I have being waiting for a long time. I can see the get_builder_restart method is removed - I guess this is moved to the plumpy upstream?

Just one comment - in the rase case that the user supplies some inputs that are not json serializable, would it except immediately?

As for storing duplicated data, I think it is not a problem because before this the only way to make get_builder_restart works for workchains is to store the options keys for the calc jobs in a Dict node (this is the case in aiida-vasp). Which also stores duplicated data in the database. Even with this the user still have to set the other keys of metadata such as metadata.label which makes doing restarts cumbersome.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 30, 2022

Thanks this lookds great to me! This is a functionality that I have being waiting for a long time. I can see the get_builder_restart method is removed - I guess this is moved to the plumpy upstream?

No, it is still there, it is defined on the Process base class. What is removed in this PR was the method on the CalcJob class, which overrode the method of the Process parent in order to add the metadata options. But this change is now superfluous as this is now taken care off by Process itself for all process classes.

Just one comment - in the rase case that the user supplies some inputs that are not json serializable, would it except immediately?

Yes, if the port is also non_db then now it will except. I thought that this use case would probably not exist since I think this would anyway fail, for at least WorkChains, as when the checkpoint is saved, the entire process instance (including inputs) is YAML serialized. However, there may be edge-cases for inputs that are YAML-serializable but not JSON-serializable, or this could be done for non-workchain classes, such as CalcJobs. So maybe it is best to change this behavior to emitting a UserWarning instead and skipping the input from storing?

@zhubonan
Copy link
Contributor

Thanks, I see!

So maybe it is best to change this behavior to emitting a UserWarning instead and skipping the input from storing?

Yes it would be useful to emitt a warning I think.

@sphuber sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 2 times, most recently from 2f6277e to 571499b Compare December 5, 2022 16:03
@sphuber
Copy link
Contributor Author

sphuber commented Dec 5, 2022

@ltalirz maybe you are interested in reviewing this? This is the first step required to a PoC I am working on that allows to rerun any workflow run in AiiDA and perfectly reproduce it. The second step is to support Docker containers (which I also have running) and final step is a function that, given a completed workflow node, can recreate the original builder exactly and relaunch it, automatically creating all the external codes necessary and rerunning in docker containers. As you have said before, this would be a very powerful demonstrator, and after this we are almost there.

@ltalirz ltalirz self-requested a review December 5, 2022 21:19
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @sphuber, very nice to see that we are now on the road towards this project!

I guess one comment I would have is that, after this change, the AiiDA API is effectively lying to the user, right? The user explicitly requests an input to be non_db, but AiiDA simply ignores this and stores the information in the database nevertheless.

If there was a reason for the user not to store this information in the database (e.g. sensitive / size / ...), this could be problematic.

It might be worth discussing this at the AiiDA meeting, i.e. including both the original rationale behind making some fields non_db, and what a future name for this field could be.

aiida/engine/processes/process.py Show resolved Hide resolved
aiida/engine/processes/process.py Outdated Show resolved Hide resolved

result[key] = non_db_value

return result
Copy link
Member

@ltalirz ltalirz Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also

Suggested change
return result
return result or None

here, unless you want the distinction between {} and None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, since None is technically a value and so I could get a return value:

{
    'sub': {
        {
            'namespace': {
                 'a': None
            }
    }
}

And now the prune call to prune empty mappings will no longer work since technically the sub.namespace.a namespace is no longer empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, this case is handled because None values are automatically handled in the same recursive loop and ignored. So I think your suggestion also works, but effectively there is no change in behavior.

@sphuber
Copy link
Contributor Author

sphuber commented Dec 5, 2022

It might be worth discussing this at the AiiDA meeting, i.e. including both the original rationale behind making some fields non_db, and what a future name for this field could be.

The original rationale of non_db was purely an internal one. When we wrote the new process-style interface for v1.0, we needed to deal somehow with inputs that were not nodes. Notably, these were "inputs" that were set on the process node attributes, all of which are essentially part of the metadata namespace, such as label, description and the options of the CalcJob process. It was never really intended to be used to actually circumvent provenance storage, but I can see how maybe some users may have ended up using it. There is a team meeting tomorrow morning and I will bring this up to see if people think we need a deprecation pathway for this. It would be quite cumbersome to implement I think. Do you have any indications of people using non_db to pass sensitive data as inputs and preventing it from being stored in the database?

@ltalirz
Copy link
Member

ltalirz commented Dec 5, 2022

Ok, I see. I just browsed the changes quickly and it looked like the storing of this information was new, but I guess this applies only to the non_db ports outside the metadata.
Do those two need to be handled separately?

From your explanation, I guess the meaning is closer to orm=False than non_db=True.
I don't know of the non_db field being used to pass sensitive information, but if the behavior changes we may need to at least update the documentation.

@sphuber
Copy link
Contributor Author

sphuber commented Dec 5, 2022

I just browsed the changes quickly and it looked like the storing of this information was new, but I guess this applies only to the non_db ports outside the metadata.

Indeed, for the metadata inputs (i.e. those defined by aiida-core base classes), the data was anyway already being stored. It was just being done manually in the Process and CalcJob class.

That being said, if a user defined a custom process with an input port designated as non_db, that will now be stored (as long as it is JSON-serializable). So if they used this functionality with the exact purposes to not have it stored in any way, that would definitely break that. I always thought that people would mostly just use this functionality (if at all) to get around the requirement that all inputs have to be Node instances. For example, I have seen people use this to pass Group instances as inputs to a WorkChain. This works, but is anyway something that should be discouraged because it is breaking provenance and leads to unpredictable results.

Do those two need to be handled separately?

Kind of yes, because they are not all stored in the same way. For example, the label and description just happen to be stored in a dedicated column on the Node database model, whereas most of the others are just stored in the attributes. Additional complication is that the metadata.options are not stored in their "namespace" but are stored flat. So for example the metadata.options.queue option is simply stored in the attributes directly under queue. Another difference is that currently not only those options that are explicitly defined are stored in the attributes, but also the values provided by the port defaults. This is not the case for what is stored in this PR: there I only store what is explicitly passed by the caller, i.e., the "raw" inputs.

So although it may be possible in principle to generalize all this, this would require quite some discussions, data migrations and other laborious stuff that may not be worth it.

I don't know of the non_db field being used to pass sensitive information, but if the behavior changes we may need to at least update the documentation.

Absolutely, that is a given.

@sphuber sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 3 times, most recently from 248f9c5 to fd31f2b Compare December 12, 2022 10:21
@sphuber
Copy link
Contributor Author

sphuber commented Dec 13, 2022

@ltalirz as promised I discussed with the team and we agree with you that it is best to not change the behavior of non_db. So instead, we define a new keyword metadata which takes on the originally intended role of non_db: marking a port whose value is not stored as Data and linked to the process node, but is stored on the process node directly, somehow. The keyword was chosen because it plays nicely with the fact that all the originally non_db ports that need to be changed to metadata port are also directly part of the metadata input namespace. non_db ports specified in plugins will continue working as before, i.e., their values are not permanently stored.

The only question remaining is that I haven't documented the new metadata keyword, but that is because it is not clear whether it should be. It is not clear whether we should allow or encourage plugins to add metadata inputs to processes and have them automatically stored in the attributes or something like that. For the time being, I would not do this as it can always be added later once we have had more time to see if this is an important valid use-case.

@sphuber sphuber requested a review from ltalirz December 13, 2022 09:26
@sphuber sphuber force-pushed the feature/4089/process-store-non-db-inputs branch from fd31f2b to 38c9bbe Compare December 13, 2022 09:28
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber !

That seems like a good solution to me.

I understand that non_db is not deprecated for the moment and the idea is to have both non_db and metadata supported for the time being? (fine with me)

aiida/engine/processes/ports.py Show resolved Hide resolved
return self._explicitly_set

@property
def metadata(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name was discussed, and the preference was for metadata over is_metadata, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (@chrisjsewell and @giovannipizzi ) agreed on metadata but I have to say we didn't consider a lot of other options. is_metadata wasn't brought up for example. Happy to think about the naming if others think this is also a better option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brought it up during today's meeting and we agree that is_metadata is a better name. Have updated the code and commits.

aiida/engine/processes/process.py Outdated Show resolved Hide resolved
# Store JSON-serializable values of ``netadata`` ports in the node's attributes. Note that instead of passing in
# the ``metadata`` inputs directly, the entire namespace of raw inputs is passed. The reason is that although
# currently in ``aiida-core`` all input ports that set ``metadata=True`` in the port specification are located
# within the ``metadata`` port namespace, this may not always be the case. The ``_filter_serializable_metadata``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to mention how you plan to use this rather than just that it could be used.
(or just remember to update this comment when you add your next PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially pertains to my last comment regarding the last remaining open question. Should we allow other plugins to add metadata ports and have them store data in the process node's attributes? If so, then this is necessary. If we don't, then we may still want to keep this although it can be done differently. I am not sure what the answer to the question is though

tests/engine/test_process.py Outdated Show resolved Hide resolved
The `_recursive_merge` method could raise a `KeyError` for a process
builder that contains a dynamic port namespace. In this case it would be
possible to merge in a value that contains a nested dictionay which
would try to access directly `dictionary[key]`, where `dictionary` is
the namespace of the builder, and so hit a `KeyError` since this nested
namespace didn't exist yet in the dynamical namespace.

The solution is to explicitly check if the `key` already exists in the
builder's namespace, and if not, we assig the entire `value` to it, as
there is no nothing to recurse into.
The method is a generic utility to operate on mappings and is not
specific to the `ProcessBuilder`. It needs to be used elsewhere soon so
therefore it is moved to a standalone function in a new utility module
`aiida.engine.processes.utils`.
@sphuber sphuber force-pushed the feature/4089/process-store-non-db-inputs branch 2 times, most recently from 8c18f86 to 59ebc13 Compare December 14, 2022 11:56
In the engine redesign from `v0.x` to `v1.0` the process interface
needed a way to distinguish inputs that are lined up to the process node
as nodes themselves and those that are stored directly on the process
node, for example as an attribute. To this end the `non_db` keyword was
added to the `InputPort`. This keyword was used for the `metadata` input
namespace, to designate that these inputs would not be `Data` instances.

The name is quite unfortunate, however, since the inputs of these ports
actually do get stored in the database, contrary to what the keyword
suggests. The most straightforward solution would be to rename the
keyword, however, for better or worse, the keyword has been adopted by
(a limited amount of) plugin packages. A known application is to pass in
`Group` instances which are not storable as nodes, but can be passed
through a `non_db` port. Hypothetically, users may have used the port as
well to pass in sensitive data that should never be stored. Renaming the
port or changing its behavior is therefore likely to break existing
code.

Instead, the `is_metadata` keyword is added to the `InputPort` and
`PortNamespace` through the `WithMetadata` mixin. When set to `True`
this keyword functions as the original intention of the `non_db` flag
and an `is_metadata` port signals that its value will be stored in the
database but directly on the `ProcessNode` instead of being linked as a
`Data` node. The naming makes sense, because the only use of this
keyword is by the `metadata` input namespace that all `Process` classes
have. The inputs in this namespace are stored, through custom logic in
the `Process` and `CalcJob` class, in the attributes of the node or
dedicated columns of the node database model, such as the label and the
description.

This addition leaves the behavior of `non_db` inputs unchanged and so
plugin packages that use this keyword should continue to function as
before.
The promise of AiiDA's provenance is that all inputs to a `Process` are
stored as nodes in the provenance graph linked to a node representing
the process. From the very early beginning, however, there needed to be
exceptions of inputs to processes that were not nodes. Notable examples
were the various "options" set for calculation jobs. Even in AiiDA v0.x
these "settings" as they were called back then, were stored in the
attributes of the node.

In the AiiDA v1.0 redesign, where all inputs of a process are defined
through the process spec, these non-node inputs were implemented by
allowing input ports to be made non-database storable, indicated by the
`non-db` argument in the port declaration. These inputs would be passed
to the `Process` instance and would be available during its lifetime,
but would not be stored in the database. Once again there are exceptions
as certain inputs defined by `aiida-core` are stored on the node, but in
various places. Notable examples are the `label` and `description` of
the process, and the `metadata.options` of the `CalcJob` class.

This historical decision has as a direct result in that it is difficult
if not impossible in certain cases to reconstruct the exact input
dictionary that was used to run a `Process` from the data stored on the
`ProcessNode`. From a provenance point of view, this is a huge weakpoint
and is what is being corrected here.

The input ports marked `non_db=True` on the base process classes
provided by `aiida-core`, `Process` and `CalcJob` were changed to use the
new `is_metadata` keyword instead in the previous commit, to remove the
inconsistency between the naming and behavior. In this commit, all inputs
that correspond to `is_metadata` ports and are JSON-serializable are stored
in the attributes of the process node under the key `metadata_inputs`.
All `is_metadata` input ports that are defined on process base classes by
`aiida-core`, such as `Process` and `CalcJob` *are* JSON serializable.
However, plugin packages can implement process classes with ports that
accept inputs that are not JSON serializable, which is why this
additional condition has to be added. But all inputs defined by
`aiida-core` should be covered.
The `get_builder_restart` method on the `ProcessNode` base class would
return a `ProcessBuilder` with the inputs set to those attached to that
node instance. The `CalcJobNode` would override this to add the metadata
options as well. This would be ok for `CalcJobNode`s, but if a restart
builder was created from a `WorkChainNode` that calls a `CalcJobNode` it
would have also received options, but those would not be restored. One
could think that when calling `get_restart_builder` on a `WorkChainNode`
that we can just go down the callstack, find all the `CalcJobNode`s and
set the options in the respective input namespaces. But this would not
exactly reproduce the original inputs, as the options that a calculation
job has received could have been a modified version of the original
options passed to the workchain, changed in the logic of the workchain.

Instead, now that the exact metadata inputs for each process are stored
in the attribute of the node, added in the previous commit, it is this
dictionary that is used to restore the exact original inputs. It not
only addresses the problem of incorrect `CalcJob` options, but it also
restores any other metadata inputs such as the `label` and `description`.
@sphuber sphuber force-pushed the feature/4089/process-store-non-db-inputs branch from 59ebc13 to d46d77b Compare December 14, 2022 12:14
@ltalirz ltalirz self-requested a review December 14, 2022 13:28
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber ! no further comments from my side

@sphuber sphuber changed the title Ensure that ProcessNode.get_builder_restart fully restores all inputs including non_db ones Ensure that ProcessNode.get_builder_restart fully restores all inputs including metadata inputs Dec 14, 2022
@sphuber sphuber merged commit a0cf2ba into aiidateam:main Dec 14, 2022
@sphuber sphuber deleted the feature/4089/process-store-non-db-inputs branch December 14, 2022 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that ProcessNode.get_builder_restart fully restores all inputs including non_db ones
3 participants