Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: fix bug that allowed non-storable inputs to be passed to process #5532

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented May 20, 2022

Fixes #5128

The basic assumption for a Process in aiida-core is that all of its
inputs should be storable in the database as nodes. Under the current
link model, this means that they should be instances of the Data class
or subclasses thereof. There is a noticeable exception for ports that
are explicitly marked as non_db=True, in which case the value is not
linked as a node, but is stored as an attribute directly on the process
node itself, or not stored whatsoever.

This basic rule was never explicitly enforced, which made it possible to
define processes that would happily take non-storable inputs. The input
would not get stored in the database, but would be available within the
processes lifetimes from the inputs property allowing it to be used.
This will most likely result into unintentional loss of provenance.

The reason is that the default valid_type of the top-level inputs
namespace of the Process class was never being set to Data. This
meant that any type would be excepted for a Process and all its
subclasses unless the valid type of a port was explicitly overridden.
This meant that for normal dynamic namespaces, even non-storable types
would be accepted just fine. Setting valid_type=Data for the input
namespace of the Process class fixes the problem therefore.

The basic assumption for a `Process` in `aiida-core` is that all of its
inputs should be storable in the database as nodes. Under the current
link model, this means that they should be instances of the `Data` class
or subclasses thereof. There is a noticeable exception for ports that
are explicitly marked as `non_db=True`, in which case the value is not
linked as a node, but is stored as an attribute directly on the process
node itself, or not stored whatsoever.

This basic rule was never explicitly enforced, which made it possible to
define processes that would happily take non-storable inputs. The input
would not get stored in the database, but would be available within the
processes lifetimes from the `inputs` property allowing it to be used.
This will most likely result into unintentional loss of provenance.

The reason is that the default `valid_type` of the top-level inputs
namespace of the `Process` class was never being set to `Data`. This
meant that any type would be excepted for a `Process` and all its
subclasses unless the valid type of a port was explicitly overridden.
This meant that for normal dynamic namespaces, even non-storable types
would be accepted just fine. Setting `valid_type=Data` for the input
namespace of the `Process` class fixes the problem therefore.
@sphuber sphuber requested review from mbercx and ltalirz May 20, 2022 19:58
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber , this looks fine to me.

Just for me to understand: this only affected sub-namespaces and ports therein or also input ports defined at the top level?

If the latter, then I could imagine some plugins already relying on this.

Let me know if you want me to approve or rather wait for the review by Marnik.

spec.input_namespace('namespace.nested', dynamic=True)
spec.input('namespace.a', valid_type=int)
spec.input('namespace.c', valid_type=dict)
spec.input_namespace('namespace.nested', dynamic=True, non_db=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmyeah, interesting that there were even tests that relied on this behavior ;-)

@sphuber
Copy link
Contributor Author

sphuber commented May 20, 2022

Just for me to understand: this only affected sub-namespaces and ports therein or also input ports defined at the top level?
If the latter, then I could imagine some plugins already relying on this.

This affected the top-level input namespace so this includes all sub-namespaces and input ports at the top-level. If a plugin indeed relied on specifying an input port without a Data (or subclass thereof) as valid_type, then that would have "worked" even if non_db was not set to True. However, maybe it would be good to have that break, because they would probably unintentionally be losing provenance since the inputs are not actually stored in the database.

Let me know if you want me to approve or rather wait for the review by Marnik.

I think we can merge this with just a single review. Thinking of it now, Marnik will actually be away for a few weeks

@ltalirz ltalirz self-requested a review May 20, 2022 23:29
@ltalirz
Copy link
Member

ltalirz commented May 20, 2022

I agree that this is a pattern we would like developers to get away from; just saying that this may break plugins. That said, a quick search on Github for strings like valid_type=int, valid_type=str did not turn up any results.

@sphuber sphuber merged commit 5c1eb3f into aiidateam:main May 21, 2022
@sphuber sphuber deleted the fix/5128/process-function-allows-non-data-input branch May 21, 2022 06:57
@ltalirz
Copy link
Member

ltalirz commented Jun 20, 2022

Just wanted to point out one use case I stumbled upon that is somewhat related to this:

Having a dynamic input namespace (e.g. steps) in a WorkChain, to which you pass multiple CalcJob builder objects via steps['step1']=builder.
This might be a useful pattern, e.g. if some of the logic in creating the inputs / metadata is quite machine-dependent and you don't want to put it inside the workchain.

This approach still seems to be allowed, even after this fix, although the metadata of the builders won't be available at the WorkChain level (which users may not realize).
In practice, however, in this particular example the metadata is typically not totally lost, since it will be available from the calcjob instances called by the workchain (possibly modified by the workchain's internal logic, though).

sphuber added a commit that referenced this pull request Sep 22, 2022
…ess (#5532)

The basic assumption for a `Process` in `aiida-core` is that all of its
inputs should be storable in the database as nodes. Under the current
link model, this means that they should be instances of the `Data` class
or subclasses thereof. There is a noticeable exception for ports that
are explicitly marked as `non_db=True`, in which case the value is not
linked as a node, but is stored as an attribute directly on the process
node itself, or not stored whatsoever.

This basic rule was never explicitly enforced, which made it possible to
define processes that would happily take non-storable inputs. The input
would not get stored in the database, but would be available within the
processes lifetimes from the `inputs` property allowing it to be used.
This will most likely result into unintentional loss of provenance.

The reason is that the default `valid_type` of the top-level inputs
namespace of the `Process` class was never being set to `Data`. This
meant that any type would be excepted for a `Process` and all its
subclasses unless the valid type of a port was explicitly overridden.
This meant that for normal dynamic namespaces, even non-storable types
would be accepted just fine. Setting `valid_type=Data` for the input
namespace of the `Process` class fixes the problem therefore.

Cherry-pick: 5c1eb3f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Process functions allow non-Data arguments to be passed as input
2 participants