Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocols: Move all static work chain inputs to protocol #902

Merged
merged 4 commits into from
Apr 18, 2023

Conversation

mbercx
Copy link
Member

@mbercx mbercx commented Apr 6, 2023

Fixes #830

Some "default" inputs are set inside one of the steps of several work chains, which:

  1. Can be confusing for the user, since the expected inputs are not there in the input files.
  2. Can be frustrating for the user when a different value is desirable, for a use case that may not immediately be obvious.
  3. Means default values are specified both in the work chain logic and protocol, making it more difficult to get a clear overview of the input parameters.

Here we move the specification of all static default inputs to the corresponding protocol file. Additionally:

  • Since the default protocol now correctly sets the calculation type to bands, the validate_inputs validator of the PwCalculation will raise a warning because a parent_folder has not been initially provided. Hence, we set the validate_inputs_base validator for the pw port of the bands namespace, as is also done for e.g. the nscf of the PdosWorkchain.
  • We fix a bug in the base_final_scf part of the PwRelaxWorkChain protocol, that erroneously specified CELL.press_conv_thr.
  • We move the logic in PwBaseWorkChain.set_max_seconds to PwBaseWorkChain.prepare_process. Having a separate method for these few lines of code seemed unnecessary, especially since it's the only step that is executed in prepare_process.
  • We remove the code added in 0aba276 that automatically set RestartType.FULL in case the parent_folder input is provided for the PwBaseWorkChain. There are simply too many different use cases of restarting I think it's difficult if not impossible to consider all of them, and in several known use cases this code addition would cause more harm than good.

@sphuber
Copy link
Contributor

sphuber commented Apr 6, 2023

Finally, we switch to using 2 spaces for the indentation of the protocol YAML files to make them easier to read.

I realize that this is always going to be opinion-based, but I actually feel the opposite is true. I find the 2-space indentation very confusing and makes it difficult to see what level of nesting I am looking at. Why would we decide that 4 spaces for code is correct, but not for the configuration? Anyway, since this distracts from this PR and also makes the actual changes to the configs more difficult to spot, could you please remove this for now? Happy to discuss this in a separate issue/PR and if others users also really prefer 2-space, then I won't stop it.

Once the tests are then also running, I will review this. Thanks!

@mbercx
Copy link
Member Author

mbercx commented Apr 6, 2023

Anyway, since this distracts from this PR and also makes the actual changes to the configs more difficult to spot, could you please remove this for now?

Sure! My preference for 2-space indentations may also be due to the fact that VScode adds vertical lines for the indentation in YAML files. The more compact 2-space indentations make it easier to see settings at a glance for me. But it's not critical.

Note that this PR is built on top of #903, so is still blocked until we merge that.

Edit: Nevermind, that was kind of unnecessary anyways!

@mbercx mbercx added the pr/blocked PR is blocked by another PR that should be merged first label Apr 6, 2023
@mbercx mbercx removed the pr/blocked PR is blocked by another PR that should be merged first label Apr 6, 2023
@sphuber
Copy link
Contributor

sphuber commented Apr 6, 2023

Sure! My preference for 2-space indentations may also be due to the fact that VScode adds vertical lines for the indentation in YAML files. The more compact 2-space indentations make it easier to see settings at a glance for me. But it's not critical.

My editor also shows the tab-lines, but with 2-spaces, it just all merges into one big blob of lines 😅

@mbercx
Copy link
Member Author

mbercx commented Apr 6, 2023

My editor also shows the tab-lines, but with 2-spaces, it just all merges into one big blob of lines 😅

This may be related to the fact that you use such ridiculously small fonts that only your Dutch-elvish eyes can read 👀

@mbercx mbercx force-pushed the fix/877/bands-restart branch 4 times, most recently from b8e7a46 to fe8938a Compare April 6, 2023 21:51
@mbercx
Copy link
Member Author

mbercx commented Apr 6, 2023

Alright, I also added the restart inputs explicitly, and switched to using paro for the diagonalization. She's ready for review now, @sphuber.

@sphuber
Copy link
Contributor

sphuber commented Apr 6, 2023

Before going into the review, two points that came to mind:

  1. Although I agree that it might make sense to "concentrate" the logic surrounding defaults, moving everything to the protocol has two downsides that I can think of:
    • It cannot use logic, and so "dynamic" defaults will still have to be set in code somewhere, that can still end up overriding initial inputs as defined in the protocol. So I am not sure if we can ever fully get around this. But maybe having slightly more concentrated logic is better than nothing.
    • The defaults and checks will now only be considered if the user goes through get_builder_from_protocol. When building the inputs manually and submitting directly, all of this is bypassed. We can still decide that users should simply (always) go through the get_builder_from_protocol but we should make this super duper clear.
  2. This only addresses the PwBandsWorkChain, but as I understand there are still "problems" downstream. I don't think this PR addresses restart_mode should be set to from_scratch for a bands calculation. #877 (not that you marked it as such, but just mentioning this). The warning is issued in the PwCalculation.validate_inputs_base and is actually incorrect. There are valid cases where you want restart_mode = 'restart' for an NSCF calculation. We should remove that warning. The PwBaseWorkChain.setup method is actually still overriding various input parameters. Is that logic still correct? Should that also be done in the protocol/get_builder_from_protocol? At least for the initialization maybe, the set_restart_type is still needed for actual restarts within the workchain. Anyway, what I am saying is that we might need to take a look at the whole stack and make sure it is consistent, no?

@mbercx
Copy link
Member Author

mbercx commented Apr 6, 2023

Thanks for your thoughts, @sphuber!

It cannot use logic, and so "dynamic" defaults will still have to be set in code somewhere, that can still end up overriding initial inputs as defined in the protocol.

That's definitely true. We should probably try to report such dynamically set inputs so the user can more easily find them.

The defaults and checks will now only be considered if the user goes through get_builder_from_protocol.

Another good point! I would argue beginners are now immediately introduced to the get_builder_from_protocol() method when starting to run work chains, and only advanced users will be able to set up the inputs from scratch. I would definitely put default inputs in the protocol, to avoid the issue I mention here. We could consider also setting defaults (i.e. never fully override) inputs in the work chain logic, but then there is some duplicated information...

I'm all for making it super duper clear that the using the get_builder_from_protocol() is the way to go.

There are valid cases where you want restart_mode = 'restart' for an NSCF calculation.

Are there? In the QE docs on the restart_mode input:

https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm65

The restart option clearly states: "Do not use to start a new one, or to perform a non-scf calculations."

The PwBaseWorkChain.setup method is actually still overriding various input parameters. Is that logic still correct?

Haven't looked at these in the context of this PR, but happy to make it more "holistic", i.e. try to implement this principle downstream. Will give a crack at this tomorrow and report back!

@mbercx
Copy link
Member Author

mbercx commented Apr 6, 2023

One additional thing to note here is that we should really try and come at this from the perspective of the user, and perhaps it would be good to write down as many use cases as we can think of.

@sphuber
Copy link
Contributor

sphuber commented Apr 7, 2023

I'm all for making it super duper clear that the using the get_builder_from_protocol() is the way to go.

Ok, I'm on board. Let's go in that direction then that protocols should be as complete as possible and get_builder_from_protocol should be the main recommended entry point for users.

Are there? In the QE docs on the restart_mode input:
https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm65
The restart option clearly states: "Do not use to start a new one, or to perform a non-scf calculations."

As I understood it, this is to restart an interrupted NSCF calculation. But reading the docs again, it is a bit ambiguous, and thinking about it now, I am not sure if an NSCF calculation could be performed partially and shutdown cleanly and then restarted to perform the rest. Maybe it can actually do part of all the k-points and then write to disk, and you can restart to do the final set of k-points? Would be interesting to test.

One additional thing to note here is that we should really try and come at this from the perspective of the user, and perhaps it would be good to write down as many use cases as we can think of.

Fully agree, but that is the tricky point here. For novice users, you want to do as much as possible in an automatic way, but also correct inputs that are most likely incorrect. This is where it becomes difficult though, because sometimes there are expert users who do something on purpose and there is nothing more frustrating than the code overriding your desired behavior and not allowing to change it. But in that sense, maybe the get_builder_from_protocol is the best solution as we can make it as automated as possible for novice users, while power users can set anything on the builder afterwards without the workchain then interfering and double-guessing them.

@mbercx
Copy link
Member Author

mbercx commented Apr 9, 2023

Maybe it can actually do part of all the k-points and then write to disk, and you can restart to do the final set of k-points? Would be interesting to test.

Well, I stand corrected, you actually can ^^. And this will indeed only work when restart_mode is set to restart. So setting this input is a legitimate use case for nscf and bands calculations.

I'm still conflicted on what to do with this code though:

# If a ``parent_folder`` is specified, automatically set the parameters for a ``RestartType.Full`` unless the
# ``CONTROL.restart_mode`` has explicitly been set to ``from_scratch``. In that case, the user most likely set
# that, and we do not want to override it.
restart_mode = self.ctx.inputs.parameters['CONTROL'].get('restart_mode', None)
if 'parent_folder' in self.ctx.inputs and restart_mode != 'from_scratch':
self.set_restart_type(RestartType.FULL, self.ctx.inputs.parent_folder)

Let me go over some use cases to discuss this change and if we want to keep it.

Use case A

Let me first repeat here the use case you mentioned here that was the reason for adding this code:

This resulted from a case from a user. They launched a calculation and it didn't finish. I pointed them to get_builder_restart as an easy way to get a builder that can be launched to restart and finish the calculation. However, as described in the commit message, this would still keep from_scratch and essentially the calculation would be redone and not finish.

Example run of use case.
builder = PwBaseWorkChain.get_builder_from_protocol(
    code=pw_code,
    structure=structure,
)

builder.clean_workdir = orm.Bool(False)
builder.max_iterations = orm.Int(1)
builder.pw.parameters['CONTROL']['max_seconds'] = 20

wc_node = submit(builder)

while not wc_node.is_finished:
    time.sleep(5)

parent_folder = wc_node.called_descendants[-1].outputs.remote_folder

restart_builder = wc_node.get_builder_restart()
restart_builder.pw.parent_folder = parent_folder

submit(restart_builder)

A couple of comments here:

  1. This will only work in case the user set clean_workdir to False for the initial run. The default for all protocols is True.
  2. The user will still have to get the remote_folder from the last PwCalculation in the failed PwBaseWorkChain. By testing this, I actually found out the outputs or the wrapped PwCalculation are not added in case the PwBaseWorkChain fails. Perhaps this is something we should fix?
  3. Restarting the calculation like this will raise a warning:
    /home/mbercx/envs/aiida-dev/code/aiida-quantumespresso/src/aiida_quantumespresso/calculations/pw.py:178: 
    UserWarning: `parent_folder` input was provided for the `PwCalculation`, but no input parameters are set to restart from 
    these files.
    
    Which I still believe to be a fair warning to have.

Without the code that sets the restart_mode when a parent_folder is present, the user would have to add the following code:

parameters = restart_builder.pw.parameters.get_dict()
parameters['CONTROL']['restart_mode'] = 'restart'
restart_builder.pw.parameters = orm.Dict(parameters)

A bit tedious perhaps, but not insurmountable. And the validation warning implemented on the PwCalculation inputs will point them in this direction (but should perhaps be made more clear).

Use case B

The user has successfully completed a PwBaseWorkChain, and they want to perform a calculation on top of this one starting from the charge density. Since from_scratch is the default setting for restart_mode, they will simply add the parent_folder input and set the ELECTRONS.starting_pot parameter to file.

By adding the code above, restart_mode will now be set to restart and the startingpot input will be removed. Now, in many cases the QE behaviour will still be as desired. It seems QE will ignore the wave function files in case some parts of the setup have changed (e.g. number of bands/kpoints) or the wave function files are missing (restart from stashed charge density). However, there may be use cases where the restart behaviour is not as the user desired:

  1. The restart calculation only changes occupations == fixed. In this case it seems QE will restart from both the charge density and wave functions:

         The initial density is read from file :
         ./out/aiida.save/charge-density
         starting charge       8.0001, renormalised to       8.0000
         Starting wfcs from file

    Which may be undesirable, also see this comment.

  2. The restart calculation wants to use a different geometry but start from the charge density (there may be use cases for this). With restart_mode == restart, the geometry will be read from the previous output, not the new one provided in the inputs:

          Atomic positions and unit cell read from directory:
         ./out/aiida.save/

I can probably think of others. The user could of course explicitly set restart_mode to from_scratch to avoid this, but I don't think we can expect a user to have to consider this, since this is the default of restart_mode in QE.

Use case C

Many work chains rely on restarts from previous steps in the work chain. The example that we try to fix in this PR is the one raised by @unkcpz (#877), where the restart for the band structure calculation suddenly needs to explicitly set restart_mode to from_scratch. We can fix this by being explicit in the protocols, as we do here, but there are other higher-level work chains out there that might not have considered this.

Conclusion

Although the addition of the code above in 0aba276 makes it slightly more straightforward to restart from interrupted runs, it complicates many other use cases in a manner that can be confusing and frustrating for the user. We could think of ways to restrict when the restart_mode input is set to restart, but there are so many different use cases of restarting I think it's difficult if not impossible to consider all of them.

I think we should:

  1. (This PR) Remove the code above, and add documentation on how to restart from interrupted runs to help with use case A.
  2. (PwCalculation: Fix restart validation for nscf/bands #906) Remove some of the warnings related to restarts from a parent_folder. E.g. it seems restarting an nscf or bands calculation with restart_mode == restart is a valid use case for interrupted runs.
  3. (PwBaseWorkChain: Add outputs of final PwCalculation when failed #923) Add the outputs of the last PwCalculation to those of the PwBaseWorkChain even if the calculation didn't complete successfully.
  4. (aiida-core) Fix an issue related to setting nested Dict content silently failing, see below.

Side tangent: trying to set the restart_mode directly on the parameters node will of course not work:

restart_builder.pw.parameters['CONTROL']['restart_mode'] = 'restart'

since the node is the same one as the already run PwBaseWorkChain and hence is stored and immutable. However, the above line of code will raise no warning or error! @sphuber I think this was already raised at some point, but not sure if there is an open issue for this. In short, trying to change a top-level key of the Dict node raises:

In [1]: d = Dict({'a': 1})
   ...: d['b'] = 2
   ...: print(d.get_dict())
{'a': 1, 'b': 2}

In [2]: d.store()
   ...: d['c'] = 3
   ...: print(d.get_dict())
---------------------------------------------------------------------------
ModificationNotAllowed                    Traceback (most recent call last)
Cell In[2], line 2
      1 d.store()
----> 2 d['c'] = 3
      3 print(d.get_dict())

File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/data/dict.py:71, in Dict.__setitem__(self, key, value)
     70 def __setitem__(self, key, value):
---> 71     self.base.attributes.set(key, value)

File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/attributes.py:118, in NodeAttributes.set(self, key, value)
    110 def set(self, key: str, value: Any) -> None:
    111     """Set an attribute to the given value.
    112 
    113     :param key: name of the attribute
   (...)
    116     :raise aiida.common.ModificationNotAllowed: if the entity is stored
    117     """
--> 118     self._node._check_mutability_attributes([key])  # pylint: disable=protected-access
    119     self._backend_node.set_attribute(key, value)

File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/node.py:209, in Node._check_mutability_attributes(self, keys)
    202 """Check if the entity is mutable and raise an exception if not.
    203 
    204 This is called from `NodeAttributes` methods that modify the attributes.
    205 
    206 :param keys: the keys that will be mutated, or all if None
    207 """
    208 if self.is_stored:
--> 209     raise exceptions.ModificationNotAllowed('the attributes of a stored entity are immutable')

ModificationNotAllowed: the attributes of a stored entity are immutable

But trying to change nested content of the Dict node fails silently:

In [3]: d = Dict({'a': {'b': None, 'c': None}})
   ...: d['a']['b'] = 2
   ...: print(d.get_dict())
{'a': {'b': 2, 'c': None}}

In [4]: d.store()
   ...: d['a']['c'] = 3
   ...: print(d.get_dict())
{'a': {'b': 2, 'c': None}}

@sphuber
Copy link
Contributor

sphuber commented Apr 9, 2023

Thanks for the detailed writeup @mbercx , this is super important and useful. Let me address your points, starting with the last:

However, the above line of code will raise no warning or error! @sphuber I think this was already raised at some point, but not sure if there is an open issue for this.

I definitely remember writing about this in an issue and basically coming to the conclusion that it will be very difficult to come up with a solution for this. Essentially what is happening with the following code:

restart_builder.pw.parameters['CONTROL']['restart_mode'] = 'restart'

First restart_builder.pw.parameters['CONTROL'] calls _getitem__ with CONTROL as key. As the implementation shows:

    def __getitem__(self, key):
        try:
            return self.base.attributes.get(key)
        except AttributeError as exc:
            raise KeyError from exc

this will get the attribute from the database with that name and return it. The return type is a plain Python dictionary, let's call it d. So next, the code calls d['restart_mode'] = 'restart , which will indeed set the restart_mode key to restart, but that is of the Python dictionary d and not of the node attributes.

The only "solution" I see here is to have Dict.__getitem__ not return a plain dict but some custom class that implements all dict methods, but overrides all setters and delete attributes (and potentially others) and either raises or emits a warning. Now while this may be possible, I see many potential pitfalls that will be difficult to foresee.

Then, as for your other points. I see how helping making one use-case easier, may hamstring another user. Maybe it is true that in the end, it is quite complicated and we cannot warn/prevent for all use-cases and will have to leave some up to documentation.

Add the outputs of the last PwCalculation to those of the PwBaseWorkChain even if the calculation didn't complete successfully.

Don't see a problem with that, fine to open a PR for this.

(#906) Remove some of the warnings related to restarts from a parent_folder. E.g. it seems restarting an nscf or bands calculation with restart_mode == restart is a valid use case for interrupted runs.

👍

(This PR) Remove the code above, and add documentation on how to restart from interrupted runs to help with use case A.

Fine, but I will be sending all complaints of people, trying to restart and the calculation failing again because it is restarting from scratch, to you 😄

@mbercx
Copy link
Member Author

mbercx commented Apr 15, 2023

Fine, but I will be sending all complaints of people, trying to restart and the calculation failing again because it is restarting from scratch, to you 😄

Hehe, agreed! I think there is more risk of someone using the PwBaseWorkChain running into a "cleaned working directory" is bigger though, see #915.

Regarding the documentation, I'm always sort of paralysed when I start trying to add something to the docs, because there are so many things I would change I feel I would go down a documentation rabbit hole that would cause me sleep deprivation. So I've opened an issue for this instead:

#916

I think we should have a meeting to discuss the documentation again, also in light of this decision:

Ok, I'm on board. Let's go in that direction then that protocols should be as complete as possible and get_builder_from_protocol should be the main recommended entry point for users.

EDIT: on the following:

The only "solution" I see here is to have Dict.getitem not return a plain dict but some custom class that implements all dict methods, but overrides all setters and delete attributes (and potentially others) and either raises or emits a warning.

I see, quite tricky indeed... I understand a solution is not simple, but I think we should thinking about doing so for stored nodes. It's quite an insidious bug that this passes without warning/error. Let me try and dig up some issues on this.

EDIT: Couldn't find an issue for this, so opened aiidateam/aiida-core#5970

CELL:
press_conv_thr: 0.5
CONTROL:
calculation: scf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not ask questions how a pressure convergence condition was the only part of the "scf" step of the PwRelaxWorkChain protocol. 😅

@mbercx mbercx changed the title PwBandsWorkChain: Move inputs to protocol Protocols: Move all static work chain inputs to protocol Apr 16, 2023
@mbercx
Copy link
Member Author

mbercx commented Apr 16, 2023

@sphuber OP message updated for new commit message, tests adapted. Ready for review!

@mbercx mbercx force-pushed the fix/877/bands-restart branch 2 times, most recently from 2e3cce5 to 969d559 Compare April 17, 2023 09:56
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbercx . Just a single request for change, then we can merge.

That being said, I do think it is very important we add as fast as possible some minimal documentation that at the very least states that the workchains do not automatically set required defaults when launched directly and they can fail, and that therefore they are recommended to use the get_builder_from_protocol. I am pretty sure that there will be quite some users that do not use the protocols and will not set all the proper defaults because the workchains used to do that.

CONTROL:
calculation: bands
ELECTRONS:
diagonalization: paro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value was added in QE 6.6? Since that is the oldest version we support?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yessir, see

#830 (comment)

@@ -229,7 +229,6 @@ def run_scf(self):
inputs.metadata.call_link_label = 'scf'
inputs.pw.structure = self.ctx.current_structure
inputs.pw.parameters = inputs.pw.parameters.get_dict()
inputs.pw.parameters.setdefault('CONTROL', {})['calculation'] = 'scf'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are removing this, maybe we move line 231 to the if self.ctx.current_number_of_bands: block, since then we only unwrap-and-wrap the parameters if we really need to update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to still add this change @mbercx ? Then this can be merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this comment somehow. On it! 🚀

Several "default" inputs of the `PwBandsWorkChain` are set inside one of the steps of the work chain, which:

1. Can be confusing for the user, since the expected inputs are not there in the input files.
2. Can be frustrating for the user when a different value is desirable, for a use case that may not immediately be obvious.
3. Means default values are specified _both_ in the work chain logic and protocol, making it more difficult to get a clear overview of the input parameters.

Here we move the specification of the default inputs to the protocol file of the `PwBandsWorkChain` (`bands.yaml`). We also explicitly set the restart inputs to make sure the `bands` calculations restarts from the charge density by default.

Since the default protocol now correctly sets the calculation type to `bands`, the `validate_inputs` validator of the `PwCalculation` will raise a warning because a `parent_folder` has not been initially provided. Hence, we set the `validate_inputs_base` validator for the `pw` port of the `bands` namespace, as is also done for e.g. the `nscf` of the `PdosWorkchain`.
This is already the default in QE anyways, and is specified as such in the protocol.
@mbercx
Copy link
Member Author

mbercx commented Apr 18, 2023

That being said, I do think it is very important we add as fast as possible some minimal documentation that at the very least states that the workchains do not automatically set required defaults when launched directly and they can fail, and that therefore they are recommended to use the get_builder_from_protocol. I am pretty sure that there will be quite some users that do not use the protocols and will not set all the proper defaults because the workchains used to do that.

Agreed. I would love to give the documentation some love soon. Will try to find the time later this week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PwBandsWorkChain: default diagonalization set to 'cg'?
2 participants