Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalcJob: Fix bug causing exception after restarting daemon #5886

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Feb 8, 2023

Fixes #5882

In v2.2.0, a feature was added to attach monitors to a CalcJob for which the Waiting (see aiida.engine.processes.calcjob.tasks) state of the process was modified. The _command attribute was moved inside the data argument passed to the constructor and was initialized only in the constructor. However, when the process is loaded from a serialized checkpoint, such as after a daemon restart, the constructor is not called but load_instance_state is. In this case, the _command attribute would not be set and when referenced it would cause an exception.

The solution is to add the _command attribute to the auto_persist decorator as this will be called on load_instance_state and will reinitialize these attributes from the persisted checkpoint.

A similar problem was lurking with the _monitors attribute which was also only initialized in the constructor. Since this has a custom data type CalcJobMonitors which may not be serializable by auto_persist a property monitors is added that will recreate the attribute and populate it with the CalcJobMonitors instance with the monitors specified in the inputs.

@sphuber
Copy link
Contributor Author

sphuber commented Feb 8, 2023

This is a fix for a critical bug that affects v2.2.0 and up. Ideally I want to have this reviewed and released a.s.a.p. as it affects the latest releases heavily. Anyone got time to review it @chrisjsewell @mbercx @ramirezfranciscof ?

In `v2.2.0`, a feature was added to attach monitors to a `CalcJob` for
which the `Waiting` (see `aiida.engine.processes.calcjob.tasks`) state
of the process was modified. The `_command` attribute was moved inside
the `data` argument passed to the constructor and was initialized only
in the constructor. However, when the process is loaded from a
serialized checkpoint, such as after a daemon restart, the constructor
is not called but `load_instance_state` is. In this case, the `_command`
attribute would not be set and when referenced it would cause an
exception.

The solution is to add the `_command` attribute to the `auto_persist`
decorator as this will be called on `load_instance_state` and will
reinitialize these attributes from the persisted checkpoint.

A similar problem was lurking with the `_monitors` attribute which was
also only initialized in the constructor. Since this has a custom data
type `CalcJobMonitors` which may not be serializable by `auto_persist`
a property `monitors` is added that will recreate the attribute and
populate it with the `CalcJobMonitors` instance with the monitors
specified in the inputs.
@sphuber sphuber force-pushed the fix/5882/calcjob-checkpoint-deserialization branch from 9ad9042 to 6f161db Compare February 8, 2023 16:50
@sphuber
Copy link
Contributor Author

sphuber commented Feb 9, 2023

@ramirezfranciscof can you please have a look a.s.a.p. As mentioned in the meeting, this bug is critically breaking the last release and we need to ship a patch a.s.a.p. Thanks

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine to me, at a glance. But as discussed, you guys worked on this code

@sphuber
Copy link
Contributor Author

sphuber commented Feb 9, 2023

Cheers @chrisjsewell . I will hold off till tomorrow afternoon to give @ramirezfranciscof the chance to also have a look.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All seems good to me! The only thing I was thinking is that I would watch out if we have many test that need to interact with the daemon, this might slow down the testing process, right? But in that case we just need to move them to the nightly testing set.

@sphuber
Copy link
Contributor Author

sphuber commented Feb 10, 2023

All seems good to me! The only thing I was thinking is that I would watch out if we have many test that need to interact with the daemon, this might slow down the testing process, right? But in that case we just need to move them to the nightly testing set.

Thanks for the review @ramirezfranciscof . I agree that these tests take quite a bit of time for a "unit test", but I think this one is really crucial. Making sure that CalcJobs don't except after restarting the daemon is quite useful 😅 But I take your point that this should not be carte blanche to add these kinds of tests for most test cases

@sphuber sphuber merged commit 46ff10a into aiidateam:release/2.2.2 Feb 10, 2023
@sphuber sphuber deleted the fix/5882/calcjob-checkpoint-deserialization branch February 10, 2023 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants