AEP: `CalcJob` live monitoring #5659

sphuber · 2022-09-24T13:35:21Z

Fixes #1925

Implementation of AEP: Allow CalcJobs to be actively monitored and interrupted

sphuber · 2022-10-18T13:01:26Z

@giovannipizzi @ramirezfranciscof let me know if you would like to review this still. I would like to get this merged soon for the v2.1 releaes

ramirezfranciscof · 2022-10-18T15:12:35Z

Sorry @sphuber , busy weeks. Now I need to prepare for my GM on friday, would it be ok if I take a look on friday afternoon?

ramirezfranciscof · 2022-11-07T17:37:50Z

Hey, I'm still working on testing this, but one issue I found is that this seems to be using a type operation that is only supported from python 3.10 on (source). When running on 3.9 I get the following (in 3.10 it works fine):

Traceback (most recent call last):
  File "/home/framirez/miniconda3/envs/aurora-aiida/bin/verdi", line 8, in <module>
    sys.exit(verdi())
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/framirez/Workenvs/aurora/aiida-core/aiida/cmdline/utils/decorators.py", line 73, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/framirez/Workenvs/aurora/aiida-core/aiida/cmdline/commands/cmd_run.py", line 115, in run
    exec(compile(handle.read(), str(filepath), 'exec', dont_inherit=True), globals_dict)  # pylint: disable=exec-used
  File "submit_newversion.py", line 68, in <module>
    cycler_builder.monitors = {
  File "/home/framirez/Workenvs/aurora/aiida-core/aiida/engine/processes/builder.py", line 123, in __setattr__
    raise ValueError(f'invalid attribute value {validation_error.message}')
ValueError: invalid attribute value `monitors.testing_monitor` is invalid: unsupported operand type(s) for |: 'type' and 'type'

sphuber · 2022-11-07T18:24:13Z

Thanks for the report @ramirezfranciscof . It's interesting that it is failing because the CI runs on Python 3.8 and the tests pass just fine. Can you share the script that you are running? Note that if you use this notation in older Python versions, you need to explicitly enable it with from __future__ import annotations. Maybe you are using this type notation (str | None, for example) in your test script and that is why it is complaining.

ramirezfranciscof · 2022-11-08T11:38:51Z

Yeap, you are right. It is not exactly on the test script but on the plugin implementation, as I was copying the examples in the AEP and didn't realize they were using this python 10 feature.

ramirezfranciscof

Hey @sphuber apologies for the delay, but I can confirm that the feature seems to be compatible with what we are trying to do with the Aurora project. Here are my comments on the code.

One general issue: whenever a calcjob finishes correctly, the status gets stuck in Waiting for transport task: retrieve. Also the kill command leaves a None status rather than the Killed through 'verdi process kill':

  PK  Created    Process label            Process State     Process status
----  ---------  -----------------------  ----------------  ------------------------------------
 161  5D ago     datanode_preparation     ⏹ Finished [0]
 165  5D ago     BatteryCyclerExperiment  ☠ Killed          Killed through `verdi process kill`
 168  5D ago     CalcjobMonitor           ⏹ Finished [0]
 176  5D ago     datanode_preparation     ⏹ Finished [0]
 180  5D ago     BatteryCyclerExperiment  ☠ Killed          None
 183  5D ago     CalcjobMonitor           ⏹ Finished [0]    Waiting for transport task: retrieve
 186  5D ago     datanode_preparation     ⏹ Finished [0]
 190  5D ago     BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 218  1D ago     datanode_preparation     ⏹ Finished [0]
 223  1D ago     BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 274  23h ago    datanode_preparation     ⏹ Finished [0]
 279  23h ago    BatteryCyclerExperiment  ⏹ Finished [150]
 284  22h ago    datanode_preparation     ⏹ Finished [0]
 289  22h ago    BatteryCyclerExperiment  ⏹ Finished [150]
 334  21h ago    datanode_preparation     ⏹ Finished [0]
 339  21h ago    BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 344  5h ago     datanode_preparation     ⏹ Finished [0]
 349  5h ago     BatteryCyclerExperiment  ⏹ Finished [150]
 354  4h ago     datanode_preparation     ⏹ Finished [0]
 358  4h ago     BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 363  4h ago     datanode_preparation     ⏹ Finished [0]
 367  4h ago     BatteryCyclerExperiment  ☠ Killed          None
 370  4h ago     CalcjobMonitor           ⏹ Finished [0]    Waiting for transport task: retrieve

Total results: 23

(Calculation 165 was run with a previous version of aiida)

I think this might be related to moving the set_process_status method out of the try statement but this is just a guess.

aiida/engine/processes/calcjobs/calcjob.py

aiida/engine/processes/calcjobs/monitors.py

aiida/engine/processes/calcjobs/tasks.py

sphuber · 2022-11-09T16:53:02Z

Thanks for testing and reviewing the code @ramirezfranciscof

One general issue: whenever a calcjob finishes correctly, the status gets stuck in Waiting for transport task: retrieve.

That's a bug indeed. Will have a look to fix it.

Also the kill command leaves a None status rather than the Killed through 'verdi process kill':

I don't think it should show Killed through 'verdi process kill' though because, well, that's not true 😄 We could show something else, but since we already have the dedicated exit code 150 I thought this wasn't necessary.

ramirezfranciscof · 2022-11-09T17:23:33Z

I don't think it should show Killed through 'verdi process kill' though because, well, that's not true 😄 We could show something else, but since we already have the dedicated exit code 150 I thought this wasn't necessary.

Haha, yes that is true. I guess what I meant is that the None status is a bit atypical, shouldn't it just show nothing like in regular finished calcjobs? Although now that I think of it, is there any drawback to showing something like Killed by monitor <monitor_keyname> in the status? Or just Killed by monitor if these status can't be "personalized".

sphuber · 2022-11-10T08:25:25Z

Although now that I think of it, is there any drawback to showing something like Killed by monitor <monitor_keyname> in the status? Or just Killed by monitor if these status can't be "personalized".

We could. We could set the process status to the message of the exit code, which does contain the specific monitor error, but the thing is that it is not really consistent as we also don't do this for any other exit code. In my mind, the code being killed by a monitor is just like the parser returning a non-zero exit code. If a user sees [150] they can run verdi process show or node.exit_message in the API to get the info on the particular monitor that triggered.

We could discuss adding this, but then I think we should add it for all non-zero exit codes. But then we are really duplicating data in the database, since the process_status and exit_message attributes will contain the same value.

sphuber · 2022-11-10T08:33:34Z

Haha, yes that is true. I guess what I meant is that the None status is a bit atypical, shouldn't it just show nothing like in regular finished calcjobs?

Ah my bad, I now see what you mean, it is when you actually kill a job it shows literal None in the process status. That would be a regression bug yeah. Will look into it.

Edit: this was actually a change I accidentally introduced in 8bb7b34 . I will open a separate PR to reinstate this behavior.

sphuber · 2022-11-10T10:47:02Z

@ramirezfranciscof thanks again for the thorough testing. I have added two commits for the bugs that you uncovered. I added them on top so it is easier for you to review the changes. Once you sign off on the changes, I will rebase them into the proper commits.

ramirezfranciscof

Good for me, issues seem addressed. I can't re-test it right now because my computer is doing some lengthy maintenance, but I think it should be ok to merge this and if I find any outstanding problem in testing later we can just do a different PR to fix it.

Instead of killing the process, it is preferable to have the engine retrieve the files from the job and call the parser, if one was defined. However, this would result in the process to be marked as terminated nominally with the exit status returned by the parser. It would be impossible to see that the job was stopped by a monitor other than from the process report. It is important that one can query for processes that were stopped by a monitor. To do so, a dedicated exit code is defined on the `CalcJob` called `STOPPED_BY_MONITOR` which is set on the node, overriding any exit status returned by the parser. There is no way around this since only one exit status can be set on a node. Nevertheless the result from the parser is logged and so is visible in the process report. If monitors are defined for the `CalcJob` the corresponding node will contains information of the package versions from which the monitors come. The information is added to the `version` attribute which already contains version information of `aiida-core` and the `CalcJob` plugin itself.

The `priority` attribute takes an integer and is zero by default. It allows the user to define the order in which monitors need to be called in case multiple are defined. The ordering is implemented in the `CalcJobMonitors` utility class. This is mostly so that it is easy to unit test. It ensures the monitors are ordered by their priority, going from high to low. In case of identical priorities, the monitors are sorted alphabetically by the keys in the imonitors` input namespace.

The `minimum_poll_interval` is an optional positive integer that can be defined for a monitor. If defined, the engine will ensure that the interval between two successive calls of the monitor will at least be this long. In order to track the last time a monitor was called, the timestamp is added to the `call_timestamp` attribute when it is called by the `CalcJobMonitors.process` method.

This dataclass can be returned by a monitor to communicate to the engine the course of action to take. Returning a str from a monitor remains supported as it is automatically converted to a `CalcJobMonitorResult`. The first attribute that is added is `action` which takes an instance of the `CalcJobMonitorAction` enum. Currently the only value is `KILL` which is therefore also the default and instructs the engine to kill the job and stop monitoring.

By default it is `True`, but if set to `False`, the engine will skip the parsing step. In this case, the `STOPPED_BY_MONITOR` exit code will be set on the node.

By default it is `True`, but if set to `False`, the engine will skip the retrieval step and terminate the process straight away. The exit code that will be set is `CalcJob.exit_codes.STOPPED_BY_MONITOR` and there will not be a `retrieved` output node.

By default it is `True`, but if set to `False`, the engine will not override the exit code returned by the parser with the default exit code `STOPPED_BY_MONITOR` that is set when a job is stopped through a monitor. Naturally, this attribute is ignored when the `parse` and or `retrieve` attribute are set to `False` as in that case the parser is never even called so there is nothing to not override.

So far the `CalcJobMonitorAction`, the enum that can be returned by a calcjob monitor, only supported a single option `KILL`. When the engine receives this instruction, the job will be killed immediately. An alternative use case is where a monitor has performed a check and an optional action and now simply wants to let the job run its course. Often in this case it is important that the monitor itself, and any others that may have been registered, are no longer run for the lifetime of the job. The `CalcJobMonitorAction` now adds the `DISABLE_ALL` option, which when set on the `action` attribute of a `CalcJobMonitorResult`, the engine will disable all monitors for the remainder of the duration of the job.

The default action for a `CalcJobMonitorResult` is the option `CalcJobMonitorAction.KILL` which immediately kills the job. However, sometimes, one wants to simply disable the monitor and continue running the job nominally. The `CalcJobMonitorAction.DISABLE_SELF` option instructs the engine to not call the monitor that returned it again in future monitor evaluations.

sphuber force-pushed the feature/1925/calcjob-monitoring branch 5 times, most recently from 2e87140 to f20bb13 Compare September 25, 2022 20:46

sphuber mentioned this pull request Sep 26, 2022

Add AEP: Allow CalcJobs to be actively monitored and interrupted aiidateam/AEP#36

Merged

5 tasks

sphuber force-pushed the feature/1925/calcjob-monitoring branch from f20bb13 to 958b565 Compare September 28, 2022 15:02

sphuber requested review from ramirezfranciscof and giovannipizzi September 28, 2022 15:40

sphuber force-pushed the feature/1925/calcjob-monitoring branch from 958b565 to bbd7b83 Compare October 14, 2022 08:02

ramirezfranciscof suggested changes Nov 9, 2022

View reviewed changes

sphuber force-pushed the feature/1925/calcjob-monitoring branch from bbd7b83 to fe8979c Compare November 10, 2022 08:28

sphuber force-pushed the feature/1925/calcjob-monitoring branch 2 times, most recently from d4d22f0 to f9035fe Compare November 10, 2022 10:45

sphuber requested a review from ramirezfranciscof November 10, 2022 10:47

ramirezfranciscof approved these changes Nov 10, 2022

View reviewed changes

sphuber force-pushed the feature/1925/calcjob-monitoring branch from f9035fe to a69a84b Compare November 10, 2022 17:06

sphuber added 2 commits November 10, 2022 18:26

sphuber added 6 commits November 10, 2022 18:30

CalcJobMonitorResult: Add the parse attribute

eebb811

By default it is `True`, but if set to `False`, the engine will skip the parsing step. In this case, the `STOPPED_BY_MONITOR` exit code will be set on the node.

sphuber force-pushed the feature/1925/calcjob-monitoring branch from a69a84b to 465d499 Compare November 10, 2022 17:34

sphuber merged commit afccd6f into aiidateam:main Nov 10, 2022

sphuber deleted the feature/1925/calcjob-monitoring branch November 10, 2022 20:09

edan-bainglass mentioned this pull request Oct 19, 2023

Connecting monitor actions to the provenance #6158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AEP: `CalcJob` live monitoring #5659

AEP: `CalcJob` live monitoring #5659

sphuber commented Sep 24, 2022 •

edited

Loading

sphuber commented Oct 18, 2022

ramirezfranciscof commented Oct 18, 2022

ramirezfranciscof commented Nov 7, 2022

sphuber commented Nov 7, 2022

ramirezfranciscof commented Nov 8, 2022

ramirezfranciscof left a comment •

edited

Loading

sphuber commented Nov 9, 2022

ramirezfranciscof commented Nov 9, 2022

sphuber commented Nov 10, 2022 •

edited

Loading

sphuber commented Nov 10, 2022 •

edited

Loading

sphuber commented Nov 10, 2022

ramirezfranciscof left a comment

AEP: CalcJob live monitoring #5659

AEP: CalcJob live monitoring #5659

Conversation

sphuber commented Sep 24, 2022 • edited Loading

sphuber commented Oct 18, 2022

ramirezfranciscof commented Oct 18, 2022

ramirezfranciscof commented Nov 7, 2022

sphuber commented Nov 7, 2022

ramirezfranciscof commented Nov 8, 2022

ramirezfranciscof left a comment • edited Loading

Choose a reason for hiding this comment

sphuber commented Nov 9, 2022

ramirezfranciscof commented Nov 9, 2022

sphuber commented Nov 10, 2022 • edited Loading

sphuber commented Nov 10, 2022 • edited Loading

sphuber commented Nov 10, 2022

ramirezfranciscof left a comment

Choose a reason for hiding this comment

AEP: `CalcJob` live monitoring #5659

AEP: `CalcJob` live monitoring #5659

sphuber commented Sep 24, 2022 •

edited

Loading

ramirezfranciscof left a comment •

edited

Loading

sphuber commented Nov 10, 2022 •

edited

Loading

sphuber commented Nov 10, 2022 •

edited

Loading