Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AEP: CalcJob live monitoring #5659

Merged
merged 9 commits into from
Nov 10, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Sep 24, 2022

@sphuber
Copy link
Contributor Author

sphuber commented Oct 18, 2022

@giovannipizzi @ramirezfranciscof let me know if you would like to review this still. I would like to get this merged soon for the v2.1 releaes

@ramirezfranciscof
Copy link
Member

Sorry @sphuber , busy weeks. Now I need to prepare for my GM on friday, would it be ok if I take a look on friday afternoon?

@ramirezfranciscof
Copy link
Member

Hey, I'm still working on testing this, but one issue I found is that this seems to be using a type operation that is only supported from python 3.10 on (source). When running on 3.9 I get the following (in 3.10 it works fine):

Traceback (most recent call last):
  File "/home/framirez/miniconda3/envs/aurora-aiida/bin/verdi", line 8, in <module>
    sys.exit(verdi())
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/framirez/miniconda3/envs/aurora-aiida/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/framirez/Workenvs/aurora/aiida-core/aiida/cmdline/utils/decorators.py", line 73, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/framirez/Workenvs/aurora/aiida-core/aiida/cmdline/commands/cmd_run.py", line 115, in run
    exec(compile(handle.read(), str(filepath), 'exec', dont_inherit=True), globals_dict)  # pylint: disable=exec-used
  File "submit_newversion.py", line 68, in <module>
    cycler_builder.monitors = {
  File "/home/framirez/Workenvs/aurora/aiida-core/aiida/engine/processes/builder.py", line 123, in __setattr__
    raise ValueError(f'invalid attribute value {validation_error.message}')
ValueError: invalid attribute value `monitors.testing_monitor` is invalid: unsupported operand type(s) for |: 'type' and 'type'

@sphuber
Copy link
Contributor Author

sphuber commented Nov 7, 2022

Thanks for the report @ramirezfranciscof . It's interesting that it is failing because the CI runs on Python 3.8 and the tests pass just fine. Can you share the script that you are running? Note that if you use this notation in older Python versions, you need to explicitly enable it with from __future__ import annotations. Maybe you are using this type notation (str | None, for example) in your test script and that is why it is complaining.

@ramirezfranciscof
Copy link
Member

Yeap, you are right. It is not exactly on the test script but on the plugin implementation, as I was copying the examples in the AEP and didn't realize they were using this python 10 feature.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sphuber apologies for the delay, but I can confirm that the feature seems to be compatible with what we are trying to do with the Aurora project. Here are my comments on the code.

One general issue: whenever a calcjob finishes correctly, the status gets stuck in Waiting for transport task: retrieve. Also the kill command leaves a None status rather than the Killed through 'verdi process kill':

  PK  Created    Process label            Process State     Process status
----  ---------  -----------------------  ----------------  ------------------------------------
 161  5D ago     datanode_preparation     ⏹ Finished [0]
 165  5D ago     BatteryCyclerExperiment  ☠ Killed          Killed through `verdi process kill`
 168  5D ago     CalcjobMonitor           ⏹ Finished [0]
 176  5D ago     datanode_preparation     ⏹ Finished [0]
 180  5D ago     BatteryCyclerExperiment  ☠ Killed          None
 183  5D ago     CalcjobMonitor           ⏹ Finished [0]    Waiting for transport task: retrieve
 186  5D ago     datanode_preparation     ⏹ Finished [0]
 190  5D ago     BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 218  1D ago     datanode_preparation     ⏹ Finished [0]
 223  1D ago     BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 274  23h ago    datanode_preparation     ⏹ Finished [0]
 279  23h ago    BatteryCyclerExperiment  ⏹ Finished [150]
 284  22h ago    datanode_preparation     ⏹ Finished [0]
 289  22h ago    BatteryCyclerExperiment  ⏹ Finished [150]
 334  21h ago    datanode_preparation     ⏹ Finished [0]
 339  21h ago    BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 344  5h ago     datanode_preparation     ⏹ Finished [0]
 349  5h ago     BatteryCyclerExperiment  ⏹ Finished [150]
 354  4h ago     datanode_preparation     ⏹ Finished [0]
 358  4h ago     BatteryCyclerExperiment  ⏹ Finished [0]    Waiting for transport task: retrieve
 363  4h ago     datanode_preparation     ⏹ Finished [0]
 367  4h ago     BatteryCyclerExperiment  ☠ Killed          None
 370  4h ago     CalcjobMonitor           ⏹ Finished [0]    Waiting for transport task: retrieve

Total results: 23

(Calculation 165 was run with a previous version of aiida)

I think this might be related to moving the set_process_status method out of the try statement but this is just a guess.

aiida/engine/processes/calcjobs/calcjob.py Outdated Show resolved Hide resolved
aiida/engine/processes/calcjobs/monitors.py Show resolved Hide resolved
aiida/engine/processes/calcjobs/monitors.py Show resolved Hide resolved
aiida/engine/processes/calcjobs/monitors.py Show resolved Hide resolved
aiida/engine/processes/calcjobs/tasks.py Outdated Show resolved Hide resolved
@sphuber
Copy link
Contributor Author

sphuber commented Nov 9, 2022

Thanks for testing and reviewing the code @ramirezfranciscof

One general issue: whenever a calcjob finishes correctly, the status gets stuck in Waiting for transport task: retrieve.

That's a bug indeed. Will have a look to fix it.

Also the kill command leaves a None status rather than the Killed through 'verdi process kill':

I don't think it should show Killed through 'verdi process kill' though because, well, that's not true 😄 We could show something else, but since we already have the dedicated exit code 150 I thought this wasn't necessary.

@ramirezfranciscof
Copy link
Member

I don't think it should show Killed through 'verdi process kill' though because, well, that's not true 😄 We could show something else, but since we already have the dedicated exit code 150 I thought this wasn't necessary.

Haha, yes that is true. I guess what I meant is that the None status is a bit atypical, shouldn't it just show nothing like in regular finished calcjobs? Although now that I think of it, is there any drawback to showing something like Killed by monitor <monitor_keyname> in the status? Or just Killed by monitor if these status can't be "personalized".

@sphuber
Copy link
Contributor Author

sphuber commented Nov 10, 2022

Although now that I think of it, is there any drawback to showing something like Killed by monitor <monitor_keyname> in the status? Or just Killed by monitor if these status can't be "personalized".

We could. We could set the process status to the message of the exit code, which does contain the specific monitor error, but the thing is that it is not really consistent as we also don't do this for any other exit code. In my mind, the code being killed by a monitor is just like the parser returning a non-zero exit code. If a user sees [150] they can run verdi process show or node.exit_message in the API to get the info on the particular monitor that triggered.

We could discuss adding this, but then I think we should add it for all non-zero exit codes. But then we are really duplicating data in the database, since the process_status and exit_message attributes will contain the same value.

@sphuber sphuber force-pushed the feature/1925/calcjob-monitoring branch from bbd7b83 to fe8979c Compare November 10, 2022 08:28
@sphuber
Copy link
Contributor Author

sphuber commented Nov 10, 2022

Haha, yes that is true. I guess what I meant is that the None status is a bit atypical, shouldn't it just show nothing like in regular finished calcjobs?

Ah my bad, I now see what you mean, it is when you actually kill a job it shows literal None in the process status. That would be a regression bug yeah. Will look into it.

Edit: this was actually a change I accidentally introduced in 8bb7b34 . I will open a separate PR to reinstate this behavior.

@sphuber sphuber force-pushed the feature/1925/calcjob-monitoring branch 2 times, most recently from d4d22f0 to f9035fe Compare November 10, 2022 10:45
@sphuber
Copy link
Contributor Author

sphuber commented Nov 10, 2022

@ramirezfranciscof thanks again for the thorough testing. I have added two commits for the bugs that you uncovered. I added them on top so it is easier for you to review the changes. Once you sign off on the changes, I will rebase them into the proper commits.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me, issues seem addressed. I can't re-test it right now because my computer is doing some lengthy maintenance, but I think it should be ok to merge this and if I find any outstanding problem in testing later we can just do a different PR to fix it.

Instead of killing the process, it is preferable to have the engine
retrieve the files from the job and call the parser, if one was defined.
However, this would result in the process to be marked as terminated
nominally with the exit status returned by the parser. It would be
impossible to see that the job was stopped by a monitor other than from
the process report. It is important that one can query for processes
that were stopped by a monitor. To do so, a dedicated exit code is
defined on the `CalcJob` called `STOPPED_BY_MONITOR` which is set on the
node, overriding any exit status returned by the parser. There is no way
around this since only one exit status can be set on a node.
Nevertheless the result from the parser is logged and so is visible in
the process report.

If monitors are defined for the `CalcJob` the corresponding node will
contains information of the package versions from which the monitors
come. The information is added to the `version` attribute which already
contains version information of `aiida-core` and the `CalcJob` plugin
itself.
@sphuber sphuber force-pushed the feature/1925/calcjob-monitoring branch from f9035fe to a69a84b Compare November 10, 2022 17:06
The `priority` attribute takes an integer and is zero by default. It
allows the user to define the order in which monitors need to be called
in case multiple are defined.

The ordering is implemented in the `CalcJobMonitors` utility class. This
is mostly so that it is easy to unit test. It ensures the monitors are
ordered by their priority, going from high to low. In case of identical
priorities, the monitors are sorted alphabetically by the keys in the
imonitors` input namespace.
The `minimum_poll_interval` is an optional positive integer that can be
defined for a monitor. If defined, the engine will ensure that the
interval between two successive calls of the monitor will at least be
this long.

In order to track the last time a monitor was called, the timestamp is
added to the `call_timestamp` attribute when it is called by the
`CalcJobMonitors.process` method.
This dataclass can be returned by a monitor to communicate to the engine
the course of action to take. Returning a str from a monitor remains
supported as it is automatically converted to a `CalcJobMonitorResult`.

The first attribute that is added is `action` which takes an instance of
the `CalcJobMonitorAction` enum. Currently the only value is `KILL`
which is therefore also the default and instructs the engine to kill the
job and stop monitoring.
By default it is `True`, but if set to `False`, the engine will skip the
parsing step. In this case, the `STOPPED_BY_MONITOR` exit code will be
set on the node.
By default it is `True`, but if set to `False`, the engine will skip the
retrieval step and terminate the process straight away. The exit code
that will be set is `CalcJob.exit_codes.STOPPED_BY_MONITOR` and there
will not be a `retrieved` output node.
By default it is `True`, but if set to `False`, the engine will not
override the exit code returned by the parser with the default exit code
`STOPPED_BY_MONITOR` that is set when a job is stopped through a monitor.
Naturally, this attribute is ignored when the `parse` and or `retrieve`
attribute are set to `False` as in that case the parser is never even
called so there is nothing to not override.
So far the `CalcJobMonitorAction`, the enum that can be returned by a
calcjob monitor, only supported a single option `KILL`. When the engine
receives this instruction, the job will be killed immediately.

An alternative use case is where a monitor has performed a check and an
optional action and now simply wants to let the job run its course.
Often in this case it is important that the monitor itself, and any
others that may have been registered, are no longer run for the lifetime
of the job.

The `CalcJobMonitorAction` now adds the `DISABLE_ALL` option, which when
set on the `action` attribute of a `CalcJobMonitorResult`, the engine
will disable all monitors for the remainder of the duration of the job.
The default action for a `CalcJobMonitorResult` is the option
`CalcJobMonitorAction.KILL` which immediately kills the job. However,
sometimes, one wants to simply disable the monitor and continue running
the job nominally. The `CalcJobMonitorAction.DISABLE_SELF` option
instructs the engine to not call the monitor that returned it again in
future monitor evaluations.
@sphuber sphuber force-pushed the feature/1925/calcjob-monitoring branch from a69a84b to 465d499 Compare November 10, 2022 17:34
@sphuber sphuber merged commit afccd6f into aiidateam:main Nov 10, 2022
@sphuber sphuber deleted the feature/1925/calcjob-monitoring branch November 10, 2022 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Monitor output of jobs while they run
2 participants