-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflow events: fix an issue where "timeout" events would not fire #5959
workflow events: fix an issue where "timeout" events would not fire #5959
Conversation
abort_conf = f"abort on {event}" | ||
if self._get_events_conf(abort_conf): | ||
# "cylc play" needs to exit with error status here. | ||
raise SchedulerError(f'"{abort_conf}" is set') | ||
if self._get_events_conf(f"{event} handlers") is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the source of the bug.
In my case I had mail events = <event>
set, but not <event> handlers
, this was preventing mail events from running.
I think the workflow events manager already has logic for determining when events should be run so this check was superfluous.
'simulation': { | ||
'default run length': 'PT0S', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, subtract this when merging into master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I let you merge in @oliver-sanders ?
ee762a5
to
2f3da88
Compare
* Fix a small bug where "workflow timeout" and "inactivity timeout" events might not fire depending on the global config. * Add integration tests to lock down the behaviour of workflow events.
2f3da88
to
6977159
Compare
schd._main_loop = killer | ||
|
||
# start the scheduler and wait for it to hit the exception | ||
with pytest.raises(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flake8-bugbear points out this is "evil" (their words) - I suggest making sure we know it's the correct exception
with pytest.raises(Exception): | |
with pytest.raises(Exception, match=r':\('): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how that match narrows it down?
I didn't want to text the exception that comes out as it's testing more internal implementation than needed (so more fragile test), but could try for a more specific exception if it's important.
Because the same workflow is also run in other tests (sanz-exception) the presence of any exception should be good enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not too bad considering the full context of the test, but it shouldn't be fragile to narrow it down to the particular exception that's supposed to be raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, Cylc catches exceptions and re-raises them as something else, this logic is beyond the scope of the test. The test is trying to ensure that if an unexpected error occurs, the handlers are fired. It doesn't matter to the test what exception is raised, all that matters is that the scheduler bails, as such, checking it raises Exception
is sufficient to ensure it's failing which is all we're interested in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we at least turn off the red squiggly line?
with pytest.raises(Exception): | |
with pytest.raises(Exception): # noqa: B017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, but we're not really in the habit of making changes outside of working practices to support people's text editor setups (pyright flags many [mostly invalid] warnings, but we let 'em be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is the repo configured flake8; when flake8 is run per-file it ignores the exclude
setting in tox.ini
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We do not enforce flake8 in the tests, there are many other errors besides this one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(9 times out of 10 it would be a no-brainer to follow this rule, so there are benefits to flake8-ing the tests)
The workflow timeout handler is not running when |
So, it looks like it actually is firing off the event (and the test picked up on this). BUT, on workflow abort, we terminate all processes in the subprocpool which kills the handler command, likely before it has had the chance to be issued: cylc-flow/cylc/flow/scheduler.py Lines 1925 to 1937 in acb164f
We could implement a mini-main loop, just to run the subprocpool until it empties naturally, but it could contain all sorts of stuff, like job submission commands which we probably don't want to run. Dammit. All I wanted to do was to clarify one line in the docs! |
I've just tried this, but it gets messy. patchdiff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 432ef11ff..687ebbb64 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1415,6 +1415,7 @@ class Scheduler:
Run workflow events in simulation and dummy mode ONLY if enabled.
"""
+ LOG.warning(f'EVENT {event} - {reason}')
conf = self.config
with suppress(KeyError):
if (
@@ -1925,7 +1926,8 @@ class Scheduler:
if self.proc_pool.is_not_done():
# e.g. KeyboardInterrupt
self.proc_pool.terminate()
- self.proc_pool.process()
+ while self.proc_pool.is_not_done():
+ self.proc_pool.process()
except Exception as exc:
LOG.exception(exc)
diff --git a/cylc/flow/subprocpool.py b/cylc/flow/subprocpool.py
index c380bdaa3..a539aa8c5 100644
--- a/cylc/flow/subprocpool.py
+++ b/cylc/flow/subprocpool.py
@@ -332,16 +332,38 @@ class SubProcPool:
def terminate(self):
"""Drain queue, and kill and process remaining child processes."""
self.close()
+ keep = []
# Drain queue
while self.queuings:
- ctx = self.queuings.popleft()[0]
- ctx.err = self.ERR_WORKFLOW_STOPPING
- ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
- self._run_command_exit(ctx)
+ item = self.queuings.popleft()
+ ctx = item[0]
+ # if ctx.cmd_key.startswith(WORKFLOW_EVENT_HANDLER)
+ cmd_key = ctx.cmd_key
+ if isinstance(cmd_key, tuple):
+ cmd_key = cmd_key[0]
+ if cmd_key.startswith('workflow-event-handler'):
+ keep.append(item)
+ else:
+ ctx.err = self.ERR_WORKFLOW_STOPPING
+ ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
+ self._run_command_exit(ctx)
+
+ # Re-add event handlers back into the queue
+ for item in keep:
+ self.queuings.append(item)
+
# Kill remaining processes
for value in self.runnings:
proc = value[0]
- if proc:
+ ctx = value[1]
+ cmd_key = ctx.cmd_key
+ if isinstance(cmd_key, tuple):
+ cmd_key = cmd_key[0]
+ if not proc:
+ # no process to kill
+ continue
+ if not cmd_key.startswith('workflow-event-handler'):
+ # don't kill event handlers
_killpg(proc, SIGKILL)
# Wait for child processes
self.process() Any job submission commands get killed, the fallout from this causes the task(s) to be put in the submit-failed state which in turn causes them to be marked as incomplete. I'm not sure this has ever worked! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened new issue for that #5997
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one question though.
await self.main_loop() | ||
while True: # MAIN LOOP | ||
await self._main_loop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On changing to this, it might make sense (clarity reasons) to move the sleep code from the end of the _main_loop
method and into the while
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable. I think something (besides the sleep) is using tinit
which might be why I left it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. In that case, fine to leave as-is.
Merging with three approvals. |
The result of attempting to document workflow events here: cylc/cylc-doc#685
Write some integration tests, discovered that a couple events weren't being fired in some situations.
[edit] It turns out that the way the scheduler terminates the subproc pool on shutdown means that, whilst this PR fixes the "event not being fired" problem, the event handler still won't actually run. This part of the issue has been bumped to #5997
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users?.?.x
branch.