workflow events: fix an issue where "timeout" events would not fire #5959

oliver-sanders · 2024-02-07T11:53:43Z

The result of attempting to document workflow events here: cylc/cylc-doc#685

Write some integration tests, discovered that a couple events weren't being fired in some situations.

[edit] It turns out that the way the scheduler terminates the subproc pool on shutdown means that, whilst this PR fixes the "event not being fired" problem, the event handler still won't actually run. This part of the issue has been bumped to #5997

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
CHANGES.md entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

cylc/flow/scheduler.py

oliver-sanders · 2024-02-07T11:56:02Z

cylc/flow/scheduler.py

            abort_conf = f"abort on {event}"
            if self._get_events_conf(abort_conf):
                # "cylc play" needs to exit with error status here.
                raise SchedulerError(f'"{abort_conf}" is set')
-            if self._get_events_conf(f"{event} handlers") is not None:


This was the source of the bug.

In my case I had mail events = <event> set, but not <event> handlers, this was preventing mail events from running.

I think the workflow events manager already has logic for determining when events should be run so this check was superfluous.

oliver-sanders · 2024-02-07T11:56:26Z

tests/integration/test_workflow_events.py

+                    'simulation': {
+                        'default run length': 'PT0S',
+                    }


Note to self, subtract this when merging into master.

Should I let you merge in @oliver-sanders ?

* Fix a small bug where "workflow timeout" and "inactivity timeout" events might not fire depending on the global config. * Add integration tests to lock down the behaviour of workflow events.

MetRonnie · 2024-02-12T11:51:07Z

tests/integration/test_workflow_events.py

+    schd._main_loop = killer
+
+    # start the scheduler and wait for it to hit the exception
+    with pytest.raises(Exception):


flake8-bugbear points out this is "evil" (their words) - I suggest making sure we know it's the correct exception

Suggested change

with pytest.raises(Exception):

with pytest.raises(Exception, match=r':\('):

I'm not sure how that match narrows it down?

I didn't want to text the exception that comes out as it's testing more internal implementation than needed (so more fragile test), but could try for a more specific exception if it's important.

Because the same workflow is also run in other tests (sanz-exception) the presence of any exception should be good enough here?

I agree it's not too bad considering the full context of the test, but it shouldn't be fragile to narrow it down to the particular exception that's supposed to be raised?

Note, Cylc catches exceptions and re-raises them as something else, this logic is beyond the scope of the test. The test is trying to ensure that if an unexpected error occurs, the handlers are fired. It doesn't matter to the test what exception is raised, all that matters is that the scheduler bails, as such, checking it raises Exception is sufficient to ensure it's failing which is all we're interested in here.

Can we at least turn off the red squiggly line?

Suggested change

with pytest.raises(Exception):

with pytest.raises(Exception): # noqa: B017

Could do, but we're not really in the habit of making changes outside of working practices to support people's text editor setups (pyright flags many [mostly invalid] warnings, but we let 'em be).

(This is the repo configured flake8; when flake8 is run per-file it ignores the exclude setting in tox.ini)

(We do not enforce flake8 in the tests, there are many other errors besides this one)

(9 times out of 10 it would be a no-brainer to follow this rule, so there are benefits to flake8-ing the tests)

MetRonnie · 2024-02-21T16:53:59Z

[scheduler]
    [[events]]
        abort on workflow timeout = True
        workflow timeout = PT1S
        abort handlers = echo "ALPHA"
        workflow timeout handlers = echo "TANGO"

The workflow timeout handler is not running when abort on workflow timeout is set

oliver-sanders · 2024-02-23T15:48:59Z

So, it looks like it actually is firing off the event (and the test picked up on this).

BUT, on workflow abort, we terminate all processes in the subprocpool which kills the handler command, likely before it has had the chance to be issued:

cylc-flow/cylc/flow/scheduler.py

Lines 1925 to 1937 in acb164f

    
           async def _shutdown(self, reason: BaseException) -> None: 
        
               """Shutdown the workflow.""" 
        
               self._log_shutdown_reason(reason) 
        
               if hasattr(self, 'proc_pool'): 
        
                   try: 
        
                       self.proc_pool.close() 
        
                       if self.proc_pool.is_not_done(): 
        
                           # e.g. KeyboardInterrupt 
        
                           self.proc_pool.terminate() 
        
                       self.proc_pool.process() 
        
                   except Exception as exc: 
        
                       LOG.exception(exc)

We could implement a mini-main loop, just to run the subprocpool until it empties naturally, but it could contain all sorts of stuff, like job submission commands which we probably don't want to run.

Dammit. All I wanted to do was to clarify one line in the docs!

oliver-sanders · 2024-02-23T16:06:11Z

We could implement a mini-main loop, just to run the subprocpool until it empties naturally

I've just tried this, but it gets messy.

patch

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 432ef11ff..687ebbb64 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1415,6 +1415,7 @@ class Scheduler:
 
         Run workflow events in simulation and dummy mode ONLY if enabled.
         """
+        LOG.warning(f'EVENT {event} - {reason}')
         conf = self.config
         with suppress(KeyError):
             if (
@@ -1925,7 +1926,8 @@ class Scheduler:
                 if self.proc_pool.is_not_done():
                     # e.g. KeyboardInterrupt
                     self.proc_pool.terminate()
-                self.proc_pool.process()
+                while self.proc_pool.is_not_done():
+                    self.proc_pool.process()
             except Exception as exc:
                 LOG.exception(exc)
 
diff --git a/cylc/flow/subprocpool.py b/cylc/flow/subprocpool.py
index c380bdaa3..a539aa8c5 100644
--- a/cylc/flow/subprocpool.py
+++ b/cylc/flow/subprocpool.py
@@ -332,16 +332,38 @@ class SubProcPool:
     def terminate(self):
         """Drain queue, and kill and process remaining child processes."""
         self.close()
+        keep = []
         # Drain queue
         while self.queuings:
-            ctx = self.queuings.popleft()[0]
-            ctx.err = self.ERR_WORKFLOW_STOPPING
-            ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
-            self._run_command_exit(ctx)
+            item = self.queuings.popleft()
+            ctx = item[0]
+            # if ctx.cmd_key.startswith(WORKFLOW_EVENT_HANDLER)
+            cmd_key = ctx.cmd_key
+            if isinstance(cmd_key, tuple):
+                cmd_key = cmd_key[0]
+            if cmd_key.startswith('workflow-event-handler'):
+                keep.append(item)
+            else:
+                ctx.err = self.ERR_WORKFLOW_STOPPING
+                ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
+                self._run_command_exit(ctx)
+
+        # Re-add event handlers back into the queue
+        for item in keep:
+            self.queuings.append(item)
+
         # Kill remaining processes
         for value in self.runnings:
             proc = value[0]
-            if proc:
+            ctx = value[1]
+            cmd_key = ctx.cmd_key
+            if isinstance(cmd_key, tuple):
+                cmd_key = cmd_key[0]
+            if not proc:
+                # no process to kill
+                continue
+            if not cmd_key.startswith('workflow-event-handler'):
+                # don't kill event handlers
                 _killpg(proc, SIGKILL)
         # Wait for child processes
         self.process()

Any job submission commands get killed, the fallout from this causes the task(s) to be put in the submit-failed state which in turn causes them to be marked as incomplete.

I'm not sure this has ever worked!

MetRonnie

Opened new issue for that #5997

hjoliver

Looks good, one question though.

hjoliver · 2024-03-11T03:04:34Z

cylc/flow/scheduler.py

-            await self.main_loop()
+            while True:  # MAIN LOOP
+                await self._main_loop()


On changing to this, it might make sense (clarity reasons) to move the sleep code from the end of the _main_loop method and into the while block?

Reasonable. I think something (besides the sleep) is using tinit which might be why I left it.

True. In that case, fine to leave as-is.

oliver-sanders · 2024-03-25T11:52:35Z

Merging with three approvals.

oliver-sanders added the bug Something is wrong :( label Feb 7, 2024

oliver-sanders added this to the cylc-8.2.5 milestone Feb 7, 2024

oliver-sanders self-assigned this Feb 7, 2024

oliver-sanders commented Feb 7, 2024

View reviewed changes

cylc/flow/scheduler.py Outdated Show resolved Hide resolved

oliver-sanders commented Feb 7, 2024

View reviewed changes

oliver-sanders force-pushed the test-workflow-events branch from ee762a5 to 2f3da88 Compare February 7, 2024 16:35

workflow events: fix an issue where "timeout" events would not fire

6977159

* Fix a small bug where "workflow timeout" and "inactivity timeout" events might not fire depending on the global config. * Add integration tests to lock down the behaviour of workflow events.

oliver-sanders force-pushed the test-workflow-events branch from 2f3da88 to 6977159 Compare February 7, 2024 16:36

oliver-sanders requested review from markgrahamdawson and MetRonnie February 8, 2024 10:16

oliver-sanders marked this pull request as ready for review February 8, 2024 11:02

MetRonnie reviewed Feb 12, 2024

View reviewed changes

oliver-sanders added 2 commits February 21, 2024 12:51

main loop: refactor out intermediate method

73b8fe5

changelog

eb76ce6

MetRonnie mentioned this pull request Feb 23, 2024

Timeout handlers do not execute when the corresponding abort on X timeout = True #5997

Open

MetRonnie approved these changes Feb 23, 2024

View reviewed changes

hjoliver approved these changes Mar 11, 2024

View reviewed changes

hjoliver reviewed Mar 11, 2024

View reviewed changes

markgrahamdawson approved these changes Mar 12, 2024

View reviewed changes

oliver-sanders merged commit ce25a81 into cylc:8.2.x Mar 25, 2024
25 checks passed

oliver-sanders deleted the test-workflow-events branch March 25, 2024 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow events: fix an issue where "timeout" events would not fire #5959

workflow events: fix an issue where "timeout" events would not fire #5959

oliver-sanders commented Feb 7, 2024 •

edited

Loading

oliver-sanders Feb 7, 2024

oliver-sanders Feb 7, 2024

markgrahamdawson Mar 12, 2024

MetRonnie Feb 12, 2024

oliver-sanders Feb 13, 2024 •

edited

Loading

MetRonnie Feb 13, 2024

oliver-sanders Feb 13, 2024 •

edited

Loading

MetRonnie Feb 14, 2024

oliver-sanders Feb 21, 2024

MetRonnie Feb 21, 2024

oliver-sanders Feb 21, 2024 •

edited

Loading

MetRonnie Feb 21, 2024

MetRonnie commented Feb 21, 2024

oliver-sanders commented Feb 23, 2024 •

edited

Loading

oliver-sanders commented Feb 23, 2024 •

edited

Loading

MetRonnie left a comment

hjoliver left a comment

hjoliver Mar 11, 2024

oliver-sanders Mar 11, 2024

hjoliver Mar 12, 2024

oliver-sanders commented Mar 25, 2024

	with pytest.raises(Exception):
	with pytest.raises(Exception, match=r':\('):

	with pytest.raises(Exception):
	with pytest.raises(Exception): # noqa: B017

workflow events: fix an issue where "timeout" events would not fire #5959

workflow events: fix an issue where "timeout" events would not fire #5959

Conversation

oliver-sanders commented Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliver-sanders Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliver-sanders Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliver-sanders Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MetRonnie commented Feb 21, 2024

oliver-sanders commented Feb 23, 2024 • edited Loading

oliver-sanders commented Feb 23, 2024 • edited Loading

MetRonnie left a comment

Choose a reason for hiding this comment

hjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliver-sanders commented Mar 25, 2024

oliver-sanders commented Feb 7, 2024 •

edited

Loading

oliver-sanders Feb 13, 2024 •

edited

Loading

oliver-sanders Feb 13, 2024 •

edited

Loading

oliver-sanders Feb 21, 2024 •

edited

Loading

oliver-sanders commented Feb 23, 2024 •

edited

Loading

oliver-sanders commented Feb 23, 2024 •

edited

Loading