Handle restart with changed sequence better #6154

MetRonnie · 2024-06-18T13:12:15Z

Description

Restarting a workflow with a waiting or orphaned task that is no longer on sequence (due to flow.cylc having been changed) can result in traceback.

Reproducible Example

Run this workflow:

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    [[graph]]
        P1 = foo[-P1] => foo
[runtime]
    [[foo]]
        script = """
            if [[ $CYLC_TASK_CYCLE_POINT == 2 ]]; then
                cylc stop "$CYLC_WORKFLOW_ID" --now;
            fi
        """

When it shuts down, edit flow.cylc:

     [[graph]]
-        P1 = foo[-P1] => foo
+        R1 = foo[-P1] => foo
 [runtime]

Then restart the workflow to get:

$ cylc vr <id>
$ cylc cat <id>
ERROR - list index out of range
    Traceback (most recent call last):
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 742, in start
        await self.configure(params)
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 490, in configure
        self._load_pool_from_db()
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 792, in _load_pool_from_db
        self.workflow_db_mgr.pri_dao.select_task_pool_for_restart(
      File "~/github/cylc-flow/cylc/flow/rundb.py", line 932, in select_task_pool_for_restart
        platform_error = callback(row_idx, list(row))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "~/github/cylc-flow/cylc/flow/task_pool.py", line 597, in load_db_task_pool_for_restart
        self.compute_runahead()
      File "~/github/cylc-flow/cylc/flow/task_pool.py", line 400, in compute_runahead
        limit_point = sorted(sequence_points)[:ilimit + 1][-1]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
    IndexError: list index out of range
CRITICAL - Workflow shutting down - list index out of range

Expected Behaviour

No traceback; more helpful error message.

Need to handle sequence_points being empty here:

cylc-flow/cylc/flow/task_pool.py

Line 400 in 3962f3d

limit_point = sorted(sequence_points)[:ilimit + 1][-1]

The text was updated successfully, but these errors were encountered:

oliver-sanders · 2024-07-15T13:50:49Z

@MetRonnie, I've not been able to reproduce this one, have tried 8.3.x, 8.3.0 and 8.2.3.

Might need a tweak to the example.

It's possible that this PR might help: #6229

MetRonnie · 2024-07-15T14:28:19Z

I can still reproduce on 8.3.2 and 8.3.x as of dc9f01b, from copying the example above.

oliver-sanders · 2024-07-15T14:52:35Z

Ach, blindingly obvious, I didn't reinstall! #6229 does not appear to fix this :(

oliver-sanders · 2024-07-15T16:05:59Z

This is very similar to #6229, the compute_runahead algorithm is (correctly) erroring due to a complete lack of tasks.

In the case of runahead: handle a workflow with an empty graph #6229, this occurs when starting a workflow for the first time.
In this case the error is introduced later by restart/reload.

This latter case should be extremely rare and is an error case in the first place so fairly low priority.

Easy fix however!

With this diff the workflow will shutdown gracefully on restart/reload:

diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..c992e880b 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
             self._prev_runahead_sequence_points = sequence_points
             self._prev_runahead_base_point = base_point
 
+        if len(sequence_points) == 0:
+            # no cycles to schedule
+            return False
         if count_cycles:
             # (len(list) may be less than ilimit due to sequence end)
             limit_point = sorted(sequence_points)[:ilimit + 1][-1]

However, a reload that essentially empties the task pool is sure to be a mistake, so going ahead with the restart/reload and wiping out the pool making it difficult to recover the workflow is not necessarily the best move.

We could consider raising an error here (examples of this form are the only confirmed way to hit this bug). This would cause the restart/reload to fail with an informative message:

diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..647353378 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
             self._prev_runahead_sequence_points = sequence_points
             self._prev_runahead_base_point = base_point
 
+        if len(sequence_points) == 0:
+            # no cycles to schedule
+            raise WorkflowConfigError('No tasks scheduled to run')
         if count_cycles:
             # (len(list) may be less than ilimit due to sequence end)
             limit_point = sorted(sequence_points)[:ilimit + 1][-1]

Which will it be?

Do what the user asked, log something helpful.
Prevent the operation from being actioned, raise a helpful error.

MetRonnie · 2024-07-15T18:06:43Z

Option 2 is no worse than the current behaviour so could at least be implemented as a quick address of the issue.

Edit: however, with the option 2 patch, I've just seen that reloading a workflow after changing

     [[graph]]
-        P1 = foo[-P1] => foo
+        R1 = foo[-P1] => foo

causes the workflow to get stuck in the reloading state after the WorkflowConfigError is raised. (This is pretty identical to the current behaviour).

oliver-sanders · 2024-07-16T09:43:35Z

Errors raised during reload should cause the reload to be aborted. The error should be logged and the workflow should continue with the original config.

If that's not happening, we should fix it. I think the reload aborts correctly on master (due to the error in compute_runahead).

MetRonnie · 2024-07-16T10:09:45Z

It aborts, however the workflow status gets stuck as reloading

MetRonnie added the bug label Jun 18, 2024

MetRonnie added this to the 8.3.x milestone Jun 18, 2024

oliver-sanders self-assigned this Jul 15, 2024

oliver-sanders removed their assignment Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle restart with changed sequence better #6154

Handle restart with changed sequence better #6154

MetRonnie commented Jun 18, 2024 •

edited by oliver-sanders

Loading

oliver-sanders commented Jul 15, 2024 •

edited

Loading

MetRonnie commented Jul 15, 2024

oliver-sanders commented Jul 15, 2024 •

edited

Loading

oliver-sanders commented Jul 15, 2024

MetRonnie commented Jul 15, 2024

oliver-sanders commented Jul 16, 2024 •

edited

Loading

MetRonnie commented Jul 16, 2024

Handle restart with changed sequence better #6154

Handle restart with changed sequence better #6154

Comments

MetRonnie commented Jun 18, 2024 • edited by oliver-sanders Loading

Description

Reproducible Example

Expected Behaviour

oliver-sanders commented Jul 15, 2024 • edited Loading

MetRonnie commented Jul 15, 2024

oliver-sanders commented Jul 15, 2024 • edited Loading

oliver-sanders commented Jul 15, 2024

MetRonnie commented Jul 15, 2024

oliver-sanders commented Jul 16, 2024 • edited Loading

MetRonnie commented Jul 16, 2024

MetRonnie commented Jun 18, 2024 •

edited by oliver-sanders

Loading

oliver-sanders commented Jul 15, 2024 •

edited

Loading

oliver-sanders commented Jul 15, 2024 •

edited

Loading

oliver-sanders commented Jul 16, 2024 •

edited

Loading