Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle restart with changed sequence better #6154

Open
MetRonnie opened this issue Jun 18, 2024 · 7 comments
Open

Handle restart with changed sequence better #6154

MetRonnie opened this issue Jun 18, 2024 · 7 comments
Labels
Milestone

Comments

@MetRonnie
Copy link
Member

MetRonnie commented Jun 18, 2024

Description

Restarting a workflow with a waiting or orphaned task that is no longer on sequence (due to flow.cylc having been changed) can result in traceback.

Reproducible Example

Run this workflow:

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    [[graph]]
        P1 = foo[-P1] => foo
[runtime]
    [[foo]]
        script = """
            if [[ $CYLC_TASK_CYCLE_POINT == 2 ]]; then
                cylc stop "$CYLC_WORKFLOW_ID" --now;
            fi
        """

When it shuts down, edit flow.cylc:

     [[graph]]
-        P1 = foo[-P1] => foo
+        R1 = foo[-P1] => foo
 [runtime]

Then restart the workflow to get:

$ cylc vr <id>
$ cylc cat <id>
ERROR - list index out of range
    Traceback (most recent call last):
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 742, in start
        await self.configure(params)
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 490, in configure
        self._load_pool_from_db()
      File "~/github/cylc-flow/cylc/flow/scheduler.py", line 792, in _load_pool_from_db
        self.workflow_db_mgr.pri_dao.select_task_pool_for_restart(
      File "~/github/cylc-flow/cylc/flow/rundb.py", line 932, in select_task_pool_for_restart
        platform_error = callback(row_idx, list(row))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "~/github/cylc-flow/cylc/flow/task_pool.py", line 597, in load_db_task_pool_for_restart
        self.compute_runahead()
      File "~/github/cylc-flow/cylc/flow/task_pool.py", line 400, in compute_runahead
        limit_point = sorted(sequence_points)[:ilimit + 1][-1]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
    IndexError: list index out of range
CRITICAL - Workflow shutting down - list index out of range

Expected Behaviour

No traceback; more helpful error message.

Need to handle sequence_points being empty here:

limit_point = sorted(sequence_points)[:ilimit + 1][-1]

@MetRonnie MetRonnie added the bug label Jun 18, 2024
@MetRonnie MetRonnie added this to the 8.3.x milestone Jun 18, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 15, 2024

@MetRonnie, I've not been able to reproduce this one, have tried 8.3.x, 8.3.0 and 8.2.3.

Might need a tweak to the example.

It's possible that this PR might help: #6229

@MetRonnie
Copy link
Member Author

I can still reproduce on 8.3.2 and 8.3.x as of dc9f01b, from copying the example above.

@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 15, 2024

Ach, blindingly obvious, I didn't reinstall! #6229 does not appear to fix this :(

@oliver-sanders oliver-sanders self-assigned this Jul 15, 2024
@oliver-sanders
Copy link
Member

This is very similar to #6229, the compute_runahead algorithm is (correctly) erroring due to a complete lack of tasks.

This latter case should be extremely rare and is an error case in the first place so fairly low priority.

Easy fix however!

With this diff the workflow will shutdown gracefully on restart/reload:

diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..c992e880b 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
             self._prev_runahead_sequence_points = sequence_points
             self._prev_runahead_base_point = base_point
 
+        if len(sequence_points) == 0:
+            # no cycles to schedule
+            return False
         if count_cycles:
             # (len(list) may be less than ilimit due to sequence end)
             limit_point = sorted(sequence_points)[:ilimit + 1][-1]

However, a reload that essentially empties the task pool is sure to be a mistake, so going ahead with the restart/reload and wiping out the pool making it difficult to recover the workflow is not necessarily the best move.

We could consider raising an error here (examples of this form are the only confirmed way to hit this bug). This would cause the restart/reload to fail with an informative message:

diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index f1a6942f0..647353378 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -398,6 +398,9 @@ class TaskPool:
             self._prev_runahead_sequence_points = sequence_points
             self._prev_runahead_base_point = base_point
 
+        if len(sequence_points) == 0:
+            # no cycles to schedule
+            raise WorkflowConfigError('No tasks scheduled to run')
         if count_cycles:
             # (len(list) may be less than ilimit due to sequence end)
             limit_point = sorted(sequence_points)[:ilimit + 1][-1]

Which will it be?

  1. Do what the user asked, log something helpful.
  2. Prevent the operation from being actioned, raise a helpful error.

@oliver-sanders oliver-sanders removed their assignment Jul 15, 2024
@MetRonnie
Copy link
Member Author

Option 2 is no worse than the current behaviour so could at least be implemented as a quick address of the issue.

Edit: however, with the option 2 patch, I've just seen that reloading a workflow after changing

     [[graph]]
-        P1 = foo[-P1] => foo
+        R1 = foo[-P1] => foo

causes the workflow to get stuck in the reloading state after the WorkflowConfigError is raised. (This is pretty identical to the current behaviour).

@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 16, 2024

Errors raised during reload should cause the reload to be aborted. The error should be logged and the workflow should continue with the original config.

If that's not happening, we should fix it. I think the reload aborts correctly on master (due to the error in compute_runahead).

@MetRonnie
Copy link
Member Author

It aborts, however the workflow status gets stuck as reloading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants