Batch spawn POC #5438

hjoliver · 2023-03-29T01:01:01Z

Zero-th level workaround for #5437

(We still need to optimize the problematic code, but this approach might be useful in the interim, and possibly in the long run as well).

UPDATE: the main problem (n-window computation) was fixed by #5475. But it may still be worth doing this as well.

When an output gets completed, instead of spawning all children into the task pool at once, record what needs to be spawned, and spawn them batch-wise via the main loop.

This plays well with queuing, because queues work with what they've got, so tasks can be released to run throughout the long spawning period.

The example from #5437 is quite usable on this branch, although CPU remains high till spawning is complete, and only the GUI table view is workable (and that with filtering for active tasks):

[task parameters]
   m = 0..6999
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"
[runtime]
   [[a]]
      script = sleep 10
   [[b<m>]]

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
CHANGES.md entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

hjoliver · 2023-03-29T04:46:33Z

(Tests not happy but I'm pretty sure it'll be something trivial...)

oliver-sanders

This seems like the best option to me.

Tried it out, the workflow is able to make progress as expected.

I found that client connections were taking longer than expected. Not sure why, a bit of debugging showed that once the requests were received, they were processed quickly by the workflow so it's the connection/security bit which is slow? Possibly the authenticator thread not getting enough CPU?

For the most part connections worked fine, but requests with larger payloads e.g. cylc dump and cylc tui don't work without a major increase in --comms-timeout. Assuming we can reduce the number of increment_graph_window calls this problem should go away and we should be able to up the max number of tasks spawned per main-loop cycle.

oliver-sanders · 2023-03-29T12:23:42Z

cylc/flow/scheduler.py

@@ -1645,6 +1645,7 @@ async def main_loop(self) -> None:
            # Shutdown workflow if timeouts have occurred
            self.timeout_check()

+            self.pool.spawn_children()


I think a lot of the functional tests can be sensitive to the order of main loop events. Maybe try to relocate this where spawning happened before, near process_queued_task_messages or whatever.

oliver-sanders · 2023-03-29T12:27:11Z

cylc/flow/task_pool.py

-            children = itask.graph_children[output]
+            if forced:
+                self.tasks_to_spawn_forced[
+                    (itask, output)


(itask, output, forced) to avoid needing two dicts?

Possibly consider using a deque or list rather than dict unless there's a need to perform in comparisons.

oliver-sanders · 2023-05-05T14:21:30Z

FYI: I've bumped into a workflow today which would benefit from this change even after the increment_graph_window efficiency improvements.

hjoliver · 2023-05-08T01:14:56Z

OK, I'll resurrect this and try to finish it off soon.

hjoliver · 2023-05-11T04:00:25Z

Damn it, I've just realized a problem with this approach. 😡

Sometimes we need to update the prerequisites of already-spawned tasks. On this branch, it's possible for those tasks to queued for spawning by the main loop and hence not actually spawned yet. So for this to work, we'll need a more complex data-structure that records tasks-to-be-spawned and prerequisites to update immediately when they are spawned.

[UPDATE] Actually it's more subtle than that. I was already recording children-to-be-spawned against the spawning outputs, and updating that prerequisite at spawn-time; the problem is there can be multiple such outputs, so the mapping really needs to be the other way round (child ->list-of-parent-outputs, with the list can grow over time before spawning)

[UPDATE 2] Got it working, but major tidying needed before I push it up...

hjoliver self-assigned this Mar 29, 2023

hjoliver force-pushed the batch-spawn branch from a37ae7e to 8da2272 Compare March 29, 2023 01:23

Batch spawn hack

cdbd30d

hjoliver force-pushed the batch-spawn branch from 8da2272 to cdbd30d Compare March 29, 2023 02:51

oliver-sanders reviewed Mar 29, 2023

View reviewed changes

hjoliver added 3 commits May 11, 2023 10:53

Add type annotations.

f507f6f

Spawn before remove.

a608b22

Fix: need copy of graph_children.

e7d2f0f

hjoliver force-pushed the batch-spawn branch from 2fce61c to e7d2f0f Compare May 11, 2023 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch spawn POC #5438

Batch spawn POC #5438

hjoliver commented Mar 29, 2023 •

edited

hjoliver commented Mar 29, 2023

oliver-sanders left a comment •

edited

oliver-sanders Mar 29, 2023 •

edited

oliver-sanders Mar 29, 2023

oliver-sanders commented May 5, 2023

hjoliver commented May 8, 2023

hjoliver commented May 11, 2023 •

edited

Batch spawn POC #5438

Are you sure you want to change the base?

Batch spawn POC #5438

Conversation

hjoliver commented Mar 29, 2023 • edited

hjoliver commented Mar 29, 2023

oliver-sanders left a comment • edited

Choose a reason for hiding this comment

oliver-sanders Mar 29, 2023 • edited

Choose a reason for hiding this comment

oliver-sanders Mar 29, 2023

Choose a reason for hiding this comment

oliver-sanders commented May 5, 2023

hjoliver commented May 8, 2023

hjoliver commented May 11, 2023 • edited

hjoliver commented Mar 29, 2023 •

edited

oliver-sanders left a comment •

edited

oliver-sanders Mar 29, 2023 •

edited

hjoliver commented May 11, 2023 •

edited