Rationalize dummy and simulation modes #1859

hjoliver · 2016-05-25T04:41:55Z

In dummy mode, tasks are supposed to run (real) dummy jobs instead of the real jobs.

In simulation mode, tasks are supposed to not run real jobs at all, just simulate job execution.

The purpose of both of these is really to test-run the suite quickly without running compute-intensive real jobs.

As implemented, dummy mode is not very useful on real suites. Task script items (but not env-script, pre-script, etc.) are replaced with dummy scripting, but everything else is left alone - job submission and hosting for e.g. (and users shouldn't be submitting dummy jobs to a remote HPC with resource directives intended for a huge parallel model run...).

Simulation mode, on the other hand, probably shouldn't be assumed to be a reliable test of the system because it has to fake a bunch important processes associated with job submission and execution, and it doesn't generate the usual job output files or populate the suite DB properly. Even if it was reliable, I can't really think of a use-case for it that wouldn't be covered by a better dummy mode.

Proposal

remove simulation mode from cylc
in dummy mode, ignore all existing task config (except inheritance, and message outputs - Dummy mode custom output messages. #1420), replacing with the usual dummy script item

This will result in a dummy mode that works for any real suite: every task will run dummy background jobs on the suite host. (Potentially we could rename it "simulation mode").

If a user wants to do more complex things, such as run dummy jobs on the real task hosts, that can be done with a small built-for-purpose test suite, or by editing the real suite.

@cylc/core - do you agree with this?

(motivation: to diagnose this problem #1857 (comment) I had to run a large real suite in dummy mode, and it required extensive surgery to get the result that I'm suggesting should be automatic in dummy mode).

The text was updated successfully, but these errors were encountered:

hjoliver · 2016-05-25T04:53:39Z

Regards simulation mode reliability: it currently crashes on reload. But I won't bother fixing that if we can remove the mode instead.

trwhitcomb · 2016-05-25T21:33:48Z

I've actually used both of these and appreciate having the separation there - for suite development, the simulation mode is great as it removes any particulars of the batch system when trying to identify tasks and the relationship between them. Dummy mode is great, especially when moving suites between machines: I like to be able to test that the job submission is set up properly without worrying about the particulars of the scripts, e.g. if not all the data is available yet, etc.

I have a hard time remembering which one is which (simulation vs. dummy) but I like having the hierarchy of "check the tasks" => "check job submission" => "run live" without needing to change the suite (although I admittedly will sometimes drop the processor count request for the dummy mode, but I want to check that it works for real before going live!)

hjoliver · 2016-05-25T23:45:51Z

@trwhitcomb - good to see you're still keeping tabs on us :-)

OK, this raises a couple of questions. Or a question and a statement, anyway:

Do you have any problem with the current simulation mode being replaced with the more extreme dummy mode as described above (just dummy background tasks on the suite host)? I think the only difference to users is it will run real dummy jobs locally, and job logs will show up as normal (which is a good thing).
I agree it can be useful to have a dummy mode that uses the real job submission config on the real task hosts, but there are several problems in practice:
- You typically can't run someone else's suite out of the box in this mode (which I often need to do when diagnosing scheduling problems) because of task host login access, batch scheduler job accounting IDs, and the like.
- I think it would generally be frowned on by HPC systems engineers to submit small dummy jobs with big-job resource requests
- We only replace script items with dummy scripting, but users often run non-trivial stuff via the various other scripting items.

Still, you have a point.

Modified proposal:

Replace current sim mode with local job dummy mode (as described above)
Maybe streamline current dummy mode a bit, e.g. dummy out all script items, and warn users of the consequences, i.e. use of real job submission settings on real task hosts.

(Regarding 2. - can script items, e.g. init-script, ever have content that would be required to run even run dummy jobs on the correct task host?)

arjclark · 2016-05-26T08:04:09Z

I think the only difference to users is it will run real dummy jobs locally, and job logs will show up as normal (which is a good thing).

Not sure why having logs (and generating job scripts) is a good thing. Surely there's no need for logs etc. to be generated from a task that sleeps some number of seconds purely for workflow testing? Similarly, I don't see the benefit of creating a bunch of task processes with all the associated overheads (cylc message etc.) when just testing workflow.

Regarding 2. - can script items, e.g. init-script, ever have content that would be required to run even run dummy jobs on the correct task host?

Depends where you're running I guess, I don't know that its something we can safely assume isn't needed. There are some weird and convoluted hpc platforms out there...

hjoliver · 2016-05-27T04:35:13Z

@arjclark - the point of a simulation mode IMO is simply to simulate running a suite without the huge overheads and wait-times of real jobs, not just to test the workflow (testing the scheduling is the most important aspect, of course, but it can be useful for other things too, e.g. to quickly populate a suite DB just as a real run would).

Simulation by means of dummy jobs achieves this goal with zero complexity and maintenance cost to us - simply dummy out script items and we're done.

By contrast, the current simulation mode is quite complicated (and currently somewhat broken!) because we have to fake out job submission, job execution, task communications, and everything associated with those processes, and then ensure it doesn't break in future despite all those differences from live mode. Why should we bother with all that just to avoid the small overhead of running dummy jobs on the suite host? (if that's a problem you certainly won't be able to run the real suite!)

hjoliver · 2016-05-27T05:06:17Z

To be as clear as mud, I'll put my complete "modified proposal" down in one place:

modified proposal

drop current simulation mode - it is somewhat complicated and aside from lack of job logs it doesn't give users anything different than a much simpler local-job dummy mode
change the default dummy mode to run dummy jobs on localhost - this will work out of the box for all suites, even those you can't run in live mode (e.g. for lack of a login account on the task hosts).
add a secondary non-default dummy mode that runs its dummy jobs by the real job submission on the real hosts (but warn users of the consequences of this).

[3. is the current dummy mode, but we should ensure it really does only run dummy jobs by disabling all xxx-script items, not just the main one as now]

trwhitcomb · 2016-05-30T17:01:08Z

I think I'm OK with this. It simplifies the code, and keeps the ability to test the workflow and the job submission. For the list that @hjoliver gave, (2) and (3) may even be able to be controlled via an option switch rather than a separate mode (so there's just live mode and dummy mode, and dummy mode can use or ignore the job submission settings).

hjoliver · 2016-05-30T22:24:42Z

@trwhitcomb - that's good, just gotta convince @arjclark now 😁 . I like your dummy mode option switch idea.

I

arjclark · 2016-05-31T11:27:21Z

@hjoliver - maybe one to discuss in our "issue prioritisation" conversation when we see you?

hjoliver · 2016-05-31T22:01:43Z

@arjclark - sure.

BTW, I'm not quite as determined to dump simulation mode as the verbosity of this ticket might suggest. I just don't like the way it is currently implemented, and I think the marginal benefit (if any) over dummy mode does not justify maintaining it as is. However, another option might be a better-designed simulation mode modelled (in terms of simplicity) on the dummy mode:

alternative proposal

improve dummy mode as described above
re-implement simulation mode more cleanly by simply swapping out the real tasks (perhaps in the guts of the multi-process pool) with simulated tasks that send started and succeeded messages and sleep a bit in between. (Note this would still write job scripts, but not job output logs, to the job log dir.)

We should probably also check that anything related to real job execution - timeouts, retries, event handlers - are disabled in simulation (and dummy?) mode. (may already be OK).

trwhitcomb · 2016-06-08T15:42:27Z

One issue that I've run into is that the simulation mode/dummy mode doesn't handle it well when you have tasks that use message triggers. I suppose you could handle this by explicitly specifying a dummy mode script, but as long as we're discussing switching how these modes work, since you need to specify the outputs from a task in the suite.rc file, it would be nice to actually emit those from the simulated task (in addition to any sleep commands) so that things dependent on the messages don't stall out.

hjoliver · 2016-06-08T21:10:26Z

@trwhitcomb - that's #1420 - easy enough to fix, and yes we might as well do both at once.

hjoliver · 2016-06-21T19:58:42Z

[meeting]

improve dummy mode as above
simplify sim mode as above if possible ("sim-out" task jobs, but still generate job files etc.?)

hjoliver · 2016-06-24T11:03:14Z

(small, later: may need to do dummy and simulation separately - dummy is definitely small)

matthewrmshin added this to the soon milestone May 25, 2016

hjoliver self-assigned this Jun 21, 2016

hjoliver added the small label Jun 24, 2016

hjoliver modified the milestones: later, soon Jun 24, 2016

hjoliver mentioned this issue Mar 29, 2017

simulation modes: proportional run length and more #2220

Merged

matthewrmshin modified the milestones: next release, later Apr 6, 2017

oliver-sanders closed this as completed in #2220 Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize dummy and simulation modes #1859

Rationalize dummy and simulation modes #1859

hjoliver commented May 25, 2016 •

edited

hjoliver commented May 25, 2016

trwhitcomb commented May 25, 2016 •

edited

hjoliver commented May 25, 2016 •

edited

arjclark commented May 26, 2016

hjoliver commented May 27, 2016 •

edited

hjoliver commented May 27, 2016 •

edited

trwhitcomb commented May 30, 2016

hjoliver commented May 30, 2016

arjclark commented May 31, 2016

hjoliver commented May 31, 2016 •

edited

trwhitcomb commented Jun 8, 2016

hjoliver commented Jun 8, 2016

hjoliver commented Jun 21, 2016

hjoliver commented Jun 24, 2016

Rationalize dummy and simulation modes #1859

Rationalize dummy and simulation modes #1859

Comments

hjoliver commented May 25, 2016 • edited

hjoliver commented May 25, 2016

trwhitcomb commented May 25, 2016 • edited

hjoliver commented May 25, 2016 • edited

arjclark commented May 26, 2016

hjoliver commented May 27, 2016 • edited

hjoliver commented May 27, 2016 • edited

modified proposal

trwhitcomb commented May 30, 2016

hjoliver commented May 30, 2016

arjclark commented May 31, 2016

hjoliver commented May 31, 2016 • edited

trwhitcomb commented Jun 8, 2016

hjoliver commented Jun 8, 2016

hjoliver commented Jun 21, 2016

hjoliver commented Jun 24, 2016

hjoliver commented May 25, 2016 •

edited

trwhitcomb commented May 25, 2016 •

edited

hjoliver commented May 25, 2016 •

edited

hjoliver commented May 27, 2016 •

edited

hjoliver commented May 27, 2016 •

edited

hjoliver commented May 31, 2016 •

edited