Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize dummy and simulation modes #1859

Closed
hjoliver opened this issue May 25, 2016 · 14 comments
Closed

Rationalize dummy and simulation modes #1859

hjoliver opened this issue May 25, 2016 · 14 comments
Assignees
Labels
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented May 25, 2016

In dummy mode, tasks are supposed to run (real) dummy jobs instead of the real jobs.

In simulation mode, tasks are supposed to not run real jobs at all, just simulate job execution.

The purpose of both of these is really to test-run the suite quickly without running compute-intensive real jobs.

As implemented, dummy mode is not very useful on real suites. Task script items (but not env-script, pre-script, etc.) are replaced with dummy scripting, but everything else is left alone - job submission and hosting for e.g. (and users shouldn't be submitting dummy jobs to a remote HPC with resource directives intended for a huge parallel model run...).

Simulation mode, on the other hand, probably shouldn't be assumed to be a reliable test of the system because it has to fake a bunch important processes associated with job submission and execution, and it doesn't generate the usual job output files or populate the suite DB properly. Even if it was reliable, I can't really think of a use-case for it that wouldn't be covered by a better dummy mode.

Proposal

  • remove simulation mode from cylc
  • in dummy mode, ignore all existing task config (except inheritance, and message outputs - Dummy mode custom output messages. #1420), replacing with the usual dummy script item

This will result in a dummy mode that works for any real suite: every task will run dummy background jobs on the suite host. (Potentially we could rename it "simulation mode").

If a user wants to do more complex things, such as run dummy jobs on the real task hosts, that can be done with a small built-for-purpose test suite, or by editing the real suite.

@cylc/core - do you agree with this?

(motivation: to diagnose this problem #1857 (comment) I had to run a large real suite in dummy mode, and it required extensive surgery to get the result that I'm suggesting should be automatic in dummy mode).

@hjoliver
Copy link
Member Author

Regards simulation mode reliability: it currently crashes on reload. But I won't bother fixing that if we can remove the mode instead.

@matthewrmshin matthewrmshin added this to the soon milestone May 25, 2016
@trwhitcomb
Copy link
Collaborator

trwhitcomb commented May 25, 2016

I've actually used both of these and appreciate having the separation there - for suite development, the simulation mode is great as it removes any particulars of the batch system when trying to identify tasks and the relationship between them. Dummy mode is great, especially when moving suites between machines: I like to be able to test that the job submission is set up properly without worrying about the particulars of the scripts, e.g. if not all the data is available yet, etc.

I have a hard time remembering which one is which (simulation vs. dummy) but I like having the hierarchy of "check the tasks" => "check job submission" => "run live" without needing to change the suite (although I admittedly will sometimes drop the processor count request for the dummy mode, but I want to check that it works for real before going live!)

@hjoliver
Copy link
Member Author

hjoliver commented May 25, 2016

@trwhitcomb - good to see you're still keeping tabs on us :-)

OK, this raises a couple of questions. Or a question and a statement, anyway:

  1. Do you have any problem with the current simulation mode being replaced with the more extreme dummy mode as described above (just dummy background tasks on the suite host)? I think the only difference to users is it will run real dummy jobs locally, and job logs will show up as normal (which is a good thing).
  2. I agree it can be useful to have a dummy mode that uses the real job submission config on the real task hosts, but there are several problems in practice:
    • You typically can't run someone else's suite out of the box in this mode (which I often need to do when diagnosing scheduling problems) because of task host login access, batch scheduler job accounting IDs, and the like.
    • I think it would generally be frowned on by HPC systems engineers to submit small dummy jobs with big-job resource requests
    • We only replace script items with dummy scripting, but users often run non-trivial stuff via the various other scripting items.

Still, you have a point.

Modified proposal:

  1. Replace current sim mode with local job dummy mode (as described above)
  2. Maybe streamline current dummy mode a bit, e.g. dummy out all script items, and warn users of the consequences, i.e. use of real job submission settings on real task hosts.

(Regarding 2. - can script items, e.g. init-script, ever have content that would be required to run even run dummy jobs on the correct task host?)

@arjclark
Copy link
Contributor

I think the only difference to users is it will run real dummy jobs locally, and job logs will show up as normal (which is a good thing).

Not sure why having logs (and generating job scripts) is a good thing. Surely there's no need for logs etc. to be generated from a task that sleeps some number of seconds purely for workflow testing? Similarly, I don't see the benefit of creating a bunch of task processes with all the associated overheads (cylc message etc.) when just testing workflow.

Regarding 2. - can script items, e.g. init-script, ever have content that would be required to run even run dummy jobs on the correct task host?

Depends where you're running I guess, I don't know that its something we can safely assume isn't needed. There are some weird and convoluted hpc platforms out there...

@hjoliver
Copy link
Member Author

hjoliver commented May 27, 2016

@arjclark - the point of a simulation mode IMO is simply to simulate running a suite without the huge overheads and wait-times of real jobs, not just to test the workflow (testing the scheduling is the most important aspect, of course, but it can be useful for other things too, e.g. to quickly populate a suite DB just as a real run would).

Simulation by means of dummy jobs achieves this goal with zero complexity and maintenance cost to us - simply dummy out script items and we're done.

By contrast, the current simulation mode is quite complicated (and currently somewhat broken!) because we have to fake out job submission, job execution, task communications, and everything associated with those processes, and then ensure it doesn't break in future despite all those differences from live mode. Why should we bother with all that just to avoid the small overhead of running dummy jobs on the suite host? (if that's a problem you certainly won't be able to run the real suite!)

@hjoliver
Copy link
Member Author

hjoliver commented May 27, 2016

To be as clear as mud, I'll put my complete "modified proposal" down in one place:

modified proposal

  1. drop current simulation mode - it is somewhat complicated and aside from lack of job logs it doesn't give users anything different than a much simpler local-job dummy mode

  2. change the default dummy mode to run dummy jobs on localhost - this will work out of the box for all suites, even those you can't run in live mode (e.g. for lack of a login account on the task hosts).

  3. add a secondary non-default dummy mode that runs its dummy jobs by the real job submission on the real hosts (but warn users of the consequences of this).

    [3. is the current dummy mode, but we should ensure it really does only run dummy jobs by disabling all xxx-script items, not just the main one as now]

@trwhitcomb
Copy link
Collaborator

I think I'm OK with this. It simplifies the code, and keeps the ability to test the workflow and the job submission. For the list that @hjoliver gave, (2) and (3) may even be able to be controlled via an option switch rather than a separate mode (so there's just live mode and dummy mode, and dummy mode can use or ignore the job submission settings).

@hjoliver
Copy link
Member Author

@trwhitcomb - that's good, just gotta convince @arjclark now 😁 . I like your dummy mode option switch idea.

I

@arjclark
Copy link
Contributor

@hjoliver - maybe one to discuss in our "issue prioritisation" conversation when we see you?

@hjoliver
Copy link
Member Author

hjoliver commented May 31, 2016

@arjclark - sure.

BTW, I'm not quite as determined to dump simulation mode as the verbosity of this ticket might suggest. I just don't like the way it is currently implemented, and I think the marginal benefit (if any) over dummy mode does not justify maintaining it as is. However, another option might be a better-designed simulation mode modelled (in terms of simplicity) on the dummy mode:

alternative proposal

  1. improve dummy mode as described above
  2. re-implement simulation mode more cleanly by simply swapping out the real tasks (perhaps in the guts of the multi-process pool) with simulated tasks that send started and succeeded messages and sleep a bit in between. (Note this would still write job scripts, but not job output logs, to the job log dir.)

We should probably also check that anything related to real job execution - timeouts, retries, event handlers - are disabled in simulation (and dummy?) mode. (may already be OK).

@trwhitcomb
Copy link
Collaborator

One issue that I've run into is that the simulation mode/dummy mode doesn't handle it well when you have tasks that use message triggers. I suppose you could handle this by explicitly specifying a dummy mode script, but as long as we're discussing switching how these modes work, since you need to specify the outputs from a task in the suite.rc file, it would be nice to actually emit those from the simulated task (in addition to any sleep commands) so that things dependent on the messages don't stall out.

@hjoliver
Copy link
Member Author

hjoliver commented Jun 8, 2016

@trwhitcomb - that's #1420 - easy enough to fix, and yes we might as well do both at once.

@hjoliver hjoliver self-assigned this Jun 21, 2016
@hjoliver
Copy link
Member Author

[meeting]

  • improve dummy mode as above
  • simplify sim mode as above if possible ("sim-out" task jobs, but still generate job files etc.?)

@hjoliver hjoliver added the small label Jun 24, 2016
@hjoliver hjoliver modified the milestones: later, soon Jun 24, 2016
@hjoliver
Copy link
Member Author

(small, later: may need to do dummy and simulation separately - dummy is definitely small)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants