Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support symlinked alternate run directories (cylc 8). #2935

Open
wants to merge 3 commits into
base: master
from

Conversation

@hjoliver
Copy link
Contributor

commented Jan 29, 2019

These changes close #2797

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (functional).
  • Includes an appropriate entry in the release change log CHANGES.md.
  • I have opened a corresponding documentation PR at cylc/cylc-doc#XXXX.

Supersedes #2817 (On reflection my initial take on this was unnecessarily focused - for historical reasons - on the idea of using an absolute suite path (instead of a registered suite name) to distinguish this from a normal run. In fact, registration is really an entirely orthogonal concept. The basic requirement (/desired feature) is:

Better support for quick-running sub-suites (of jobs that run on a single node?) by allowing sub-suite run directories to be located on fast local disk or RAM disk.

Why? 1) too many small files kills HPC shared filesystems like Lustre; and 2) top-level suite run directories rapidly proliferate if you use sub-suites in a cycling suite.

This trivial PR just adds a new registration option --run-dir=DIR that symlinks the standard suite run directory to an alternative location, so all suite job logs get written there.

$ cylc reg --run-dir=/tmp cat/dog suites/foo
REGISTERED cat/dog -> /home/oliverh/suites/foo

$ ls -l cylc-run/cat/
lrwxrwxrwx 1 oliverh oliverh 12 Jan 29 14:50 dog -> /tmp/cat/dog/

$ ls -a /tmp/cat/dog/
./  ../  .service/

Everything works as normal (including cylc scan and cylc review) but the log files are on (e.g.) /tmp.

We still have (for the moment; see below) proliferation of sub-suite run directories, but at least they're not on the shared FS.

@hjoliver hjoliver self-assigned this Jan 29, 2019

@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

@TomekTrzeciak - and others - what do you think?

If this does the trick in principle, I can enhance it further before merge, e.g.:

  • when the suite shuts down, automatically concatenate its job.out/errs to a single summary log, or tar them up.
  • then delete the suite run dir and its standard-location symlink, if configured to copy the result to another (main suite, presumably) location
  • add the same option to cylc run, so that explicit registration can be skipped.
  • check that all job hosts in the suite do actually see the alternate run-dir location.
@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

Note this is reminiscent of rose suite-run root-dir symlinks, but: (a) the purpose is different; (b) you would (or could) run sub-suites with cylc run rather than rose suite-run; and (c) rose suite-run will be migrated to Cylc soon anyway ... we can easily make the two functionalities consistent.

@hjoliver hjoliver added this to the soon milestone Jan 29, 2019

@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

(BTW I've assumed that we want sub-suite share and work dirs on local disk too, with main-suite share path passed in for final sub-suite outputs, but maybe that needs to be configurable?)

@TomekTrzeciak

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

@TomekTrzeciak - and others - what do you think?

@hjoliver, this looks really good to me 👍. I think this is both: simpler and more explicit than the previous approach.

If this does the trick in principle, I can enhance further before merge, e.g.:

  • when the sub-suite shuts down, automatically tar and gzip its job logs, copy the tarball to the main-suite job log dir, then delete the sub-suite run dir and its standard-location symlink.
  • (or concatenate all sub-suite job.out/errs to a single summary log, and copy that back)

I would probably hold off with handling the logs until interaction with cylc review is figured out, as these things might be quite interdependent.

  • add the same option to cylc run, so that explicit registration can be skipped.

Yes. I also wonder if it would be useful to add a new cylc subrun command in the future. This would be just like the cylc run but use more appropriate defaults (--no-detach --auto-shutdown), like I suggested at the end of this comment.

  • check that all job hosts in the sub-suite do actually see the alternate run-dir location.

Isn't that part of remote invocation (or is that feature rose suite-run only)?

(BTW I've assumed that we want sub-suite share and work dirs on local disk too, with main-suite share path passed in for final sub-suite outputs, but maybe that needs to be configurable?)

My gut feeling is that there will be need for flexibility here. Another possible choice is to make share global by default (i.e., inherited from the master suite) and only work private. At the moment it is rose suite-run that manages share and work redirection, so this could be left undecided until this is migrated to cylc.

@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Feb 4, 2019

@TomekTrzeciak -

Yes. I also wonder if it would be useful to add a new cylc subrun command in the future. This would be just like the cylc run but use more appropriate defaults (--no-detach --auto-shutdown), like I suggested at the end of this comment.

Yeah, I saw your comment but didn't make --no-detach etc. the default because this log dir relocation might sometimes be used for "main" suites, not just sub-suites.

cylc subrun is another option, but I'd rather avoid that as there are too many cylc commands already, and we're hoping to rethink and rationalize the command set (particularly for start and restart) before too long. So for the moment, I think I'll just document clearly that sub-suites should be non-detaching.

@TomekTrzeciak

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2019

Yeah, I saw your comment but didn't make --no-detach etc. the default because this log dir relocation might sometimes be used for "main" suites, not just sub-suites.

Yes, I agree this shouldn't be tied to --run-dir=DIR option, it's a different concern.

cylc subrun is another option, but I'd rather avoid that as there are too many cylc commands already, and we're hoping to rethink and rationalize the command set (particularly for start and restart) before too long. So for the moment, I think I'll just document clearly that sub-suites should be non-detaching.

Absolutely happy with this. Passing some extra options to cylc run for sub-suite runs is perfectly OK for now.

@hjoliver hjoliver force-pushed the hjoliver:alt-run-dir-take-2 branch from 423de83 to a22cd9c Jun 25, 2019

@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 25, 2019

Rebased and deconflicted.

@hjoliver hjoliver added the WIP label Jun 25, 2019

@hjoliver hjoliver changed the title WIP: support alternate cylc-run locations. Support alternate cylc-run locations. Jun 25, 2019

@matthewrmshin

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

Lacking tests. LGTM otherwise.

@hjoliver hjoliver force-pushed the hjoliver:alt-run-dir-take-2 branch from a22cd9c to 862b59c Jun 30, 2019

@hjoliver hjoliver referenced this pull request Jun 30, 2019
4 of 6 tasks complete

@hjoliver hjoliver changed the title Support alternate cylc-run locations. Support symlinked alternate run directories (cylc 8). Jun 30, 2019

@hjoliver hjoliver force-pushed the hjoliver:alt-run-dir-take-2 branch from 862b59c to 8b20137 Jun 30, 2019

@cylc cylc deleted a comment from codacy-bot Jun 30, 2019

@hjoliver hjoliver closed this Jun 30, 2019

@hjoliver hjoliver deleted the hjoliver:alt-run-dir-take-2 branch Jun 30, 2019

@hjoliver hjoliver restored the hjoliver:alt-run-dir-take-2 branch Jun 30, 2019

@hjoliver hjoliver reopened this Jun 30, 2019

@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Jun 30, 2019

(oops, accidental closure reverted!)

@hjoliver

This comment has been minimized.

Copy link
Contributor Author

commented Jul 8, 2019

(this is just waiting on documentation; might get it done today if I'm lucky)

@matthewrmshin matthewrmshin removed this from the soon milestone Jul 25, 2019

@matthewrmshin matthewrmshin modified the milestones: cylc-8.0a1, cylc-8.0a2 Jul 25, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.