Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support symlinked alternate run directories (cylc 8). #2935

Merged
merged 8 commits into from
Aug 30, 2019

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Jan 29, 2019

These changes close #2797

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (functional).
  • Includes an appropriate entry in the release change log CHANGES.md.
  • I have opened a corresponding documentation PR at Document alternate run diretories. cylc-doc#77.

Supersedes #2817 (On reflection my initial take on this was unnecessarily focused - for historical reasons - on the idea of using an absolute suite path (instead of a registered suite name) to distinguish this from a normal run. In fact, registration is really an entirely orthogonal concept. The basic requirement (/desired feature) is:

Better support for quick-running sub-suites (of jobs that run on a single node?) by allowing sub-suite run directories to be located on fast local disk or RAM disk.

Why? 1) too many small files kills HPC shared filesystems like Lustre; and 2) top-level suite run directories rapidly proliferate if you use sub-suites in a cycling suite.

This trivial PR just adds a new registration option --run-dir=DIR that symlinks the standard suite run directory to an alternative location, so all suite job logs get written there.

$ cylc reg --run-dir=/tmp cat/dog suites/foo
REGISTERED cat/dog -> /home/oliverh/suites/foo

$ ls -l cylc-run/cat/
lrwxrwxrwx 1 oliverh oliverh 12 Jan 29 14:50 dog -> /tmp/cat/dog/

$ ls -a /tmp/cat/dog/
./  ../  .service/

Everything works as normal (including cylc scan and cylc review) but the log files are on (e.g.) /tmp.

We still have (for the moment; see below) proliferation of sub-suite run directories, but at least they're not on the shared FS.

@hjoliver hjoliver self-assigned this Jan 29, 2019
@hjoliver
Copy link
Member Author

hjoliver commented Jan 29, 2019

@TomekTrzeciak - and others - what do you think?

If this does the trick in principle, I can enhance it further before merge, e.g.:

  • when the suite shuts down, automatically concatenate its job.out/errs to a single summary log, or tar them up.
  • then delete the suite run dir and its standard-location symlink, if configured to copy the result to another (main suite, presumably) location
  • add the same option to cylc run, so that explicit registration can be skipped.
  • check that all job hosts in the suite do actually see the alternate run-dir location.

@hjoliver
Copy link
Member Author

hjoliver commented Jan 29, 2019

Note this is reminiscent of rose suite-run root-dir symlinks, but: (a) the purpose is different; (b) you would (or could) run sub-suites with cylc run rather than rose suite-run; and (c) rose suite-run will be migrated to Cylc soon anyway ... we can easily make the two functionalities consistent.

@hjoliver hjoliver added this to the soon milestone Jan 29, 2019
@hjoliver
Copy link
Member Author

hjoliver commented Jan 29, 2019

(BTW I've assumed that we want sub-suite share and work dirs on local disk too, with main-suite share path passed in for final sub-suite outputs, but maybe that needs to be configurable?)

@TomekTrzeciak
Copy link
Contributor

@TomekTrzeciak - and others - what do you think?

@hjoliver, this looks really good to me 👍. I think this is both: simpler and more explicit than the previous approach.

If this does the trick in principle, I can enhance further before merge, e.g.:

  • when the sub-suite shuts down, automatically tar and gzip its job logs, copy the tarball to the main-suite job log dir, then delete the sub-suite run dir and its standard-location symlink.
  • (or concatenate all sub-suite job.out/errs to a single summary log, and copy that back)

I would probably hold off with handling the logs until interaction with cylc review is figured out, as these things might be quite interdependent.

  • add the same option to cylc run, so that explicit registration can be skipped.

Yes. I also wonder if it would be useful to add a new cylc subrun command in the future. This would be just like the cylc run but use more appropriate defaults (--no-detach --auto-shutdown), like I suggested at the end of this comment.

  • check that all job hosts in the sub-suite do actually see the alternate run-dir location.

Isn't that part of remote invocation (or is that feature rose suite-run only)?

(BTW I've assumed that we want sub-suite share and work dirs on local disk too, with main-suite share path passed in for final sub-suite outputs, but maybe that needs to be configurable?)

My gut feeling is that there will be need for flexibility here. Another possible choice is to make share global by default (i.e., inherited from the master suite) and only work private. At the moment it is rose suite-run that manages share and work redirection, so this could be left undecided until this is migrated to cylc.

@hjoliver
Copy link
Member Author

hjoliver commented Feb 4, 2019

@TomekTrzeciak -

Yes. I also wonder if it would be useful to add a new cylc subrun command in the future. This would be just like the cylc run but use more appropriate defaults (--no-detach --auto-shutdown), like I suggested at the end of this comment.

Yeah, I saw your comment but didn't make --no-detach etc. the default because this log dir relocation might sometimes be used for "main" suites, not just sub-suites.

cylc subrun is another option, but I'd rather avoid that as there are too many cylc commands already, and we're hoping to rethink and rationalize the command set (particularly for start and restart) before too long. So for the moment, I think I'll just document clearly that sub-suites should be non-detaching.

@TomekTrzeciak
Copy link
Contributor

Yeah, I saw your comment but didn't make --no-detach etc. the default because this log dir relocation might sometimes be used for "main" suites, not just sub-suites.

Yes, I agree this shouldn't be tied to --run-dir=DIR option, it's a different concern.

cylc subrun is another option, but I'd rather avoid that as there are too many cylc commands already, and we're hoping to rethink and rationalize the command set (particularly for start and restart) before too long. So for the moment, I think I'll just document clearly that sub-suites should be non-detaching.

Absolutely happy with this. Passing some extra options to cylc run for sub-suite runs is perfectly OK for now.

@hjoliver
Copy link
Member Author

Rebased and deconflicted.

@hjoliver hjoliver added the WIP label Jun 25, 2019
@hjoliver hjoliver changed the title WIP: support alternate cylc-run locations. Support alternate cylc-run locations. Jun 25, 2019
@matthewrmshin
Copy link
Contributor

Lacking tests. LGTM otherwise.

@hjoliver hjoliver changed the title Support alternate cylc-run locations. Support symlinked alternate run directories (cylc 8). Jun 30, 2019
@cylc cylc deleted a comment Jun 30, 2019
@hjoliver hjoliver closed this Jun 30, 2019
@hjoliver hjoliver deleted the alt-run-dir-take-2 branch June 30, 2019 13:56
@hjoliver hjoliver restored the alt-run-dir-take-2 branch June 30, 2019 15:11
@hjoliver hjoliver reopened this Jun 30, 2019
@hjoliver
Copy link
Member Author

(oops, accidental closure reverted!)

@hjoliver
Copy link
Member Author

hjoliver commented Jul 8, 2019

(this is just waiting on documentation; might get it done today if I'm lucky)

@matthewrmshin matthewrmshin modified the milestones: soon, cylc-8.0a1, cylc-8.0a2 Jul 25, 2019
@matthewrmshin
Copy link
Contributor

Looks like a Travis CI issue. I've just kicked the job. If that doesn't work, I'll restart the whole build job, which may help.

@matthewrmshin
Copy link
Contributor

Travis CI is hating us today. 👎

@kinow
Copy link
Member

kinow commented Aug 29, 2019

Restarted whole build.

@matthewrmshin
Copy link
Contributor

Looks like the same chunk is still failing with 4 different tests again.

@hjoliver
Copy link
Member Author

Travis CI is hating us today. -1

The hating is mutual 😠

@kinow
Copy link
Member

kinow commented Aug 29, 2019

Looks like the same chunk is still failing with 4 different tests again.

3 jobs (chunks) failed. So I kicked them all. One is still failing, I believe he same as before 😕

./tests/cylc-poll/00-basic.t                                                   (Wstat: 0 Tests: 2 Failed: 1)
  Failed test:  2
./tests/cylc-insert/10-insert-non-graphed-cycle-point.t                        (Wstat: 0 Tests: 3 Failed: 1)
  Failed test:  2
./tests/events/13-task-event-mail-globalcfg.t                                  (Wstat: 0 Tests: 5 Failed: 4)
  Failed tests:  2-5
./tests/job-file-trap/02-pipefail.t                                            (Wstat: 0 Tests: 4 Failed: 2)
  Failed tests:  3-4
./tests/cylc-trigger/08-edit-run-host-select.t                                 (Wstat: 0 Tests: 3 Failed: 2)
  Failed tests:  2-3

Looking at the error details, some tracebacks with sqlite3.OperationalError: unable to open database file, FileNotFoundError: [Errno 2] No such file or directory: '/home/travis/cylc-run/cylctb-20190829T201640Z/cylc-trigger/08-edit-run-host-select'. Strange.

@hjoliver
Copy link
Member Author

But they are different specific tests than before! Super-weird.

@hjoliver
Copy link
Member Author

hjoliver commented Aug 29, 2019

Looking at the code changes on this branch again, there is no difference at all from master unless the new --run-dir option is used with cylc register.

Unless the new tests that do use --run-dir are interfering with the main run directory structure (seems unlikely, but better check that...)

@hjoliver
Copy link
Member Author

Unless the new tests that do use --run-dir are interfering with the main run directory structure (seems unlikely, but better check that...)

Huh, tests/registration/00-simple.t does run in the bad batch, and the four failures occur after it runs. Very suspicious 🤔

@hjoliver
Copy link
Member Author

That does seem to be the problem! (sorry I doubted you Travis CI 😭 )

@matthewrmshin
Copy link
Contributor

Oh. That's why it is always four random tests! We run our tests shuffled with -j 5...

@hjoliver
Copy link
Member Author

hjoliver commented Aug 29, 2019

Yes, it all makes sense. Solved, I think. I will claim the "idiot of the day" prize 😜 I'm just checking for other instances of the same problem (which is, removing the top level test directory under cylc-run). It could potentially explain other missing-file flakiness that we've seen. [None found]

@hjoliver
Copy link
Member Author

I've fixed that problem. Chunk 1 failed, tests/shutdown/08-now1.t may be flaky. Kicked T-CI...

@hjoliver
Copy link
Member Author

Chunk 1 failed, tests/shutdown/08-now1.t may be flaky.

#3336 might fix it.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this. I'll kick the build on travis.

@wxtim wxtim merged commit 3918589 into cylc:master Aug 30, 2019
@matthewrmshin matthewrmshin modified the milestones: cylc-8.0a2, cylc-8.0a1 Sep 3, 2019
@hjoliver hjoliver mentioned this pull request Oct 8, 2020
7 tasks
@hjoliver hjoliver deleted the alt-run-dir-take-2 branch May 10, 2021 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

local run-dir mode for sub-suites
5 participants