Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new 'RAW' variables for SHARE and WORK variables to ensure share/work dirs are created, not left with broken symlinks #5978

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ColemanTom
Copy link
Contributor

Closes #5567

CYLC_WORKFLOW_SHARE_DIR_RAW and CYLC_WORKFLOW_WORK_DIR_RAW added. These are then used for 'mkdir' to ensure directories are created. Currently, if SHARE_DIR and WORK_DIR are symlinks, and the dir they point to does not exist, the dirs will not be created, and tasks may fail. The new variables ensure the correct directories are always created.

I've not fully tested this (and am honestly not sure where to test this in an automated sense or what I can leverage from the existing infrastructure) but wanted to open up for comment before I spend any more time on it.

  • Are you happy with this approach or bin the code?
  • Any advice on where I can look to test this, preferably from unit tests, but maybe that isn't sufficient?
  • If proceed, do you want the variable name changed? I could not think of a name I am really happy with
  • Are you happy with the variables being exported or do you want the scope changed to not be generally available?

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.

CYLC_WORKFLOW_SHARE_DIR_RAW and CYLC_WORKFLOW_WORK_DIR_RAW added.
These are then used for 'mkdir' to ensure directories are created.
Currently, if SHARE_DIR and WORK_DIR are symlinks, and the dir
they point to does not exist, the dirs will not be created, and
tasks may fail. The new variables ensure the correct directories
are always created.
@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Feb 19, 2024
@hjoliver
Copy link
Member

@ColemanTom - we discussed this in the project meeting today.

We wondered whether having the disk yanked out from underneath you is really something that Cylc should be expected to handle. @dpmatthews might have some more specific comments tomorrow.

At this stage I don't see a problem with handling it if we can, in principle, but given the purpose of the share directory (shared space for workflow task IO) it seems likely that simply recreating it mid-run would be insufficient to allow the workflow continue - wouldn't you expect critical input files for upcoming tasks to be disappeared by the disk failover?

Also can you explain where you expect these new RAW variables to be defined?

@ColemanTom
Copy link
Contributor Author

ColemanTom commented Feb 19, 2024

Assuming workflows are using share and work as per the user guide (i.e. warm/shared data in share and transient data in work), this issue only affects cold-running models for the share area but work for all models. Warm models should always have a failover disk setup as the warm start data should always be backed up to the failover disk routinely I hope, so that means they should always have share created. However, they shouldn't ever have work created on the failover disk.

At this stage I don't see a problem with handling it if we can, in principle, but given the purpose of the share directory (shared space for workflow task IO) it seems likely that simply recreating it mid-run would be insufficient to allow the workflow continue

It depends on the system and timing. Hypothetical based on real models.

Hypothetical 1 - cold start, easy rewind

  • model is always a cold start
  • takes 10 minutes to run
  • does not need to back anything up to the failover disk as in the event of a disk failover, you can just rerun the whole cycle by starting a new flow
  • If the failover disk is swapped in between runs, and has never run on the failover disk before, then the share and work directories (assuming that they are symlinks) are not there, model will fail randomly the first time they try to be used (but not in the job.sh setup part as they have || true for a reason I'm still not sure about as mkdir -p is meant to be thread safe and atomic)
  • With this change, it will create them straight away. Failure won't be because the share and work dirs don't exist, but if one happens, it will be because input data is missing and it will tell you that in error logs

Hypothetical 2 - warm start, no work dir

  • model runs every 6 hours
  • data is backed up to the failover disk for the share area
  • nothing is backed up or setup on the failover disk for the work area - so nothing dynamically ensures work is made
  • disk failover happens
  • model tries to run, but fails due to the lack of the work directory randomly (maybe via rose task-run trying to copy files in)
  • Someone then needs to go in and manually create the correct work directory so the model can continue running

Next question

Also can you explain where you expect these new RAW variables to be defined?

Unfortunately I cleaned out my examples yesterday due to quotas being neared, so I can't show you a screenshot, but, in this current MR, they would be defined in the job file if the directories are symbolic links, inserted at the red line in this screenshot.

image

@@ -1345,6 +1348,9 @@ def get_job_conf(
'pre-script': rtconfig['pre-script'],
'script': rtconfig['script'],
'submit_num': itask.submit_num,
'symlink_dirs': get_dirs_to_symlink(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were to go forward, I imagine making sure get_dirs_to_symlink was cached would be a reasonable thing to do for a minor speedup with limited extra memory load.

@hjoliver
Copy link
Member

hjoliver commented Feb 19, 2024

Note that cylc play and cylc reinstall fail if the existing run-dir symlinks are dangling, which is reasonable:

WorkflowFilesError: /home/oliverh/cylc-run/bug/runN symlink not valid

I wonder if it might actually be simplest and safest just to do the run-time equivalent of that, for workflows that are running when the failover occurs, rather than try to handle it automatically. I think that would be this:

# job.sh
     # Create share and work directories
-    mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}" || true
+    mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}" 

Then jobs submitted after the failover will just fail, with a simple error message that explains what's wrong.

Then it would be up to the user to diagnose the problem (pretty easy, given that you ought to know that disk failover is a possibility) and:

  • manually fix the dangling symlinks
  • (and also check that critical input files for upcoming tasks haven't been disappeared, which we can't automate anyway)
  • retrigger the failed tasks, to carry on

To me that seems a reasonable way to respond to something as drastic as the disk being replaced under the running workflow.

What do you think?

@ColemanTom
Copy link
Contributor Author

manually fix the dangling symlinks

Hard to do in our environment where there is no direct access to a user and access is via Cylc edit runs only. Can your proposal fix the dangling symlinks via an edit run? You can discuss our environment requirements more with @jarich as I'm defintely able to handle making directories myself because I work as myself. Although.. what if a workflow is running under a different user, you have permissions to run their system, and they are on leave so can't make the folder themselves? Again, I guess it comes back to, it needs to be doable via a Cylc edit run still?

will just fail, with a simple error message that explains what's wrong.

I do support failing early (removing || true) because I like things failing early, for example permisson error or quota breach.

@jarich
Copy link

jarich commented Feb 20, 2024

We wondered whether having the disk yanked out from underneath you is really something that Cylc should be expected to handle.

This is something that our workflows need to handle. This isn't the same as it being something Cylc needs to handle, but it would be nice if it did.

Disk failovers are rare, but they happen sometimes, and while we absolutely expect jobs will fail when we do a disk failover, we need them to succeed with no more actions required than to re-run the job. First level support will not have access to create any missing directories or symbolic links.

There are other moments in time when we can make sure that these actions are carried out if Cylc isn't the right place to do it, but I would not have thought some basic disk filesystem failover was out of scope for how other agencies run their HPCs.

@ColemanTom ColemanTom changed the title Add new 'RAW' variables for SHARE and WORK variables to ensure share/work dirs are created with broken symlinks Add new 'RAW' variables for SHARE and WORK variables to ensure share/work dirs are created, not left with broken symlinks Feb 20, 2024
@ColemanTom
Copy link
Contributor Author

This isn't the same as it being something Cylc needs to handle, but it would be nice if it did.

Yes, we do have code which makes them, its just tedious having every workflow have to call a script/function (and I'm pretty sure some don't so this has caused us problems in the past).

@hjoliver
Copy link
Member

hjoliver commented Feb 20, 2024

we need them to succeed with no more actions required than to re-run the job. First level support will not have access to create any missing directories or symbolic links.

What I was trying to get at above, in part, is whether or not having Cylc automatically fix the symlinks really is sufficient under the circumstances.

Consider some generic Cylc workflow, regardless of model cold/warm starts.

foo => bar => baz

These task will almost certainly be communicating through the share directory: foo's output files are bar's input files, etc.

If I understand Tom's response above, some of your workflows may require rewinding to the previous cycle, and some may require manual run-dir manipulation to prepare these "intermediate files" after the fail-over?

At least in the second case, I presume your first level support staff can't do that either?

If so, maybe it's reasonable to say that disk fail-over isn't a "first level support" problem. Or if it is, those staff need to be given the power to fix it?

(Note, I'm not playing devil's advocate for the hell of it, just trying to understand exactly what your situation is, to form an opinion on whether or not it should be a "Cylc problem" 😁 !)

@ColemanTom
Copy link
Contributor Author

If I understand Tom's response above, some of your workflows may require rewinding to the previous cycle, and some may require manual run-dir manipulation to prepare these "intermediate files" after the fail-over?

Not to the previous cycle, just rerun the current cycle. If in Cylc7 world, right click on the cycle, reset status to waiting, let it run.

No manual run-dir manipulation takes place.

If the Cylc team don't want this sort of functionality, that is completely fine, its a proposal/idea. I imagine I can easily modify our global-init-script to do mkdir on the resolved directories (as its a known pattern in the global.cylc file already), and the global init-script part is run prior to cylc__job_main, so before Cylc does it's mkdir, so if you were to remove the || true for example, it would be non-breaking.

@hjoliver
Copy link
Member

@ColemanTom - I'm not dismissing the idea outright, at least not yet, it seems worth considering, especially if it's true that no other external manual intervention would be required for your fail-over use case. Others on the team who work closer to the metal than me might have stronger opinions, let's see.

At the very least, I think we should remove the || true after the mkdir -p commands. I can't think why that's there in the first place. We must have thought, in the past, that un-createable (not merely un-created) share and work directories should not bring tasks down. But given the importance of those directories, that seems a bit silly.

@ColemanTom
Copy link
Contributor Author

I think we should remove the || true after the mkdir -p commands. I can't think why that's there in the first place. We must have thought, in the past, that un-createable (not merely un-created) share and work directories should not bring tasks down. But given the importance of those directories, that seems a bit silly.

I just assumed there was fear of a race condition - and perhaps in a non-POSIX linux environment there is?

@hjoliver
Copy link
Member

hjoliver commented Feb 20, 2024

I just assumed there was fear of a race condition - and perhaps in a non-POSIX linux environment there is?

Ah, yeah .. maybe that's it (I'd have thought -p avoids that, but not 100% sure...)

@oliver-sanders
Copy link
Member

At the very least, I think we should remove the || true after the mkdir -p commands

Changes like this are high risk and require a lot of thinking and testing (this may have been protecting us for some use cases).

@hjoliver
Copy link
Member

Changes like this are high risk and require a lot of thinking and testing (this may have been protecting us for some use cases).

It was added (without an explanatory comment) in PR #17 - merged 30 Sep 2011 🤯

@ColemanTom
Copy link
Contributor Author

I feel like the discussion around || true should perhaps be in a separate Issue just for that topic? Perhaps Matt even remembers why it is there? I know in the UM there was (is?) some oddities in the I/O that are remanants from old HPC the UKMO owned, so it is quite possible they had issues? Or it was just ultra-caution at the time.

@ColemanTom
Copy link
Contributor Author

On the topic of this MR itself, the Issue is currently listed as against the 8.3.0 milestone. Given I raised the Issue and had an idea on how one may approach it, I figured I should provide the idea to help 8.3.0 in some small way instead of distracting you with random questions and Issues. Whilst I like this idea as it saves me making changes for our local env specifically, it does add some extra complexity which may not be desired. Happy for this to be punted to a discussion for 8.4.0 if desired too. Just let me know if you want me to proceed with this and adding tests, etc, else I'll just forget it for now and come back to the topic in a couple of months.

@hjoliver
Copy link
Member

I feel like the discussion around || true should perhaps be in a separate Issue just for that topic? Perhaps Matt even remembers why it is there?

See #6000

@hjoliver hjoliver added this to the cylc-8.4.0 milestone Feb 27, 2024
@hjoliver
Copy link
Member

On the topic of this MR itself, ...

@ColemanTom - thanks for engaging on this (and other!) issues, much appreciated. However, I don't think we've reached a consensus on this one yet, so I've punted it back to 8.4.0 for now.

@dpmatthews
Copy link
Contributor

I would prefer the readlink approach suggested in the original issue.
Something like:

if [[ ! -e "${CYLC_WORKFLOW_SHARE_DIR}" ]]; then
  if [[ -L "${CYLC_WORKFLOW_SHARE_DIR}" ]]; then
    mkdir -p $(readlink "${CYLC_WORKFLOW_SHARE_DIR}")
  else
    mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}"
  fi
fi

As far as I can see this is safe and harmless,
It also has the advantage of avoiding the unnecessary mkdir calls (mkdir is not a bash built-in) so is more efficient that the current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Flag this as a question for the next Cylc project meeting.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job.sh improvement for creating share and work directories?
5 participants