-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use case: SCR needs to hold onto resources while it copies data to global storage #2080
Comments
Also from #2040:
More specifically, resources are not released on a rank until the job shell exits and any subsequent cleanup has been issued and completes successfully on that rank. Here cleanup includes executing a set of cleanup/epilog scripts, as well as killing all stray processes and destroying any cgroups and namspaces for the job. Therefore other plugins and tools can hold resources for the job by adding themselves to the system cleanup/epilog configuration (probably via scripts). Note that work in the cleanup phase is executed as a privileged user by the IMP, and therefore available plugins/tools would have to be configured by system administrators (possibly selectable at runtime). In the future it might be interesting if unneeded resources could optionally be released before or during cleanup for long-running cleanup tasks. For example, a cleanup task that copies data off local storage could (somehow) release all resources except a core and the local storage itself for the duration of the copy. |
Note that one form of "application terminates abnormally" is that one of the nodes failed, e.g. crashed due to a Lustre bug or a hardware issue, in case that affects your architecture. |
Coming back around to some old issues and noticed this one. There's currently a plan to move the adminstrative epilog/cleanup scripts to a "housekeeping" service which would not be associated with the job. Therefore, it is probably no longer (or will be no longer) appropriate to use the epilog script(s) to handle this movement of data. A couple of alternatives come to mind here. I'm not sure of the status of SCR and Flux so these may not be good solutions either, but I'll throw them out there for reference:
|
As described by @kathrynmohror in #2040:
We should ensure that we have an initial plan to support this sort of thing via plugin or cleanup/epilog script etc.
The text was updated successfully, but these errors were encountered: