-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea: add shell plugin to work around non-locality-aware scheduling of cores and GPUs #5342
Comments
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jul 26, 2023
Problem: An epilog-start event emitted for a job that never had an alloc event will proceed immediately to inactive because there is no pending "free" event to hold the job until the matching epilog-finish is posted. A pending epilog-start event should prevent the "clean" event for a job as well as the "free" event. This means a job cannot become inactive until all epilogs are complete. Fixes flux-framework#5342
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jul 27, 2023
Problem: An epilog-start event emitted for a job that never had an alloc event will proceed immediately to inactive because there is no pending "free" event to hold the job until the matching epilog-finish is posted. A pending epilog-start event should prevent the "clean" event for a job as well as the "free" event. This means a job cannot become inactive until all epilogs are complete. Fixes flux-framework#5342
@grondo, was this the intended target of that commit? |
Dang no. Typo 🤦 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@trws had the idea to build a simple shell plugin which could work around the lack of guaranteed locality aware scheduling of cores and GPUs at the moment.
The idea would be to implement a
-o follow-gpus
or similar option which would ignore the cores assigned to the job and would instead pick one or more cores that are near allocated GPUs.From @trws:
The case in question was a user that wanted to run 2 jobs per node, each with 4 tasks using 1 core and 1 gpu each.
The text was updated successfully, but these errors were encountered: