Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: add shell plugin to work around non-locality-aware scheduling of cores and GPUs #5342

Open
grondo opened this issue Jul 18, 2023 · 2 comments

Comments

@grondo
Copy link
Contributor

grondo commented Jul 18, 2023

@trws had the idea to build a simple shell plugin which could work around the lack of guaranteed locality aware scheduling of cores and GPUs at the moment.

The idea would be to implement a -o follow-gpus or similar option which would ignore the cores assigned to the job and would instead pick one or more cores that are near allocated GPUs.

From @trws:

so it would be this case, but slightly broadened so that if you have something you want to run, where the main constraint is "different GPUs and some nearby cores" we could handle it better as an interim solution without having to write up a one-off that's specific to the numbers for a given job on a given box

The case in question was a user that wanted to run 2 jobs per node, each with 4 tasks using 1 core and 1 gpu each.

grondo added a commit to grondo/flux-core that referenced this issue Jul 26, 2023
Problem: An epilog-start event emitted for a job that never had an
alloc event will proceed immediately to inactive because there is no
pending "free" event to hold the job until the matching epilog-finish
is posted.

A pending epilog-start event should prevent the "clean" event for
a job as well as the "free" event. This means a job cannot become
inactive until all epilogs are complete.

Fixes flux-framework#5342
grondo added a commit to grondo/flux-core that referenced this issue Jul 27, 2023
Problem: An epilog-start event emitted for a job that never had an
alloc event will proceed immediately to inactive because there is no
pending "free" event to hold the job until the matching epilog-finish
is posted.

A pending epilog-start event should prevent the "clean" event for
a job as well as the "free" event. This means a job cannot become
inactive until all epilogs are complete.

Fixes flux-framework#5342
@mergify mergify bot closed this as completed in f276677 Jul 27, 2023
@trws
Copy link
Member

trws commented Jul 27, 2023

@grondo, was this the intended target of that commit?

@grondo
Copy link
Contributor Author

grondo commented Jul 27, 2023

Dang no. Typo 🤦

@grondo grondo reopened this Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants