calculating resources with clusterscope #105
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why ?
clusterscope has an API to calculate proportionate resource usage in the machine (CPUs / Mem) given the number of GPUs or CPUs.
notice how the original code has the following args:
H100 nodes have 192 cpus, the code above only uses 96 cpus per node (96 cpus_per_task * 1 ntasks_per_node), leaving the other 96 cpus idle.
what clusterscope provides is an API that will allocate the appropriate number of CPUs and Memory given the number of GPUs requested. it goes something like:
H100 nodes have 8 GPUs total, so this means clusterscope will allocate all memory and all cpus
Test plan
github ci