Skip to content

Conversation

@luccabb
Copy link
Member

@luccabb luccabb commented Oct 7, 2025

Why ?

clusterscope has an API to calculate proportionate resource usage in the machine (CPUs / Mem) given the number of GPUs or CPUs.

notice how the original code has the following args:

"ntasks_per_node": 1,
"cpus_per_task": 96,
"gpus_per_node": 8,

H100 nodes have 192 cpus, the code above only uses 96 cpus per node (96 cpus_per_task * 1 ntasks_per_node), leaving the other 96 cpus idle.

what clusterscope provides is an API that will allocate the appropriate number of CPUs and Memory given the number of GPUs requested. it goes something like:

  1. request 8 GPUs
  2. clusterscope gets total cpus and memory for the nodes
  3. clusterscope allocates cpus/memory based on the proportion of GPUs being requested

H100 nodes have 8 GPUs total, so this means clusterscope will allocate all memory and all cpus

Test plan

github ci

@luccabb luccabb requested a review from skalyan October 7, 2025 22:16
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@luccabb luccabb requested a review from gunchu October 7, 2025 22:16
@luccabb luccabb changed the title Clusterscope calculating resources with clusterscope Oct 7, 2025
Copy link

@skalyan skalyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me, unless there are some special case considerations for asking hard-coded CPU core count

@dongwang218
Copy link
Contributor

@luccabb
Copy link
Member Author

luccabb commented Oct 7, 2025

maybe, we can also retire https://github.com/facebookresearch/matrix/blob/main/matrix/cluster/ray_worker_job.py#L30-L43?

yeah I think we can retire this. I'll have to write code on the clusterscope side to support this operation, so I think its best to leave it for a future PR

@dongwang218
Copy link
Contributor

sure, we can do in a follow up.

maybe, we can also retire https://github.com/facebookresearch/matrix/blob/main/matrix/cluster/ray_worker_job.py#L30-L43?

yeah I think we can retire this. I'll have to write code on the clusterscope side to support this operation, so I think its best to leave it for a future PR

@dongwang218 dongwang218 merged commit d86c493 into main Oct 7, 2025
8 checks passed
@dongwang218 dongwang218 deleted the clusterscope branch October 7, 2025 22:53
@skalyan
Copy link

skalyan commented Oct 7, 2025

@dongwang218 where do you set this env variable? is it custom to you?

num_gpus = int(os.environ.get("SLURM_GPUS_ON_NODE", 0))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants