Skip to content

Conversation

@un-def
Copy link
Collaborator

@un-def un-def commented Mar 26, 2025

# Generate hostfile for mpirun
: > hostfile
for ip in ${DSTACK_NODES_IPS}; do
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the effect of missing slots? I got errors when using a hostfile without slots with TCPRX on 8xH100.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message:

here are not enough slots available in the system to satisfy the N
slots that were requested by the application:

  ./all_reduce_perf

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor core

For example, n1-highmem-2 has 4 T4 but only 2 CPU, meaning that slots defaults to 2 while we request 4 processes per node.

@un-def un-def merged commit 9d0b83f into master Mar 26, 2025
24 checks passed
@un-def un-def deleted the pr_update_nccl_test_examples branch March 26, 2025 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants