Skip to content

[AWS]: dstack doesn't use the EFA-enabled Docker image for H100:1 on AWS (p5.4xlarge) #3069

@peterschmidt85

Description

@peterschmidt85

Steps to reproduce

  1. Create a fleet
type: fleet
name: gpu
nodes: 2
placement: cluster
backends: [aws]
resources:
  gpu: H100:1
  1. Run nccl-tests
type: task
nodes: 2
working_dir: "."
startup_order: workers-first
stop_criteria: master-done
env:
  - NCCL_DEBUG=INFO
commands:
  - |
    if [ $DSTACK_NODE_RANK -eq 0 ]; then
      mpirun \
        --allow-run-as-root \
        --hostfile $DSTACK_MPI_HOSTFILE \
        -n $DSTACK_GPUS_NUM \
        -N $DSTACK_GPUS_PER_NODE \
        --bind-to none \
        /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
    else
      sleep infinity
    fi
resources:
  gpu: nvidia:1..8
  shm_size: 16GB

Actual behaviour

  1. dstack automatically picks dstackai/base:0.10-devel-efa-ubuntu22.04 (EFA-enabled)

NCCL logs (no EFA initialized):

# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     87 on ip-172-31-41-23 device  0 [0000:33:00] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid    105 on ip-172-31-37-112 device  0 [0000:33:00] NVIDIA H100 80GB HBM3
ip-172-31-41-23:87:87 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
ip-172-31-41-23:87:87 [0] NCCL INFO Bootstrap: Using enp39s0:172.31.41.23<0>
ip-172-31-41-23:87:87 [0] NCCL INFO cudaDriverVersion 12080
ip-172-31-41-23:87:87 [0] NCCL INFO NCCL version 2.26.2+cuda12.1
ip-172-31-41-23:87:102 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
ip-172-31-41-23:87:102 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
ip-172-31-41-23:87:102 [0] NCCL INFO NET/IB : No device found.
ip-172-31-41-23:87:102 [0] NCCL INFO NET/IB : Using [RO]; OOB enp39s0:172.31.41.23<0>
ip-172-31-41-23:87:102 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
ip-172-31-41-23:87:102 [0] NCCL INFO NET/Socket : Using [0]enp39s0:172.31.41.23<0>
ip-172-31-41-23:87:102 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
ip-172-31-41-23:87:102 [0] NCCL INFO Using network Socket
ip-172-31-41-23:87:102 [0] NCCL INFO ncclCommInitRank comm 0x57c72df01450 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 33000 commId 0x51d1ccf57958a510 - Init START
ip-172-31-41-23:87:102 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
ip-172-31-41-23:87:102 [0] NCCL INFO Bootstrap timings total 0.029221 (create 0.000033, send 0.000143, recv 0.009620, ring 0.009715, delay 0.000000)
ip-172-31-41-23:87:102 [0] NCCL INFO Setting affinity for GPU 0 to ffff
ip-172-31-41-23:87:102 [0] NCCL INFO comm 0x57c72df01450 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
ip-172-31-41-23:87:102 [0] NCCL INFO Channel 00/02 : 0 1
ip-172-31-41-23:87:102 [0] NCCL INFO Channel 01/02 : 0 1
ip-172-31-41-23:87:102 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
ip-172-31-41-23:87:102 [0] NCCL INFO P2P Chunksize set to 131072
ip-172-31-41-23:87:102 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
ip-172-31-41-23:87:104 [0] NCCL INFO [Proxy Service] Device 0 CPU core 11
ip-172-31-41-23:87:105 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 13
ip-172-31-41-23:87:102 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-172-31-41-23:87:102 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ip-172-31-41-23:87:102 [0] NCCL INFO CC Off, workFifoBytes 1048576
ip-172-31-41-23:87:102 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
ip-172-31-41-23:87:102 [0] NCCL INFO ncclCommInitRank comm 0x57c72df01450 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 33000 commId 0x51d1ccf57958a510 - Init COMPLETE
ip-172-31-41-23:87:102 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.36 (kernels 0.31, alloc 0.02, bootstrap 0.03, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-172-31-41-23:87:107 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 7
ip-172-31-41-23:87:104 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
ip-172-31-41-23:87:104 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
ip-172-31-41-23:87:106 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 0
           8             2     float     sum      -1    117.7    0.00    0.00      0    118.4    0.00    0.00      0
          16             4     float     sum      -1    116.1    0.00    0.00      0    122.7    0.00    0.00      0
          32             8     float     sum      -1    119.9    0.00    0.00      0    118.2    0.00    0.00      0
          64            16     float     sum      -1    113.5    0.00    0.00      0    114.4    0.00    0.00      0
         128            32     float     sum      -1    120.4    0.00    0.00      0    116.6    0.00    0.00      0
         256            64     float     sum      -1    119.3    0.00    0.00      0    129.5    0.00    0.00      0
         512           128     float     sum      -1    125.4    0.00    0.00      0    126.3    0.00    0.00      0
        1024           256     float     sum      -1    124.7    0.01    0.01      0    119.0    0.01    0.01      0
        2048           512     float     sum      -1    124.4    0.02    0.02      0    122.3    0.02    0.02      0
        4096          1024     float     sum      -1    130.1    0.03    0.03      0    122.8    0.03    0.03      0
        8192          2048     float     sum      -1    131.3    0.06    0.06      0    132.9    0.06    0.06      0
       16384          4096     float     sum      -1    180.4    0.09    0.09      0    181.8    0.09    0.09      0
       32768          8192     float     sum      -1    203.8    0.16    0.16      0    198.7    0.16    0.16      0
       65536         16384     float     sum      -1    256.8    0.26    0.26      0    253.8    0.26    0.26      0
      131072         32768     float     sum      -1    279.0    0.47    0.47      0    284.7    0.46    0.46      0
      262144         65536     float     sum      -1    336.1    0.78    0.78      0    336.2    0.78    0.78      0
      524288        131072     float     sum      -1    481.3    1.09    1.09      0    495.4    1.06    1.06      0
     1048576        262144     float     sum      -1    643.2    1.63    1.63      0    650.7    1.61    1.61      0
     2097152        524288     float     sum      -1    976.6    2.15    2.15      0   1016.0    2.06    2.06      0
     4194304       1048576     float     sum      -1   3936.6    1.07    1.07      0   3521.7    1.19    1.19      0
     8388608       2097152     float     sum      -1   5162.4    1.62    1.62      0   7949.8    1.06    1.06      0
    16777216       4194304     float     sum      -1    13277    1.26    1.26      0    11353    1.48    1.48      0
    33554432       8388608     float     sum      -1    36911    0.91    0.91      0    18357    1.83    1.83      0
    67108864      16777216     float     sum      -1    74591    0.90    0.90      0    49076    1.37    1.37      0
   134217728      33554432     float     sum      -1    95068    1.41    1.41      0   104888    1.28    1.28      0
   268435456      67108864     float     sum      -1   156613    1.71    1.71      0   160618    1.67    1.67      0
   536870912     134217728     float     sum      -1   303174    1.77    1.77      0   306972    1.75    1.75      0
  1073741824     268435456     float     sum      -1   562061    1.91    1.91      0   545488    1.97    1.97      0
  2147483648     536870912     float     sum      -1  1106222    1.94    1.94      0  1094698    1.96    1.96      0
  4294967296    1073741824     float     sum      -1  2132099    2.01    2.01      0  2106053    2.04    2.04      0
  8589934592    2147483648     float     sum      -1  4233864    2.03    2.03      0  4258548    2.02    2.02      0
ip-172-31-41-23:87:166 [0] NCCL INFO comm 0x57c72df01450 rank 0 nranks 2 cudaDev 0 busId 33000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.831064 
#

Expected behaviour

  1. dstack picks dstackai/base:0.10-base-ubuntu22.04 (no EFA)

dstack version

0.19.26

Server logs

Additional information

H100:1 on AWS corresponds to p5.4xlarge which supports EFA but it was added only recently and our Docker image patching logic doesn't support it yet:

https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/server/background/tasks/process_running_jobs.py#L1130

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions