-
Notifications
You must be signed in to change notification settings - Fork 195
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Steps to reproduce
- Create a fleet
type: fleet
name: gpu
nodes: 2
placement: cluster
backends: [aws]
resources:
gpu: H100:1
- Run
nccl-tests
type: task
nodes: 2
working_dir: "."
startup_order: workers-first
stop_criteria: master-done
env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
resources:
gpu: nvidia:1..8
shm_size: 16GB
Actual behaviour
dstack
automatically picksdstackai/base:0.10-devel-efa-ubuntu22.04
(EFA-enabled)
NCCL logs (no EFA initialized):
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 87 on ip-172-31-41-23 device 0 [0000:33:00] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 105 on ip-172-31-37-112 device 0 [0000:33:00] NVIDIA H100 80GB HBM3
ip-172-31-41-23:87:87 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
ip-172-31-41-23:87:87 [0] NCCL INFO Bootstrap: Using enp39s0:172.31.41.23<0>
ip-172-31-41-23:87:87 [0] NCCL INFO cudaDriverVersion 12080
ip-172-31-41-23:87:87 [0] NCCL INFO NCCL version 2.26.2+cuda12.1
ip-172-31-41-23:87:102 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
ip-172-31-41-23:87:102 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
ip-172-31-41-23:87:102 [0] NCCL INFO NET/IB : No device found.
ip-172-31-41-23:87:102 [0] NCCL INFO NET/IB : Using [RO]; OOB enp39s0:172.31.41.23<0>
ip-172-31-41-23:87:102 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker,lo
ip-172-31-41-23:87:102 [0] NCCL INFO NET/Socket : Using [0]enp39s0:172.31.41.23<0>
ip-172-31-41-23:87:102 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ip-172-31-41-23:87:102 [0] NCCL INFO Using network Socket
ip-172-31-41-23:87:102 [0] NCCL INFO ncclCommInitRank comm 0x57c72df01450 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 33000 commId 0x51d1ccf57958a510 - Init START
ip-172-31-41-23:87:102 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
ip-172-31-41-23:87:102 [0] NCCL INFO Bootstrap timings total 0.029221 (create 0.000033, send 0.000143, recv 0.009620, ring 0.009715, delay 0.000000)
ip-172-31-41-23:87:102 [0] NCCL INFO Setting affinity for GPU 0 to ffff
ip-172-31-41-23:87:102 [0] NCCL INFO comm 0x57c72df01450 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
ip-172-31-41-23:87:102 [0] NCCL INFO Channel 00/02 : 0 1
ip-172-31-41-23:87:102 [0] NCCL INFO Channel 01/02 : 0 1
ip-172-31-41-23:87:102 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
ip-172-31-41-23:87:102 [0] NCCL INFO P2P Chunksize set to 131072
ip-172-31-41-23:87:102 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
ip-172-31-41-23:87:104 [0] NCCL INFO [Proxy Service] Device 0 CPU core 11
ip-172-31-41-23:87:105 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 13
ip-172-31-41-23:87:102 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-172-31-41-23:87:102 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ip-172-31-41-23:87:102 [0] NCCL INFO CC Off, workFifoBytes 1048576
ip-172-31-41-23:87:102 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
ip-172-31-41-23:87:102 [0] NCCL INFO ncclCommInitRank comm 0x57c72df01450 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 33000 commId 0x51d1ccf57958a510 - Init COMPLETE
ip-172-31-41-23:87:102 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.36 (kernels 0.31, alloc 0.02, bootstrap 0.03, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-41-23:87:107 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 7
ip-172-31-41-23:87:104 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
ip-172-31-41-23:87:104 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
ip-172-31-41-23:87:106 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
ip-172-31-41-23:87:106 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 0
8 2 float sum -1 117.7 0.00 0.00 0 118.4 0.00 0.00 0
16 4 float sum -1 116.1 0.00 0.00 0 122.7 0.00 0.00 0
32 8 float sum -1 119.9 0.00 0.00 0 118.2 0.00 0.00 0
64 16 float sum -1 113.5 0.00 0.00 0 114.4 0.00 0.00 0
128 32 float sum -1 120.4 0.00 0.00 0 116.6 0.00 0.00 0
256 64 float sum -1 119.3 0.00 0.00 0 129.5 0.00 0.00 0
512 128 float sum -1 125.4 0.00 0.00 0 126.3 0.00 0.00 0
1024 256 float sum -1 124.7 0.01 0.01 0 119.0 0.01 0.01 0
2048 512 float sum -1 124.4 0.02 0.02 0 122.3 0.02 0.02 0
4096 1024 float sum -1 130.1 0.03 0.03 0 122.8 0.03 0.03 0
8192 2048 float sum -1 131.3 0.06 0.06 0 132.9 0.06 0.06 0
16384 4096 float sum -1 180.4 0.09 0.09 0 181.8 0.09 0.09 0
32768 8192 float sum -1 203.8 0.16 0.16 0 198.7 0.16 0.16 0
65536 16384 float sum -1 256.8 0.26 0.26 0 253.8 0.26 0.26 0
131072 32768 float sum -1 279.0 0.47 0.47 0 284.7 0.46 0.46 0
262144 65536 float sum -1 336.1 0.78 0.78 0 336.2 0.78 0.78 0
524288 131072 float sum -1 481.3 1.09 1.09 0 495.4 1.06 1.06 0
1048576 262144 float sum -1 643.2 1.63 1.63 0 650.7 1.61 1.61 0
2097152 524288 float sum -1 976.6 2.15 2.15 0 1016.0 2.06 2.06 0
4194304 1048576 float sum -1 3936.6 1.07 1.07 0 3521.7 1.19 1.19 0
8388608 2097152 float sum -1 5162.4 1.62 1.62 0 7949.8 1.06 1.06 0
16777216 4194304 float sum -1 13277 1.26 1.26 0 11353 1.48 1.48 0
33554432 8388608 float sum -1 36911 0.91 0.91 0 18357 1.83 1.83 0
67108864 16777216 float sum -1 74591 0.90 0.90 0 49076 1.37 1.37 0
134217728 33554432 float sum -1 95068 1.41 1.41 0 104888 1.28 1.28 0
268435456 67108864 float sum -1 156613 1.71 1.71 0 160618 1.67 1.67 0
536870912 134217728 float sum -1 303174 1.77 1.77 0 306972 1.75 1.75 0
1073741824 268435456 float sum -1 562061 1.91 1.91 0 545488 1.97 1.97 0
2147483648 536870912 float sum -1 1106222 1.94 1.94 0 1094698 1.96 1.96 0
4294967296 1073741824 float sum -1 2132099 2.01 2.01 0 2106053 2.04 2.04 0
8589934592 2147483648 float sum -1 4233864 2.03 2.03 0 4258548 2.02 2.02 0
ip-172-31-41-23:87:166 [0] NCCL INFO comm 0x57c72df01450 rank 0 nranks 2 cudaDev 0 busId 33000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.831064
#
Expected behaviour
dstack
picksdstackai/base:0.10-base-ubuntu22.04
(no EFA)
dstack version
0.19.26
Server logs
Additional information
H100:1
on AWS corresponds to p5.4xlarge
which supports EFA but it was added only recently and our Docker image patching logic doesn't support it yet:
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working