You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "./comms.py", line 526, in <module>
main()
File "./comms.py", line 523, in main
collBenchObj.runBench(comms_world_info, commsParams)
File "./comms.py", line 485, in runBench
backendObj.benchmark_comms()
File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 252, in benchmark_comms
self.commsParams.benchTime(index, self.commsParams, self)
File "./comms.py", line 426, in benchTime
comm_fn=collectiveFunc, compute_fn=computeFunc
File "./comms.py", line 164, in runColl
comm_fn(self.collectiveArgs)
File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 108, in all_to_all
async_op=collectiveArgs.asyncOp,
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0
However setting NCCL_DEBUG=INFO I see i have a NCCL lib version >=2.7.0
ip-172-31-44-177:11401:11401 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-44-177:11404:11404 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.44.177<0>
However if I remove the --collective stanza altogether it works
The text was updated successfully, but these errors were encountered:
However setting NCCL_DEBUG=INFO I see i have a NCCL lib version >=2.7.0
However if I remove the
--collective
stanza altogether it worksThe text was updated successfully, but these errors were encountered: