ProcessGroupNCCL alltoall error #13

amrragab8080 · 2020-11-02T14:43:57Z

Traceback (most recent call last):
  File "./comms.py", line 526, in <module>
    main()
  File "./comms.py", line 523, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "./comms.py", line 485, in runBench
    backendObj.benchmark_comms()
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 252, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "./comms.py", line 426, in benchTime
    comm_fn=collectiveFunc, compute_fn=computeFunc
  File "./comms.py", line 164, in runColl
    comm_fn(self.collectiveArgs)
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 108, in all_to_all
    async_op=collectiveArgs.asyncOp,
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
    work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

However setting NCCL_DEBUG=INFO I see i have a NCCL lib version >=2.7.0

ip-172-31-44-177:11401:11401 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-44-177:11404:11404 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.44.177<0>

However if I remove the --collective stanza altogether it works

The text was updated successfully, but these errors were encountered:

shz0116 · 2020-11-02T18:07:39Z

@amrragab8080 Could you tell us on which platform you are testing ? And the exact test command ? Thanks.

amrragab8080 · 2020-11-03T13:31:33Z

I am specifically testing the Collective-Comms benchmark. I am using AWS with the p4d - A100s with Pytorch 1.7
My stack is built according to this: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline

/opt/amazon/openmpi/bin/mpirun -np 128 -N 8 --hostfile hostfile \
-x PATH -x LD_LIBRARY_PATH -x NCCL_ALGO=ring -x NCCL_DEBUG=info -x RDMAV_FORK_SAFE=1 \
-x FI_EFA_USE_DEVICE_RDMA=1 --mca pml ^cm --mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0 --bind-to core --map-by numa \
 /home/ubuntu/param/train/comms/pt/comms.py --master-ip 172.31.76.84 \
--b 8 --e 8192M --n 100 \
 --f 2 --z 1 --collective all_to_all

I believe this is a Torch issue than a param/dlrm issue. Its how the ProcessGroupNCCL checks the NCCL version

amrragab8080 · 2020-11-06T01:08:15Z

Linked issue
pytorch/pytorch#47291

all2all will be merged in the Pytorch 1.7.1

srinivas212 assigned kingchc, srinivas212 and shz0116 Nov 2, 2020

srinivas212 closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProcessGroupNCCL alltoall error #13

ProcessGroupNCCL alltoall error #13

amrragab8080 commented Nov 2, 2020

shz0116 commented Nov 2, 2020

amrragab8080 commented Nov 3, 2020 •

edited

amrragab8080 commented Nov 6, 2020

ProcessGroupNCCL alltoall error #13

ProcessGroupNCCL alltoall error #13

Comments

amrragab8080 commented Nov 2, 2020

shz0116 commented Nov 2, 2020

amrragab8080 commented Nov 3, 2020 • edited

amrragab8080 commented Nov 6, 2020

amrragab8080 commented Nov 3, 2020 •

edited