Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessGroupNCCL alltoall error #13

Closed
amrragab8080 opened this issue Nov 2, 2020 · 3 comments
Closed

ProcessGroupNCCL alltoall error #13

amrragab8080 opened this issue Nov 2, 2020 · 3 comments
Assignees

Comments

@amrragab8080
Copy link

Traceback (most recent call last):
  File "./comms.py", line 526, in <module>
    main()
  File "./comms.py", line 523, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "./comms.py", line 485, in runBench
    backendObj.benchmark_comms()
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 252, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "./comms.py", line 426, in benchTime
    comm_fn=collectiveFunc, compute_fn=computeFunc
  File "./comms.py", line 164, in runColl
    comm_fn(self.collectiveArgs)
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 108, in all_to_all
    async_op=collectiveArgs.asyncOp,
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
    work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

However setting NCCL_DEBUG=INFO I see i have a NCCL lib version >=2.7.0

ip-172-31-44-177:11401:11401 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-44-177:11404:11404 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.44.177<0>

However if I remove the --collective stanza altogether it works

@shz0116
Copy link

shz0116 commented Nov 2, 2020

@amrragab8080 Could you tell us on which platform you are testing ? And the exact test command ? Thanks.

@amrragab8080
Copy link
Author

amrragab8080 commented Nov 3, 2020

I am specifically testing the Collective-Comms benchmark. I am using AWS with the p4d - A100s with Pytorch 1.7
My stack is built according to this: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline

/opt/amazon/openmpi/bin/mpirun -np 128 -N 8 --hostfile hostfile \
-x PATH -x LD_LIBRARY_PATH -x NCCL_ALGO=ring -x NCCL_DEBUG=info -x RDMAV_FORK_SAFE=1 \
-x FI_EFA_USE_DEVICE_RDMA=1 --mca pml ^cm --mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0 --bind-to core --map-by numa \
 /home/ubuntu/param/train/comms/pt/comms.py --master-ip 172.31.76.84 \
--b 8 --e 8192M --n 100 \
 --f 2 --z 1 --collective all_to_all

I believe this is a Torch issue than a param/dlrm issue. Its how the ProcessGroupNCCL checks the NCCL version

@amrragab8080
Copy link
Author

Linked issue
pytorch/pytorch#47291

all2all will be merged in the Pytorch 1.7.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants