Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument #24

Closed
eric-haibin-lin opened this issue Sep 13, 2019 · 7 comments

Comments

@eric-haibin-lin
Copy link

I'm using horovod with EFA, and the multi-node job hangs with

...
[1,26]<stdout>:ip-172-32-36-209:8716:9558 [2] NCCL INFO Ring 00 : 26[2] -> 30[6] via P2P/IPC
[1,25]<stdout>:ip-172-32-36-209:8715:9546 [1] NCCL INFO Ring 00 : 25[1] -> 27[3] via P2P/IPC
[1,29]<stdout>:ip-172-32-36-209:8719:9555 [5] NCCL INFO Ring 00 : 29[5] -> 31[7] via P2P/IPC
[1,28]<stdout>:ip-172-32-36-209:8718:9563 [4] NCCL INFO Ring 00 : 28[4] -> 29[5] via P2P/IPC
[1,27]<stdout>:ip-172-32-36-209:8717:9550 [3] NCCL INFO Ring 00 : 27[3] -> 26[2] via P2P/IPC
[1,7]<stdout>:ip-172-32-38-1:21471:22316 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/AWS Libfabric/0
[1,15]<stdout>:ip-172-32-34-121:70560:71394 [7] NCCL INFO Ring 00 : 15 -> 16 [send] via NET/AWS Libfabric/0
[1,30]<stdout>:ip-172-32-36-209:8720:9567 [6] NCCL INFO Ring 00 : 30[6] -> 28[4] via P2P/IPC
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO Ring 00 : 23 -> 24 [send] via NET/AWS Libfabric/0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO NET/OFI No NIC info for dev 0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO include/net.h:24 -> 2
[1,23]<stdout>:
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO include/net.h:27 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO transport/net.cc:357 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:668 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:814 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:950 -> 2
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO Ring 00 : 31 -> 0 [send] via NET/AWS Libfabric/0
...
ubuntu@ip-172-32-38-1:~$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD
@eric-haibin-lin eric-haibin-lin changed the title [1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument Sep 13, 2019
@AddyLaddy
Copy link
Contributor

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

@rashikakheria
Copy link
Contributor

In addition to what David asked, could you provide the following information as well?

  1. Complete log of your run.
  2. EFA installer version. You can find this using
cat /opt/amazon/efa_installed_packages
  1. Which commit of aws-ofi-nccl are you using to run the tests? Is that the latest of aws branch?

@eric-haibin-lin
Copy link
Author

Thanks for the quick reply.

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

Yes.

Complete log of your run.

efa.log

EFA installer version. You can find this using

# EFA installer version: 1.4.1
# Debug packages installed: no
# Packages installed:
efa_1.3.0-1.amzn1_amd64 libfabric1_1.8.0amzn1.0_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 openmpi_3.1.4-2_amd64

I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files

@rashikakheria
Copy link
Contributor

rashikakheria commented Sep 13, 2019

Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest aws branch of the plugin when running on EC2 infrastructure.

Also, does host "ip-172-32-36-209" have efa installed?

@apeforest
Copy link

I got the same error while running nccl-test

~/anaconda3/bin/mpirun \
        -x FI_PROVIDER="efa" \
        -x FI_EFA_TX_MIN_CREDITS=64 \
        -x LD_LIBRARY_PATH=$HOME/drivers/aws-ofi-nccl/install/lib/:$HOME/drivers/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
        -x NCCL_DEBUG=WARN -x NCCL_TREE_THRESHOLD=0 --hostfile $HOME/hosts -n 16 -N 8 \
        --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
        $HOME/drivers/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

Running on single node is fine. The error only occur if I run on multiple nodes.
I am actually following the quip doc "Running nccl-tests on AWS EC2" and got the following error:

[ec2-user@ip-172-31-10-20 nccl-tests]$ ./run-test.sh  | tee test.log
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
#   Rank  0 Pid  23389 on ip-172-31-10-20 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  23390 on ip-172-31-10-20 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  23391 on ip-172-31-10-20 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  23392 on ip-172-31-10-20 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid  23393 on ip-172-31-10-20 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid  23394 on ip-172-31-10-20 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid  23395 on ip-172-31-10-20 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid  23396 on ip-172-31-10-20 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid  10508 on ip-172-31-1-59 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid  10509 on ip-172-31-1-59 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid  10510 on ip-172-31-1-59 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid  10511 on ip-172-31-1-59 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid  10512 on ip-172-31-1-59 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid  10513 on ip-172-31-1-59 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid  10514 on ip-172-31-1-59 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid  10515 on ip-172-31-1-59 device  7 [0x00] Tesla V100-SXM2-32GB
NCCL version 2.4.6+cuda10.0

ip-172-31-1-59:10508:10581 [0] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'

ip-172-31-1-59:10515:10584 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42896,1],8]
  Exit code:    3

@rashikakheria
Copy link
Contributor

This would happen if you are using the master branch. Please use aws branch when working with EFA.

@rashikakheria
Copy link
Contributor

Please re-open if you see the issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants