-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument #24
Comments
Does your security group include a rule for All Traffic for itself Inbound & Outbound ? |
In addition to what David asked, could you provide the following information as well?
|
Thanks for the quick reply.
Yes.
I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files |
Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest Also, does host "ip-172-32-36-209" have efa installed? |
I got the same error while running nccl-test
Running on single node is fine. The error only occur if I run on multiple nodes.
|
This would happen if you are using the |
Please re-open if you see the issue again. |
I'm using horovod with EFA, and the multi-node job hangs with
The text was updated successfully, but these errors were encountered: