Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate "Invalid address" to NCCL communicator #346

Closed
vmarkovtsev opened this issue Feb 21, 2024 · 0 comments · Fixed by #347
Closed

Propagate "Invalid address" to NCCL communicator #346

vmarkovtsev opened this issue Feb 21, 2024 · 0 comments · Fixed by #347

Comments

@vmarkovtsev
Copy link

We are using EFA to do asynchronous ncclAllReduce over 120 ranks. Every once in a while and every few hours, the operation breaks. We poll NCCL errors by first checking the scheduled stream status, and, if it is equal to cudaErrorNotReady, we consequently check the communicator errors by calling ncclCommGetAsyncError. Under yet-to-be-fully-understood circumstances (we are investigating together with Amazon support), we see the following log under NCCL_DEBUG=TRACE, FI_LOG_LEVEL=1:

libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
    libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
     8[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 00/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 01/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 02/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 03/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453422 [4] 222743.762603 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222746.623005 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222767.773259 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222780.007305 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Connected all trees
    ip-172-31-69-127:451707:453373 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
    ip-172-31-69-127:451707:453373 [4] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
    ip-172-31-69-127:451707:453373 [4] NCCL INFO comm 0x55dcf7b0a630 rank 0 nranks 12 cudaDev 4 nvmlDev 4 busId 97000 commId 0xe25512ad6c40b2e2 - Init COMPLETE
    ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453404 [4] 235242.807784 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453404 [4] 235245.883529 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453391 [4] 294782.055802 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453391 [4] 294785.363301 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
  
    ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae859a0 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }
  
    ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae86e00 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }

The NCCL communicator doesn't return errors in ncclCommGetAsyncError afterward, and we hang in our error polling loop. It would be very useful if the plugin propagated that error to the communicator so that we could recreate it.

The same code correctly handles errors in other non-Amazon clusters with InfiniBand. When something bad happens to InfiniBand, we handle the errors polled from the communicator and restart.

rajachan added a commit to rajachan/aws-ofi-nccl that referenced this issue Feb 21, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes aws#346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan added a commit to rajachan/aws-ofi-nccl that referenced this issue Feb 22, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes aws#346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan added a commit to rajachan/aws-ofi-nccl that referenced this issue Feb 22, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes aws#346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan added a commit to rajachan/aws-ofi-nccl that referenced this issue Feb 22, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes aws#346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan added a commit that referenced this issue Feb 22, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes #346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan added a commit to rajachan/aws-ofi-nccl that referenced this issue Feb 23, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes aws#346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
(cherry picked from commit 5aac4dc)
rajachan added a commit that referenced this issue Feb 23, 2024
When progressing completions, we could get a completion error for a
request before NCCL gets to calling test() explicitly for that request.
Since NCCL tests for completions in order, this can lead to hangs when
there are non-recoverable failures in the network and NCCL never
receives a successful completion for the earliest request. With this
change, completion errors are always passed up the stack so NCCL can
abort the job and fail gracefully where possible.

This logic can further be enhanced based on provider-specific
information from completion error entry to distinguish between fatal
errors vs recoverable user errors, but that would not be portable.

Fixes #346

Signed-off-by: Raghu Raja <raghunch@amazon.com>
(cherry picked from commit 5aac4dc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant