Propagate "Invalid address" to NCCL communicator #346

vmarkovtsev · 2024-02-21T15:42:12Z

We are using EFA to do asynchronous ncclAllReduce over 120 ranks. Every once in a while and every few hours, the operation breaks. We poll NCCL errors by first checking the scheduled stream status, and, if it is equal to cudaErrorNotReady, we consequently check the communicator errors by calling ncclCommGetAsyncError. Under yet-to-be-fully-understood circumstances (we are investigating together with Amazon support), we see the following log under NCCL_DEBUG=TRACE, FI_LOG_LEVEL=1:

libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
    libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
     8[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 00/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 01/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 02/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 03/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453422 [4] 222743.762603 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222746.623005 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222767.773259 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222780.007305 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Connected all trees
    ip-172-31-69-127:451707:453373 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
    ip-172-31-69-127:451707:453373 [4] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
    ip-172-31-69-127:451707:453373 [4] NCCL INFO comm 0x55dcf7b0a630 rank 0 nranks 12 cudaDev 4 nvmlDev 4 busId 97000 commId 0xe25512ad6c40b2e2 - Init COMPLETE
    ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453404 [4] 235242.807784 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453404 [4] 235245.883529 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453391 [4] 294782.055802 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453391 [4] 294785.363301 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
  
    ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae859a0 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }
  
    ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae86e00 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }

The NCCL communicator doesn't return errors in ncclCommGetAsyncError afterward, and we hang in our error polling loop. It would be very useful if the plugin propagated that error to the communicator so that we could recreate it.

The same code correctly handles errors in other non-Amazon clusters with InfiniBand. When something bad happens to InfiniBand, we handle the errors polled from the communicator and restart.

The text was updated successfully, but these errors were encountered:

When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com>

When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes #346 Signed-off-by: Raghu Raja <raghunch@amazon.com>

When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com> (cherry picked from commit 5aac4dc)

When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes #346 Signed-off-by: Raghu Raja <raghunch@amazon.com> (cherry picked from commit 5aac4dc)

rajachan mentioned this issue Feb 21, 2024

Always propagate completion errors to NCCL #347

Merged

rajachan closed this as completed in #347 Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate "Invalid address" to NCCL communicator #346

Propagate "Invalid address" to NCCL communicator #346

vmarkovtsev commented Feb 21, 2024

Propagate "Invalid address" to NCCL communicator #346

Propagate "Invalid address" to NCCL communicator #346

Comments

vmarkovtsev commented Feb 21, 2024