-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate "Invalid address" to NCCL communicator #346
Comments
rajachan
added a commit
to rajachan/aws-ofi-nccl
that referenced
this issue
Feb 21, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan
added a commit
to rajachan/aws-ofi-nccl
that referenced
this issue
Feb 22, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan
added a commit
to rajachan/aws-ofi-nccl
that referenced
this issue
Feb 22, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan
added a commit
to rajachan/aws-ofi-nccl
that referenced
this issue
Feb 22, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan
added a commit
that referenced
this issue
Feb 22, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes #346 Signed-off-by: Raghu Raja <raghunch@amazon.com>
rajachan
added a commit
to rajachan/aws-ofi-nccl
that referenced
this issue
Feb 23, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes aws#346 Signed-off-by: Raghu Raja <raghunch@amazon.com> (cherry picked from commit 5aac4dc)
rajachan
added a commit
that referenced
this issue
Feb 23, 2024
When progressing completions, we could get a completion error for a request before NCCL gets to calling test() explicitly for that request. Since NCCL tests for completions in order, this can lead to hangs when there are non-recoverable failures in the network and NCCL never receives a successful completion for the earliest request. With this change, completion errors are always passed up the stack so NCCL can abort the job and fail gracefully where possible. This logic can further be enhanced based on provider-specific information from completion error entry to distinguish between fatal errors vs recoverable user errors, but that would not be portable. Fixes #346 Signed-off-by: Raghu Raja <raghunch@amazon.com> (cherry picked from commit 5aac4dc)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We are using EFA to do asynchronous ncclAllReduce over 120 ranks. Every once in a while and every few hours, the operation breaks. We poll NCCL errors by first checking the scheduled stream status, and, if it is equal to
cudaErrorNotReady
, we consequently check the communicator errors by callingncclCommGetAsyncError
. Under yet-to-be-fully-understood circumstances (we are investigating together with Amazon support), we see the following log underNCCL_DEBUG=TRACE
,FI_LOG_LEVEL=1
:The NCCL communicator doesn't return errors in
ncclCommGetAsyncError
afterward, and we hang in our error polling loop. It would be very useful if the plugin propagated that error to the communicator so that we could recreate it.The same code correctly handles errors in other non-Amazon clusters with InfiniBand. When something bad happens to InfiniBand, we handle the errors polled from the communicator and restart.
The text was updated successfully, but these errors were encountered: