You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NCCL, during its sequence of GDR capability checking, calls ofi_listen, ofi_connect and ofi_accept sequentially.
ofi_connect issues non-blocking fi_tsend to send small buffer of data to recipient.
Later it tries to ensure that data was actually transferred.
Underlying provider detects that caller wants to talk to self.
In case if it implements rendezvous mode for talking to self, transfer considered complete only if caller actually received buffer, i.e. ofi_accept must be called before ofi_connect. At the same time accepting considered complete only when data actually there when accept was called. Deadlock.
To fix this the following pull request was created #84
Please consider merging it.
BRs,
Denis
The text was updated successfully, but these errors were encountered:
Hello aws_ofi_nccl maintainers,
NCCL, during its sequence of GDR capability checking, calls ofi_listen, ofi_connect and ofi_accept sequentially.
ofi_connect issues non-blocking fi_tsend to send small buffer of data to recipient.
Later it tries to ensure that data was actually transferred.
Underlying provider detects that caller wants to talk to self.
In case if it implements rendezvous mode for talking to self, transfer considered complete only if caller actually received buffer, i.e. ofi_accept must be called before ofi_connect. At the same time accepting considered complete only when data actually there when accept was called. Deadlock.
To fix this the following pull request was created #84
Please consider merging it.
BRs,
Denis
The text was updated successfully, but these errors were encountered: