Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock for rendezvous providers #85

Closed
dmaryin opened this issue Jan 13, 2022 · 2 comments
Closed

Deadlock for rendezvous providers #85

dmaryin opened this issue Jan 13, 2022 · 2 comments

Comments

@dmaryin
Copy link
Contributor

dmaryin commented Jan 13, 2022

Hello aws_ofi_nccl maintainers,

NCCL, during its sequence of GDR capability checking, calls ofi_listen, ofi_connect and ofi_accept sequentially.
ofi_connect issues non-blocking fi_tsend to send small buffer of data to recipient.
Later it tries to ensure that data was actually transferred.
Underlying provider detects that caller wants to talk to self.
In case if it implements rendezvous mode for talking to self, transfer considered complete only if caller actually received buffer, i.e. ofi_accept must be called before ofi_connect. At the same time accepting considered complete only when data actually there when accept was called. Deadlock.

To fix this the following pull request was created #84
Please consider merging it.

BRs,
Denis

@dmaryin
Copy link
Contributor Author

dmaryin commented Feb 21, 2022

Hello @rashikakheria, this issue has been resolved, would you mind to close it?

@rashikakheria
Copy link
Contributor

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants