Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancellation safety #47

Closed
Nugine opened this issue Apr 18, 2022 · 1 comment · Fixed by #65
Closed

Cancellation safety #47

Nugine opened this issue Apr 18, 2022 · 1 comment · Fixed by #65
Labels
bug Something isn't working

Comments

@Nugine
Copy link
Contributor

Nugine commented Apr 18, 2022

Disclaimer: I know nothing about the details of RDMA before.

I have some questions about the example at first glance.

/// Server can't aware of rdma `read` or `write`, so we need sync with client
/// before client `read` and after `write`.
async fn sync_with_client(rdma: &Rdma) {
let mut lmr_sync = rdma
.alloc_local_mr(Layout::new::<Request>())
.map_err(|err| println!("{}", &err))
.unwrap();
//write data to lmr
unsafe { *(lmr_sync.as_mut_ptr() as *mut Request) = Request::Sync };
rdma.send(&lmr_sync)
.await
.map_err(|err| println!("{}", &err))
.unwrap();
}

  1. When will the local memory region be deallocated?
  2. Does the unsafe statement have any safety requirements?
  3. What if the future returned by sync_with_client is dropped (cancelled) before the request completes?

Related article: https://zhuanlan.zhihu.com/p/346219893

@GTwhy
Copy link
Collaborator

GTwhy commented Apr 19, 2022

Hi, thanks for your thoughtful questions and article.

  1. For now, the relationship between LocalMr and RawMemoryRegion(contains ibv_mr : the "real" RDMA memory region) is one-to-one, and they are connected through Arc . So the memory region drop flow is : lmr.drop() -> raw_mr.drop() -> ibv_dreg_mr(ibv_mr). It's simple but inefficient, so we will transform jemalloc to manage ibv_mr and reuse it's preapplication and retain strategy in the next release.

  2. The operations in the unsafe statement here almost degenerate to operations on raw memory pointers in C. I haven't thought carefully about how to pack this part better. I saw in the article that you have some thoughts about this. Maybe you have good ideas.

  3. Good question! That's depending on the stage of the request. If the future dropped before or after ibv_post_send(rdma api) done, there will only be some error messages. But the handling of this during the execution of ibv_post_send has not been taken into account. This question is similar to the one mentioned in your article.

I will deal with the above issues later. You are welcome to contribute if you are interested, Thank you!

@GTwhy GTwhy added the bug Something isn't working label Apr 19, 2022
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 9, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Fixes: datenlord#47
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 9, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Fixes: datenlord#47
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 9, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Fixes: datenlord#47
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 15, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Fixes: datenlord#47 datenlord#56
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 19, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Fixes: datenlord#47 datenlord#56
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 20, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Add `MrState` to avoid potential race condition

When the RDMA operations future were canceled but the related
mr is still being used by netdev or kernel. This is a potential
race condition. So we add `MrState` to `LocalMrInner`, to make
sure we can get slice or mutable slice only if the mr is readable
or writeable.

Fixes: datenlord#47 datenlord#56
GTwhy added a commit to GTwhy/async-rdma that referenced this issue May 23, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Add `RwLock` to avoid potential race condition

When the RDMA operations future were canceled but the related
mr is still being used by netdev or kernel. This is a potential
race condition. So we add `Rwlock` to `LocalMrInner`, to make
sure we can get slice or mutable slice only if the mr is readable
or writeable.

Replace hashmap to reduce time overhead and ensure guards drop after
remove immediately to unlock the rwlock.`std::collections::HashMap`
has less time overhead than `LockFreeCuckooHash` in the low concurrency
situation. `LockFreeCuckooHash` can't drop guards immediately after remove
them and that led to a deadlock.

Add some unchecked methods without lock, which should be put into unsafe
trait later.

Integrate common `HashMap` methods in hashmap_utilities.

Fixes: datenlord#47 datenlord#56
rogercloud pushed a commit that referenced this issue May 24, 2022
Let `event_listener` holds the `Arc`s of the `LocalMrInner`s
that are being used by RDMA ops to ensure cancellation safety.

`LocalMr` was replaced with New struct `LocalMrInner`.
Because every struct that can use APIs should hold an `Arc`
of the mr's metadata, but previous `LocalMr` can't hold itself.

Add `RwLock` to avoid potential race condition

When the RDMA operations future were canceled but the related
mr is still being used by netdev or kernel. This is a potential
race condition. So we add `Rwlock` to `LocalMrInner`, to make
sure we can get slice or mutable slice only if the mr is readable
or writeable.

Replace hashmap to reduce time overhead and ensure guards drop after
remove immediately to unlock the rwlock.`std::collections::HashMap`
has less time overhead than `LockFreeCuckooHash` in the low concurrency
situation. `LockFreeCuckooHash` can't drop guards immediately after remove
them and that led to a deadlock.

Add some unchecked methods without lock, which should be put into unsafe
trait later.

Integrate common `HashMap` methods in hashmap_utilities.

Fixes: #47 #56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants