Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mr_allocator related bug when dealing two big lmr with one rdma object #116

Closed
fishiu opened this issue Aug 22, 2023 · 1 comment · Fixed by #117
Closed

[BUG] mr_allocator related bug when dealing two big lmr with one rdma object #116

fishiu opened this issue Aug 22, 2023 · 1 comment · Fixed by #117
Assignees

Comments

@fishiu
Copy link

fishiu commented Aug 22, 2023

The following code should reproduce the memory problem.

The case is that in server there are two lmr to read data.

The first read is successful. When the first lmr is dropped, jemalloc dalloc is triggered (if the lmr is fairly large), and the EXTENT_TOKEN_MAP will remove the raw_mr item.

However, when creating the second lmr, the lookup_raw_mr function in mr_allocator.rs will get error. There are actually three error situaitions in this function and I have seen all of them (still wondering why...)

Some thing about my system setting:

  • I tried using sudo to run this, still failed
  • I have set my user ulimit to unlimited
  • I use softiwarp on ubuntu 20.04, but I think the bug is only related to mr_allocator
use async_rdma::{LocalMrWriteAccess, RdmaBuilder};
use portpicker::pick_unused_port;
use std::{
    alloc::Layout,
    io::{self, Write},
    net::{Ipv4Addr, SocketAddrV4},
    time::Duration,
};

const SIZE: usize = 44444444;

async fn client(addr: SocketAddrV4) -> io::Result<()> {
    let rdma = RdmaBuilder::default().connect(addr).await?;
    let data = vec![0u8; SIZE];

    // first send
    let layout = Layout::from_size_align(SIZE, 1).unwrap();
    let mut lmr = rdma.alloc_local_mr(layout)?;
    lmr.as_mut_slice().write(&data)?;
    rdma.send_local_mr(lmr).await?;

    // second send
    let layout = Layout::from_size_align(SIZE, 1).unwrap();
    let mut lmr = rdma.alloc_local_mr(layout)?;
    lmr.as_mut_slice().write(&data)?;
    rdma.send_local_mr(lmr).await?;

    // wait for server to read, otherwise this client will early exit
    tokio::time::sleep(Duration::from_secs(5)).await;

    Ok(())
}

#[tokio::main]
async fn server(addr: SocketAddrV4) -> io::Result<()> {
    let rdma = RdmaBuilder::default().listen(addr).await?;

    {
        let layout = Layout::from_size_align(SIZE, 1).unwrap();
        println!("layout: {:?}", layout);
        let mut lmr = rdma.alloc_local_mr(layout)?;
        println!("lmr: {:?}", lmr);
        let rmr = rdma.receive_remote_mr().await?;
        rdma.read(&mut lmr, &rmr).await?;
        println!("rdma read\n-------------");
    }

    // lmr will drop here

    {
        let layout = Layout::from_size_align(SIZE, 1).unwrap();
        println!("layout: {:?}", layout);
        // the memory bug occurs here
        let mut lmr = rdma.alloc_local_mr(layout)?;
        println!("lmr: {:?}", lmr);
        let rmr = rdma.receive_remote_mr().await?;
        rdma.read(&mut lmr, &rmr).await?;
        println!("rdma read\n-------------");
    }

    Ok(())
}

#[tokio::main]
async fn main() {
    let addr = SocketAddrV4::new(Ipv4Addr::new(127, 0, 0, 1), pick_unused_port().unwrap());
    std::thread::spawn(move || server(addr));
    tokio::time::sleep(Duration::from_secs(1)).await;
    client(addr)
        .await
        .map_err(|err| println!("{}", err))
        .unwrap();
}
@GTwhy
Copy link
Collaborator

GTwhy commented Aug 23, 2023

Hi @fishiu , Thanks for your feedback.
The three errors you mentioned may all be related to the retain function of jemalloc.
You can try to set retain as false to fix the OOM or can not find raw mr, as described in #110 . I can pass your test in that way.

Actually it should be false as default in linux, but according to the feedbacks, it seems not to be the case.
The reason for the arena_id assertion failure may be that the client and server are in the same address space during testing, and because lazy_static shares the same EXTENT_TOKEN_MAP, combined with jemalloc's retain, it reads the address from the wrong arena.

Further analysis is required and it is necessary to enhance the default settings and error prompts in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants