Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: fixes crash in fio #16957

Closed
wants to merge 1 commit into from

Conversation

alex-mikheev
Copy link
Contributor

@yuyuyu101 this pr fixes the same problem as PR #16893
However instead of using ceph context singleton both infiniband and dispatcher are made part of the
RDMA stack. So each ceph context will have a completely separate resources.
It is also better for the performance because there are less qps that are attached to the same SRQ.

fio creates multiple CephContext in a single process.
Crash(es) happen because rdma stack has a global resources that
are still used from one ceph context while have already been destroyed
by another context.

The commit removes global instances of RDMA dispatcher and infiniband
and makes them context (rdma stack) specific.

Signed-off-by: Adir Lev adirl@mellanox.com
Signed-off-by: Alex Mikheev alexm@mellanox.com

fio creates multiple CephContext in a single process.
Crash(es) happen because rdma stack has a global resources that
are still used from one ceph context while have already been destroyed
by another context.

The commit removes global instances of RDMA dispatcher and infiniband
and makes them context (rdma stack) specific.

Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
@@ -776,9 +778,45 @@ int Infiniband::MemoryManager::get_send_buffers(std::vector<Chunk*> &c, size_t b
return send->get_buffers(c, bytes);
}

Infiniband::Infiniband(CephContext *cct, const std::string &device_name, uint8_t port_num)
: cct(cct), lock("IB lock"), device_name(device_name), port_num(port_num)
bool Infiniband::init_prereq = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::atomic

ldout(cct, 20) << __func__ << " ms_async_rdma_enable_hugepage value is: " << cct->_conf->ms_async_rdma_enable_hugepage << dendl;
if (cct->_conf->ms_async_rdma_enable_hugepage){
rc = setenv("RDMAV_HUGEPAGES_SAFE","1",1);
ldout(cct, 20) << __func__ << " RDMAV_HUGEPAGES_SAFE is set as: " << getenv("RDMAV_HUGEPAGES_SAFE") << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level 0 is prefer, it's an important info to know

assert(tx_cq);
rx_cq = global_infiniband->create_comp_queue(cct, rx_cc);
rx_cq = get_stack()->get_infiniband().create_comp_queue(cct, rx_cc);
assert(rx_cq);

t = std::thread(&RDMADispatcher::polling, this);
}

void RDMADispatcher::polling_stop()
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex::Locker l(lock)

@yuyuyu101
Copy link
Member

I will test locally

@alex-mikheev
Copy link
Contributor Author

@yuyuyu101 thanks.

One remaining problem is that rx buffer pool is still a singleton. However there are pending changes that address this in https://github.com/Mellanox/ceph/commits/luminous-rdma
Commits:
Mellanox@a2df3cb
Mellanox@3650b9b
Mellanox@515b390

@yuyuyu101
Copy link
Member

replaced by #16981

@yuyuyu101 yuyuyu101 closed this Aug 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants