New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053
msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053
Conversation
reply" due to out of Tx buffers, this can be reproduced by marking some OSDs down in a big Ceph cluster, say 300+ OSDs. rootcause: when RDMAStack wants to delete faulty connections there are chances that those QPs still have inflight CQEs, thus inflight Tx buffers; without waiting for them to complete, Tx buffer pool will run out of buffers finally. fix: ideally the best way to fix this bug is to destroy QPs gracefully such as to_dead(), we now just reply on the number of Tx WQE and CQE to avoid buffer leakage; RDMAStack polling is always running so we are safe to simply bypass some QPs that are not in 'complete' state. Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
@yuyuyu101 @Adirl pls help review; thanks! |
@tchaikov in case you are interested in this topic ... |
really good, currently we avoid this case to set rx_buffer number to 0, it will allow dynamic memory allocation and registration. @alex-mikheev |
src/msg/async/rdma/RDMAStack.cc
Outdated
Mutex::Locker l(lock); | ||
// Try to find the QP in qp_conns firstly. | ||
auto it = qp_conns.find(qp); | ||
if (it == qp_conns.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
early return if found to reduce the indent level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch:
if (it != qp_conns.end())
return it->second.first;
src/msg/async/rdma/RDMAStack.cc
Outdated
auto it = qp_conns.find(qp); | ||
if (it == qp_conns.end()) { | ||
// Try again in dead_queue_pairs. | ||
for(auto dead_qp = dead_queue_pairs.begin(); dead_qp != dead_queue_pairs.end(); dead_qp++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to use range-based loop or std::find_if()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thing that may need your review:
for (auto &i : dead_queue_pairs)
if (i->get_local_qp_number() == qp)
return i;
return nullptr;
src/msg/async/rdma/RDMAStack.cc
Outdated
auto it = qp_conns.find(qp); | ||
if (it == qp_conns.end()) { | ||
// Try again in dead_queue_pairs. | ||
for(auto dead_qp = dead_queue_pairs.begin(); dead_qp != dead_queue_pairs.end(); dead_qp++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a space after for
.
src/msg/async/rdma/RDMAStack.cc
Outdated
while (!dead_queue_pairs.empty()) { | ||
ldout(cct, 10) << __func__ << " finally delete qp=" << dead_queue_pairs.back() << dendl; | ||
delete dead_queue_pairs.back(); | ||
for (auto idx = 0; idx < dead_queue_pairs.size(); idx++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use range-based loop please to avoid repeating dead_queue_pairs.at(idx)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to use the following style? Pls correct me if anything is incorrect; thanks!
for (auto &i : dead_queue_pairs) {
if (i->get_tx_wc() != i->get_tx_wr())
continue;
auto it = std::find(dead_queue_pairs.begin(), dead_queue_pairs.end(), i);
if (it != dead_queue_pairs.end())
dead_queue_pairs.erase(it);
ldout(cct, 10) << __func__ << "finally delete qp=" << i << dendl;
delete i;
perf_logger->dec(l_msgr_rdma_active_queue_pair);
--num_dead_queue_pair;
}
@@ -575,6 +577,8 @@ int RDMAConnectedSocketImpl::post_work_request(std::vector<Chunk*> &tx_buffers) | |||
worker->perf_logger->inc(l_msgr_rdma_tx_failed); | |||
return -errno; | |||
} | |||
// Update the Tx WQE counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure if this comment helps.
@@ -595,6 +599,8 @@ void RDMAConnectedSocketImpl::fin() { | |||
worker->perf_logger->inc(l_msgr_rdma_tx_failed); | |||
return ; | |||
} | |||
// Update the Tx WQE counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
src/msg/async/rdma/Infiniband.h
Outdated
@@ -486,6 +497,8 @@ class Infiniband { | |||
uint32_t max_recv_wr; | |||
uint32_t q_key; | |||
bool dead; | |||
std::atomic<uint32_t> tx_wr; // atomic counter for successful Tx WQEs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better off initialize it like:
std::atomic<uint32_t> tx_wr{0};
or
std::atomic<uint32_t> tx_wr = {0};
s/atomic counter/counter/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and with such initialization we should also get rid of the following lines in Infiniband::QueuePair::QueuePair():
- tx_wr(0),
- tx_wc(0)
src/msg/async/rdma/Infiniband.h
Outdated
@@ -464,6 +465,16 @@ class Infiniband { | |||
* Return true if the queue pair is in an error state, false otherwise. | |||
*/ | |||
bool is_error() const; | |||
/** | |||
* Add Tx work request and completion counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop this comment, as the code is self-documenting.
@yuyuyu101 good catch! I don't think setting number of rx buffers to infinite will avoid this bug. @ownedu Is it possible to rely on already existing rdma socket/stack locks instead of adding two more atomic variables ? And use just one counter for the outstanding tx work requests ? |
@alex-mikheev I do not think current RDMAStack has QP-level counters for such inflight WQE/CQE, but it is a good idea to use just a single atomic counter and I will address this in the following CR commit; thanks. |
Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
atomic counter for inflight Tx CQEs. Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
@alex-mikheev addressed your comments in commit e323771; thanks. |
@yuyuyu101 thanks for the review and pls help merge. |
where the iterator is not working properly after erase(). Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
where the iterator is not working properly after erase(). Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
msg/async/rdma: fix a coredump introduced by PR #18053 Reviewed-by: Haomai Wang <haomai@xsky.com> Reviewed-by: Kefu Chai <kchai@redhat.com>
this can be reproduced by marking some OSDs down in a big Ceph cluster, say 300+ OSDs.
rootcause: when RDMAStack wants to delete faulty connections there are
chances that those QPs still have inflight CQEs, thus inflight Tx
buffers; without waiting for them to complete, Tx buffer pool will run
out of buffers finally.
fix: ideally the best way to fix this bug is to destroy QPs gracefully
such as to_dead(), we now just rely on the number of Tx WQE and CQE to
avoid buffer leakage; RDMAStack polling is always running so we are safe
to simply bypass some QPs that are not in 'complete' state.
Signed-off-by: Yan Lei yongyou.yl@alibaba-inc.com