Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053

Merged
merged 3 commits into from Oct 3, 2017

Conversation

ownedu
Copy link
Contributor

@ownedu ownedu commented Sep 30, 2017

this can be reproduced by marking some OSDs down in a big Ceph cluster, say 300+ OSDs.

rootcause: when RDMAStack wants to delete faulty connections there are
chances that those QPs still have inflight CQEs, thus inflight Tx
buffers; without waiting for them to complete, Tx buffer pool will run
out of buffers finally.

fix: ideally the best way to fix this bug is to destroy QPs gracefully
such as to_dead(), we now just rely on the number of Tx WQE and CQE to
avoid buffer leakage; RDMAStack polling is always running so we are safe
to simply bypass some QPs that are not in 'complete' state.

Signed-off-by: Yan Lei yongyou.yl@alibaba-inc.com

reply" due to out of Tx buffers, this can be reproduced by marking some
OSDs down in a big Ceph cluster, say 300+ OSDs.

rootcause: when RDMAStack wants to delete faulty connections there are
chances that those QPs still have inflight CQEs, thus inflight Tx
buffers; without waiting for them to complete, Tx buffer pool will run
out of buffers finally.

fix: ideally the best way to fix this bug is to destroy QPs gracefully
such as to_dead(), we now just reply on the number of Tx WQE and CQE to
avoid buffer leakage; RDMAStack polling is always running so we are safe
to simply bypass some QPs that are not in 'complete' state.

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
@ownedu
Copy link
Contributor Author

ownedu commented Sep 30, 2017

@yuyuyu101 @Adirl pls help review; thanks!

@ownedu
Copy link
Contributor Author

ownedu commented Sep 30, 2017

@tchaikov in case you are interested in this topic ...

@yuyuyu101
Copy link
Member

really good, currently we avoid this case to set rx_buffer number to 0, it will allow dynamic memory allocation and registration. @alex-mikheev

Mutex::Locker l(lock);
// Try to find the QP in qp_conns firstly.
auto it = qp_conns.find(qp);
if (it == qp_conns.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early return if found to reduce the indent level.

Copy link
Contributor Author

@ownedu ownedu Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch:

    if (it != qp_conns.end())
      return it->second.first;

auto it = qp_conns.find(qp);
if (it == qp_conns.end()) {
// Try again in dead_queue_pairs.
for(auto dead_qp = dead_queue_pairs.begin(); dead_qp != dead_queue_pairs.end(); dead_qp++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to use range-based loop or std::find_if() instead.

Copy link
Contributor Author

@ownedu ownedu Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing that may need your review:

    for (auto &i : dead_queue_pairs)
      if (i->get_local_qp_number() == qp)
        return i;
      
    return nullptr;

auto it = qp_conns.find(qp);
if (it == qp_conns.end()) {
// Try again in dead_queue_pairs.
for(auto dead_qp = dead_queue_pairs.begin(); dead_qp != dead_queue_pairs.end(); dead_qp++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after for.

while (!dead_queue_pairs.empty()) {
ldout(cct, 10) << __func__ << " finally delete qp=" << dead_queue_pairs.back() << dendl;
delete dead_queue_pairs.back();
for (auto idx = 0; idx < dead_queue_pairs.size(); idx++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use range-based loop please to avoid repeating dead_queue_pairs.at(idx).

Copy link
Contributor Author

@ownedu ownedu Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to use the following style? Pls correct me if anything is incorrect; thanks!

    for (auto &i : dead_queue_pairs) {
            if (i->get_tx_wc() != i->get_tx_wr())
                    continue;
            auto it = std::find(dead_queue_pairs.begin(), dead_queue_pairs.end(), i);
            if (it != dead_queue_pairs.end())
                    dead_queue_pairs.erase(it);
            ldout(cct, 10) << __func__ << "finally delete qp=" << i << dendl;
            delete i;
            perf_logger->dec(l_msgr_rdma_active_queue_pair);
            --num_dead_queue_pair;
    }

@@ -575,6 +577,8 @@ int RDMAConnectedSocketImpl::post_work_request(std::vector<Chunk*> &tx_buffers)
worker->perf_logger->inc(l_msgr_rdma_tx_failed);
return -errno;
}
// Update the Tx WQE counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure if this comment helps.

@@ -595,6 +599,8 @@ void RDMAConnectedSocketImpl::fin() {
worker->perf_logger->inc(l_msgr_rdma_tx_failed);
return ;
}
// Update the Tx WQE counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -486,6 +497,8 @@ class Infiniband {
uint32_t max_recv_wr;
uint32_t q_key;
bool dead;
std::atomic<uint32_t> tx_wr; // atomic counter for successful Tx WQEs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better off initialize it like:

std::atomic<uint32_t> tx_wr{0};

or

std::atomic<uint32_t> tx_wr = {0};

s/atomic counter/counter/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, and with such initialization we should also get rid of the following lines in Infiniband::QueuePair::QueuePair():

  • tx_wr(0),
  • tx_wc(0)

@@ -464,6 +465,16 @@ class Infiniband {
* Return true if the queue pair is in an error state, false otherwise.
*/
bool is_error() const;
/**
* Add Tx work request and completion counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop this comment, as the code is self-documenting.

@alex-mikheev
Copy link
Contributor

@yuyuyu101 good catch! I don't think setting number of rx buffers to infinite will avoid this bug. @ownedu Is it possible to rely on already existing rdma socket/stack locks instead of adding two more atomic variables ? And use just one counter for the outstanding tx work requests ?

@ownedu
Copy link
Contributor Author

ownedu commented Oct 1, 2017

@alex-mikheev I do not think current RDMAStack has QP-level counters for such inflight WQE/CQE, but it is a good idea to use just a single atomic counter and I will address this in the following CR commit; thanks.

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
@ownedu
Copy link
Contributor Author

ownedu commented Oct 1, 2017

@tchaikov Pls help review commit 303e640 which addresses your CR comments; thanks!

atomic counter for inflight Tx CQEs.

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
@ownedu
Copy link
Contributor Author

ownedu commented Oct 1, 2017

@alex-mikheev addressed your comments in commit e323771; thanks.

@ownedu
Copy link
Contributor Author

ownedu commented Oct 3, 2017

@yuyuyu101 thanks for the review and pls help merge.

@yuyuyu101 yuyuyu101 merged commit 5d6d138 into ceph:master Oct 3, 2017
@ownedu ownedu deleted the wip-fix-async-rdma-tx-buffer-leakage branch October 3, 2017 10:53
ownedu added a commit to ownedu/ceph that referenced this pull request Oct 10, 2017
where the iterator is not working properly after erase().

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
ownedu added a commit to ownedu/ceph that referenced this pull request Oct 11, 2017
where the iterator is not working properly after erase().

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>
tchaikov added a commit that referenced this pull request Oct 13, 2017
msg/async/rdma: fix a coredump introduced by PR #18053

Reviewed-by: Haomai Wang <haomai@xsky.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants