msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053

ownedu · 2017-09-30T02:23:02Z

this can be reproduced by marking some OSDs down in a big Ceph cluster, say 300+ OSDs.

rootcause: when RDMAStack wants to delete faulty connections there are
chances that those QPs still have inflight CQEs, thus inflight Tx
buffers; without waiting for them to complete, Tx buffer pool will run
out of buffers finally.

fix: ideally the best way to fix this bug is to destroy QPs gracefully
such as to_dead(), we now just rely on the number of Tx WQE and CQE to
avoid buffer leakage; RDMAStack polling is always running so we are safe
to simply bypass some QPs that are not in 'complete' state.

Signed-off-by: Yan Lei yongyou.yl@alibaba-inc.com

reply" due to out of Tx buffers, this can be reproduced by marking some OSDs down in a big Ceph cluster, say 300+ OSDs. rootcause: when RDMAStack wants to delete faulty connections there are chances that those QPs still have inflight CQEs, thus inflight Tx buffers; without waiting for them to complete, Tx buffer pool will run out of buffers finally. fix: ideally the best way to fix this bug is to destroy QPs gracefully such as to_dead(), we now just reply on the number of Tx WQE and CQE to avoid buffer leakage; RDMAStack polling is always running so we are safe to simply bypass some QPs that are not in 'complete' state. Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

ownedu · 2017-09-30T03:03:50Z

@yuyuyu101 @Adirl pls help review; thanks!

ownedu · 2017-09-30T03:27:39Z

@tchaikov in case you are interested in this topic ...

yuyuyu101 · 2017-09-30T04:43:12Z

really good, currently we avoid this case to set rx_buffer number to 0, it will allow dynamic memory allocation and registration. @alex-mikheev

tchaikov · 2017-09-30T04:50:44Z

src/msg/async/rdma/RDMAStack.cc

+  Mutex::Locker l(lock);
+  // Try to find the QP in qp_conns firstly.
+  auto it = qp_conns.find(qp);
+  if (it == qp_conns.end()) {


early return if found to reduce the indent level.

Good catch:

if (it != qp_conns.end()) return it->second.first;

tchaikov · 2017-09-30T04:51:03Z

src/msg/async/rdma/RDMAStack.cc

+  auto it = qp_conns.find(qp);
+  if (it == qp_conns.end()) {
+    // Try again in dead_queue_pairs.
+    for(auto dead_qp = dead_queue_pairs.begin(); dead_qp != dead_queue_pairs.end(); dead_qp++) {


might want to use range-based loop or std::find_if() instead.

The same thing that may need your review:

for (auto &i : dead_queue_pairs) if (i->get_local_qp_number() == qp) return i; return nullptr;

tchaikov · 2017-09-30T04:52:30Z

src/msg/async/rdma/RDMAStack.cc

+  auto it = qp_conns.find(qp);
+  if (it == qp_conns.end()) {
+    // Try again in dead_queue_pairs.
+    for(auto dead_qp = dead_queue_pairs.begin(); dead_qp != dead_queue_pairs.end(); dead_qp++) {


add a space after for.

tchaikov · 2017-09-30T04:55:02Z

src/msg/async/rdma/RDMAStack.cc

-        while (!dead_queue_pairs.empty()) {
-          ldout(cct, 10) << __func__ << " finally delete qp=" << dead_queue_pairs.back() << dendl;
-          delete dead_queue_pairs.back();
+        for (auto idx = 0; idx < dead_queue_pairs.size(); idx++) {


use range-based loop please to avoid repeating dead_queue_pairs.at(idx).

Do you want me to use the following style? Pls correct me if anything is incorrect; thanks!

for (auto &i : dead_queue_pairs) { if (i->get_tx_wc() != i->get_tx_wr()) continue; auto it = std::find(dead_queue_pairs.begin(), dead_queue_pairs.end(), i); if (it != dead_queue_pairs.end()) dead_queue_pairs.erase(it); ldout(cct, 10) << __func__ << "finally delete qp=" << i << dendl; delete i; perf_logger->dec(l_msgr_rdma_active_queue_pair); --num_dead_queue_pair; }

tchaikov · 2017-09-30T04:59:12Z

src/msg/async/rdma/RDMAConnectedSocketImpl.cc

@@ -575,6 +577,8 @@ int RDMAConnectedSocketImpl::post_work_request(std::vector<Chunk*> &tx_buffers)
    worker->perf_logger->inc(l_msgr_rdma_tx_failed);
    return -errno;
  }
+  // Update the Tx WQE counter


i am not sure if this comment helps.

tchaikov · 2017-09-30T04:59:16Z

src/msg/async/rdma/RDMAConnectedSocketImpl.cc

@@ -595,6 +599,8 @@ void RDMAConnectedSocketImpl::fin() {
    worker->perf_logger->inc(l_msgr_rdma_tx_failed);
    return ;
  }
+  // Update the Tx WQE counter


tchaikov · 2017-09-30T05:01:20Z

src/msg/async/rdma/Infiniband.h

@@ -486,6 +497,8 @@ class Infiniband {
    uint32_t     max_recv_wr;
    uint32_t     q_key;
    bool dead;
+    std::atomic<uint32_t> tx_wr; // atomic counter for successful Tx WQEs


better off initialize it like:

std::atomic<uint32_t> tx_wr{0};

or

std::atomic<uint32_t> tx_wr = {0};

s/atomic counter/counter/

OK, and with such initialization we should also get rid of the following lines in Infiniband::QueuePair::QueuePair():

tx_wr(0),

tx_wc(0)

tchaikov · 2017-09-30T05:02:27Z

src/msg/async/rdma/Infiniband.h

@@ -464,6 +465,16 @@ class Infiniband {
     * Return true if the queue pair is in an error state, false otherwise.
     */
    bool is_error() const;
+    /**
+     * Add Tx work request and completion counters.


drop this comment, as the code is self-documenting.

alex-mikheev · 2017-10-01T04:49:18Z

@yuyuyu101 good catch! I don't think setting number of rx buffers to infinite will avoid this bug. @ownedu Is it possible to rely on already existing rdma socket/stack locks instead of adding two more atomic variables ? And use just one counter for the outstanding tx work requests ?

ownedu · 2017-10-01T07:37:31Z

@alex-mikheev I do not think current RDMAStack has QP-level counters for such inflight WQE/CQE, but it is a good idea to use just a single atomic counter and I will address this in the following CR commit; thanks.

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

ownedu · 2017-10-01T09:15:25Z

@tchaikov Pls help review commit 303e640 which addresses your CR comments; thanks!

atomic counter for inflight Tx CQEs. Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

ownedu · 2017-10-01T10:31:34Z

@alex-mikheev addressed your comments in commit e323771; thanks.

ownedu · 2017-10-03T00:54:11Z

@yuyuyu101 thanks for the review and pls help merge.

where the iterator is not working properly after erase(). Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

msg/async/rdma: fix a coredump introduced by PR #18053 Reviewed-by: Haomai Wang <haomai@xsky.com> Reviewed-by: Kefu Chai <kchai@redhat.com>

tchaikov reviewed Sep 30, 2017

View reviewed changes

xiexingguo added bug-fix rdma labels Sep 30, 2017

Addressing CR comments from tchaikov (Kefu Chai).

303e640

Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

Addressing CR comments from alex-mikheev (Alex Mikheev), to use a single

e323771

atomic counter for inflight Tx CQEs. Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

alex-mikheev approved these changes Oct 2, 2017

View reviewed changes

yuyuyu101 approved these changes Oct 2, 2017

View reviewed changes

yuyuyu101 merged commit 5d6d138 into ceph:master Oct 3, 2017

ownedu deleted the wip-fix-async-rdma-tx-buffer-leakage branch October 3, 2017 10:53

ownedu added a commit to ownedu/ceph that referenced this pull request Oct 10, 2017

msg/async/rdma: fix a coredump bug which is introduced by PR ceph#18053,

f73bb15

where the iterator is not working properly after erase(). Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

ownedu mentioned this pull request Oct 10, 2017

msg/async/rdma: fix a coredump introduced by PR #18053, #18204

Merged

ownedu added a commit to ownedu/ceph that referenced this pull request Oct 11, 2017

msg/async/rdma: fix a coredump bug which is introduced by PR ceph#18053,

322f87f

where the iterator is not working properly after erase(). Signed-off-by: Yan Lei <yongyou.yl@alibaba-inc.com>

tchaikov added a commit that referenced this pull request Oct 13, 2017

Merge pull request #18204 from ownedu/wip-fix-async-rdma-coredump

5f021b2

msg/async/rdma: fix a coredump introduced by PR #18053 Reviewed-by: Haomai Wang <haomai@xsky.com> Reviewed-by: Kefu Chai <kchai@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053

msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053

ownedu commented Sep 30, 2017 •

edited

ownedu commented Sep 30, 2017

ownedu commented Sep 30, 2017

yuyuyu101 commented Sep 30, 2017

tchaikov Sep 30, 2017

ownedu Sep 30, 2017 •

edited

tchaikov Sep 30, 2017

ownedu Sep 30, 2017 •

edited

tchaikov Sep 30, 2017

tchaikov Sep 30, 2017

ownedu Sep 30, 2017 •

edited

tchaikov Sep 30, 2017

tchaikov Sep 30, 2017

tchaikov Sep 30, 2017

ownedu Sep 30, 2017

tchaikov Sep 30, 2017

alex-mikheev commented Oct 1, 2017

ownedu commented Oct 1, 2017

ownedu commented Oct 1, 2017

ownedu commented Oct 1, 2017

ownedu commented Oct 3, 2017

msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053

msg/async/rdma: fix Tx buffer leakage that can introduce "heartbeat no reply" #18053

Conversation

ownedu commented Sep 30, 2017 • edited

ownedu commented Sep 30, 2017

ownedu commented Sep 30, 2017

yuyuyu101 commented Sep 30, 2017

Choose a reason for hiding this comment

ownedu Sep 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ownedu Sep 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ownedu Sep 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-mikheev commented Oct 1, 2017

ownedu commented Oct 1, 2017

ownedu commented Oct 1, 2017

ownedu commented Oct 1, 2017

ownedu commented Oct 3, 2017

ownedu commented Sep 30, 2017 •

edited

ownedu Sep 30, 2017 •

edited

ownedu Sep 30, 2017 •

edited

ownedu Sep 30, 2017 •

edited