Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: destroy QueuePair if needed #13810

Merged
merged 1 commit into from Mar 7, 2017

Conversation

Projects
None yet
3 participants
@yuyuyu101
Copy link
Member

yuyuyu101 commented Mar 6, 2017

Signed-off-by: Haomai Wang haomai@xsky.com

msg/async/rdma: destroy QueuePair if needed
Signed-off-by: Haomai Wang <haomai@xsky.com>
@Adirl

This comment has been minimized.

Copy link

Adirl commented Mar 6, 2017

@yuyuyu101
reviewing

Mutex::Locker l(lock); // FIXME reuse dead qp because creating one qp costs 1 ms
while (!dead_queue_pairs.empty()) {
ldout(cct, 10) << __func__ << " finally delete qp=" << dead_queue_pairs.back() << dendl;
delete dead_queue_pairs.back();
perf_logger->dec(l_msgr_rdma_active_queue_pair);
dead_queue_pairs.pop_back();
--num_dead_queue_pair;

This comment has been minimized.

Copy link
@Adirl

Adirl Mar 6, 2017

we might want to move this to ~QueuePair() to cover other cases of deleting qp's

This comment has been minimized.

Copy link
@yuyuyu101

yuyuyu101 Mar 6, 2017

Author Member

no, we only trace in queue qp..

@@ -202,13 +203,14 @@ void RDMADispatcher::polling()
// Additionally, don't delete qp while outstanding_buffers isn't empty,

This comment has been minimized.

Copy link
@Adirl

Adirl Mar 6, 2017

since we don't check inflight value, is this still true?

This comment has been minimized.

Copy link
@yuyuyu101

yuyuyu101 Mar 6, 2017

Author Member

yes, we don't need to care inflight tx messages

@Adirl

Adirl approved these changes Mar 6, 2017

@yuyuyu101 yuyuyu101 merged commit d124e6f into ceph:master Mar 7, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@yuyuyu101 yuyuyu101 deleted the yuyuyu101:wip-rdma-inflight branch Mar 7, 2017

@Adirl

This comment has been minimized.

Copy link

Adirl commented Mar 9, 2017

@yuyuyu101

[cephuser@clx-ssp-055 ~]$ ceph -s
    cluster 68e56c22-d9d3-4680-872e-e547ab7fdf80
     health HEALTH_OK
     monmap e5: 4 mons at {clx-ssp-055=110.168.1.55:6789/0,clx-ssp-060=110.168.1.60:6789/0,clx-ssp-065=110.168.1.65:6789/0,clx-ssp-070=110.168.1.70:6789/0}
            election epoch 36, quorum 0,1,2,3 clx-ssp-055,clx-ssp-060,clx-ssp-065,clx-ssp-070
        mgr active: clx-ssp-060 standbys: clx-ssp-065, clx-ssp-070, clx-ssp-055
     osdmap e1404: 256 osds: 256 up, 256 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds,require_luminous_osds
      pgmap v84918: 8192 pgs, 1 pools, 16000 GB data, 4000 kobjects
            48051 GB used, 45854 GB / 93905 GB avail
                8192 active+clean
/mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/msg/async/rdma/RDMAStack.cc: In function 'virtual RDMADispatcher::~RDMADispatcher()' thread 7fb71b472700 time 2017-03-09 16:33:55.296822
/mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/msg/async/rdma/RDMAStack.cc: 39: FAILED assert(dead_queue_pairs.empty())
 ceph version 12.0.0-1037-gf7e0f57 (f7e0f57f797e1bf6a80a7226bcc21765024e7e8a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7fb721668bc0]
 2: (RDMADispatcher::~RDMADispatcher()+0x336) [0x7fb7217a3116]
 3: (RDMADispatcher::~RDMADispatcher()+0x9) [0x7fb7217a3149]
 4: (RDMAStack::~RDMAStack()+0x38) [0x7fb7217a1728]
 5: (CephContext::TypedSingletonWrapper<StackSingleton>::~TypedSingletonWrapper()+0x8a) [0x7fb72178311a]
 6: (CephContext::~CephContext()+0x47) [0x7fb72181b597]
 7: (CephContext::put()+0x17c) [0x7fb72181bc4c]
 8: (librados::RadosClient::~RadosClient()+0x1b0) [0x7fb729f78520]
 9: (librados::RadosClient::~RadosClient()+0x9) [0x7fb729f78579]
 10: (rados_shutdown()+0x2e) [0x7fb729f2b3ce]
 11: (()+0x17cc2) [0x7fb72a236cc2]
 12: (PyEval_EvalFrameEx()+0x730a) [0x7fb72c7cd00a]
 13: (PyEval_EvalCodeEx()+0x7ed) [0x7fb72c7cee3d]
 14: (PyEval_EvalFrameEx()+0x663c) [0x7fb72c7cc33c]
 15: (PyEval_EvalFrameEx()+0x67bd) [0x7fb72c7cc4bd]
 16: (PyEval_EvalCodeEx()+0x7ed) [0x7fb72c7cee3d]
 17: (()+0x70798) [0x7fb72c758798]
 18: (PyObject_Call()+0x43) [0x7fb72c7338e3]
 19: (()+0x5a8d5) [0x7fb72c7428d5]
 20: (PyObject_Call()+0x43) [0x7fb72c7338e3]
 21: (PyEval_CallObjectWithKeywords()+0x47) [0x7fb72c7c56f7]
 22: (()+0x1155c2) [0x7fb72c7fd5c2]
 23: (()+0x7dc5) [0x7fb72c4d3dc5]
 24: (clone()+0x6d) [0x7fb72baf821d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@Adirl

This comment has been minimized.

Copy link

Adirl commented Mar 12, 2017

@yuyuyu101 have you seen this ?

@Adirl

This comment has been minimized.

Copy link

Adirl commented Mar 12, 2017

@yuyuyu101
here's another one on different cluster


    -2> 2017-03-12 10:12:01.131113 7f14bb7ff700  1 RDMAStack handle_async_event it's not forwardly stopped by us, reenable=0x7f14ca196a80
    -1> 2017-03-12 10:12:01.131124 7f14bb7ff700  1  RDMAConnectedSocketImpl fault tcp fd 23
     0> 2017-03-12 10:12:01.133464 7f14bb7ff700 -1 /mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f14bb7ff700 time 2017-03-12 10:12:01.131139
/mnt/jenkins/ceph/rpmbuild/BUILD/ceph-12.0.0-1037-gf7e0f57/src/common/Mutex.cc: 113: FAILED assert(r == 0)



 ceph version 12.0.0-1037-gf7e0f57 (f7e0f57f797e1bf6a80a7226bcc21765024e7e8a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f14c08c50e0]
 2: (Mutex::Lock(bool)+0x194) [0x7f14c088adb4]
 3: (RDMADispatcher::erase_qpn(unsigned int)+0x1d) [0x7f14c096c1ed]
 4: (RDMADispatcher::handle_async_event()+0x49c) [0x7f14c096c7ec]
 5: (RDMADispatcher::polling()+0x566) [0x7f14c096eff6]
 6: (()+0xb5220) [0x7f14bdc91220]
 7: (()+0x7dc5) [0x7f14be319dc5]
 8: (clone()+0x6d) [0x7f14bd3f921d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@yuyuyu101

This comment has been minimized.

Copy link
Member Author

yuyuyu101 commented Mar 12, 2017

@Adirl fixed here #13905

@DanielBar-On

This comment has been minimized.

Copy link
Contributor

DanielBar-On commented Mar 12, 2017

Couldn't replicate the crash. Regardless, the qp leak from #13435 is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.