New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdm: fix leak when existing failure in ip network #13435

Merged
merged 1 commit into from Feb 18, 2017

Conversation

Projects
None yet
3 participants
@yuyuyu101
Member

yuyuyu101 commented Feb 15, 2017

Signed-off-by: Haomai Wang haomai@xsky.com

msg/async/rdm: fix leak when existing failure in ip network
Signed-off-by: Haomai Wang <haomai@xsky.com>
@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 15, 2017

@Adirl @orendu I find the leaking QP which is created but calling "try_connect" failed. So it happen in unhealthy cluster and no problem in healthy cluster.

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 15, 2017

@Adirl plz test this pr in your cluster to verify current qp number and see whether matching our assuption..

@Adirl

This comment has been minimized.

Adirl commented Feb 15, 2017

Great ! thanks
patch looks good,
i need to test and will send update

@yuyuyu101 yuyuyu101 merged commit 6dcd79c into ceph:master Feb 18, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@yuyuyu101 yuyuyu101 deleted the yuyuyu101:wip-rdma-leak branch Feb 18, 2017

@DanielBar-On

This comment has been minimized.

Contributor

DanielBar-On commented Feb 28, 2017

@yuyuyu101
Confirming qp leak issue is resolved. Checked on 3 nodes with 13 osds.

with 1 osds up out of 13 in, we got-
1528 qps created ,9 qps active and 16 qps according to kernel system information.
then after a few minutes, results stayed pretty much the same with -
1580 qps created, 9 qps active and 14 qps according to kernel system information.

with 6 osds up out of 13 in, we got -
2314 qps created ,164 qps active and 182 qps according to kernel system information.
and again, after a few minutes, results didn't change with -
2407 qps created, 164 qps active and 182 qps according to kernel system information.

with 13 osds up out of 13 in, we got -
2287 qps created ,712 qps active and 737 qps according to kernel system information.
and again, after a few minutes -
2616 qps created, 712 qps active and 733 qps according to kernel system information.

@Adirl

This comment has been minimized.

Adirl commented Feb 28, 2017

@DanielBo @yuyuyu101
qp numbers look good !

@DanielBar-On

This comment has been minimized.

Contributor

DanielBar-On commented Feb 28, 2017

@yuyuyu101
Encountered a new issue:
On a setup of 3 nodes, 1 mon, 13 osds we found that after killing osds and bringing them back up a few times, a machine will start showing a constantly increasing number of QPs according to the kernel system information. Created QPs counter increased as well but active QPs counter stayed the same.

Following the logs from the problematic node, this is what happens:
we get a "wrong node!", which calls the destructor "~RDMAConnectedSocketImpl ", then polling which is busy and the dead QPs never get destroyed (usually, polling would destroy the QPs)

The log is also attached.

2017-02-28 12:06:25.992135 7f9c3cc5f700 25 -- 11.0.0.4:6800/17441 >> 11.0.0.2:6800/5645 conn(0x7f9c4e324800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).read_until read_bulk recv_end is 0 left is 281 got 281
2017-02-28 12:06:25.992145 7f9c3cc5f700 20 -- 11.0.0.4:6800/17441 >> 11.0.0.2:6800/5645 conn(0x7f9c4e324800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect read peer addr 11.0.0.2:6800/24196 on socket 47
2017-02-28 12:06:25.992154 7f9c3cc5f700 0 -- 11.0.0.4:6800/17441 >> 11.0.0.2:6800/5645 conn(0x7f9c4e324800 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._process_connection connect claims to be 11.0.0.2:6800/24196 not 11.0.0.2:6800/5645 - wrong node!
2017-02-28 12:06:25.992164 7f9c3cc5f700 20 Event(0x7f9c4d12c340 nevent=5000 time_id=783).delete_file_event delete event started fd=47 mask=3 original mask is 3
2017-02-28 12:06:25.992167 7f9c3cc5f700 20 EpollDriver.del_event del event fd=47 cur_mask=3 delmask=3 to 4
2017-02-28 12:06:25.992172 7f9c3cc5f700 10 Event(0x7f9c4d12c340 nevent=5000 time_id=783).delete_file_event delete event end fd=47 mask=3 original mask is 0
2017-02-28 12:06:25.992176 7f9c3cc5f700 20 RDMAConnectedSocketImpl ~RDMAConnectedSocketImpl destruct.
2017-02-28 12:06:25.992179 7f9c3d460700 20 RDMAStack polling pool completion queue got 1 responses.
2017-02-28 12:06:25.992179 7f9c3cc5f700 20 Event(0x7f9c4d12c340 nevent=5000 time_id=783).delete_file_event delete event started fd=50 mask=1 original mask is 1
2017-02-28 12:06:25.992181 7f9c3d460700 25 RDMAStack got a tx cqe, bytes:281

osd9log.txt

@DanielBar-On

This comment has been minimized.

Contributor

DanielBar-On commented Mar 5, 2017

Hey @yuyuyu101, any idea on why this is happening?

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment