AsyncMessenger: Bind thread to core, use buffer read and fix some bugs by yuyuyu101 · Pull Request #3211 · ceph/ceph

yuyuyu101 · 2014-12-19T02:58:29Z

Bind limited threads to specified cores
Using buffer read like what pipe done
discard poll call
add ms_inject_* to async
fix replace process
fix memory leak
lots of small bug fixed

loic-bot · 2014-12-19T04:04:46Z

SUCCESS: make check on 701ca11 output is http://paste.ubuntu.com/9566231/

Sent from GH.

loic-bot · 2014-12-30T18:41:46Z

SUCCESS: the output of run-make-check.sh on 455980e is http://paste.pound-python.org/show/aTHR3KLWV19wSVbADhud/

Sent from GH.

loic-bot · 2014-12-31T10:28:55Z

SUCCESS: the output of run-make-check.sh on c8417fc is http://paste.pound-python.org/show/iX3x8h1MhDl5LB5QSQdX/

Sent from GH.

loic-bot · 2015-01-05T04:23:41Z

SUCCESS: the output of run-make-check.sh on fb8aa2e is http://paste.pound-python.org/show/0aqTftCXfPWZe7w1YHom/

Sent from GH.

loic-bot · 2015-01-07T07:50:48Z

FAIL: the output of run-make-check.sh on 1b93d23 is http://paste.pound-python.org/show/0KX0dZXsMqd9Dx4ROgM4/

Sent from GH.

loic-bot · 2015-01-07T09:23:43Z

FAIL: the output of run-make-check.sh on b47ddad is http://paste.pound-python.org/show/1EYiXYettCig28uMbscx/

Sent from GH.

loic-bot · 2015-01-07T09:43:23Z

FAIL: the output of run-make-check.sh on 17f40af is http://paste.pound-python.org/show/RsRXI3vJERIzAzQO5MeQ/

Sent from GH.

loic-bot · 2015-01-07T10:03:47Z

SUCCESS: the output of run-make-check.sh on 809539b is http://paste.pound-python.org/show/k37wrugTh6jwfLs0GpDz/

Sent from GH.

loic-bot · 2015-01-08T15:15:58Z

SUCCESS: the output of run-make-check.sh on d28bb83 is http://paste.pound-python.org/show/pSIQecSuNEqhW6VY5jNc/

Sent from GH.

loic-bot · 2015-01-13T10:58:02Z

SUCCESS: the output of run-make-check.sh on centos-centos7 for d28bb83 is http://paste2.org/sELM1yeW

Sent from GH.

loic-bot · 2015-01-13T11:10:04Z

SUCCESS: the output of run-make-check.sh on centos-centos7 for 905d86f is http://paste2.org/7KZE8YeO

Sent from GH.

loic-bot · 2015-01-14T03:42:40Z

SUCCESS: the output of run-make-check.sh on centos-centos7 for 905d86f is http://paste2.org/AemLt83t

Sent from GH.

loic-bot · 2015-01-14T04:13:12Z

FAIL: the output of run-make-check.sh on centos-centos7 for 12423e2589bc3f9215febd6f62511aeb0a6344cb is http://paste2.org/ssHLfZMX

Sent from GH.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Make handle_connect_msg follow lock rule: unlock any lock before acquire messenger's lock. Otherwise, deadlock will happen. Enhance lock condition check because connection's state maybe change while unlock itself and lock again. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Because AsyncConnection won't enter "open" tag from "replace" tag, the codes which set reply_tag won't be used when enter "open" tag. It will cause server side discard out_q and lose state. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

If connection sent many messages without acked, then it was marked down. Next we get a new connection, it will issue a connect_msg with connect_seq=0, server side need to detect "connect_seq==0 && existing->connect_seq >0", so it will reset out_q and detect remote reset. But if client side failed before sending connect_msg, now it will issue a connect_msg with non-zero connect_seq which will cause server-side can't detect exist remote reset. Server-side will reply a non-zero in_seq and cause client crash. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

If client reconnect a already mark_down endpoint, server-side will detect remote reset happen, so it will reset existing connection. Meanwhile, retry tag is received by client-side connection and it will try to reconnect. Again, client-side connection will send connect_msg with connect_seq(1). But it will met server-side connection's connect_seq(0), it will make server-side reply with reset tag. So this connection will loop in reset and retry tag. One solution is that we close server-side connection if connect_seq ==0 and no message in queue. But it will trigger another problem: 1. client try to connect a already mark_down endpoint 2. client->send_message 3. server-side accept new socket, replace old one and reply retry tag 4. client plus one to connect_seq but socket failure happen 5. server-side connection detected and close because of connect_seq==0 and no message 6. client reconnect, server-side has no existing connection and met "connect.connect_seq > 0". So server-side will reply to RESET tag 7. client discard all messages in queue. So we lose a message never delivered This solution add a new "once_session_reset" flag to indicate whether "existing" reset. Because server-side's connect_seq is 0 only when it never successfully or ever session reset. We only need to reply RESET tag if ever session reset. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

yuyuyu101 · 2015-01-15T19:09:56Z

@dachary I guess "make check" think it's timeout? Actually new unittest_msgr will ran much time than before.

loic-bot · 2015-01-15T19:44:33Z

SUCCESS: the output of run-make-check.sh on centos-centos7 for 3162e6d is http://paste2.org/m4AZtsbs

Sent from GH.

ghost · 2015-01-15T19:48:55Z

please ignore the latest fail of the bot.
is there a reason for unittest_msg to take more than two minutes ?

ghost · 2015-01-15T20:07:53Z

@yuyuyu101 if I'm not mistaken the reason why the tests take more time is because you increased the timers in CHECK_AND_WAIT_TRUE. The problem with this approach is that it will not fix the root of the problem which is that the tests are racy and will fail depending on how slow / fast the machine is. Increasing the timers will only make the problems less frequent, it will not fix them. It means that someone working on an unrelated part of Ceph may see the test_msgr.cc fail and it will be quite difficult for her/him to figure out that it is because test_msgr.cc will sometime fail but not most of the time.

If I missed the real fix, please ignore me ;-)

yuyuyu101 · 2015-01-16T02:06:07Z

@dachary No, CHECK_AND_WAIT_TRUE isn't the root cause. For example, if we run this test in a slow vm, it need more time to build tcp connection and shakehands. Previously, it only left 1ms let it build and now I changed it to 0.5s at max. It doesn't hidden any potential bug.

The last assert fail reason is that the test mark_down a connection but peer detect closed connection need more time than I expected, the detect time depends on OS and network. So I add a lock/cond method instead of a fixed waiting time.

Why we ran this test more than two minutes is that SyntheticStressTest and SyntheticInjecteTest are added which will consuming nearly 5mins even if in my physical server. The time changed for CHECK_AND_WAIT_TRUE is very little.

I hope I clarify the reason :-)

ghost · 2015-01-16T10:22:00Z

@yuyuyu101

The last assert fail reason is that the test mark_down a connection but peer detect closed connection need more time than I expected, the detect time depends on OS and network. So I add a lock/cond method instead of a fixed waiting time.

This is exactly what I mean, thanks for clarifying. My point is that every time you use CHECK_AND_WAIT_TRUE you can replace it with a lock/cond. That will make the test bullet proof and impossible to fail, no matter how slow the machine is.

Does that make sense or is there a CHECK_AND_WAIT_TRUE call that will never fail ?

*_handler will store a reference to AsyncConnection, it need to explicit reset it. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

yuyuyu101 · 2015-01-18T04:57:30Z

@dachary Yes, current all CHECK_AND_WAIT_TRUE has been reviewed again. I think it shouldn't fail again if others well.

And you suggest that unittest_msgr ran too long for "make check", now unittest_msgr will ran nearly 7-8mins in my physical server. But I think if we decrease op num to reduce it under 1min, it may can't met test purpose. So I consider it may be move to qa-suite.

@liewegas Updated, add to qa?

loic-bot · 2015-01-18T05:04:21Z

SUCCESS: the output of run-make-check.sh on centos-centos7 for acf4188 is http://paste2.org/AF7LGDb6

Sent from GH.

AsyncMessenger: Bind thread to core, use buffer read and fix some bugs

yuyuyu101 added core feature labels Dec 19, 2014

yuyuyu101 force-pushed the wip-10172 branch from 42a870f to bf6a1dd Compare December 19, 2014 03:05

tchaikov mentioned this pull request Dec 25, 2014

AsyncMessenger: fix the leak of file_events #3255

Merged

yuyuyu101 force-pushed the wip-10172 branch from bf6a1dd to e32861b Compare December 30, 2014 18:25

yuyuyu101 assigned liewegas Dec 30, 2014

yuyuyu101 force-pushed the wip-10172 branch from e32861b to 52e5f2a Compare December 31, 2014 10:16

yuyuyu101 force-pushed the wip-10172 branch from 52e5f2a to 77ef6bc Compare January 5, 2015 04:11

yuyuyu101 force-pushed the wip-10172 branch from e042607 to 9257650 Compare January 7, 2015 09:10

yuyuyu101 force-pushed the wip-10172 branch from 9257650 to 00cb5d4 Compare January 7, 2015 09:31

yuyuyu101 force-pushed the wip-10172 branch from 00cb5d4 to 8a87c5f Compare January 7, 2015 09:54

yuyuyu101 force-pushed the wip-10172 branch from 8a87c5f to 2a1cb42 Compare January 8, 2015 15:02

yuyuyu101 force-pushed the wip-10172 branch from b69653d to ff7098b Compare January 13, 2015 10:48

yuyuyu101 changed the title ~~AsyncMessenger: Bind thread to core, use buffer read and clean up~~ AsyncMessenger: Bind thread to core, use buffer read and fix some bugs Jan 13, 2015

yuyuyu101 force-pushed the wip-10172 branch 2 times, most recently from 0db2e5f to 867cd71 Compare January 14, 2015 03:56

yuyuyu101 force-pushed the wip-10172 branch from 867cd71 to fc53a05 Compare January 14, 2015 04:15

yuyuyu101 added 13 commits January 16, 2015 03:07

AsyncConnection: Don't alloc buffer when reenter "READ_FRONT" state

e823af4

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

test_msgr: Add SyntheticWorkload to do message measurement

4b900a6

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Event: Fix incorrect memset

2f92383

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

AsyncConnection: set state_offset=0 in case of reuse this connection

a175390

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

AsyncConnection: Add ms_inject_* to AsyncConnection

a75ac0e

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

test_msgr: Add SyntheticInjectTest

c65df9b

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

AsyncConnection: Don't discard out_q and unregister when replacing

2bc1675

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

async: adjust test_msgr and normalize log output format

898d43d

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

Event: Fix typo

bd627e7

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

yuyuyu101 force-pushed the wip-10172 branch from e54f421 to e30fb5c Compare January 15, 2015 19:08

AsyncConnection: Fix memory leak for AsyncConnection

4d0e0ae

*_handler will store a reference to AsyncConnection, it need to explicit reset it. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>

yuyuyu101 force-pushed the wip-10172 branch from e30fb5c to 4d0e0ae Compare January 18, 2015 04:48

yuyuyu101 added needs-qa and removed needs-qa labels Jan 18, 2015

liewegas added the wip-sage-testing label Jan 19, 2015

liewegas added a commit that referenced this pull request Jan 19, 2015

Merge pull request #3211 from yuyuyu101/wip-10172

01af73b

AsyncMessenger: Bind thread to core, use buffer read and fix some bugs

liewegas merged commit 01af73b into ceph:master Jan 19, 2015

Conversation

yuyuyu101 commented Dec 19, 2014

Uh oh!

loic-bot commented Dec 19, 2014

Uh oh!

loic-bot commented Dec 30, 2014

Uh oh!

loic-bot commented Dec 31, 2014

Uh oh!

loic-bot commented Jan 5, 2015

Uh oh!

loic-bot commented Jan 7, 2015

Uh oh!

loic-bot commented Jan 7, 2015

Uh oh!

loic-bot commented Jan 7, 2015

Uh oh!

loic-bot commented Jan 7, 2015

Uh oh!

loic-bot commented Jan 8, 2015

Uh oh!

loic-bot commented Jan 13, 2015

Uh oh!

loic-bot commented Jan 13, 2015

Uh oh!

loic-bot commented Jan 14, 2015

Uh oh!

loic-bot commented Jan 14, 2015

Uh oh!

yuyuyu101 commented Jan 15, 2015

Uh oh!

loic-bot commented Jan 15, 2015

Uh oh!

ghost commented Jan 15, 2015

Uh oh!

ghost commented Jan 15, 2015

Uh oh!

yuyuyu101 commented Jan 16, 2015

Uh oh!

ghost commented Jan 16, 2015

Uh oh!

yuyuyu101 commented Jan 18, 2015

Uh oh!

loic-bot commented Jan 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants