AsyncMessenger: Bind thread to core, use buffer read and fix some bugs#3211
AsyncMessenger: Bind thread to core, use buffer read and fix some bugs#3211liewegas merged 28 commits intoceph:masterfrom
Conversation
|
SUCCESS: make check on 701ca11 output is http://paste.ubuntu.com/9566231/ |
|
SUCCESS: the output of run-make-check.sh on 455980e is http://paste.pound-python.org/show/aTHR3KLWV19wSVbADhud/ |
|
SUCCESS: the output of run-make-check.sh on c8417fc is http://paste.pound-python.org/show/iX3x8h1MhDl5LB5QSQdX/ |
|
SUCCESS: the output of run-make-check.sh on fb8aa2e is http://paste.pound-python.org/show/0aqTftCXfPWZe7w1YHom/ |
|
FAIL: the output of run-make-check.sh on 1b93d23 is http://paste.pound-python.org/show/0KX0dZXsMqd9Dx4ROgM4/ |
|
FAIL: the output of run-make-check.sh on b47ddad is http://paste.pound-python.org/show/1EYiXYettCig28uMbscx/ |
|
FAIL: the output of run-make-check.sh on 17f40af is http://paste.pound-python.org/show/RsRXI3vJERIzAzQO5MeQ/ |
|
SUCCESS: the output of run-make-check.sh on 809539b is http://paste.pound-python.org/show/k37wrugTh6jwfLs0GpDz/ |
|
SUCCESS: the output of run-make-check.sh on d28bb83 is http://paste.pound-python.org/show/pSIQecSuNEqhW6VY5jNc/ |
|
SUCCESS: the output of run-make-check.sh on centos-centos7 for d28bb83 is http://paste2.org/sELM1yeW |
|
SUCCESS: the output of run-make-check.sh on centos-centos7 for 905d86f is http://paste2.org/7KZE8YeO |
|
SUCCESS: the output of run-make-check.sh on centos-centos7 for 905d86f is http://paste2.org/AemLt83t |
0db2e5f to
867cd71
Compare
|
FAIL: the output of run-make-check.sh on centos-centos7 for 12423e2589bc3f9215febd6f62511aeb0a6344cb is http://paste2.org/ssHLfZMX |
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Make handle_connect_msg follow lock rule: unlock any lock before acquire messenger's lock. Otherwise, deadlock will happen. Enhance lock condition check because connection's state maybe change while unlock itself and lock again. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Because AsyncConnection won't enter "open" tag from "replace" tag, the codes which set reply_tag won't be used when enter "open" tag. It will cause server side discard out_q and lose state. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
If connection sent many messages without acked, then it was marked down. Next we get a new connection, it will issue a connect_msg with connect_seq=0, server side need to detect "connect_seq==0 && existing->connect_seq >0", so it will reset out_q and detect remote reset. But if client side failed before sending connect_msg, now it will issue a connect_msg with non-zero connect_seq which will cause server-side can't detect exist remote reset. Server-side will reply a non-zero in_seq and cause client crash. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
If client reconnect a already mark_down endpoint, server-side will detect remote reset happen, so it will reset existing connection. Meanwhile, retry tag is received by client-side connection and it will try to reconnect. Again, client-side connection will send connect_msg with connect_seq(1). But it will met server-side connection's connect_seq(0), it will make server-side reply with reset tag. So this connection will loop in reset and retry tag. One solution is that we close server-side connection if connect_seq ==0 and no message in queue. But it will trigger another problem: 1. client try to connect a already mark_down endpoint 2. client->send_message 3. server-side accept new socket, replace old one and reply retry tag 4. client plus one to connect_seq but socket failure happen 5. server-side connection detected and close because of connect_seq==0 and no message 6. client reconnect, server-side has no existing connection and met "connect.connect_seq > 0". So server-side will reply to RESET tag 7. client discard all messages in queue. So we lose a message never delivered This solution add a new "once_session_reset" flag to indicate whether "existing" reset. Because server-side's connect_seq is 0 only when it never successfully or ever session reset. We only need to reply RESET tag if ever session reset. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
|
@dachary I guess "make check" think it's timeout? Actually new unittest_msgr will ran much time than before. |
|
SUCCESS: the output of run-make-check.sh on centos-centos7 for 3162e6d is http://paste2.org/m4AZtsbs |
|
please ignore the latest fail of the bot. |
|
@yuyuyu101 if I'm not mistaken the reason why the tests take more time is because you increased the timers in CHECK_AND_WAIT_TRUE. The problem with this approach is that it will not fix the root of the problem which is that the tests are racy and will fail depending on how slow / fast the machine is. Increasing the timers will only make the problems less frequent, it will not fix them. It means that someone working on an unrelated part of Ceph may see the test_msgr.cc fail and it will be quite difficult for her/him to figure out that it is because test_msgr.cc will sometime fail but not most of the time. If I missed the real fix, please ignore me ;-) |
|
@dachary No, CHECK_AND_WAIT_TRUE isn't the root cause. For example, if we run this test in a slow vm, it need more time to build tcp connection and shakehands. Previously, it only left 1ms let it build and now I changed it to 0.5s at max. It doesn't hidden any potential bug. The last assert fail reason is that the test mark_down a connection but peer detect closed connection need more time than I expected, the detect time depends on OS and network. So I add a lock/cond method instead of a fixed waiting time. Why we ran this test more than two minutes is that SyntheticStressTest and SyntheticInjecteTest are added which will consuming nearly 5mins even if in my physical server. The time changed for CHECK_AND_WAIT_TRUE is very little. I hope I clarify the reason :-) |
This is exactly what I mean, thanks for clarifying. My point is that every time you use CHECK_AND_WAIT_TRUE you can replace it with a lock/cond. That will make the test bullet proof and impossible to fail, no matter how slow the machine is. Does that make sense or is there a CHECK_AND_WAIT_TRUE call that will never fail ? |
*_handler will store a reference to AsyncConnection, it need to explicit reset it. Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
|
@dachary Yes, current all CHECK_AND_WAIT_TRUE has been reviewed again. I think it shouldn't fail again if others well. And you suggest that unittest_msgr ran too long for "make check", now unittest_msgr will ran nearly 7-8mins in my physical server. But I think if we decrease op num to reduce it under 1min, it may can't met test purpose. So I consider it may be move to qa-suite. @liewegas Updated, add to qa? |
|
SUCCESS: the output of run-make-check.sh on centos-centos7 for acf4188 is http://paste2.org/AF7LGDb6 |
AsyncMessenger: Bind thread to core, use buffer read and fix some bugs
impl feature #10172 #10396