Skip to content

AsyncMessenger: Bind thread to core, use buffer read and fix some bugs#3211

Merged
liewegas merged 28 commits intoceph:masterfrom
yuyuyu101:wip-10172
Jan 19, 2015
Merged

AsyncMessenger: Bind thread to core, use buffer read and fix some bugs#3211
liewegas merged 28 commits intoceph:masterfrom
yuyuyu101:wip-10172

Conversation

@yuyuyu101
Copy link
Copy Markdown
Member

  1. Bind limited threads to specified cores
  2. Using buffer read like what pipe done
  3. discard poll call
  4. add ms_inject_* to async
  5. fix replace process
  6. fix memory leak
  7. lots of small bug fixed

impl feature #10172 #10396

@loic-bot
Copy link
Copy Markdown

SUCCESS: make check on 701ca11 output is http://paste.ubuntu.com/9566231/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on 455980e is http://paste.pound-python.org/show/aTHR3KLWV19wSVbADhud/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on c8417fc is http://paste.pound-python.org/show/iX3x8h1MhDl5LB5QSQdX/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

loic-bot commented Jan 5, 2015

SUCCESS: the output of run-make-check.sh on fb8aa2e is http://paste.pound-python.org/show/0aqTftCXfPWZe7w1YHom/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

loic-bot commented Jan 7, 2015

FAIL: the output of run-make-check.sh on 1b93d23 is http://paste.pound-python.org/show/0KX0dZXsMqd9Dx4ROgM4/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

loic-bot commented Jan 7, 2015

FAIL: the output of run-make-check.sh on b47ddad is http://paste.pound-python.org/show/1EYiXYettCig28uMbscx/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

loic-bot commented Jan 7, 2015

FAIL: the output of run-make-check.sh on 17f40af is http://paste.pound-python.org/show/RsRXI3vJERIzAzQO5MeQ/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

loic-bot commented Jan 7, 2015

SUCCESS: the output of run-make-check.sh on 809539b is http://paste.pound-python.org/show/k37wrugTh6jwfLs0GpDz/

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

loic-bot commented Jan 8, 2015

SUCCESS: the output of run-make-check.sh on d28bb83 is http://paste.pound-python.org/show/pSIQecSuNEqhW6VY5jNc/

:octocat: Sent from GH.

@yuyuyu101 yuyuyu101 changed the title AsyncMessenger: Bind thread to core, use buffer read and clean up AsyncMessenger: Bind thread to core, use buffer read and fix some bugs Jan 13, 2015
@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on centos-centos7 for d28bb83 is http://paste2.org/sELM1yeW

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on centos-centos7 for 905d86f is http://paste2.org/7KZE8YeO

:octocat: Sent from GH.

@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on centos-centos7 for 905d86f is http://paste2.org/AemLt83t

:octocat: Sent from GH.

@yuyuyu101 yuyuyu101 force-pushed the wip-10172 branch 2 times, most recently from 0db2e5f to 867cd71 Compare January 14, 2015 03:56
@loic-bot
Copy link
Copy Markdown

FAIL: the output of run-make-check.sh on centos-centos7 for 12423e2589bc3f9215febd6f62511aeb0a6344cb is http://paste2.org/ssHLfZMX

:octocat: Sent from GH.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Make handle_connect_msg follow lock rule: unlock any lock before acquire
messenger's lock. Otherwise, deadlock will happen.

Enhance lock condition check because connection's state maybe change while
unlock itself and lock again.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Because AsyncConnection won't enter "open" tag from "replace" tag,
the codes which set reply_tag won't be used when enter "open" tag.
It will cause server side discard out_q and lose state.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
If connection sent many messages without acked, then it was marked down.
Next we get a new connection, it will issue a connect_msg with connect_seq=0,
server side need to detect "connect_seq==0 && existing->connect_seq >0",
so it will reset out_q and detect remote reset. But if client side failed
before sending connect_msg, now it will issue a connect_msg with non-zero
connect_seq which will cause server-side can't detect exist remote reset.
Server-side will reply a non-zero in_seq and cause client crash.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
If client reconnect a already mark_down endpoint, server-side will detect
remote reset happen, so it will reset existing connection. Meanwhile,
retry tag is received by client-side connection and it will try to
reconnect. Again, client-side connection will send connect_msg with
connect_seq(1). But it will met server-side connection's connect_seq(0),
it will make server-side reply with reset tag. So this connection will
loop in reset and retry tag.

One solution is that we close server-side connection if connect_seq ==0 and
no message in queue. But it will trigger another problem:
1. client try to connect a already mark_down endpoint
2. client->send_message
3. server-side accept new socket, replace old one and reply retry tag
4. client plus one to connect_seq but socket failure happen
5. server-side connection detected and close because of connect_seq==0 and no
message
6. client reconnect, server-side has no existing connection and met
"connect.connect_seq > 0". So server-side will reply to RESET tag
7. client discard all messages in queue. So we lose a message never delivered

This solution add a new "once_session_reset" flag to indicate whether
"existing" reset. Because server-side's connect_seq is 0 only when it never
successfully or ever session reset. We only need to reply RESET tag if ever
session reset.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
@yuyuyu101
Copy link
Copy Markdown
Member Author

@dachary I guess "make check" think it's timeout? Actually new unittest_msgr will ran much time than before.

@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on centos-centos7 for 3162e6d is http://paste2.org/m4AZtsbs

:octocat: Sent from GH.

@ghost
Copy link
Copy Markdown

ghost commented Jan 15, 2015

please ignore the latest fail of the bot.
is there a reason for unittest_msg to take more than two minutes ?

@ghost
Copy link
Copy Markdown

ghost commented Jan 15, 2015

@yuyuyu101 if I'm not mistaken the reason why the tests take more time is because you increased the timers in CHECK_AND_WAIT_TRUE. The problem with this approach is that it will not fix the root of the problem which is that the tests are racy and will fail depending on how slow / fast the machine is. Increasing the timers will only make the problems less frequent, it will not fix them. It means that someone working on an unrelated part of Ceph may see the test_msgr.cc fail and it will be quite difficult for her/him to figure out that it is because test_msgr.cc will sometime fail but not most of the time.

If I missed the real fix, please ignore me ;-)

@yuyuyu101
Copy link
Copy Markdown
Member Author

@dachary No, CHECK_AND_WAIT_TRUE isn't the root cause. For example, if we run this test in a slow vm, it need more time to build tcp connection and shakehands. Previously, it only left 1ms let it build and now I changed it to 0.5s at max. It doesn't hidden any potential bug.

The last assert fail reason is that the test mark_down a connection but peer detect closed connection need more time than I expected, the detect time depends on OS and network. So I add a lock/cond method instead of a fixed waiting time.

Why we ran this test more than two minutes is that SyntheticStressTest and SyntheticInjecteTest are added which will consuming nearly 5mins even if in my physical server. The time changed for CHECK_AND_WAIT_TRUE is very little.

I hope I clarify the reason :-)

@ghost
Copy link
Copy Markdown

ghost commented Jan 16, 2015

@yuyuyu101

The last assert fail reason is that the test mark_down a connection but peer detect closed connection need more time than I expected, the detect time depends on OS and network. So I add a lock/cond method instead of a fixed waiting time.

This is exactly what I mean, thanks for clarifying. My point is that every time you use CHECK_AND_WAIT_TRUE you can replace it with a lock/cond. That will make the test bullet proof and impossible to fail, no matter how slow the machine is.

Does that make sense or is there a CHECK_AND_WAIT_TRUE call that will never fail ?

*_handler will store a reference to AsyncConnection, it need to explicit reset
it.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
@yuyuyu101
Copy link
Copy Markdown
Member Author

@dachary Yes, current all CHECK_AND_WAIT_TRUE has been reviewed again. I think it shouldn't fail again if others well.

And you suggest that unittest_msgr ran too long for "make check", now unittest_msgr will ran nearly 7-8mins in my physical server. But I think if we decrease op num to reduce it under 1min, it may can't met test purpose. So I consider it may be move to qa-suite.

@liewegas Updated, add to qa?

@loic-bot
Copy link
Copy Markdown

SUCCESS: the output of run-make-check.sh on centos-centos7 for acf4188 is http://paste2.org/AF7LGDb6

:octocat: Sent from GH.

liewegas added a commit that referenced this pull request Jan 19, 2015
AsyncMessenger: Bind thread to core, use buffer read and fix some bugs
@liewegas liewegas merged commit 01af73b into ceph:master Jan 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants