[ROCKETMQ-184]-It takes too long(3-33 seconds) to switch to read from slave when master crashes #95

Jaskey · 2017-04-20T14:21:29Z

JIRA:https://issues.apache.org/jira/browse/ROCKETMQ-184?jql=project%20%3D%20ROCKETMQ

Problem, no listener is triggered when Chanel is close.

When async command sent to the server, and the server is crash before sending response to client, the callback can not be invoked in time. Instead, the callback can only be triggered by the timeout scan service.

This is obvious for pulling message since the timeout is by default 30 seconds. So if master crashes before process response to the client, the client can not repull until scan service tell it, which takes at most 30 seconds. And repull will have 3 seconds delay, so the HA to read from slave has to take 3-33 seconds when this problem occurs.

coveralls · 2017-04-20T14:57:46Z

Coverage increased (+0.04%) to 34.666% when pulling 0dc37e1 on Jaskey:ROCKETMQ-184-slave-switch into 42f78c2 on apache:develop.

coveralls · 2017-04-21T03:42:37Z

Coverage decreased (-0.0007%) to 34.631% when pulling 65ffffd on Jaskey:ROCKETMQ-184-slave-switch into 42f78c2 on apache:develop.

lizhanhui · 2017-04-21T07:12:12Z

remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingAbstract.java

            final SemaphoreReleaseOnlyOnce once = new SemaphoreReleaseOnlyOnce(this.semaphoreAsync);

-            final ResponseFuture responseFuture = new ResponseFuture(opaque, timeoutMillis, invokeCallback, once);
+            final GenericFutureListener<ChannelFuture> chanelCloseListener = new ChannelFutureListener() {


chanel-->channel

lizhanhui · 2017-04-22T01:08:47Z

Thanks @Jaskey, this is indeed a good place to improve.

For the implementation, I suggest an alternative generic way, instead of add a close future for each request, we add the opaque integer into a collection per channel. Remove the opaque integer on response or invalidate all of them in NettyConnectManageHandler. Suggested approach has fewer memory footprint and we may also easily cover the sync request scenario -- respond earlier before timeoutMillis amount of time elapsed in case channel experiences problems.

Jaskey · 2017-04-22T04:17:39Z

@lizhanhui

I have considered that, which will take more efforts to achieve the same goal, since we need to change some structure to make the connect manager to get access to the responseFuture map.

For my first implementation, I just want to issue this problem and involve you guys to discuss.

If you think that is indeed a better approach, I will submit an updated implementations for that, and then let all guys to choose.

lizhanhui · 2017-04-23T11:38:01Z

You are right that more changes are required for my suggested approach. But, IMO, the suggested way is more unified in design and may also save a few memory footprint in case we have very large semaphore initial capacity.

Indeed, this is the place we need to enhance.

Let's bring more guys into discussion before you implement the suggested approach. They should easily conceive what's going on here via checking changes made in your PR. Any opinion on this issue? @zhouxinyu @shroman @vongosling

zhouxinyu · 2017-04-23T13:16:27Z

Thanks @Jaskey , agree with @lizhanhui , use NettyConnectManageHandler to handle this close event is a better way, no need to import two mechanisms.

coveralls · 2017-04-24T11:35:09Z

Coverage increased (+0.1%) to 37.974% when pulling 388e88b on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:35:09Z

Coverage increased (+0.1%) to 37.974% when pulling 388e88b on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:35:09Z

Coverage increased (+0.1%) to 37.974% when pulling 388e88b on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:40:58Z

Coverage increased (+0.02%) to 37.869% when pulling 40d77ea on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:40:58Z

Coverage increased (+0.02%) to 37.869% when pulling 40d77ea on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:40:58Z

Coverage increased (+0.02%) to 37.869% when pulling 40d77ea on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:50:59Z

Coverage increased (+0.004%) to 37.858% when pulling abd61c8 on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:53:58Z

Coverage decreased (-0.09%) to 37.761% when pulling abd61c8 on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:53:58Z

Coverage decreased (-0.09%) to 37.761% when pulling abd61c8 on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T11:53:58Z

Coverage decreased (-0.09%) to 37.761% when pulling abd61c8 on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

coveralls · 2017-04-24T12:37:09Z

Coverage increased (+0.04%) to 37.894% when pulling d8faa33 on Jaskey:ROCKETMQ-184-slave-switch into 6a9628b on apache:develop.

Jaskey · 2017-04-24T13:07:16Z

@zhouxinyu @lizhanhui
please review the updated pr

lizhanhui · 2017-05-11T01:59:28Z

please review the updated pr

Looks this PR is not updated. Do you forget to push your changes?

Jaskey · 2017-05-11T02:16:22Z

@lizhanhui

It has been refactored using the close event handler mechanism according to your advice, please review the pr from close callback in NettyRemotingClient.java

Jaskey · 2017-05-18T12:37:09Z

@lizhanhui @vsair @zhouxinyu @shroman

any ideas for this updated solution?

lizhanhui · 2017-05-19T08:44:25Z

+1

dongeforever · 2017-05-25T08:28:59Z

LGTM @zhouxinyu

Jaskey · 2017-06-06T02:43:57Z

@zhouxinyu @vongosling @shroman what's your advice, can this pr be merged?

coveralls · 2017-11-14T02:08:43Z

Coverage increased (+0.09%) to 39.238% when pulling 80f4e6b on Jaskey:ROCKETMQ-184-slave-switch into cba3089 on apache:develop.

Jaskey · 2017-11-14T04:04:00Z

This pr is not updated from the source for long, so I updated just now , please review.

I think this is a good improvement for HA.

@zhouxinyu @vongosling @shroman @dongeforever

vongosling

LGTM

lizhanhui reviewed Apr 21, 2017

View reviewed changes

invoke callback at once when channel is close

80f4e6b

vongosling approved these changes Jul 14, 2018

View reviewed changes

vongosling merged commit 6ae619c into apache:develop Jul 14, 2018

vongosling added this to the 4.3.0 milestone Jul 14, 2018

lizhanhui pushed a commit to lizhanhui/rocketmq that referenced this pull request Jun 25, 2019

Issue apache#95 fix CommitLog bug

1b50547

renshuaibing-aaron pushed a commit to renshuaibing-aaron/rocketmq that referenced this pull request Apr 13, 2020

Invoke callback at once when channel is close (apache#95)

6e6c22b

JiaMingLiu93 pushed a commit to JiaMingLiu93/rocketmq that referenced this pull request May 28, 2020

Invoke callback at once when channel is close (apache#95)

e71631f

[ROCKETMQ-184]-It takes too long(3-33 seconds) to switch to read from slave when master crashes #95

[ROCKETMQ-184]-It takes too long(3-33 seconds) to switch to read from slave when master crashes #95

Uh oh!

Conversation

Jaskey commented Apr 20, 2017

Uh oh!

coveralls commented Apr 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Apr 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizhanhui Apr 21, 2017

Choose a reason for hiding this comment

Uh oh!

lizhanhui commented Apr 22, 2017

Uh oh!

Jaskey commented Apr 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lizhanhui commented Apr 23, 2017

Uh oh!

zhouxinyu commented Apr 23, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017

Uh oh!

coveralls commented Apr 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jaskey commented Apr 24, 2017

Uh oh!

lizhanhui commented May 11, 2017

Uh oh!

Jaskey commented May 11, 2017

Uh oh!

Jaskey commented May 18, 2017

Uh oh!

lizhanhui commented May 19, 2017

Uh oh!

dongeforever commented May 25, 2017

Uh oh!

Jaskey commented Jun 6, 2017

Uh oh!

coveralls commented Nov 14, 2017

Uh oh!

Jaskey commented Nov 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vongosling left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

coveralls commented Apr 20, 2017 •

edited

Loading

coveralls commented Apr 21, 2017 •

edited

Loading

Jaskey commented Apr 22, 2017 •

edited

Loading

coveralls commented Apr 24, 2017 •

edited

Loading

coveralls commented Apr 24, 2017 •

edited

Loading

Jaskey commented Nov 14, 2017 •

edited

Loading