Skip to content

[SCB-548] support gracefully shutdown#693

Merged
wujimin merged 13 commits intoapache:masterfrom
zhengyangyong:SCB-548
May 18, 2018
Merged

[SCB-548] support gracefully shutdown#693
wujimin merged 13 commits intoapache:masterfrom
zhengyangyong:SCB-548

Conversation

@zhengyangyong
Copy link
Copy Markdown

@zhengyangyong zhengyangyong commented May 8, 2018

Signed-off-by: zhengyangyong yangyong.zheng@huawei.com

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [SCB-XXX] Fixes bug in ApproximateQuantiles, where you replace SCB-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean install to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

When user call system.exit(0),these three works will do:
1.Unregister microservice instance from Service Center:
this work will do by process ContextClosedEvent, we need unregister immediately for stopping provide service any more.
2.Waiting for all invocations to finish:
this work will do by ShutdownHandler, when all invocations finished or 'deadline time reached', cc and transport vertx threads will close.
3.All spring bean do close process
we had registerShutdownHook for spring ApplicationContext and beans can define 'destroy-method' do cleaning.

here is example:

Spring mvc Hello Java Chassis
Pojo Hello person ServiceComb/Java Chassis
Jaxrs Hello person ServiceComb/Java Chassis
Spring mvc Hello person ServiceComb/Java Chassis
2018-05-09 16:02:14,288 [WARN] handler chain is shutting down org.apache.servicecomb.core.handler.ShutdownHookHandler.run(ShutdownHookHandler.java:87)
2018-05-09 16:02:14,289 [INFO] Closing org.springframework.context.support.ClassPathXmlApplicationContext@7cf10a6f: startup date [Wed May 09 16:02:05 CST 2018]; root of context hierarchy org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:984)
2018-05-09 16:02:14,290 [WARN] cse is closing now... org.apache.servicecomb.core.CseApplicationListener.onApplicationEvent(CseApplicationListener.java:148)
2018-05-09 16:02:14,291 [INFO] service center task is shutdown. org.apache.servicecomb.serviceregistry.registry.RemoteServiceRegistry.onShutdown(RemoteServiceRegistry.java:72)
2018-05-09 16:02:14,295 [WARN] handler chain is shut down org.apache.servicecomb.core.handler.ShutdownHookHandler.run(ShutdownHookHandler.java:103)
2018-05-09 16:02:14,296 [INFO] Unregister microservice instance success. microserviceId=90b76fd551c511e8b51db4b676a39f40 instanceId=481f630d535f11e8bc19b4b676a39f40 org.apache.servicecomb.serviceregistry.registry.AbstractServiceRegistry.unregisterInstance(AbstractServiceRegistry.java:232)
Process finished with exit code 0

@zhengyangyong
Copy link
Copy Markdown
Author

Springmvc it test failed because we had fully shutdown and re-init seems do not init transport again,I will try fix this problem

@WillemJiang
Copy link
Copy Markdown
Member

As the Server socket maybe reused, it's hard to reconnect the server which is restarted with the same port. I think we need to avoid restarting the server in the unit test or system test.

@coveralls
Copy link
Copy Markdown

coveralls commented May 9, 2018

Coverage Status

Coverage increased (+0.2%) to 87.551% when pulling 4b5722c on zhengyangyong:SCB-548 into 5c29a7c on apache:master.

@Override
public void handle(Invocation invocation, AsyncResponse asyncResp) throws Exception {
if (shuttingDown) {
System.out.println("shutting down in progress");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use System.out.println to print logs. And we have already throw an exception with this message, and I think this log is not necessary.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry is for debug,thanks!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

@liubao68 liubao68 closed this May 10, 2018
@liubao68 liubao68 reopened this May 10, 2018
@zhengyangyong zhengyangyong reopened this May 10, 2018
@zhengyangyong
Copy link
Copy Markdown
Author

[�[1;31mERROR�[m] Failed to execute goal �[32mio.fabric8:docker-maven-plugin:0.20.0:start�[m �[1m(start)�[m on project �[36mdynamic-config-tests�[m: �[1;31mExecution start of goal io.fabric8:docker-maven-plugin:0.20.0:start failed: Start-Job failed with unexpected exception: [nobodyiam/apollo-quick-start] "apollo.servicecomb.apache.org": Timeout after 120033 ms while waiting on log out 'Portal started' and on tcp port '[/172.17.0.3:8080, /172.17.0.3:8070]'�[m -> �[1m[Help 1]�[m

Failed by apollo IT

@wujimin
Copy link
Copy Markdown
Contributor

wujimin commented May 10, 2018

if ContextCloseEvent and handler/ShutdownHookHandler.java is random ordered, then it's not ok

@zhengyangyong
Copy link
Copy Markdown
Author

zhengyangyong commented May 11, 2018

I had used a Semaphore in order to sync all invocation had finished.now gracefully shutdown had fixed below order :

  1. Unregister microservice instance from Service Center and close vertx
  2. Wait all invocation finished (delete timeout mechanism in ShutdownHandler)
  3. Notify all component do clean works via Event
  4. Stop transport and cc vertx to prevent blocking exit

zhengyangyong added 5 commits May 11, 2018 15:23
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
…gmvc-tests)

make four independent module in order to eliminate disturbance(from gracefully shutdown)

Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
@zhengyangyong
Copy link
Copy Markdown
Author

Rebase on latest master

Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
@zhengyangyong
Copy link
Copy Markdown
Author

improve gracefully shutdown process order

  private void gracefullyShutdown() {
    //Step 1: notify all component stop invoke via BEFORE_CLOSE Event
    triggerEvent(EventType.BEFORE_CLOSE);

    //Step 2: Unregister microservice instance from Service Center and close vertx
    RegistryUtils.destroy();
    VertxUtils.closeVertxByName("registry");

    //Step 3: wait all invocation finished
    try {
      ShutdownHookHandler.INSTANCE.ALL_INVOCATION_FINISHED.acquire();
      LOGGER.warn("all invocation finished");
    } catch (InterruptedException e) {
      LOGGER.error("invocation finished semaphore interrupted", e);
    }

    //Step 4: Stop vertx to prevent blocking exit
    VertxUtils.closeVertxByName("config-center");
    VertxUtils.closeVertxByName("transport");

    //Step 5: notify all component do clean works via AFTER_CLOSE Event
    triggerEvent(EventType.AFTER_CLOSE);
  }

@wujimin
Copy link
Copy Markdown
Contributor

wujimin commented May 12, 2018

ShutdownHookHandler is not so good, sometimes will block shutdown process cause of wrong statistics, must depend on timeout to make shutdown process continue.

we should:
1.prevent create new consumer invocation
2.prevent other find me
3.make accurate statistics

@wujimin
Copy link
Copy Markdown
Contributor

wujimin commented May 12, 2018

and CI failed, it's about trace, i can find what happened, it seems that did not print error test case name?

@zhengyangyong
Copy link
Copy Markdown
Author

@wujimin failed by 'ExecutionException Error occurred in starting fork, check output in log', and I do not change zuul test case, I will reopen for restart ci check

Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>

private ShutdownHookHandler() {
try {
ALL_INVOCATION_FINISHED.acquire();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to make semaphore private and add a method like waitInvocationFinish. And in initialization, acquire is not required and shutdownhook can be removed too.

}
}

private synchronized void validAllInvocationFinished() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This synchronized code blocking all invocations and not good idea

@liubao68
Copy link
Copy Markdown
Contributor

liubao68 commented May 14, 2018

@wujimin

ShutdownHookHandler is not so good, sometimes will block shutdown process cause of wrong statistics, must depend on timeout to make shutdown process continue. ----I don't know how this comes. Because
ShutdownHookHandler will wait quit a long time much more than the request timeout. If the time reached there usually means timeout not properly set or some other bug.

we should:
1.prevent create new consumer invocation ----This is have already done
2.prevent other find me ---We can not prevent it from actually happen and maybe time delay.
3.make accurate statistics ---The process is shutdown and we force close some unexpected invocations and I think the precision is not very important and accurate statistics can not be achieved, due to metrics shutdown, for example.

We are experiencing a lot of scenarios that can not be shutdown but need force to exit the program. Program not exit can cause more bad consequences.

@wujimin
Copy link
Copy Markdown
Contributor

wujimin commented May 14, 2018

in performance test, ShutdownHookHandler will always block shutdown process for many seconds (wait for timeout for wrong request/response count statistics)

will change statistics from invocation handler to invocation event
after change to invocation event, then reject in invocation handler is too late
so shutdown hook will rewrite

by new hook, even in performance test, we can shutdown normally in one or two second.

Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
}

public synchronized void init() {
if (validIsDown()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use the SCBStatus directly , I cannot get the method work by reading the name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good , fixed

zhengyangyong added 2 commits May 16, 2018 17:17
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
safeTriggerEvent(EventType.AFTER_CLOSE);

//Step 6: Clean flags for re-init
eventBus.unregister(this);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when succeed to init, then already unregister this, unregister a not register subscribe will cause exception.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,it had unregister in triggerAfterRegistryEvent after succeed to init,I will delete this line

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

private void doUninit() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should not throw exception
even one step failed, must finish remain steps

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i will fix

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

//Chassis is Stopping (progressing)
STOPPING,
//Chassis Init Failed
FAILED
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that uninit failed, will set status to FAILED.
so what's your design?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if uninit failed, can enable init again ,so I think we may better set to FAILED ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, we will only init and uninit only one time
but i remember your UT will do this multi times?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,set to DOWN because no exception will be throw

Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
@zhengyangyong
Copy link
Copy Markdown
Author

Failed by apollo it , [�[1;31mERROR�[m] Failed to execute goal �[32mio.fabric8:docker-maven-plugin:0.20.0:start�[m �[1m(start)�[m on project �[36mdynamic-config-tests�[m: �[1;31mExecution start of goal io.fabric8:docker-maven-plugin:0.20.0:start failed: Start-Job failed with unexpected exception: [nobodyiam/apollo-quick-start] "apollo.servicecomb.apache.org": Timeout after 120162 ms while waiting on log out 'Portal started' and on tcp port '[/172.17.0.3:8080, /172.17.0.3:8070]'�[m -> �[1m[Help 1]�[m Reopen

VertxUtils.blockCloseVertxByName("registry");

//Step 3: wait all invocation finished
// forbit create new consumer invocation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forbit comments should move to status = SCBStatus.STOPPING;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,it's my mistake,will fix

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SchemaMeta schemaMeta = referenceConfig.getMicroserviceMeta().ensureFindSchemaMeta(schemaId);
Invocation invocation = InvocationFactory.forConsumer(referenceConfig, schemaMeta, operationName, args);
return syncInvoke(invocation);
validCanInvoke();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check validCanInvoke()
this invoked multi times for one invocation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the check cost is low, I think it's no problem becasuse these method are all public and user may direct use them for general invoke.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,will changed in future PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"validCanInvoke" is a bad name (It just check the status of Engine), how about using the "checkEngineStatus".

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
Signed-off-by: zhengyangyong <yangyong.zheng@huawei.com>
@wujimin wujimin merged commit 731e09e into apache:master May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants