Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.0] Manage global resources and executor services, fix zk client connections #9033

Merged
merged 57 commits into from
Oct 19, 2021

Conversation

kylixs
Copy link
Member

@kylixs kylixs commented Oct 13, 2021

What is the purpose of the change

  • Improve and destroy executor services
  • Support manage global static resources, including EventLoopGroup, HashedWheelTimer, etc.
  • Improve application/module deploying
  • Fix zk client connection leaks and early closing problem, change ZookeeperTransporter to application scope, see [3.0] Manage global resources and executor services, fix zk client connections #9033 (comment)
  • Destroy nacos client and MetadataReportFactory
  • Support check unclosed threads in unit tests by mvn test -DcheckThreads=true, see org.apache.dubbo.test.check.DubboTestChecker
  • Sort methods of adaptive classes in dubbo-native

Brief changelog

Verifying this change

…ion is divided into pre-destroy and post-destroy
applicationDeployer.checkStarted(startFuture);
applicationDeployer.checkStarted();
// complete module start future after application state changed, fix #9012 ?
startFuture.complete(true);
Copy link
Contributor

@zrlw zrlw Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. your applicationDeployer.checkStarted() only set starting flag and return directly if one module is starting, it doesn't mean all module started after applicationDeployer.checkStarted().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DefaultApplicationDeployer.checkStarted() function is used to check all modules state and notify start checking.
If start by module manually in some scenario, maybe some modules are started, but the application is starting until all modules are started.

    private void doStart() {
        startModules();

        // prepare application instance
        prepareApplicationInstance();

        executorRepository.getSharedExecutor().submit(() -> {
            while (true) {
                // notify on each module started
                synchronized (startedLock) {
                    try {
                        startedLock.wait(500);
                    } catch (InterruptedException e) {
                        // ignore
                    }
                }

                // if has new module, do start again
                if (hasPendingModule()) {
                    startModules();
                    continue;
                }

                DeployState state = checkState();
                if (!(state == DeployState.STARTING || state == DeployState.PENDING)) {
                    // start finished or error
                    break;
                }
            }
        });
    }
   ....

    public void checkStarted() {
        // TODO improve state checking
        DeployState _state = checkState();
        switch (_state) {
            case STARTED:
                onStarted();
                break;
            case STARTING:
                onStarting();
                break;
            case PENDING:
                setPending();
                break;
        }

        // notify started
        synchronized (startedLock) {
            startedLock.notifyAll();
        }
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startFuture of DefaultModuleDeployer just monitors the start action of the module, and has no strong relationship with the application.

Copy link
Contributor

@zrlw zrlw Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set breakpoint at checkStarted() and DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload line 306:

moduleDeployer1.start().get();

debug DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload as juit test, you will see checkStarted will be called twice.
at the first time, if you let main thread continue, the test will be failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload as juit test, you will see checkStarted will be called twice.

The first time is started internal module, the second time is started module of serviceConfig1, the application state is STARTING, test is passed. This behavior is expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DubboBootstrapMultiInstanceTest.testMultiModuleDeployAndReload:

ModuleDeployer internalModuleDeployer = applicationModel.getInternalModule().getDeployer();
Assertions.assertTrue(internalModuleDeployer.isStarted()); 
<== it will be failed if run this before the second time of checkStarted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please checkout this pr and test again.
Module start processing is changed, now can be sure to start internal module before pub module, the checking Assertions.assertTrue(internalModuleDeployer.isStarted()) is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, i modify my fix based on your codes to await internal module deployer finished.

@@ -69,7 +69,6 @@

private Map<String, Object> attributes;
private AtomicBoolean destroyed = new AtomicBoolean(false);
private volatile boolean stopping;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #9000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ApplicationModel is stopping, the value of isDestroyed() is true, no need to add a stopping field. Furthermore, I want to add a state field to ApplicationModel and ModuleModel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see AlbumenJ said in #9001: Metadata refresh future should be canceled after unregister service instance, which should be called before applicationModel.destroy().

Copy link
Member Author

@kylixs kylixs Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a look at the new processing of application destroy:

  1. Destroy by application
ApplicationModel.destroy() (change destroyed to true) 
-> ApplicationModel.onDestroy()
-> ApplicationDeployer.preDestroy()  
   -> set state to stopping 
-> destroy all modules and remove self from frameworkModel
-> ApplicationDeployer.postDestroy()  
    -> cancel asyncMetadataFuture and unregisterServiceInstance
    -> ...
    -> set state to stopped
-> frameworkModel.tryDestroy()

ApplicationModel and ApplicationDeployer have a strange relationship, ApplicationDeployer is designed to handle Application start/stop. ApplicationDeployer is an associated object of ApplicationModel and becomes clearer when it is destroyed from ApplicationModel.destroy().

  1. Destroy by module
ModuleModel.destroy()
-> ModuleModel.onDestroy()
-> ModuleDeployer.preDestroy()  
    -> set module state to stopping
-> remove module from application
-> ModuleDeployer.postDestroy()
   -> unexport and unrefer services
   -> set module state to stopped
-> applicationModel.tryDestroy()

Destroy pub modules one by one, if only remain internal module, then call ApplicationModel.destroy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seams better to cancel asyncMetadataFuture in ApplicationDeployer.preDestroy()

@@ -324,7 +328,7 @@ public void blockUntilUpdated() {
metadataSemaphore.drainPermits();
updateLock.writeLock().lock();
} catch (InterruptedException e) {
if (!applicationModel.isStopping()) {
if (!applicationModel.isDestroyed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlbumenJ said in #9001:
Metadata refresh future should be canceled after unregister service instance, which should be called before applicationModel.destroy().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entry of destroy an application is ApplicationModel.destroy(). So, check destroyed is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -83,7 +82,9 @@ public String getInternalProperty(String key) {

@Override
protected void doClose() throws Exception {
zkClient.close();
// TODO zkClient is shared in framework, should not close it here?
// zkClient.close();
Copy link
Contributor

@zrlw zrlw Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #8993
the send thread of zookeeper ClientCnxn will keep trying reconnect to the closed zk Server if you don't close the zk client and it will cause more integration testing problems because huge reconnect failed events will block curator global event loop processing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here: org.apache.dubbo.remoting.zookeeper.AbstractZookeeperTransporter#destroy()
All zk clients is created and destroyed in ZookeeperTransporter.

    @Override
    public void destroy() {
        // only destroy zk clients here
        for (ZookeeperClient client : zookeeperClientMap.values()) {
            client.close();
        }
        zookeeperClientMap.clear();
    }

Copy link
Member Author

@kylixs kylixs Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ZookeeperTransporter is shared in framework scope, so zk clients is shared in framework, cannot be destroyed when one application is shutdown but some application is alive .

Copy link
Contributor

@zrlw zrlw Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tested SingleRegistryCenterInjvmIntegrationTest (using registry only as config center testing has no registry) and zkClient was closed eventually, so the fix is ok for zookeeper client, but zkClient is just one kind of dynamic configuration which are not closed by now.

Copy link
Contributor

@zrlw zrlw Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's say we have two application A and B. A connect zk A, B connect zk B, A is short life app which only run 1 minute, the zkClient of A should be closed when A exited, otherwise it still cause curator reconnection failure events surge problem.

Copy link
Member Author

@kylixs kylixs Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If application A and B are not shared same FrameworkModel, they use two isolated Instances of ZookeeperTransporter so that there are no redundant ZK connections.
If A and B use the same FrameworkModel, it will result in A shared ZK connections and cannot simply close part of them when application B stopped. Some ZK connections may be used in both application A and B.
If we want to close ZK connections on application shutdown, we must first record which applications these ZK connections are used in, and then we can safely close idle ZK connections. You can submit pr later for fine application-level ZK connection management.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A and B use the same FrameworkModel but connect different zk server, you keep them all in the zookeeper transport will cause curator repeat reconnect trying if only one of the applications exit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A and B use the same FrameworkModel but connect different zk server, you keep them all in the zookeeper transport will cause curator repeat reconnect trying if only one of the applications exit.

ZookeeperTransporter has been changed to application scope to solve the zk clients leaks problem after application is stopped.

@zrlw
Copy link
Contributor

zrlw commented Oct 13, 2021

you'd better split this PR based on different subjects, otherwise it might revert other merged PR's working.

@kylixs
Copy link
Member Author

kylixs commented Oct 13, 2021

you'd better split this PR based on different subjects, otherwise it might revert other merged PR's working.

Most of the problems have been solved, because the processing flow has changed greatly, some of the previous repairs are unnecessary, and if there are other problems, they need to be dealt with again.

@@ -82,8 +82,10 @@ public String getInternalProperty(String key) {

@Override
protected void doClose() throws Exception {
// TODO zkClient is shared in framework, should not close it here?
// zkClient is shared in framework, should not close it here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping zkClient independent in each application might be better,not all app connect same registry and have same lifecycle.
#9015

Copy link
Member Author

@kylixs kylixs Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbumenJ @zrlw
There are two ideas:

  1. Zk Clients are shared within application scope and isolated among different applications
  2. Record the applications associated with each ZK client. When an application is stopped, you can close the ZK Clients that are no longer used.

If there is no strong need to share ZK Clients within the Framework, I think the first one is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now will force destroy all FrameworkModels after finished every test class by DubboTestChecker, it is expected no zk clients leaks in unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbumenJ @zrlw There are two ideas:

  1. Zk Clients are shared within application scope and isolated among different applications
  2. Record the applications associated with each ZK client. When an application is stopped, you can close the ZK Clients that are no longer used.

If there is no strong need to share ZK Clients within the Framework, I think the first one is better.

#9015 close zookeeper client that is no long used by any application.
the advantage is the zk connection will be closed immediately if it's not used anymore.
app A -> zk server IP-A, app B -> zk server IP-B, Notice: IP-A is not equal IP-B !!!
the zk connection of app A will be closed immediately if A exit and B still going on.
if put them in the global zookeeper transporter, close operation must wait another app exit, it will cause curator client keep trying reconnect to the closed zk server and trigger huge curator connection failed events.

Copy link
Member Author

@kylixs kylixs Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO. The zk client is shared between ZookeeperRegistry , ZookeeperDynamicConfiguration and ZookeeperMetadataReport, if close it in one place, the other reference use the zk client may throw already destroyed exeception.
The safe way is close zk clients in ApplicationDeployer.postDestroy() after release all zk components before onStopped(), some thing like: zookeeperTransporter.closeClientsOfApplication(application)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now #9015 destroys zkClients not used at DefaultAppliationDeployer.destroy()

@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2021

Codecov Report

Merging #9033 (0752315) into 3.0 (dcfd3ed) will increase coverage by 0.47%.
The diff coverage is 71.62%.

Impacted file tree graph

@@             Coverage Diff              @@
##                3.0    #9033      +/-   ##
============================================
+ Coverage     63.75%   64.23%   +0.47%     
- Complexity      313      322       +9     
============================================
  Files          1168     1174       +6     
  Lines         50192    50569     +377     
  Branches       7465     7518      +53     
============================================
+ Hits          32001    32481     +480     
+ Misses        14703    14540     -163     
- Partials       3488     3548      +60     
Impacted Files Coverage Δ
...nfig/configcenter/nop/NopDynamicConfiguration.java 22.22% <0.00%> (-2.78%) ⬇️
...ava/org/apache/dubbo/config/ServiceConfigBase.java 61.23% <0.00%> (-0.35%) ⬇️
...in/java/org/apache/dubbo/rpc/model/ScopeModel.java 80.59% <ø> (+11.58%) ⬆️
...ava/org/apache/dubbo/config/DubboShutdownHook.java 42.42% <0.00%> (ø)
...ter/support/apollo/ApolloDynamicConfiguration.java 46.98% <ø> (ø)
...enter/support/nacos/NacosConfigServiceWrapper.java 37.50% <0.00%> (-5.36%) ⬇️
...enter/support/nacos/NacosDynamicConfiguration.java 22.36% <0.00%> (-0.30%) ⬇️
...e/dubbo/metadata/report/MetadataReportFactory.java 0.00% <0.00%> (ø)
.../apache/dubbo/qos/command/impl/ShutdownTelnet.java 68.00% <0.00%> (-12.77%) ⬇️
...g/apache/dubbo/rpc/protocol/grpc/GrpcProtocol.java 63.52% <0.00%> (-1.54%) ⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcfd3ed...0752315. Read the comment docs.

Copy link
Member

@AlbumenJ AlbumenJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AlbumenJ AlbumenJ merged commit 1bdf359 into apache:3.0 Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants