Clean up Node#close. #39317

jpountz · 2019-02-22T14:31:46Z

Node#close is pretty hard to rely on today:

it might swallow exceptions
it waits for 10 seconds for threads to terminate but doesn't signal anything
if threads are still not terminated after 10 seconds

This commit makes IOExceptions propagated and splits Node#close into
Node#close and Node#awaitClose so that the decision what to do if a node
takes too long to close can be done on top of Node#close.

It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.

`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.

elasticmachine · 2019-02-22T14:31:48Z

Pinging @elastic/es-core-infra

jpountz · 2019-02-28T12:56:18Z

@elasticmachine run elasticsearch-ci/packaging-sample

danielmitterdorfer

I left a couple of comments.

danielmitterdorfer · 2019-04-09T06:08:06Z

server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java

                    } catch (IOException ex) {
                        throw new ElasticsearchException("failed to stop node", ex);
+                    } catch (InterruptedException e) {
+                        Thread.currentThread().interrupt();
+                        LogManager.getLogger(Bootstrap.class).warn("Thread got interrupted while waiting for the node to shutdown.");


How about we revert the order here (i.e. first log the message, then restore interrupt status)?

danielmitterdorfer · 2019-04-09T06:08:51Z

server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java

+            }
+        } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+            LogManager.getLogger(Bootstrap.class).warn("Thread got interrupted while waiting for the node to shutdown.");


Also here I suggest to revert the order.

danielmitterdorfer · 2019-04-09T06:22:58Z

server/src/main/java/org/elasticsearch/common/component/AbstractLifecycleComponent.java

-        lifecycle.moveToStarted();
-        for (LifecycleListener listener : listeners) {
-            listener.afterStart();
+        synchronized (lifecycle) {


I think it can become tricky to synchronize on an object that can also be directly accessed by subclasses. The state of Lifecycle has only one field that is declared volatile and as far as I could see all usages in subclasses only query the state and only the base class modifies it. From that perspective we are safe but it is easy to introduce subtle bugs. I wonder whether in the future it would make sense to think about encapsulating Lifecycle in AbstractLifecycleComponent to make this a bit more robust.

danielmitterdorfer · 2019-04-09T06:29:31Z

server/src/main/java/org/elasticsearch/indices/IndicesService.java

@@ -308,6 +312,15 @@ protected void doClose() throws IOException {
        indicesRefCount.decRef();
    }

+    /**
+     * Wait for this {@link IndicesService} to be effectively closed. When this returns, all shard and shard stores are closed and all


I think the comment is not entirely accurate? The guarantees described here only hold if this method returns true. If it returns false it timed out waiting for the condition to happen.

danielmitterdorfer · 2019-04-09T06:47:31Z

server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java

@@ -183,8 +184,15 @@ public void run() {
                        IOUtils.close(node, spawner);
                        LoggerContext context = (LoggerContext) LogManager.getContext(false);
                        Configurator.shutdown(context);
+                        if (node != null && node.awaitClose(10, TimeUnit.SECONDS) == false) {
+                            throw new IOException("Node didn't stop within 10 seconds. " +


I think IllegalStateException might be more appropriate?

danielmitterdorfer · 2019-04-09T06:47:48Z

server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java

@@ -267,6 +275,12 @@ private void start() throws NodeValidationException {
    static void stop() throws IOException {
        try {
            IOUtils.close(INSTANCE.node, INSTANCE.spawner);
+            if (INSTANCE.node != null && INSTANCE.node.awaitClose(10, TimeUnit.SECONDS) == false) {
+                throw new IOException("Node didn't stop within 10 seconds. Any outstanding requests or tasks might get killed.");


I think IllegalStateException might be more appropriate?

danielmitterdorfer · 2019-04-09T06:48:59Z

server/src/main/java/org/elasticsearch/node/Node.java

+
+
+        ThreadPool threadPool = injector.getInstance(ThreadPool.class);
+        final boolean terminated = ThreadPool.terminate(threadPool, timeout, timeUnit);


I find it a bit odd to call a mutating method in an await-style method but I do see the reason why it is needed here.

danielmitterdorfer · 2019-04-09T06:49:35Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

@@ -1049,6 +1049,13 @@ public void close() throws IOException {
                closed.set(true);
                markNodeDataDirsAsPendingForWipe(node);
                node.close();
+                try {
+                    if (node.awaitClose(10, TimeUnit.SECONDS) == false) {
+                        throw new IOException("Node didn't close within 10 seconds.");


I think IllegalStateException might be more appropriate?

jpountz · 2019-04-09T07:05:14Z

Thanks @danielmitterdorfer for looking. I wonder whether you have thoughts on the overall approach as well?

danielmitterdorfer · 2019-04-09T07:51:45Z

I wonder whether you have thoughts on the overall approach as well?

The overall approach makes sense to me.

danielmitterdorfer

Thanks for iterating! LGTM

jpountz · 2019-04-17T11:48:55Z

Thanks @danielmitterdorfer !

`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.

The changes in elastic#39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. elastic#41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes elastic#41683

The changes in #39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. #41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes #41683

`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.

The changes in elastic#39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. elastic#41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes elastic#41683

jpountz added >enhancement :Delivery/Build Build or test infrastructure labels Feb 22, 2019

jpountz added 5 commits February 25, 2019 14:50

Merge branch 'master' into enhancement/wait_for_closed_indices

184a62c

Log in case of interruptions.

3b4b67a

Warn in case of leaked stores/shards.

d67f703

Merge branch 'master' into enhancement/wait_for_closed_indices

d6f7fd9

Tests.

ad5e1da

danielmitterdorfer self-requested a review March 27, 2019 15:34

danielmitterdorfer reviewed Apr 9, 2019

View reviewed changes

jpountz added 3 commits April 10, 2019 11:36

Merge branch 'master' into enhancement/wait_for_closed_indices

08068d6

Review feedback.

1041225

Merge branch 'master' into enhancement/wait_for_closed_indices

5507cef

danielmitterdorfer approved these changes Apr 16, 2019

View reviewed changes

jpountz merged commit d706b40 into elastic:master Apr 17, 2019

jpountz deleted the enhancement/wait_for_closed_indices branch April 17, 2019 11:49

jpountz added v7.2.0 v8.0.0 labels Apr 17, 2019

jaymode mentioned this pull request Apr 30, 2019

Address concurrency issues in DequeRecycler close #41695

Closed

jaymode mentioned this pull request May 7, 2019

Remove close method in PageCacheRecycler/Recycler #41917

Merged

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up Node#close. #39317

Clean up Node#close. #39317

jpountz commented Feb 22, 2019

elasticmachine commented Feb 22, 2019

jpountz commented Feb 28, 2019

danielmitterdorfer left a comment

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

danielmitterdorfer Apr 9, 2019

jpountz commented Apr 9, 2019

danielmitterdorfer commented Apr 9, 2019

danielmitterdorfer left a comment

jpountz commented Apr 17, 2019



		ThreadPool threadPool = injector.getInstance(ThreadPool.class);
		final boolean terminated = ThreadPool.terminate(threadPool, timeout, timeUnit);

Clean up Node#close. #39317

Clean up Node#close. #39317

Conversation

jpountz commented Feb 22, 2019

elasticmachine commented Feb 22, 2019

jpountz commented Feb 28, 2019

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Apr 9, 2019

danielmitterdorfer commented Apr 9, 2019

danielmitterdorfer left a comment

Choose a reason for hiding this comment

jpountz commented Apr 17, 2019