Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up Node#close. #39317

Merged
merged 9 commits into from
Apr 17, 2019

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 22, 2019

Node#close is pretty hard to rely on today:

  • it might swallow exceptions
  • it waits for 10 seconds for threads to terminate but doesn't signal anything
    if threads are still not terminated after 10 seconds

This commit makes IOExceptions propagated and splits Node#close into
Node#close and Node#awaitClose so that the decision what to do if a node
takes too long to close can be done on top of Node#close.

It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.

`Node#close` is pretty hard to rely on today:
 - it might swallow exceptions
 - it waits for 10 seconds for threads to terminate but doesn't signal anything
   if threads are still not terminated after 10 seconds

This commit makes `IOException`s propagated and splits `Node#close` into
`Node#close` and `Node#awaitClose` so that the decision what to do if a node
takes too long to close can be done on top of `Node#close`.

It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.
@jpountz jpountz added >enhancement :Delivery/Build Build or test infrastructure labels Feb 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jpountz
Copy link
Contributor Author

jpountz commented Feb 28, 2019

@elasticmachine run elasticsearch-ci/packaging-sample

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments.

} catch (IOException ex) {
throw new ElasticsearchException("failed to stop node", ex);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
LogManager.getLogger(Bootstrap.class).warn("Thread got interrupted while waiting for the node to shutdown.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we revert the order here (i.e. first log the message, then restore interrupt status)?

}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
LogManager.getLogger(Bootstrap.class).warn("Thread got interrupted while waiting for the node to shutdown.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here I suggest to revert the order.

lifecycle.moveToStarted();
for (LifecycleListener listener : listeners) {
listener.afterStart();
synchronized (lifecycle) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can become tricky to synchronize on an object that can also be directly accessed by subclasses. The state of Lifecycle has only one field that is declared volatile and as far as I could see all usages in subclasses only query the state and only the base class modifies it. From that perspective we are safe but it is easy to introduce subtle bugs. I wonder whether in the future it would make sense to think about encapsulating Lifecycle in AbstractLifecycleComponent to make this a bit more robust.

@@ -308,6 +312,15 @@ protected void doClose() throws IOException {
indicesRefCount.decRef();
}

/**
* Wait for this {@link IndicesService} to be effectively closed. When this returns, all shard and shard stores are closed and all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not entirely accurate? The guarantees described here only hold if this method returns true. If it returns false it timed out waiting for the condition to happen.

@@ -183,8 +184,15 @@ public void run() {
IOUtils.close(node, spawner);
LoggerContext context = (LoggerContext) LogManager.getContext(false);
Configurator.shutdown(context);
if (node != null && node.awaitClose(10, TimeUnit.SECONDS) == false) {
throw new IOException("Node didn't stop within 10 seconds. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IllegalStateException might be more appropriate?

@@ -267,6 +275,12 @@ private void start() throws NodeValidationException {
static void stop() throws IOException {
try {
IOUtils.close(INSTANCE.node, INSTANCE.spawner);
if (INSTANCE.node != null && INSTANCE.node.awaitClose(10, TimeUnit.SECONDS) == false) {
throw new IOException("Node didn't stop within 10 seconds. Any outstanding requests or tasks might get killed.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IllegalStateException might be more appropriate?



ThreadPool threadPool = injector.getInstance(ThreadPool.class);
final boolean terminated = ThreadPool.terminate(threadPool, timeout, timeUnit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit odd to call a mutating method in an await-style method but I do see the reason why it is needed here.

@@ -1049,6 +1049,13 @@ public void close() throws IOException {
closed.set(true);
markNodeDataDirsAsPendingForWipe(node);
node.close();
try {
if (node.awaitClose(10, TimeUnit.SECONDS) == false) {
throw new IOException("Node didn't close within 10 seconds.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IllegalStateException might be more appropriate?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 9, 2019

Thanks @danielmitterdorfer for looking. I wonder whether you have thoughts on the overall approach as well?

@danielmitterdorfer
Copy link
Member

I wonder whether you have thoughts on the overall approach as well?

The overall approach makes sense to me.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! LGTM

@jpountz
Copy link
Contributor Author

jpountz commented Apr 17, 2019

Thanks @danielmitterdorfer !

@jpountz jpountz merged commit d706b40 into elastic:master Apr 17, 2019
@jpountz jpountz deleted the enhancement/wait_for_closed_indices branch April 17, 2019 11:49
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Apr 17, 2019
`Node#close` is pretty hard to rely on today:
 - it might swallow exceptions
 - it waits for 10 seconds for threads to terminate but doesn't signal anything
   if threads are still not terminated after 10 seconds

This commit makes `IOException`s propagated and splits `Node#close` into
`Node#close` and `Node#awaitClose` so that the decision what to do if a node
takes too long to close can be done on top of `Node#close`.

It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.
jpountz added a commit that referenced this pull request Apr 17, 2019
`Node#close` is pretty hard to rely on today:
 - it might swallow exceptions
 - it waits for 10 seconds for threads to terminate but doesn't signal anything
   if threads are still not terminated after 10 seconds

This commit makes `IOException`s propagated and splits `Node#close` into
`Node#close` and `Node#awaitClose` so that the decision what to do if a node
takes too long to close can be done on top of `Node#close`.

It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.
jaymode added a commit to jaymode/elasticsearch that referenced this pull request May 8, 2019
The changes in elastic#39317 brought to light some concurrency issues in the
close method of Recyclers as we do not wait for threads running in the
threadpool to be finished prior to the closing of the PageCacheRecycler
and the Recyclers that are used internally. elastic#41695 was opened to
address the concurrent close issues but upon review, the closing of
these classes is not really needed as the instances should be become
available for garbage collection once there is no longer a reference to
the closed node.

Closes elastic#41683
jaymode added a commit that referenced this pull request May 10, 2019
The changes in #39317 brought to light some concurrency issues in the
close method of Recyclers as we do not wait for threads running in the
threadpool to be finished prior to the closing of the PageCacheRecycler
and the Recyclers that are used internally. #41695 was opened to
address the concurrent close issues but upon review, the closing of
these classes is not really needed as the instances should be become
available for garbage collection once there is no longer a reference to
the closed node.

Closes #41683
jaymode added a commit that referenced this pull request May 10, 2019
The changes in #39317 brought to light some concurrency issues in the
close method of Recyclers as we do not wait for threads running in the
threadpool to be finished prior to the closing of the PageCacheRecycler
and the Recyclers that are used internally. #41695 was opened to
address the concurrent close issues but upon review, the closing of
these classes is not really needed as the instances should be become
available for garbage collection once there is no longer a reference to
the closed node.

Closes #41683
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
`Node#close` is pretty hard to rely on today:
 - it might swallow exceptions
 - it waits for 10 seconds for threads to terminate but doesn't signal anything
   if threads are still not terminated after 10 seconds

This commit makes `IOException`s propagated and splits `Node#close` into
`Node#close` and `Node#awaitClose` so that the decision what to do if a node
takes too long to close can be done on top of `Node#close`.

It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
The changes in elastic#39317 brought to light some concurrency issues in the
close method of Recyclers as we do not wait for threads running in the
threadpool to be finished prior to the closing of the PageCacheRecycler
and the Recyclers that are used internally. elastic#41695 was opened to
address the concurrent close issues but upon review, the closing of
these classes is not really needed as the instances should be become
available for garbage collection once there is no longer a reference to
the closed node.

Closes elastic#41683
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure >enhancement Team:Delivery Meta label for Delivery team v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants