New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the lifecycle management of the join control thread in zen discovery. #8327

Merged
merged 1 commit into from Nov 4, 2014

Conversation

Projects
None yet
4 participants
@martijnvg
Member

martijnvg commented Nov 3, 2014

This PR also includes:

  • Better exception handling in UnicastZenPing#ping
  • In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has successfully completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often).
}
});
} catch (Exception e) {
sendPingsHandler.close();

This comment has been minimized.

@bleskes

bleskes Nov 3, 2014

Member

we want to rethrow here, right?

@bleskes

bleskes Nov 3, 2014

Member

we want to rethrow here, right?

@bleskes

View changes

Show outdated Hide outdated src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java
@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Nov 3, 2014

Member

Looking good. left two comments.

Member

bleskes commented Nov 3, 2014

Looking good. left two comments.

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg Nov 3, 2014

Member

@bleskes I updated the PR.

Member

martijnvg commented Nov 3, 2014

@bleskes I updated the PR.

@bleskes

View changes

Show outdated Hide outdated src/main/java/org/elasticsearch/ExceptionsHelper.java
@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg Nov 3, 2014

Member

@bleskes I applied the feedback, also added better error handling for multicast ping.

Member

martijnvg commented Nov 3, 2014

@bleskes I applied the feedback, also added better error handling for multicast ping.

@bleskes

View changes

Show outdated Hide outdated src/main/java/org/elasticsearch/ExceptionsHelper.java
@bleskes

View changes

Show outdated Hide outdated ...ava/org/elasticsearch/discovery/zen/ping/multicast/MulticastZenPing.java
@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Nov 3, 2014

Member

@martijnvg looking good. left two last little comments

Member

bleskes commented Nov 3, 2014

@martijnvg looking good. left two last little comments

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg Nov 3, 2014

Member

@bleskes Thanks, I updated the PR to address your comments.

Member

martijnvg commented Nov 3, 2014

@bleskes Thanks, I updated the PR to address your comments.

@@ -1281,7 +1281,7 @@ public ClusterState stopRunningThreadAndRejoin(ClusterState clusterState, String
return rejoin(clusterState, reason);
}
/** starts a new joining thread if there is no currently active one */
/** starts a new joining thread if there is no currently active one and join thread controlling is started */

This comment has been minimized.

@bleskes

bleskes Nov 4, 2014

Member

join thread controlling is started <-- love it :)

@bleskes

bleskes Nov 4, 2014

Member

join thread controlling is started <-- love it :)

@bleskes

View changes

Show outdated Hide outdated ...ava/org/elasticsearch/discovery/zen/ping/multicast/MulticastZenPing.java
@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Nov 4, 2014

Member

Left one minor logging comment. LGTM!

Member

bleskes commented Nov 4, 2014

Left one minor logging comment. LGTM!

Discovery: Improve the lifecycle management of the join control threa…
…d in zen discovery.

Also added:
* Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping
* In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often).
Applied feedback 3

Closes #8327

martijnvg added a commit that referenced this pull request Nov 4, 2014

Discovery: Improve the lifecycle management of the join control threa…
…d in zen discovery.

Also added:
* Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping
* In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often).
Applied feedback 3

Closes #8327

@martijnvg martijnvg merged commit 4ddb057 into elastic:master Nov 4, 2014

@martijnvg

This comment has been minimized.

Show comment
Hide comment
@martijnvg

martijnvg Nov 4, 2014

Member

Pushed, thanks @bleskes!

Member

martijnvg commented Nov 4, 2014

Pushed, thanks @bleskes!

martijnvg added a commit that referenced this pull request Nov 4, 2014

Discovery: Improve the lifecycle management of the join control threa…
…d in zen discovery.

Also added:
* Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping
* In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often).
Applied feedback 3

Closes #8327

@s1monw s1monw removed the review label Nov 4, 2014

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Nov 6, 2014

Discovery: a more lenient wait joinThread when stopping
When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits.

Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level.

bleskes added a commit that referenced this pull request Nov 6, 2014

Discovery: a more lenient wait joinThread when stopping
When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits.

Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level.

Closes #8359

bleskes added a commit that referenced this pull request Nov 7, 2014

Discovery: don't wait joinThread when stopping
When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. However, the joining thread is part of a thread pool and will not stop until the thread pool is shutdown.

Another issue raised by the unneeded wait is that when we shutdown, we may ping ourselves - which results in an ugly warn level log. We now log all remote exception during pings at a debug level.

Closes #8359

bleskes added a commit that referenced this pull request Nov 7, 2014

Discovery: don't wait joinThread when stopping
When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. However, the joining thread is part of a thread pool and will not stop until the thread pool is shutdown.

Another issue raised by the unneeded wait is that when we shutdown, we may ping ourselves - which results in an ugly warn level log. We now log all remote exception during pings at a debug level.

Closes #8359

bleskes added a commit that referenced this pull request Dec 11, 2014

Discovery: a more lenient wait joinThread when stopping
When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits.

Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level.

Closes #8359

@martijnvg martijnvg deleted the martijnvg:improvements/join_thread_control_life_cycle branch May 18, 2015

@clintongormley clintongormley changed the title from Discovery: Improve the lifecycle management of the join control thread in zen discovery. to Improve the lifecycle management of the join control thread in zen discovery. Jun 7, 2015

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Discovery: Improve the lifecycle management of the join control threa…
…d in zen discovery.

Also added:
* Better exception handling in UnicastZenPing#ping and MulticastZenPing#ping
* In the join thread that runs the innerJoinCluster loop, remember the last known exception and throw that when assertions are enabled. We loop until inner join has completed and if continues exceptions are thrown we should fail the test, because the exception shouldn't occur in production (at least not too often).
Applied feedback 3

Closes #8327

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Discovery: a more lenient wait joinThread when stopping
When a node stops, we cancel any ongoing join process. With #8327, we improved this logic and wait for it to complete before shutting down the node. In our tests we typically shutdown an entire cluster at once, which makes it very likely for nodes to be joining while shutting down. This introduces a race condition where the joinThread.interrupt can happen before the thread starts waiting on pings which causes shutdown logic to be slow. This commits improves by repeatedly trying to stop the thread in smaller waits.

Another side effect of the change is that we are now more likely to ping ourselves while shutting down, we results in an ugly warn level log. We now log all remote exception during pings at a debug level.

Closes #8359
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment