Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only retry join when other node is not (yet) a master #8972

Closed

Conversation

Projects
None yet
3 participants
@bleskes
Copy link
Member

bleskes commented Dec 16, 2014

When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for ElasticsearchIllegalStateException. However, local ElasticsearchIllegalStateException can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

The PR adds an explicit NotMasterException to indicate the remote node is not a master. A similarly named exception (but meaning something else) in the master fault detection code was given a better name. Also clean up some other exceptions while at it.

See http://build-us-00.elasticsearch.org/job/es_g1gc_master_metal/499/testReport/junit/org.elasticsearch.discovery.zen/ZenDiscoveryTests/testNodeFailuresAreProcessedOnce/ for a test that gets confused by the extra join

Discovery: only retry join for remote exceptions
When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detec this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for ElasticsearchIllegalStateException. However, local ElasticsearchIllegalStateException can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

@bleskes bleskes force-pushed the bleskes:zen_disco_only_retry_on_remote_exception branch Dec 16, 2014

@bleskes bleskes force-pushed the bleskes:zen_disco_only_retry_on_remote_exception branch to de2bb69 Dec 16, 2014

@bleskes

This comment has been minimized.

Copy link
Member Author

bleskes commented Dec 16, 2014

@kimchy I updated the PR to use an explicit exception. Also cleaned up some unused exceptions. I'll update the PR description + commit msg once the review is done.

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Dec 16, 2014

LGTM

@bleskes bleskes removed the v1.5.0 label Dec 16, 2014

@bleskes bleskes closed this in 8f146f9 Dec 16, 2014

@bleskes bleskes deleted the bleskes:zen_disco_only_retry_on_remote_exception branch Dec 16, 2014

@bleskes bleskes changed the title Discovery: only retry join for remote exceptions Discovery: only retry join when other node is not (yet) a master Dec 16, 2014

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Dec 16, 2014

Discovery: only retry join when other node is not (yet) a master
When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for `ElasticsearchIllegalStateException`. However, local `ElasticsearchIllegalStateException` can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

Since we can't introduce new exceptions in a BWC manner, we are forced to check the message of the exception.

Relates to elastic#8972

bleskes added a commit that referenced this pull request Dec 16, 2014

Discovery: only retry join when other node is not (yet) a master
When a node tries to join a master, the master may not yet be ready to accept the join request. In such cases we retry sending the join request up to 3 times before going back to ping. To detect this the current logic uses ExceptionsHelper.unwrapCause(t) to unwrap the incoming RemoteTransportException and inspect it's source, looking for `ElasticsearchIllegalStateException`. However, local `ElasticsearchIllegalStateException` can also be thrown when the join process should be cancelled (i.e., node shut down). In this case we shouldn't retry.

Since we can't introduce new exceptions in a BWC manner, we are forced to check the message of the exception.

Relates to #8972
Closes #8979

@bleskes bleskes added the resiliency label Feb 2, 2015

@clintongormley clintongormley removed the review label Mar 19, 2015

@clintongormley clintongormley changed the title Discovery: only retry join when other node is not (yet) a master Only retry join when other node is not (yet) a master Jun 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.