Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only connect to new nodes on new cluster state #39629

Conversation

Projects
None yet
4 participants
@DaveCTurner
Copy link
Contributor

commented Mar 4, 2019

Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves #29025.
Supersedes #31547.

Only connect to new nodes on new cluster state
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves #29025.
Supersedes #31547.
@elasticmachine

This comment has been minimized.

Copy link

commented Mar 4, 2019

@andrershov
Copy link
Contributor

left a comment

initial pass

Show resolved Hide resolved server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java Outdated
"connection cancelled by disconnection");
}

Runnable ensureConnected(ActionListener<Void> listener) {

This comment has been minimized.

Copy link
@andrershov

andrershov Mar 4, 2019

Contributor

Is it possible that ensureConnected and connect/disconnect are called in different threads at the same time? I'm not sure how we're protecting listeners from races

This comment has been minimized.

Copy link
@DaveCTurner

DaveCTurner Mar 4, 2019

Author Contributor

Yes, they can be called in different threads, but we only ever read or write listeners under the mutex. The listeners are never called under the mutex so it is possible that the notifications happen out of order, but this is benign.

DaveCTurner added some commits Mar 4, 2019

DaveCTurner added some commits Mar 8, 2019

@DaveCTurner DaveCTurner requested a review from andrershov Mar 8, 2019

@DaveCTurner

This comment has been minimized.

Copy link
Contributor Author

commented Mar 8, 2019

@andrershov you'll be pleased to hear I adjusted this to use a future 😁

DaveCTurner added some commits Mar 12, 2019

@henningandersen
Copy link
Contributor

left a comment

LGTM.

I added a few nits/minor comments.

DaveCTurner added some commits Mar 12, 2019

@andrershov
Copy link
Contributor

left a comment

Unfortunately, replacing the list of listeners with the future does not make the code much simpler. But I must confess I also cannot come up with an easier ConnectionTarget implementation, using future chaining.
Nice job! LGTM

@DaveCTurner DaveCTurner merged commit 839237d into elastic:master Mar 12, 2019

8 checks passed

CLA All commits in pull request signed
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/bwc Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@DaveCTurner DaveCTurner deleted the DaveCTurner:2019-03-02-nodeconnectionsservice-avoid-blocking-on-known-nodes branch Mar 12, 2019

DaveCTurner added a commit that referenced this pull request Mar 12, 2019

Only connect to new nodes on new cluster state (#39629)
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves #29025.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 7, 2019

Only connect to new nodes on new cluster state
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves elastic#29025.
Backport of elastic#39629 and elastic#40037.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 7, 2019

Only connect to new nodes on new cluster state
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves elastic#29025.
Backport of elastic#39629 and elastic#40037.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.