Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce connection timeout for intra-cluster connections #29022

Open
DaveCTurner opened this issue Mar 13, 2018 · 4 comments
Open

Reduce connection timeout for intra-cluster connections #29022

DaveCTurner opened this issue Mar 13, 2018 · 4 comments
Labels
:Distributed/Network Http and internode communication implementations >enhancement stalled Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

A 30 second timeout for establishing node-to-node connections within a cluster is unreasonably long. We need longer timeouts for connections out of the cluster, so simply reducing transport.tcp.connect_timeout isn't feasible, but a separate connection profile for node-to-node connections with a separately configurable timeout would mean that attempts to connect to an unresponsive node would be able to fail much more quickly.

Relates #28920 in which cluster state application is blocked for multiple minutes while repeated attempts to connect to unresponsive nodes take place.

@DaveCTurner DaveCTurner added help wanted adoptme :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Mar 13, 2018

We assumed it would be simpler to change the profiles used by NodeConnectionService and the pinging but we should double check. Maybe it is better to reduce transport.tcp.connect_timeout and increase it for the cases we need (transport client & CCS come to mind)

@DaveCTurner
Copy link
Contributor Author

Altering the profile used by the NodeConnectionsService involves a nontrivial amount of plumbing, because it needs to be based on TcpTransport#defaultConnectionProfile so that it has the correct numbers of each kind of channel. Yet, reducing all connection timeouts except for the cases (we think) we need seems like an invitation for trouble. This needs further discussion.

@DaveCTurner DaveCTurner added team-discuss and removed help wanted adoptme labels Jun 24, 2018
@DaveCTurner DaveCTurner added help wanted adoptme and removed team-discuss labels Jul 3, 2018
@DaveCTurner
Copy link
Contributor Author

We discussed this and decided it's likely that connection management is at risk of broader changes than this, so it makes sense to park this for now. To be revisited when the dust has settled.

@DaveCTurner DaveCTurner added stalled and removed help wanted adoptme labels Jul 3, 2018
@lcawl lcawl added v6.4.1 and removed v6.4.0 labels Aug 23, 2018
@DaveCTurner DaveCTurner added :Distributed/Network Http and internode communication implementations and removed :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 12, 2019
@polyfractal polyfractal added v7.2.0 and removed v7.0.0 labels Apr 9, 2019
@jakelandis jakelandis added v7.3.0 and removed v7.2.0 labels Jun 17, 2019
@jpountz jpountz removed the v6.4.1 label Jul 5, 2019
@jpountz jpountz removed the v7.3.0 label Jul 5, 2019
@rjernst rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >enhancement stalled Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

9 participants