Reduce connection timeout for intra-cluster connections #29022

DaveCTurner · 2018-03-13T16:33:09Z

A 30 second timeout for establishing node-to-node connections within a cluster is unreasonably long. We need longer timeouts for connections out of the cluster, so simply reducing transport.tcp.connect_timeout isn't feasible, but a separate connection profile for node-to-node connections with a separately configurable timeout would mean that attempts to connect to an unresponsive node would be able to fail much more quickly.

Relates #28920 in which cluster state application is blocked for multiple minutes while repeated attempts to connect to unresponsive nodes take place.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-03-13T16:33:11Z

Pinging @elastic/es-distributed

bleskes · 2018-03-13T21:26:32Z

We assumed it would be simpler to change the profiles used by NodeConnectionService and the pinging but we should double check. Maybe it is better to reduce transport.tcp.connect_timeout and increase it for the cases we need (transport client & CCS come to mind)

DaveCTurner · 2018-06-24T19:50:38Z

Altering the profile used by the NodeConnectionsService involves a nontrivial amount of plumbing, because it needs to be based on TcpTransport#defaultConnectionProfile so that it has the correct numbers of each kind of channel. Yet, reducing all connection timeouts except for the cases (we think) we need seems like an invitation for trouble. This needs further discussion.

DaveCTurner · 2018-07-03T14:40:50Z

We discussed this and decided it's likely that connection management is at risk of broader changes than this, so it makes sense to park this for now. To be revisited when the dust has settled.

DaveCTurner added help wanted adoptme :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 13, 2018

DaveCTurner added v7.0.0 v6.3.0 labels Mar 13, 2018

DaveCTurner mentioned this issue Mar 13, 2018

Slow recovery of write availability after partition of a large cluster #28920

Closed

colings86 added the >enhancement label Apr 24, 2018

bleskes added v6.3.1 v6.4.0 and removed v6.3.0 v6.3.1 labels Apr 26, 2018

DaveCTurner added team-discuss and removed help wanted adoptme labels Jun 24, 2018

DaveCTurner added help wanted adoptme and removed team-discuss labels Jul 3, 2018

DaveCTurner added stalled and removed help wanted adoptme labels Jul 3, 2018

DaveCTurner mentioned this issue Aug 1, 2018

MinimumMasterNodesIT fails due to connection timeout to shut-down node #32552

Closed

lcawl added v6.4.1 and removed v6.4.0 labels Aug 23, 2018

DaveCTurner added :Distributed/Network Http and internode communication implementations and removed :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 12, 2019

$@polyfractal$ polyfractal added v7.2.0 and removed v7.0.0 labels Apr 9, 2019

jakelandis added v7.3.0 and removed v7.2.0 labels Jun 17, 2019

jpountz removed the v6.4.1 label Jul 5, 2019

jpountz removed the v7.3.0 label Jul 5, 2019

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce connection timeout for intra-cluster connections #29022

Reduce connection timeout for intra-cluster connections #29022

DaveCTurner commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

bleskes commented Mar 13, 2018

DaveCTurner commented Jun 24, 2018

DaveCTurner commented Jul 3, 2018

Reduce connection timeout for intra-cluster connections #29022

Reduce connection timeout for intra-cluster connections #29022

Comments

DaveCTurner commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

bleskes commented Mar 13, 2018

DaveCTurner commented Jun 24, 2018

DaveCTurner commented Jul 3, 2018