Skip to content

Commit

Permalink
Generalize TCP retxn docs to cover remote clusters (#74732)
Browse files Browse the repository at this point in the history
Today the docs on setting `tcp_retries2` only talk about intra-cluster
connections, but in fact this setting is equally important to the
resilience of remote cluster connections too. This commit rewords these
docs to cover both cases.

Relates #34405
  • Loading branch information
DaveCTurner committed Jul 5, 2021
1 parent c13556e commit 13f0284
Showing 1 changed file with 29 additions and 23 deletions.
52 changes: 29 additions & 23 deletions docs/reference/setup/sysconfig/tcpretries.asciidoc
@@ -1,32 +1,38 @@
[[system-config-tcpretries]]
=== TCP retransmission timeout

Each pair of nodes in a cluster communicates via a number of TCP connections
which <<long-lived-connections,remain open>> until one of the nodes shuts down
or communication between the nodes is disrupted by a failure in the underlying
Each pair of {es} nodes communicates via a number of TCP connections which
<<long-lived-connections,remain open>> until one of the nodes shuts down or
communication between the nodes is disrupted by a failure in the underlying
infrastructure.

TCP provides reliable communication over occasionally-unreliable networks by
TCP provides reliable communication over occasionally unreliable networks by
hiding temporary network disruptions from the communicating applications. Your
operating system will retransmit any lost messages a number of times before
informing the sender of any problem. Most Linux distributions default to
retransmitting any lost packets 15 times. Retransmissions back off
exponentially, so these 15 retransmissions take over 900 seconds to complete.
This means it takes Linux many minutes to detect a network partition or a
failed node with this method. Windows defaults to just 5 retransmissions which
corresponds with a timeout of around 6 seconds.
informing the sender of any problem. {es} must wait while the retransmissions
are happening and can only react once the operating system decides to give up.
Users must therefore also wait for a sequence of retransmissions to complete.

Most Linux distributions default to retransmitting any lost packets 15 times.
Retransmissions back off exponentially, so these 15 retransmissions take over
900 seconds to complete. This means it takes Linux many minutes to detect a
network partition or a failed node with this method. Windows defaults to just 5
retransmissions which corresponds with a timeout of around 6 seconds.

The Linux default allows for communication over networks that may experience
very long periods of packet loss, but this default is excessive for production
networks within a single data centre as is the case for most {es} clusters.
Highly-available clusters must be able to detect node failures quickly so that
they can react promptly by reallocating lost shards, rerouting searches and
perhaps electing a new master node. Linux users should therefore reduce the
maximum number of TCP retransmissions.
very long periods of packet loss, but this default is excessive and even harmful
on the high quality networks used by most {es} installations. When a cluster
detects a node failure it reacts by reallocating lost shards, rerouting
searches, and maybe electing a new master node. Highly available clusters must
be able to detect node failures promptly, which can be achieved by reducing the
permitted number of retransmissions. Connections to
<<modules-remote-clusters,remote clusters>> should also prefer to detect
failures much more quickly than the Linux default allows. Linux users should
therefore reduce the maximum number of TCP retransmissions.

You can decrease the maximum number of TCP retransmissions to `5` by running
the following command as `root`. Five retransmissions corresponds with a
timeout of around six seconds.
You can decrease the maximum number of TCP retransmissions to `5` by running the
following command as `root`. Five retransmissions corresponds with a timeout of
around six seconds.

[source,sh]
-------------------------------------
Expand All @@ -38,8 +44,8 @@ To set this value permanently, update the `net.ipv4.tcp_retries2` setting in
`sysctl net.ipv4.tcp_retries2`.

IMPORTANT: This setting applies to all TCP connections and will affect the
reliability of communication with systems outside your cluster too. If your
cluster communicates with external systems over an unreliable network then you
reliability of communication with systems other than {es} clusters too. If your
clusters communicate with external systems over a low quality network then you
may need to select a higher value for `net.ipv4.tcp_retries2`. For this reason,
{es} does not adjust this setting automatically.

Expand All @@ -54,6 +60,6 @@ related to these application-level health checks.
You must also ensure your network infrastructure does not interfere with the
long-lived connections between nodes, <<long-lived-connections,even if those
connections appear to be idle>>. Devices which drop connections when they reach
a certain age are a common source of problems to Elasticsearch clusters, and
must not be used.
a certain age are a common source of problems to {es} clusters, and must not be
used.

0 comments on commit 13f0284

Please sign in to comment.