Improve CCS network faults detection #34405

javanna · 2018-10-11T16:07:11Z

When using Cross Cluster Search and a remote cluster becomes unreachable due to network issues, it takes the CCS node a while to detect that. This seems particularly bad if a firewall in-between drops connections, as it makes CCS searches hang, despite TCP connections can be initiated from the CCS node to the remote cluster nodes on port 9300.

This has been reported on our forum and also on #30247 .

The following are changes that we could make to improve this:

allow to enable scheduled pings directed to remote clusters without affecting intra-cluster communication (see also Schedule transport ping interval #10189) (Allow to enable pings for specific remote clusters #34753)
possibly adapt the transport ping to come back with a response (Make keepalive pings bidirectional and optimizable #35441)
add support for a timeout to the transport ping? (users should do this at the TCP level and Suggest reducing tcp_retries2 #59222 documents how)
consider enabling transport pings by default for remote clusters with a sensible default interval (see also Update documentation to prevent dropping tcp connections #5067, some discussion happened in Allow to enable pings for specific remote clusters #34753)
given that CCS searches are timing out in the initial search_shards phase, we also may want to apply a sensible timeout and/or make the timeout configurable (see CCS: Should timeout parameter be honored? #32678)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-10-11T16:07:12Z

Pinging @elastic/es-search-aggs

Tim-Brooks · 2018-10-11T16:16:47Z

possibly adapt the transport ping to come back with a response and support a timeout

I want to leave a comment about this option. Currently a ping is 6 bytes ('E','S',-1). There is no response.

We could move this to be an actual transport message (say TransportPing). This makes the message larger as there are a number of things required for a whole transport message.
We could also expand the unique nature of a ping (say 'E', 'S', -1, and then a random or incrementing int or long as a ping identifier). This would allow the receiving node to echo the ping back to the sending node. We could make it so that the client connection sends pings and the server connection responds. This will still test writes in both directions. If it were to timeout, the client would close the connection.
We could make some type of CCS specific ping (similar to how I think we have a ZenPing). But then we probably need one for CCR eventually.

Tim-Brooks · 2018-10-11T16:20:30Z

I think that #3 is the worst option.

Option #2 is okay, but it is a little gross as -1 technically is supposed to mean that the message has no length (that is how we identify pings). But now there would be an additional int.

DaveCTurner · 2018-10-12T09:43:44Z

I think #1 would be a fine solution here.

I've most often seen this where a firewall decides a connection is idle and drops it, black holing any future traffic on that connection. In this case, both TCP keepalives and transport pings are sufficient to prevent it, because the firewall will see packets in both directions (the message itself, and the corresponding ACK) and not drop the connection. The issue we often face with TCP keepalives is that security policy sometimes oddly dictates that keepalives may not be set below the default of 2h (on Linux) and the firewall drops connections after 1h, which is why we have to use transport pings too. We recommend properly configured TCP keepalives in the docs but do not spell out that this applies to cross-cluster connections too.

In any case if a keepalive or a ping doesn't go through then we will receive a notification a short while later, regardless of whether there's any application-level response, because we can rely on TCP retrying a few times until it receives an ACK and then eventually closing the connection, to which we react.

DaveCTurner · 2018-10-12T09:50:52Z

I think #1 would be a fine solution here.

I meant @javanna's #1, i.e., using application-level pings.

javanna · 2018-10-12T17:03:02Z

I agree that the application-level pings will help keeping connections alive, but sadly we will still not detect network disconnections quickly enough, see https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/6 . Does that make sense to you as well @DaveCTurner ?

DaveCTurner · 2018-10-12T21:21:27Z

I dug into the details a bit further and was surprised by Linux's default behaviour here.

On Linux the number of retransmissions for a TCP packet before the connection is dropped is /proc/sys/net/ipv4/tcp_retries2 which defaults to 15. Retries start quickly but back off exponentially so 15 retries take a little over 15 minutes. I'm surprised to learn it's this long, and 15 retries seems unreasonably many here. I think I must have tested things on systems on which this had been lowered from the default; I tried reducing it to 6 and observed connections being detected as dead and dropped after ~30sec. However this is a system-wide parameter, so it may not be desirable to change it, and we'd certainly need to think carefully before issuing a general recommendation about this. There are already places that suggest reducing this to 3 in a HA situation and they are not alone despite that RFC 1122 suggests it should be ≥8. To me 3 feels too low for connections that span any appreciable geographical distance, but might be fine within a single datacentre.

On Linux there's also the per-connection TCP_USER_TIMEOUT but this isn't portable (see https://bugs.openjdk.java.net/browse/JDK-8038145). I haven't looked into the alternatives available on Windows.

One advantage of dealing with this at the TCP layer is that it's really just looking at the connection, and is insensitive to things like a GC pause on the remote node. However if we want to avoid this kind of TCP tuning then adapting the application-level pings to be bidirectional does seem like a better approach. One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

elasticmachine · 2018-10-12T23:18:53Z

Pinging @elastic/es-distributed

s1monw · 2018-10-13T19:50:38Z

One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

I did think of this as well but it might be tricky for us since we don't necessarily have a bi-directional connection here so implementing this would be quite tricky. What we can do is drive the heart-beat from one side of the connection and don't necessarily wait for a response. In such a case we can just send back a ping every time we receive one. This way we can implement it on the top level in TcpTransport and don't have to break all our abstractions. if we then didn't receive a ping from a node for X ms we can still declare the connection dead.

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we enable application-level pings by default every 5 seconds from CCS nodes to the selected remote nodes. We also add a setting called `cluster.remote.ping_schedule` that allows to change the interval and potentially disable application-level pings, similar to `transport.ping_schedule` but the new setting only affects connections made to remote clusters. Relates to elastic#34405

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`. The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default. Relates to #34405

s1monw · 2018-11-06T13:08:40Z

@javanna I see you have been working on adding support to enable pings on CC connections only. Good stuff. Regarding enabling it by default I wonder if we can be a bit smarter than we are today and enable it by default when the connections haven't been used for like a minute or two and then go and make sure they get at least a ping every 60 seconds. The defaults are quite low and we might be able to prevent most of the issues if we do it once in a while? @jasontedor WDYT

This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.

This is related to elastic#34405 and a follow-up to elastic#34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.

This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates elastic#34405

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates elastic#34405

DaveCTurner · 2021-06-28T11:28:04Z

I'm marking this for team discussion again because I think the two remaining items are effectively solved by other means and therefore we can close it. We enable TCP keepalives by default with a sensible interval and recommend users set tcp_retries2 to a sensible value as well. Together these TCP-level timeouts will have the effect of keeping connections open if possible, and detecting a dropped connection relatively quickly, obviating the need for application-level effort to do the same in most cases.

DaveCTurner · 2021-06-28T11:41:54Z

I have one further question in this area regarding how RemoteClusterAwareClient#doExecute always calls RemoteClusterService#ensureConnected:

elasticsearch/server/src/main/java/org/elasticsearch/transport/RemoteClusterAwareClient.java

Lines 36 to 37 in a92a647

    
           void doExecute(ActionType<Response> action, Request request, ActionListener<Response> listener) { 
        
               remoteClusterService.ensureConnected(clusterAlias, ActionListener.wrap(v -> {

If the remote cluster is in a network black hole then this will trigger a new connection attempt that then times out after 30s (I think). Furthermore we may end up waiting for a multiple of the timeout period if we're sending requests in sequence to the remote cluster, as I believe we do in CCS. We should consider whether/when to fail requests quickly if the remote is disconnected rather than waiting on another connection attempt. If the skip_unavailable flag is set that seems like the right thing to do at least.

Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates elastic#34405

DaveCTurner · 2021-06-30T16:53:24Z

We (the @elastic/es-distributed team) discussed this and agreed that this is good to close for the reasons I gave above. We were a little time-constrained so I'll leave it open for another couple of days in case anyone has further thoughts to share async.

I opened #74773 to ask the search team to consider the remaining question I raised in my previous message.

Today the docs on setting `tcp_retries2` only talk about intra-cluster connections, but in fact this setting is equally important to the resilience of remote cluster connections too. This commit rewords these docs to cover both cases. Relates #34405

javanna added >bug :Search/Search Search-related issues that do not fall into other categories labels Oct 11, 2018

javanna mentioned this issue Oct 12, 2018

Remote cross cluster search stops working... #30247

Closed

jasontedor added :Distributed/Network Http and internode communication implementations and removed :Search/Search Search-related issues that do not fall into other categories labels Oct 12, 2018

javanna mentioned this issue Oct 23, 2018

Allow to enable pings for specific remote clusters #34753

Merged

javanna added Meta and removed >bug labels Nov 6, 2018

Tim-Brooks mentioned this issue Nov 12, 2018

Make keepalive pings bidirectional and optimizable #35441

Merged

Tim-Brooks mentioned this issue Nov 29, 2018

Make keepalive pings bidirectional and optimizable (#35441) #36063

Merged

This was referenced Dec 19, 2018

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds helm/charts#8785

Closed

[Feature Request] Configuration to customize discovery/zen/fd/master_ping #36822

Closed

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 8, 2020

Suggest reducing tcp_retries2

cdec0f8

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates elastic#34405

DaveCTurner mentioned this issue Jul 8, 2020

Suggest reducing tcp_retries2 #59222

Merged

DaveCTurner added a commit that referenced this issue Jul 27, 2020

Suggest reducing tcp_retries2 (#59222)

d3cf8d3

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405

DaveCTurner added a commit that referenced this issue Jul 27, 2020

Suggest reducing tcp_retries2 (#59222)

d8fdb82

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405

DaveCTurner added a commit that referenced this issue Jul 27, 2020

Suggest reducing tcp_retries2 (#59222)

8cf7b4a

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates #34405

wuyinan0126 pushed a commit to wuyinan0126/elasticsearch that referenced this issue Jul 28, 2020

Suggest reducing tcp_retries2 (elastic#59222)

8657640

Adds documentation suggesting reducing `tcp_retries2` on Linux to detect network partitions more quickly. Relates elastic#34405

matschaffer mentioned this issue Jun 22, 2021

[Monitoring] Handle failed/unavailable CCS requests elastic/kibana#100696

Closed

DaveCTurner added the team-discuss label Jun 28, 2021

DaveCTurner mentioned this issue Jun 30, 2021

Generalize TCP retxn docs to cover remote clusters #74732

Merged

DaveCTurner closed this as completed Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CCS network faults detection #34405

Improve CCS network faults detection #34405

javanna commented Oct 11, 2018 •

edited by DaveCTurner

elasticmachine commented Oct 11, 2018

Tim-Brooks commented Oct 11, 2018 •

edited

Tim-Brooks commented Oct 11, 2018 •

edited

DaveCTurner commented Oct 12, 2018

DaveCTurner commented Oct 12, 2018 •

edited

javanna commented Oct 12, 2018

DaveCTurner commented Oct 12, 2018

elasticmachine commented Oct 12, 2018

s1monw commented Oct 13, 2018

s1monw commented Nov 6, 2018

DaveCTurner commented Jun 28, 2021

DaveCTurner commented Jun 28, 2021

DaveCTurner commented Jun 30, 2021

Improve CCS network faults detection #34405

Improve CCS network faults detection #34405

Comments

javanna commented Oct 11, 2018 • edited by DaveCTurner

elasticmachine commented Oct 11, 2018

Tim-Brooks commented Oct 11, 2018 • edited

Tim-Brooks commented Oct 11, 2018 • edited

DaveCTurner commented Oct 12, 2018

DaveCTurner commented Oct 12, 2018 • edited

javanna commented Oct 12, 2018

DaveCTurner commented Oct 12, 2018

elasticmachine commented Oct 12, 2018

s1monw commented Oct 13, 2018

s1monw commented Nov 6, 2018

DaveCTurner commented Jun 28, 2021

DaveCTurner commented Jun 28, 2021

DaveCTurner commented Jun 30, 2021

javanna commented Oct 11, 2018 •

edited by DaveCTurner

Tim-Brooks commented Oct 11, 2018 •

edited

Tim-Brooks commented Oct 11, 2018 •

edited

DaveCTurner commented Oct 12, 2018 •

edited