New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CCS network faults detection #34405

Open
javanna opened this Issue Oct 11, 2018 · 9 comments

Comments

Projects
None yet
6 participants
@javanna
Member

javanna commented Oct 11, 2018

When using Cross Cluster Search and a remote cluster becomes unreachable due to network issues, it takes the CCS node a while to detect that. This seems particularly bad if a firewall in-between drops connections, as it makes CCS searches hang, despite TCP connections can be initiated from the CCS node to the remote cluster nodes on port 9300.

This has been reported on our forum and also on #30247 .

There are two things that we should do to try and improve this:

  1. enable scheduled pings at the transport layer for CCS, like we already do for the transport client (see also #5067 and #10189)

  2. possibly adapt the transport ping to come back with a response and support a timeout

  3. given that CCS searches are timing out in the initial search_shards phase, we also may want to apply a sensible timeout and/or make the timeout configurable (see #32678)

@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine commented Oct 11, 2018

@tbrooks8

This comment has been minimized.

Show comment
Hide comment
@tbrooks8

tbrooks8 Oct 11, 2018

Contributor

possibly adapt the transport ping to come back with a response and support a timeout

I want to leave a comment about this option. Currently a ping is 6 bytes ('E','S',-1). There is no response.

  1. We could move this to be an actual transport message (say TransportPing). This makes the message larger as there are a number of things required for a whole transport message.

  2. We could also expand the unique nature of a ping (say 'E', 'S', -1, and then a random or incrementing int or long as a ping identifier). This would allow the receiving node to echo the ping back to the sending node. We could make it so that the client connection sends pings and the server connection responds. This will still test writes in both directions. If it were to timeout, the client would close the connection.

  3. We could make some type of CCS specific ping (similar to how I think we have a ZenPing). But then we probably need one for CCR eventually.

Contributor

tbrooks8 commented Oct 11, 2018

possibly adapt the transport ping to come back with a response and support a timeout

I want to leave a comment about this option. Currently a ping is 6 bytes ('E','S',-1). There is no response.

  1. We could move this to be an actual transport message (say TransportPing). This makes the message larger as there are a number of things required for a whole transport message.

  2. We could also expand the unique nature of a ping (say 'E', 'S', -1, and then a random or incrementing int or long as a ping identifier). This would allow the receiving node to echo the ping back to the sending node. We could make it so that the client connection sends pings and the server connection responds. This will still test writes in both directions. If it were to timeout, the client would close the connection.

  3. We could make some type of CCS specific ping (similar to how I think we have a ZenPing). But then we probably need one for CCR eventually.

@tbrooks8

This comment has been minimized.

Show comment
Hide comment
@tbrooks8

tbrooks8 Oct 11, 2018

Contributor

I think that #3 is the worst option.

Option #2 is okay, but it is a little gross as -1 technically is supposed to mean that the message has no length (that is how we identify pings). But now there would be an additional int.

Contributor

tbrooks8 commented Oct 11, 2018

I think that #3 is the worst option.

Option #2 is okay, but it is a little gross as -1 technically is supposed to mean that the message has no length (that is how we identify pings). But now there would be an additional int.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Oct 12, 2018

Contributor

I think #1 would be a fine solution here.

I've most often seen this where a firewall decides a connection is idle and drops it, black holing any future traffic on that connection. In this case, both TCP keepalives and transport pings are sufficient to prevent it, because the firewall will see packets in both directions (the message itself, and the corresponding ACK) and not drop the connection. The issue we often face with TCP keepalives is that security policy sometimes oddly dictates that keepalives may not be set below the default of 2h (on Linux) and the firewall drops connections after 1h, which is why we have to use transport pings too. We recommend properly configured TCP keepalives in the docs but do not spell out that this applies to cross-cluster connections too.

In any case if a keepalive or a ping doesn't go through then we will receive a notification a short while later, regardless of whether there's any application-level response, because we can rely on TCP retrying a few times until it receives an ACK and then eventually closing the connection, to which we react.

Contributor

DaveCTurner commented Oct 12, 2018

I think #1 would be a fine solution here.

I've most often seen this where a firewall decides a connection is idle and drops it, black holing any future traffic on that connection. In this case, both TCP keepalives and transport pings are sufficient to prevent it, because the firewall will see packets in both directions (the message itself, and the corresponding ACK) and not drop the connection. The issue we often face with TCP keepalives is that security policy sometimes oddly dictates that keepalives may not be set below the default of 2h (on Linux) and the firewall drops connections after 1h, which is why we have to use transport pings too. We recommend properly configured TCP keepalives in the docs but do not spell out that this applies to cross-cluster connections too.

In any case if a keepalive or a ping doesn't go through then we will receive a notification a short while later, regardless of whether there's any application-level response, because we can rely on TCP retrying a few times until it receives an ACK and then eventually closing the connection, to which we react.

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Oct 12, 2018

Contributor

I think #1 would be a fine solution here.

I meant @javanna's #1, i.e., using application-level pings.

Contributor

DaveCTurner commented Oct 12, 2018

I think #1 would be a fine solution here.

I meant @javanna's #1, i.e., using application-level pings.

@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna Oct 12, 2018

Member

I agree that the application-level pings will help keeping connections alive, but sadly we will still not detect network disconnections quickly enough, see https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/6 . Does that make sense to you as well @DaveCTurner ?

Member

javanna commented Oct 12, 2018

I agree that the application-level pings will help keeping connections alive, but sadly we will still not detect network disconnections quickly enough, see https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/6 . Does that make sense to you as well @DaveCTurner ?

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Oct 12, 2018

Contributor

I dug into the details a bit further and was surprised by Linux's default behaviour here.

On Linux the number of retransmissions for a TCP packet before the connection is dropped is /proc/sys/net/ipv4/tcp_retries2 which defaults to 15. Retries start quickly but back off exponentially so 15 retries take a little over 15 minutes. I'm surprised to learn it's this long, and 15 retries seems unreasonably many here. I think I must have tested things on systems on which this had been lowered from the default; I tried reducing it to 6 and observed connections being detected as dead and dropped after ~30sec. However this is a system-wide parameter, so it may not be desirable to change it, and we'd certainly need to think carefully before issuing a general recommendation about this. There are already places that suggest reducing this to 3 in a HA situation and they are not alone despite that RFC 1122 suggests it should be ≥8. To me 3 feels too low for connections that span any appreciable geographical distance, but might be fine within a single datacentre.

On Linux there's also the per-connection TCP_USER_TIMEOUT but this isn't portable (see https://bugs.openjdk.java.net/browse/JDK-8038145). I haven't looked into the alternatives available on Windows.

One advantage of dealing with this at the TCP layer is that it's really just looking at the connection, and is insensitive to things like a GC pause on the remote node. However if we want to avoid this kind of TCP tuning then adapting the application-level pings to be bidirectional does seem like a better approach. One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

Contributor

DaveCTurner commented Oct 12, 2018

I dug into the details a bit further and was surprised by Linux's default behaviour here.

On Linux the number of retransmissions for a TCP packet before the connection is dropped is /proc/sys/net/ipv4/tcp_retries2 which defaults to 15. Retries start quickly but back off exponentially so 15 retries take a little over 15 minutes. I'm surprised to learn it's this long, and 15 retries seems unreasonably many here. I think I must have tested things on systems on which this had been lowered from the default; I tried reducing it to 6 and observed connections being detected as dead and dropped after ~30sec. However this is a system-wide parameter, so it may not be desirable to change it, and we'd certainly need to think carefully before issuing a general recommendation about this. There are already places that suggest reducing this to 3 in a HA situation and they are not alone despite that RFC 1122 suggests it should be ≥8. To me 3 feels too low for connections that span any appreciable geographical distance, but might be fine within a single datacentre.

On Linux there's also the per-connection TCP_USER_TIMEOUT but this isn't portable (see https://bugs.openjdk.java.net/browse/JDK-8038145). I haven't looked into the alternatives available on Windows.

One advantage of dealing with this at the TCP layer is that it's really just looking at the connection, and is insensitive to things like a GC pause on the remote node. However if we want to avoid this kind of TCP tuning then adapting the application-level pings to be bidirectional does seem like a better approach. One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine commented Oct 12, 2018

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Oct 13, 2018

Contributor

One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

I did think of this as well but it might be tricky for us since we don't necessarily have a bi-directional connection here so implementing this would be quite tricky. What we can do is drive the heart-beat from one side of the connection and don't necessarily wait for a response. In such a case we can just send back a ping every time we receive one. This way we can implement it on the top level in TcpTransport and don't have to break all our abstractions. if we then didn't receive a ping from a node for X ms we can still declare the connection dead.

Contributor

s1monw commented Oct 13, 2018

One possible alternative is to follow STOMP's model and negotiate bidirectional pings in the handshake instead of having a strict request/response model.

I did think of this as well but it might be tricky for us since we don't necessarily have a bi-directional connection here so implementing this would be quite tricky. What we can do is drive the heart-beat from one side of the connection and don't necessarily wait for a response. In such a case we can just send back a ping every time we receive one. This way we can implement it on the top level in TcpTransport and don't have to break all our abstractions. if we then didn't receive a ping from a node for X ms we can still declare the connection dead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment