Allow to enable pings for specific remote clusters #34753

javanna · 2018-10-23T14:37:20Z

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. It is already possible to enable application-level pings through the transport.ping_schedule setting, but such setting affects also intra-cluster communication. With this PR we add a per cluster new setting (called cluster.remote.${cluster_alias}.transport.ping_schedule ) that allows to configure application-level pings for each remote cluster.

Relates to #34405
Possible relates to #30247

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we enable application-level pings by default every 5 seconds from CCS nodes to the selected remote nodes. We also add a setting called `cluster.remote.ping_schedule` that allows to change the interval and potentially disable application-level pings, similar to `transport.ping_schedule` but the new setting only affects connections made to remote clusters. Relates to elastic#34405

elasticmachine · 2018-10-23T14:37:23Z

Pinging @elastic/es-distributed

javanna · 2018-10-23T14:38:27Z

I am contemplating marking this as a bug and backporting this, given that we heard of connectivity problems to remote clusters in quite a few cases, and this change should help those.

jasontedor · 2018-10-23T15:00:14Z

I am not sure about making this the default. I can be convinced, but my bias would be that this be similar to the within-cluster pings and be disabled by default.

Also, we are going to have a larger need to enable different transport settings on remote cluster connections versus within-cluster connections. For example, see #34483. I think we should have an eye towards how it would look and so I would propose that the setting be namespaced under cluster.remote.<name of cluster>.transport.<transport setting>. So that's two changes:

this would be on a per remote cluster basis
a different namespace for the setting

Finally, I don't think this should be considered a bug fix. I agree with your current labeling of an enhancement.

javanna · 2018-10-23T15:43:42Z

We've had multiple cases in discuss forums etc. where users have connectivity problems when using CCS. We've discussed this as part of #34405 with @tbrooks8 , @s1monw and @DaveCTurner which makes me reasonably sure that this is a good improvement. Scheduled pings are defaulted to 5s for the transport client for a long time, and I think that CCS is comparable to transport client as it connects to remote nodes, there are firewalls in-between etc. so the issues are different compared to those that you have within the same cluster.

One reason why this (or some other similar) change is needed is that it affects only connections to remote clusters, while we've suggested in some cases to enable scheduled transport pings in order to fix CCS connectivity issues, and that will also enable intra-cluster transport pings which is not desirable. Adding the new setting also allows to disable those pings if they are not needed, but having them enabled by default should be a good trade-off. I specifically added the setting as a global one for all remote clusters, because I did not see the need to make things configurable per cluster, which would also complicate the implementation. Would you see this configured per cluster because in the context of CCR scheduled pings would not be needed? Or what is the scenario where different clusters need different ping intervals?

DaveCTurner · 2018-10-24T08:39:49Z

I think this is a valuable feature, and I think Jason's proposal to have this setting be different for different remote clusters is a good improvement.

I'm undecided about whether we should switch this on by default. I don't expect it to have a very significant impact on resource usage, but it does have some impact, particularly if there are many clusters in a deployment. Also it would (for instance) make it slightly harder to interpret the flow of traffic in a tcpdump capture. Based on the number of times this comes up, I think this solves a problem experienced by a fairly small fraction of users, and these users can switch it on themselves.

I'm also undecided about the default being 5s. A common idle-connection timeout is 1h, so this could reasonably default to tens-of-minutes and still solve a lot of problems.

I think it'd be best to introduce this feature but not to change the default in this PR, and we can have a separate debate about the default at a later date.

javanna · 2018-10-24T08:46:30Z

Thanks for your feedback David, what would the use-case that requires configuring the scheduled pings per cluster?

DaveCTurner · 2018-10-24T09:20:10Z

For instance, a deployment that federates a bunch of clusters spread around the world may want to avoid the unnecessary pings between clusters within each datacentre but may require them for more distant clusters; the need for pings (and their required frequency) may vary on a link-by-link basis.

javanna · 2018-10-24T15:15:44Z

I have updated the PR based on the comments, reviews would be appreciated ;)

jasontedor · 2018-10-24T15:23:05Z

For instance, a deployment that federates a bunch of clusters spread around the world may want to avoid the unnecessary pings between clusters within within each datacentre but may require them for more distant clusters; the need for pings (and their required frequency) may vary on a link-by-link basis.

+1

Indeed, we renamed CCR from XDCR (i.e., cross-cluster over cross-datacenter) in recognition of the fact that we expect many use-cases for replicating amongst clusters within in a single datacenter, it's not only limited to replicating across the globe.

DaveCTurner

I mainly looked at the docs and tests and left some thoughts.

DaveCTurner · 2018-10-24T15:22:34Z

docs/reference/modules/remote-clusters.asciidoc

@@ -152,6 +152,14 @@ PUT _cluster/settings
  by default, but they can selectively be made optional by setting this setting
  to `true`.

+`cluster.remote.${cluster_alias}.transport.ping_schedule`::
+
+  Schedule a regular application-level ping message to ensure that transport


I think we need to say that this setting sets the time between pings, otherwise it's not clear what values other than -1 mean. I also think that the "defaults to ... which defaults to ..." in the last sentence might cause confusion. I drafted an alternative:

Sets the time interval between regular application-level ping messages that are sent to ensure that transport connections to nodes belonging to remote clusters are kept alive. If set to -1, application-level ping messages to this remote cluster are not sent. If unset, application-level ping messages are sent according to the global transport.ping_schedule setting, which defaults to -1 meaning that pings are not sent.

I'm not sure this is correct, however. If we set

transport.ping_schedule: 5s
cluster.remote.foo.transport.ping_schedule: -1

Does this disable pings to the foo remote? Should it? I think it'd be useful to be able to do so. I haven't dug into the implementation but there's no test for this case as far as I can see.

it sounds great, I was hoping you would help out rephrasing the docs, thanks a lot for that. The behaviour should be what you describe with transport.ping_schedule as a fallback, but I will add a test for this specific case that you mention, it's a good point.

DaveCTurner · 2018-10-24T15:30:45Z

docs/reference/modules/transport.asciidoc

-keep-alives apply to all kinds of long-lived connection and not just to
+`5s` in the transport client and `-1` (disabled) elsewhere. It is preferable
+to correctly configure TCP keep-alives instead of using this feature, because
+TCP keep-alives apply to all kinds of long-lived connections and not just to


Debates rage to this day on the internet about the use of singular or plural after "kinds of". I suspect that British English prefers the singular and US English the plural, and both are ok 😄 🇬🇧 (I'm ok with this change, just thought you'd like to know)

interesting :)

Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753

javanna · 2018-10-26T13:23:36Z

I have addressed the comments, tests are green, @DaveCTurner would you mind having another look please?

DaveCTurner · 2018-10-29T11:24:26Z

I think you didn't intend to close this @matriv - perhaps a typo?

javanna · 2018-10-29T14:48:17Z

retest this please

DaveCTurner

LGTM. I left a handful of optional suggestions but this looks good either way.

server/src/test/java/org/elasticsearch/transport/RemoteClusterServiceTests.java

…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>

javanna · 2018-10-30T14:55:44Z

retest this please

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`. The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default. Relates to #34405

This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.

This is related to elastic#34405 and a follow-up to elastic#34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.

This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.

javanna added 3 commits October 23, 2018 16:34

remove constructor

e016ae7

remove connection manager public constructor

edfdac5

javanna added >enhancement :Distributed/Network Http and internode communication implementations v7.0.0 v6.5.0 labels Oct 23, 2018

javanna requested review from Tim-Brooks and DaveCTurner October 23, 2018 14:37

fix check-style

19b4bac

javanna mentioned this pull request Oct 24, 2018

Improve CCS network faults detection #34405

Closed

5 tasks

per cluster setting, default to -1

239ddc1

javanna changed the title ~~Schedule ping by default for remote clusters~~ Allow to enable pings for specific remote clusters Oct 24, 2018

make method package private

fae393a

javanna added v6.6.0 and removed v6.5.0 labels Oct 24, 2018

DaveCTurner reviewed Oct 24, 2018

View reviewed changes

Merge branch 'master' into enhancement/remote_clusters_pings

ce98370

matriv added a commit to matriv/elasticsearch that referenced this pull request Oct 25, 2018

SQL: Fix and enable test with randomness

c99a35a

Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753

matriv added a commit to matriv/elasticsearch that referenced this pull request Oct 25, 2018

SQL: [Test] Fix and enable test with randomness

778e4d1

Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753

matriv mentioned this pull request Oct 25, 2018

SQL: [Test] Fix and enable test with randomness #34850

Merged

matriv added a commit to matriv/elasticsearch that referenced this pull request Oct 25, 2018

SQL: [Test] Fix and enable test with randomness

b9b90e2

Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753

javanna added 3 commits October 25, 2018 19:42

address comments

dafcabf

checkstyle

74af6d7

Merge branch 'master' into enhancement/remote_clusters_pings

4df073d

matriv closed this in #34850 Oct 29, 2018

DaveCTurner reopened this Oct 29, 2018

DaveCTurner approved these changes Oct 30, 2018

View reviewed changes

DaveCTurner and others added 5 commits October 30, 2018 11:51

Update server/src/test/java/org/elasticsearch/transport/RemoteCluster…

fd43e39

…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>

Update server/src/test/java/org/elasticsearch/transport/RemoteCluster…

8550bad

…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>

Update server/src/test/java/org/elasticsearch/transport/RemoteCluster…

a0ca97c

…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>

address review comments

3e984ac

Merge branch 'master' into enhancement/remote_clusters_pings

8b27267

Merge branch 'master' into enhancement/remote_clusters_pings

076aef8

javanna merged commit ef5181c into elastic:master Oct 31, 2018

javanna added the backport pending label Oct 31, 2018

javanna removed the backport pending label Nov 1, 2018

Tim-Brooks mentioned this pull request Nov 12, 2018

Make keepalive pings bidirectional and optimizable #35441

Merged

Tim-Brooks mentioned this pull request Nov 29, 2018

Make keepalive pings bidirectional and optimizable (#35441) #36063

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to enable pings for specific remote clusters #34753

Allow to enable pings for specific remote clusters #34753

javanna commented Oct 23, 2018 •

edited

elasticmachine commented Oct 23, 2018

javanna commented Oct 23, 2018

jasontedor commented Oct 23, 2018 •

edited

javanna commented Oct 23, 2018

DaveCTurner commented Oct 24, 2018

javanna commented Oct 24, 2018

DaveCTurner commented Oct 24, 2018 •

edited

javanna commented Oct 24, 2018

jasontedor commented Oct 24, 2018

DaveCTurner left a comment

DaveCTurner Oct 24, 2018

javanna Oct 24, 2018

DaveCTurner Oct 24, 2018

javanna Oct 24, 2018

javanna commented Oct 26, 2018

DaveCTurner commented Oct 29, 2018

javanna commented Oct 29, 2018

DaveCTurner left a comment

javanna commented Oct 30, 2018

Allow to enable pings for specific remote clusters #34753

Allow to enable pings for specific remote clusters #34753

Conversation

javanna commented Oct 23, 2018 • edited

elasticmachine commented Oct 23, 2018

javanna commented Oct 23, 2018

jasontedor commented Oct 23, 2018 • edited

javanna commented Oct 23, 2018

DaveCTurner commented Oct 24, 2018

javanna commented Oct 24, 2018

DaveCTurner commented Oct 24, 2018 • edited

javanna commented Oct 24, 2018

jasontedor commented Oct 24, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Oct 24, 2018

Choose a reason for hiding this comment

javanna Oct 24, 2018

Choose a reason for hiding this comment

DaveCTurner Oct 24, 2018

Choose a reason for hiding this comment

javanna Oct 24, 2018

Choose a reason for hiding this comment

javanna commented Oct 26, 2018

DaveCTurner commented Oct 29, 2018

javanna commented Oct 29, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

javanna commented Oct 30, 2018

javanna commented Oct 23, 2018 •

edited

jasontedor commented Oct 23, 2018 •

edited

DaveCTurner commented Oct 24, 2018 •

edited