New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to enable pings for specific remote clusters #34753
Allow to enable pings for specific remote clusters #34753
Conversation
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we enable application-level pings by default every 5 seconds from CCS nodes to the selected remote nodes. We also add a setting called `cluster.remote.ping_schedule` that allows to change the interval and potentially disable application-level pings, similar to `transport.ping_schedule` but the new setting only affects connections made to remote clusters. Relates to elastic#34405
Pinging @elastic/es-distributed |
I am contemplating marking this as a bug and backporting this, given that we heard of connectivity problems to remote clusters in quite a few cases, and this change should help those. |
I am not sure about making this the default. I can be convinced, but my bias would be that this be similar to the within-cluster pings and be disabled by default. Also, we are going to have a larger need to enable different transport settings on remote cluster connections versus within-cluster connections. For example, see #34483. I think we should have an eye towards how it would look and so I would propose that the setting be namespaced under
Finally, I don't think this should be considered a bug fix. I agree with your current labeling of an enhancement. |
We've had multiple cases in discuss forums etc. where users have connectivity problems when using CCS. We've discussed this as part of #34405 with @tbrooks8 , @s1monw and @DaveCTurner which makes me reasonably sure that this is a good improvement. Scheduled pings are defaulted to 5s for the transport client for a long time, and I think that CCS is comparable to transport client as it connects to remote nodes, there are firewalls in-between etc. so the issues are different compared to those that you have within the same cluster. One reason why this (or some other similar) change is needed is that it affects only connections to remote clusters, while we've suggested in some cases to enable scheduled transport pings in order to fix CCS connectivity issues, and that will also enable intra-cluster transport pings which is not desirable. Adding the new setting also allows to disable those pings if they are not needed, but having them enabled by default should be a good trade-off. I specifically added the setting as a global one for all remote clusters, because I did not see the need to make things configurable per cluster, which would also complicate the implementation. Would you see this configured per cluster because in the context of CCR scheduled pings would not be needed? Or what is the scenario where different clusters need different ping intervals? |
I think this is a valuable feature, and I think Jason's proposal to have this setting be different for different remote clusters is a good improvement. I'm undecided about whether we should switch this on by default. I don't expect it to have a very significant impact on resource usage, but it does have some impact, particularly if there are many clusters in a deployment. Also it would (for instance) make it slightly harder to interpret the flow of traffic in a I'm also undecided about the default being I think it'd be best to introduce this feature but not to change the default in this PR, and we can have a separate debate about the default at a later date. |
Thanks for your feedback David, what would the use-case that requires configuring the scheduled pings per cluster? |
For instance, a deployment that federates a bunch of clusters spread around the world may want to avoid the unnecessary pings between clusters within each datacentre but may require them for more distant clusters; the need for pings (and their required frequency) may vary on a link-by-link basis. |
I have updated the PR based on the comments, reviews would be appreciated ;) |
+1 Indeed, we renamed CCR from XDCR (i.e., cross-cluster over cross-datacenter) in recognition of the fact that we expect many use-cases for replicating amongst clusters within in a single datacenter, it's not only limited to replicating across the globe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly looked at the docs and tests and left some thoughts.
@@ -152,6 +152,14 @@ PUT _cluster/settings | |||
by default, but they can selectively be made optional by setting this setting | |||
to `true`. | |||
|
|||
`cluster.remote.${cluster_alias}.transport.ping_schedule`:: | |||
|
|||
Schedule a regular application-level ping message to ensure that transport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to say that this setting sets the time between pings, otherwise it's not clear what values other than -1
mean. I also think that the "defaults to ... which defaults to ..." in the last sentence might cause confusion. I drafted an alternative:
Sets the time interval between regular application-level ping messages that are sent to ensure that transport connections to nodes belonging to remote clusters are kept alive. If set to
-1
, application-level ping messages to this remote cluster are not sent. If unset, application-level ping messages are sent according to the globaltransport.ping_schedule
setting, which defaults to-1
meaning that pings are not sent.
I'm not sure this is correct, however. If we set
transport.ping_schedule: 5s
cluster.remote.foo.transport.ping_schedule: -1
Does this disable pings to the foo
remote? Should it? I think it'd be useful to be able to do so. I haven't dug into the implementation but there's no test for this case as far as I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds great, I was hoping you would help out rephrasing the docs, thanks a lot for that. The behaviour should be what you describe with transport.ping_schedule as a fallback, but I will add a test for this specific case that you mention, it's a good point.
keep-alives apply to all kinds of long-lived connection and not just to | ||
`5s` in the transport client and `-1` (disabled) elsewhere. It is preferable | ||
to correctly configure TCP keep-alives instead of using this feature, because | ||
TCP keep-alives apply to all kinds of long-lived connections and not just to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debates rage to this day on the internet about the use of singular or plural after "kinds of". I suspect that British English prefers the singular and US English the plural, and both are ok 😄 🇬🇧 (I'm ok with this change, just thought you'd like to know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting :)
Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753
Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753
Increase minimum number of elements for List<> ctor arguments for specific classes that validate the size of the list. Fixes: elastic#34753
I have addressed the comments, tests are green, @DaveCTurner would you mind having another look please? |
I think you didn't intend to close this @matriv - perhaps a typo? |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left a handful of optional suggestions but this looks good either way.
server/src/test/java/org/elasticsearch/transport/RemoteClusterServiceTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/transport/RemoteClusterServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/transport/RemoteClusterServiceTests.java
Outdated
Show resolved
Hide resolved
…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>
…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>
…ServiceTests.java Co-Authored-By: javanna <javanna@users.noreply.github.com>
retest this please |
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`. The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default. Relates to #34405
This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.
This is related to elastic#34405 and a follow-up to elastic#34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.
This is related to #34405 and a follow-up to #34753. It makes a number of changes to our current keepalive pings. The ping interval configuration is moved to the ConnectionProfile. The server channel now responds to pings. This makes the keepalive pings bidirectional. On the client-side, the pings can now be optimized away. What this means is that if the channel has received a message or sent a message since the last pinging round, the ping is not sent for this round.
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. It is already possible to enable application-level pings through the
transport.ping_schedule
setting, but such setting affects also intra-cluster communication. With this PR we add a per cluster new setting (calledcluster.remote.${cluster_alias}.transport.ping_schedule
) that allows to configure application-level pings for each remote cluster.Relates to #34405
Possible relates to #30247