Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to enable pings for specific remote clusters #34753

Merged
merged 16 commits into from Oct 31, 2018

Conversation

javanna
Copy link
Member

@javanna javanna commented Oct 23, 2018

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. It is already possible to enable application-level pings through the transport.ping_schedule setting, but such setting affects also intra-cluster communication. With this PR we add a per cluster new setting (called cluster.remote.${cluster_alias}.transport.ping_schedule ) that allows to configure application-level pings for each remote cluster.

Relates to #34405
Possible relates to #30247

When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we enable application-level pings by default every 5 seconds from CCS nodes to the selected remote nodes. We also add a setting called `cluster.remote.ping_schedule` that allows to change the interval and potentially disable application-level pings, similar to `transport.ping_schedule` but the new setting only affects connections made to remote clusters.

Relates to elastic#34405
@javanna javanna added >enhancement :Distributed/Network Http and internode communication implementations v7.0.0 v6.5.0 labels Oct 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@javanna
Copy link
Member Author

javanna commented Oct 23, 2018

I am contemplating marking this as a bug and backporting this, given that we heard of connectivity problems to remote clusters in quite a few cases, and this change should help those.

@jasontedor
Copy link
Member

jasontedor commented Oct 23, 2018

I am not sure about making this the default. I can be convinced, but my bias would be that this be similar to the within-cluster pings and be disabled by default.

Also, we are going to have a larger need to enable different transport settings on remote cluster connections versus within-cluster connections. For example, see #34483. I think we should have an eye towards how it would look and so I would propose that the setting be namespaced under cluster.remote.<name of cluster>.transport.<transport setting>. So that's two changes:

  • this would be on a per remote cluster basis
  • a different namespace for the setting

Finally, I don't think this should be considered a bug fix. I agree with your current labeling of an enhancement.

@javanna
Copy link
Member Author

javanna commented Oct 23, 2018

We've had multiple cases in discuss forums etc. where users have connectivity problems when using CCS. We've discussed this as part of #34405 with @tbrooks8 , @s1monw and @DaveCTurner which makes me reasonably sure that this is a good improvement. Scheduled pings are defaulted to 5s for the transport client for a long time, and I think that CCS is comparable to transport client as it connects to remote nodes, there are firewalls in-between etc. so the issues are different compared to those that you have within the same cluster.

One reason why this (or some other similar) change is needed is that it affects only connections to remote clusters, while we've suggested in some cases to enable scheduled transport pings in order to fix CCS connectivity issues, and that will also enable intra-cluster transport pings which is not desirable. Adding the new setting also allows to disable those pings if they are not needed, but having them enabled by default should be a good trade-off. I specifically added the setting as a global one for all remote clusters, because I did not see the need to make things configurable per cluster, which would also complicate the implementation. Would you see this configured per cluster because in the context of CCR scheduled pings would not be needed? Or what is the scenario where different clusters need different ping intervals?

@DaveCTurner
Copy link
Contributor

I think this is a valuable feature, and I think Jason's proposal to have this setting be different for different remote clusters is a good improvement.

I'm undecided about whether we should switch this on by default. I don't expect it to have a very significant impact on resource usage, but it does have some impact, particularly if there are many clusters in a deployment. Also it would (for instance) make it slightly harder to interpret the flow of traffic in a tcpdump capture. Based on the number of times this comes up, I think this solves a problem experienced by a fairly small fraction of users, and these users can switch it on themselves.

I'm also undecided about the default being 5s. A common idle-connection timeout is 1h, so this could reasonably default to tens-of-minutes and still solve a lot of problems.

I think it'd be best to introduce this feature but not to change the default in this PR, and we can have a separate debate about the default at a later date.

@javanna
Copy link
Member Author

javanna commented Oct 24, 2018

Thanks for your feedback David, what would the use-case that requires configuring the scheduled pings per cluster?

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 24, 2018

For instance, a deployment that federates a bunch of clusters spread around the world may want to avoid the unnecessary pings between clusters within each datacentre but may require them for more distant clusters; the need for pings (and their required frequency) may vary on a link-by-link basis.

@javanna javanna changed the title Schedule ping by default for remote clusters Allow to enable pings for specific remote clusters Oct 24, 2018
@javanna
Copy link
Member Author

javanna commented Oct 24, 2018

I have updated the PR based on the comments, reviews would be appreciated ;)

@jasontedor
Copy link
Member

For instance, a deployment that federates a bunch of clusters spread around the world may want to avoid the unnecessary pings between clusters within within each datacentre but may require them for more distant clusters; the need for pings (and their required frequency) may vary on a link-by-link basis.

+1

Indeed, we renamed CCR from XDCR (i.e., cross-cluster over cross-datacenter) in recognition of the fact that we expect many use-cases for replicating amongst clusters within in a single datacenter, it's not only limited to replicating across the globe.

@javanna javanna added v6.6.0 and removed v6.5.0 labels Oct 24, 2018
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly looked at the docs and tests and left some thoughts.

@@ -152,6 +152,14 @@ PUT _cluster/settings
by default, but they can selectively be made optional by setting this setting
to `true`.

`cluster.remote.${cluster_alias}.transport.ping_schedule`::

Schedule a regular application-level ping message to ensure that transport
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to say that this setting sets the time between pings, otherwise it's not clear what values other than -1 mean. I also think that the "defaults to ... which defaults to ..." in the last sentence might cause confusion. I drafted an alternative:

Sets the time interval between regular application-level ping messages that are sent to ensure that transport connections to nodes belonging to remote clusters are kept alive. If set to -1, application-level ping messages to this remote cluster are not sent. If unset, application-level ping messages are sent according to the global transport.ping_schedule setting, which defaults to -1 meaning that pings are not sent.

I'm not sure this is correct, however. If we set

transport.ping_schedule: 5s
cluster.remote.foo.transport.ping_schedule: -1

Does this disable pings to the foo remote? Should it? I think it'd be useful to be able to do so. I haven't dug into the implementation but there's no test for this case as far as I can see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds great, I was hoping you would help out rephrasing the docs, thanks a lot for that. The behaviour should be what you describe with transport.ping_schedule as a fallback, but I will add a test for this specific case that you mention, it's a good point.

keep-alives apply to all kinds of long-lived connection and not just to
`5s` in the transport client and `-1` (disabled) elsewhere. It is preferable
to correctly configure TCP keep-alives instead of using this feature, because
TCP keep-alives apply to all kinds of long-lived connections and not just to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debates rage to this day on the internet about the use of singular or plural after "kinds of". I suspect that British English prefers the singular and US English the plural, and both are ok 😄 🇬🇧 (I'm ok with this change, just thought you'd like to know)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting :)

matriv added a commit to matriv/elasticsearch that referenced this pull request Oct 25, 2018
Increase minimum number of elements for List<> ctor arguments
for specific classes that validate the size of the list.

Fixes: elastic#34753
matriv added a commit to matriv/elasticsearch that referenced this pull request Oct 25, 2018
Increase minimum number of elements for List<> ctor arguments
for specific classes that validate the size of the list.

Fixes: elastic#34753
matriv added a commit to matriv/elasticsearch that referenced this pull request Oct 25, 2018
Increase minimum number of elements for List<> ctor arguments
for specific classes that validate the size of the list.

Fixes: elastic#34753
@javanna
Copy link
Member Author

javanna commented Oct 26, 2018

I have addressed the comments, tests are green, @DaveCTurner would you mind having another look please?

@DaveCTurner
Copy link
Contributor

I think you didn't intend to close this @matriv - perhaps a typo?

@DaveCTurner DaveCTurner reopened this Oct 29, 2018
@javanna
Copy link
Member Author

javanna commented Oct 29, 2018

retest this please

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a handful of optional suggestions but this looks good either way.

DaveCTurner and others added 5 commits October 30, 2018 11:51
…ServiceTests.java

Co-Authored-By: javanna <javanna@users.noreply.github.com>
…ServiceTests.java

Co-Authored-By: javanna <javanna@users.noreply.github.com>
…ServiceTests.java

Co-Authored-By: javanna <javanna@users.noreply.github.com>
@javanna
Copy link
Member Author

javanna commented Oct 30, 2018

retest this please

@javanna javanna merged commit ef5181c into elastic:master Oct 31, 2018
javanna added a commit that referenced this pull request Nov 1, 2018
When we connect to remote clusters, there may be a few more routers/firewalls in-between compared to when we connect to nodes in the same cluster. We've experienced cases where firewalls drop connections completely and keep-alives seem not to be enough, or they are not properly configured. With this commit we allow to enable application-level pings specifically from CCS nodes to the selected remote nodes through the new setting `cluster.remote.${clusterAlias}.transport.ping_schedule`.  The new setting is similar `transport.ping_schedule` but it does not affect intra-cluster communication, pings are only sent to specific remote cluster when specifically enabled, as they are disabled by default.

Relates to #34405
Tim-Brooks added a commit that referenced this pull request Nov 29, 2018
This is related to #34405 and a follow-up to #34753. It makes a number
of changes to our current keepalive pings.

The ping interval configuration is moved to the ConnectionProfile.

The server channel now responds to pings. This makes the keepalive
pings bidirectional.

On the client-side, the pings can now be optimized away. What this
means is that if the channel has received a message or sent a message
since the last pinging round, the ping is not sent for this round.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this pull request Nov 29, 2018
This is related to elastic#34405 and a follow-up to elastic#34753. It makes a number
of changes to our current keepalive pings.

The ping interval configuration is moved to the ConnectionProfile.

The server channel now responds to pings. This makes the keepalive
pings bidirectional.

On the client-side, the pings can now be optimized away. What this
means is that if the channel has received a message or sent a message
since the last pinging round, the ping is not sent for this round.
Tim-Brooks added a commit that referenced this pull request Nov 29, 2018
This is related to #34405 and a follow-up to #34753. It makes a number
of changes to our current keepalive pings.

The ping interval configuration is moved to the ConnectionProfile.

The server channel now responds to pings. This makes the keepalive
pings bidirectional.

On the client-side, the pings can now be optimized away. What this
means is that if the channel has received a message or sent a message
since the last pinging round, the ping is not sent for this round.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >enhancement v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants